Answers are due in pdf format to AVENUE anytime before Friday November
8, 5:00pm. Assignments will not be accepted after this time.
----------------------------------------------------------------------
1. Below are four partial sequences of cytochrome oxidase I from some
(hopefully familiar) Drosophila species. This is the COI fragment
that is being used for the Barcode of Life project.
We will use the "puzzle" program to estimate the alpha parameter
for these sequences. Puzzle is the last program listed in chpt9
on the web page http://evol.mcmaster.ca/p3S03.html.
(1) You will want the advanced form.
(2) Enter email and paste the sequence data.
(3) Declare it to be DNA sequence.
(4) Scroll down to "control options" and un-select 'approximate
quartet likelihood' and set 'number of puzzling steps' to
zero. (these are options appropriate to do some other features
of the program).
(5) Scroll down to "rate heterogeneity options". In the pull-down
menu select "gamma distributed rates" (do not enter an alpha
rate).
(6) Run the program by clicking on the 'run puzzle' button.
This program does lots of things but the only thing we are interested
in today, is the rate heterogeneity estimate that it provides.
In the output file scroll down to this section. You will see an
estimate of alpha. This program estimates the parameter of the
gamma distribution by fitting a discrete version of the distribution
with a model to the sequence data. By discrete I mean it splits
the distribution into (by default) eight equal sections (it splits
the DISTRIBUTION into eight equal sections, not the sites into
equal categories). (What is meant by "fitting" will be discussed
more later). It also estimates the relative rates of each of these
eight sections.
(a) Provide the estimate of alpha and the eight rates to your TA.
(b) Confirm that the mean of the rates is one (they are equally probable).
(c) Check the distribution in your notes; is the gamma
distribution with this value of alpha roughly symmetrical or
skewed to the right (with greater density on the left).
Lastly, this program estimates which sites belong to which category.
Something not necessary for a gamma distribution with a simple
JC correction.
5 663
D.mont ---GGAACATTATATTTTATCTTTGGTGCTTGAGCAGGAATAGTAGGAACATCTTTAAGA
D.arizo ------ACTTTATATTTTATTTTCGGAGCTTGAGCTGGGATAGTGGGAACTTCTTTAAGA
D.mauri ATTGGAACTTTATATTTTATCTTTGGAGCTTGAGCTGGGATAGTAGGTACATCCCTAAGA
D.melan ATTGGAACTTTATATTTTATTTTTGGAGCTTGAGCTGGAATAGTTGGAACATCTTTAAGA
D.moja ------ACTTTATATTTTATTTTCGGAGCTTGAGCTGGAATAGTAGGAACTTCTTTAAGA
ATTTTAATTCGAGCTGAATTAGGTCACCCAGGAGCTTTAATTGGAGATGATCAAATTTAT
ATTTTAATTCGTGCCGAATTAGGTCATCCAGGTGCACTAATTGGAGATGATCAAATTTAT
ATTTTAATTCGAGCCGAATTAGGTCATCCTGGAGCATTAATTGGAGATGATCAAATTTAT
ATTTTAATTCGAGCTGAATTAGGACATCCTGGAGCATTAATTGGAGATGATCAAATTTAT
ATTTTAATTCGTGCTGAATTAGGTCATCCAGGTGCACTAATTGGAGATGATCAAATTTAC
AATGTTATTGTTACTGCTCATGCTTTTGTTATAATTTTTTTCATAGTTATACCAATTATA
AATGTAATTGTTACAGCACATGCTTTTGTAATAATTTTTTTTATAGTAATGCCTATTATA
AATGTAATTGTAACTGCACATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTATA
AATGTAATTGTAACTGCACATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTATA
AATGTAATTGTCACAGCACACGCTTTTGTAATAATTTTTTTTATAGTAATACCTATTATA
ATTGGAGGATTCGGTAATTGACTTGTACCTTTAATATTAGGAGCTCCTGATATAGCTTTT
ATTGGAGGATTTGGAAATTGACTGGTGCCTTTAATACTAGGGGCCCCTGATATAGCATTC
ATTGGTGGATTTGGAAATTGATTGGTACCTTTAATATTAGGTGCTCCTGATATAGCATTT
ATTGGTGGATTTGGAAATTGATTAGTGCCTTTAATATTAGGTGCTCCTGATATAGCATTC
ATTGGAGGATTTGGAAATTGACTAGTACCTTTAATACTCGGGGCCCCTGATATAGCATTC
CCCCGAATAAACAATATAAGATTTTGACTATTACCTCCTGCTTTATCCTTACTTTTAGTT
CCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCAGCTTTAACTTTATTATTAGTA
CCACGAATAAATAATATAAGATTCTGACTACTACCCCCTGCTCTTTCTCTATTATTAGTG
CCACGAATAAATAATATAAGATTTTGACTTCTACCTCCTGCTCTTTCTTTACTATTAGTA
CCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCAGCTTTAACTTTATTATTAGTA
AGAAGAATAGTTGAAAATGGAGCTGGGACAGGATGAACCGTATACCCCCCATTATCGGCA
AGCAGTATAGTTGAAAACGGAGCTGGAACAGGGTGAACTGTCTACCCTCCGCTATCTTCA
AGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCACCTTTATCTGCT
AGTAGAATAGTTGAAAATGGAGCTGGGACAGGATGAACTGTTTATCCACCTCTATCCGCT
AGCAGTATAGTTGAAAACGGAGCTGGAACAGGGTGAACTGTCTACCCTCCGCTATCTTCA
GGAATTGCTCATGGAGGAGCTTCTGTTGATCTAGCTATTTTTTCTTTACATTTAGCTGGA
GGTATTGCCCATGGAGGAGCCTCAGTAGATTTAGCAATTTTTTCTTTACATTTAGCAGGA
GGAATTGCCCATGGTGGAGCTTCAGTTGATTTAGCTATTTTTTCTTTACATTTAGCTGGA
GGAATTGCTCATGGTGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTACATTTAGCAGGA
GGTATTGCCCATGGAGGAGCCTCAGTAGATTTAGCAATTTTTTCTTTACATTTAGCAGGA
ATCTCTTCAATTTTAGGAGCTGTAAATTTCATTACAACTGTAATTAATATACGATCTTCA
ATTTCTTCTATTTTAGGTGCCGTAAATTTTATTACAACAGTAATTAACATACGATCAACT
ATTTCTTCAATTTTAGGAGCTGTAAATTTTATTACAACTGTAATTAATATACGATCTACA
ATTTCTTCAATTTTAGGAGCTGTAAATTTTATTACAACTGTAATTAATATACGATCAACA
ATTTCTTCTATTTTAGGGGCCGTAAATTTTATTACAACAGTAATTAATATACGATCAACT
GGAATTACTTTAGACCGAATACCTTTATTTGTTTGATCAGTTGTAATTACAGCTTTATTA
GGAATTACTCTTGACCGTATGCCTTTATTTGTGTGATCTGTTGTAATTACAGCTTTATTA
GGAATTTCATTAGATCGTATACCTTTATTTGTTTGATCAGTAGTTATTACTGCTTTATTA
GGAATTTCATTAGATCGTATACCTTTATTTGTTTGATCAGTAGTTATTACTGCTTTATTA
GGAATTACTCTTGACCGTATACCTTTATTTGTGTGATCTGTTGTAATCACAGCTTTATTA
CTACTTTTATCCTTACCAGTTTTAGCCGGAGCTATCACTATATTATTAACAGACCGAAAT
TTACTTTTATCATTACCTGTCTTAGCCGGGGCTATTACAATATTATTAACAGACCGAAAT
TTACTTTTATCATTACCAGTATTAGCAGGAGCTATCACTATATTATTAACAGATCGAAAT
TTATTATTATCACTTCCAGTACTAGCAGGAGCTATTACTATATTATTAACAGATCGAAAT
CTACTGTTATCATTACCTGTCTTAGCCGGGGCTATTACAATATTATTAACAGACCGAAAT
CTTAACACCTCATTTTTTGACCCGGCTGGTGGGGGGGATCCAATTTTATACCAACACTTA
TTAAATACTTCATTTTTTGACCCCGCTGGAGGAGGAGACCCAATTCTTTACCAACATTTA
TTAAATACATCATTTTTTGACCCAGCTGGAGGAGGAGATCCTATTTTATACCAACATTTA
TTAAATACATCATTTTTTGACCCAGCGGGAGGAGGAGATCCTATTTTATACCAACATTTA
TTAAATACTTCATTTTTTGACCCCGCAGGAGGAGGAGACCCAATTCTTTACCAACATTTA
TTT
TTT
TTT
TTT
TTT
# RATE HETEROGENEITY
#
# Model of rate heterogeneity: Gamma distributed rates
# Gamma distribution parameter alpha (estimated from data set): 0.13 (S.E. 0.02)
# Number of Gamma rate categories: 8
#
# Rates and their respective probabilities used in the likelihood function:
#
# Category Relative rate Probability
# 1 0.0000 0.1250
# 2 0.0000 0.1250
# 3 0.0003 0.1250
# 4 0.0056 0.1250
# 5 0.0496 0.1250
# 6 0.2868 0.1250
# 7 1.3130 0.1250
# 8 6.3447 0.1250
#
#
# Categories 1-8 approximate a continuous Gamma-distribution with expectation 1
# and variance 7.89.
#
# Combination of categories that contributes the most to the likelihood
# (computation done without clock assumption assuming quartet-puzzling tree):
#
# 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 1 1 1 1 1 8 1 1 7 1 1 8 1 1 1
# 1 1 1 1 1 8 1 1 8 1 1 1 1 1 8 1 1 8 1 1 8 1 1 8 8 1 1 1 1 1
# 1 1 1 1 1 1 1 1 1 1 1 8 1 1 8 1 1 1 1 1 1 1 1 8 1 1 7 1 1 8
# 1 1 8 1 1 8 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8
# 1 1 1 1 1 8 1 1 1 1 1 8 1 1 8 1 1 8 1 1 8 1 1 1 1 1 1 7 1 8
#
# (b) 6.3447+1.313+0.2868+0.0496+0.0056+0.0003+0+0/8 = 8/8 = 1.0
# (c) The distribution with alpha = 0.13 is highly skewed.
----------------------------------------------------------------------
2. Assume that the frequencies of the four nucleotides are T: 0.1, A:
0.1, G: 0.4 and C: 0.4. We will assume that these are the
equilibrium frequencies of the nucleotides such that given a large
amount of mutation the frequencies would converge to these values (some
thermophilic bacteria have a highly skewed GC content apparently to
aid their adaptation to these extreme temperatures: presumably the
triple bonds of G/C are better than the double bounds of A/T).
a) If two sequences in these two beasts diverged for an infinite
amount of time, what would be the raw sequence distance expected
(i.e. a count as in D=k/n)?
b) What would this be normally in the absence of a GC content bias?
c) What would Jukes-Cantor (JC) report as the degree of divergence
for this bias and to what value would JC approach as the
estimated divergence in the absence of this bias?
d) Hence, JC should be corrected for ...?
# a) If the two sequences are random then you can calculate that if you
# have X in one sequence the choice in the second sequence is
# random but constrained by the frequencies. So an A across from a T
# should occur with pure random chance as 0.1 * 0.1. An A across
# form a G with 0.1 * 0.4 ... and so on. Keep going.
#
#
# You will have identical A's, 0.1*0.1; identical T's, 0.1*0.1;
# identical C's 0.4*0.4; identical G's 0.4*0.4. So total identical is
# .01+0.01+0.16+0.16 = 0.34. So total different is 0.66.
#
# D = 0.66
#
# b) D = 0.75.
#
# c) K = -(3/4) ln(1-(4/3)D)
# = 1.59
# otherwise K -> infinity
#
# d) Hence all measures of distance should (and usually are) corrected
# for base frequency bias.
----------------------------------------------------------------------
3. Seq1 VWEDNWDDD
Seq2 VSEDNRDDD
Seq3 VWDDNWDED
a) Calculate pairwise alignment scores between all the sequences using the scoring method: match = 1, mismatch = 0.
# (This is the scoring scheme for the simplest identity matrix)
# seq1 vs seq2 = 7
# seq1 vs seq3 = 7
# seq2 vs seq3 = 5
b) Is Seq1 more similar to Seq2 or Seq3?
# seq1 is equally similar to seq2 as it is to seq3, because both
# seq2 and seq3 have 2 mismatches compared to seq1.
c) What happens if you use another scoring scheme, with a different value for a match and a different value for a mismatch?
# This does not depend on the exact scores in the identity matrix, e.g.
# if you assume match = 2, mismatch = -2, then seq1 is still equally
# similar to seq2 as it is to seq3.
d) Calculate the pairwise alignment scores between all the sequences using the BLOSUM62 matrix. (This is found in Table 7.2 on page 134 of the course textbook)
#Score of similarity between 1 and 2 = 4 - 3 + 5 + 6 + 6 - 3 + 6 + 6 + 6 = 33
#Score of similarity between 1 and 3 = 4 + 11 + 2 + 6 + 6 + 11 + 6 + 2 + 6 = 54
#Score of similarity between 2 and 3 = 4 - 3 + 2 + 6 + 6 - 3 + 6 + 2 + 6 = 26
e) Is Seq1 more similar to Seq2 or Seq3? (use the scores you calculated in d))
#Sequence 1 is more similar to sequence 3 than to sequence 2.
f) Explain which of these substitution matrices you would use to obtain a biologically meaningful similarity score?
# The BLOSUM62 matrix contains substitution scores derived from observed
# substitution rates in blocks of well-aligned homologs. These sequences
# were shaped by evolution itself, so is a biologically meaningful
# measure of protein sequence similarity
----------------------------------------------------------------------
4. a) When we include the Γ distribution with our molecular models - say JC69 + Γ - what are we exactly doing?
# We are controlling for variations in substitution rates between
# different sites due to codon positions, chromosomal location, and
# functional constraints. If it weren't included, it is likely that our
# calculated distances will be off from the "true" result.
b) Why would we want to control for Invariant sites (+ I)?
# Having a substantial number of Invariant sites may cause us to
# underestimate the total difference. Some sites are very likely to be
# conserved as they may contain deleterious SNPs that would otherwise
# prevent any reproductive success. Thus, by controlling for these
# invariant sites we can accurately measure the functional distance
# between two sequences.
----------------------------------------------------------------------
5. Let's build our own BLOSUM-Like Matrix! Section 7.2.2 in Chapter 8 of
the textbook is going to be useful. Also, you can round when
reporting, but don't round when carrying through to the next step.
To make this easier we'll use an Amino Acid Alphabet with only 5 residues:
B, J, O, U, Z (These and X are the only letters not already used)
Our (Fake) BLOCKS database has 3 conserved regions from 5 Species
>Region 1 Alignment
ZJJUUBUUJZ
ZJJUOBUUZZ
ZJJOUBUUJU
ZJJUUJOUJZ
ZJJUOBJUJZ
>Region 2 Alignment
OOBOZUZOBZ
ZOBBZUZOBZ
OOBUUUZOBZ
UOBUZUZOBZ
OOBOZUJOBZ
>Region 3 Alignment
BBJOOBBUZB
BBBOOOBUZB
BBJOOJBUUB
BBJOOBBOBB
JBJOOBBZZB
Below is the lower triangle of an incomplete, symmetric pair-count table from this database.
The value in the table is the number of times a pair occurs in the same site
in the same Block.
Ex. Counting ZU pairs
Z and U appear in the same site in R1-10, R2-1, R2-5, R3-8, and R3-9
In the 10th site of region 1 there is one U and 4 Z's,
this means there are 6 possible ZZ pairs: (1,2) (1,4) (1,5) (2,4) (2,5) (4,5),
4 possible ZU pairs: (1,3) (2,3) (4,3) (5,3), and no UU pairs.
Site 1 of Region 2 has one ZU Pair, 3 OZ's 3 OU's, and 3 OO's
R2-4 is the same as R1-10: 6 ZZ, 4 ZU
R3-8 has 3U's and 1 Z and 1 O, therefore: 3 ZU, 3 UU, 3 OU, and 1 OZ
R3-9: 3 ZZ, 3 ZU, 3 ZB, and 1 BU
Adding up the ZU's we get 4 + 1 + 4 + 3 + 3 = 15
If you're partial to math In a given site in a region The count for a
particular pair (f_ij) is n_i Choose 2 if i == j or n_i * n_j if i != j
Ex R1-10 has 4 Z's and 1 U. 4Z Choose 2 is 6ZZ, and 1U * 4Z is 4UZ.
B J O U Z
B -
J 15 32
O 5 2 45
U 3 3 - 36
Z 3 8 4 15 41
a) What values should be in the table for BB and OU?
# f_BB = 65 f_OU = 23
b) The frequency for a particular pair (q_ij) is the count for that pair divided by the
sum of counts across all pairs. Calculate the entire Frequency table (Hint the table should sum to 1).
For Marks: What is the frequency for ZZ and OU?
#
# Total Count = 300
# q_ZZ = 41/300 OR 0.137 q_OU = 23/300 OR 0.077
#
# q_ij = f_ij / 300
# Frequency Table
# B J O U Z
# B 0.217
# J 0.05 0.107
# O 0.017 0.007 0.15
# U 0.01 0.01 0.077 0.12
# Z 0.01 0.027 0.013 0.05 0.137
c) What is the frequency of each amino acid in all pairs?
In other words, if you took a random pair, what is the chances it would contain B for example? (Hint: They should sum to 1)
# p_i = q_ii + 0.5*sum(q_ij) for all j != i
#
# B = 13/50 OR 0.26
# J = 23/150 OR 0.153
# O = 31/150 OR 0.207
# U = 29/150 OR 0.193
# Z = 28/150 OR 0.187
d) Calculate the table of expected frequencies for each pair. (Hint the table should sum to 1)
For Marks: What is the expected frequency of BB and BO?
#
# e_BB = (13/50)^2 = 169/2500 OR 0.0676
# e_BO = 2 * 13/50 * 31/150 = 403/3750 = 0.10747
#
# e_ij = p_i^2 if i=j OR 2*p_i*p_j otherwise
# Expected Frequency Table
# B J O U Z
#B 0.0676
#J 0.07973 0.02351
#O 0.10747 0.06338 0.04271
#U 0.10053 0.05929 0.07991 0.03738
#Z 0.09707 0.05724 0.07716 0.07218 0.03484
e) Calculate the log odds ('lod') matrix. Congratz! You made a BLOSUM-Like Matrix.
For Marks: Use the matrix to score the following alignment using 'lod' rounded to nearest integer.
BOUZJOUZZ
ZOUBJUUZU
# Odds_ij = q_ij / e_ij
# logodds_ij = s_ij = 2*log_2(Odds_ij)
# = 2*log_2(q_ij / e_ij)
#
# Log-odds Matrix
# B J O U Z
#B 3.4
#J -1.3 4.4
#O -5.4 -6.5 3.6
#U -6.7 -5.1 -0.1 3.4
#Z -6.6 -2.2 -5.1 -1.1 3.9
#
# Alignment Score = -7 + 4 + 3 + -7 + 4 - 0 + 3 + 4 -1 = 3
----------------------------------------------------------------------