Answers are due in pdf format to AVENUE anytime before Friday November
22, 5:00pm. Assignments will not be accepted after this time.
----------------------------------------------------------------------
1. Here is an unrooted tree ((A:p,B:q):q,C:p,D:q);
or if you prefer
A C
\___/
/ \
B D
with the branch leading to A and C equal to length p and the branch
leading to B and to D and between the two groups equal to q.
In our case we will assume that p=0.5 and q=0.1.
Then the phylogenetically informative patterns and their probabilities
of occurrence are
ABCD
----
xxyy 0.061
xyxy 0.189
xyyx 0.061
(from Felsenstein).
As a general rule, for any statistical test as you get more and
more data, the results _should_ become more and more certain.
If there are 100bp of sequence from each species, which topology
would parsimony prefer? If there are 10,000bp analyzed? If you had
huge amounts of data to analyze this question, say 100 million bp
analyzed (a huge amount of work and data), which tree would parsimony
prefer? Are these the correct tree?
#
# In each case the tree ((A,C),B,D) would be preferred. The correct
# tree is ((A,B),C,D).
#
# This perverse property is called inconsistency: given more and more
# data you become more and more certain of the wrong tree.
#
# Parsimony is inconsistent for many instances of the problem of 'long
# branch attraction' (aka 'The Felsenstein Zone')
#
----------------------------------------------------------------------
2. We are going to build a consensus tree by hand! The following table
represents the frequency of tree splits of certain taxa from 6
different tree topologies:
Topology/ 1 2 3 4 5 6
Grouping
FG | ABCDE * * * * * * 6
BC | ADEFG * * * * * * 6
DE | ABCFG * * * * * 5
ABC | DEFG * * * * * 5
ADE | BCFG * 1
DFG | ABCE * 1
For example, the first row of the table tells you that you see the
grouping of taxa (F,G) and (A,B,C,D,E) in all 6 tree topologies.
For the this set of 6 trees, give the definition and draw:
a) the strict consensus tree (use a newick format for the tree).
# consists of all groups that occurred 100% of the time (occurred in
# all the trees), the rest being ignored
# ((B,C),(F,G),E,A,D);
b) 50%/Majority Rule consensus tree (at each fork in the tree,
please write the number indicating how many times the group which
consists of the taxa to the left of the fork occurred; similar to
how branch length would be recorded in a newick formatted tree).
# consists of all groups that occur more than 50% of the time
# ((A,(B,C):6):5, (F,G):6, (D,E):5);
----------------------------------------------------------------------
3. The invariant method is a way of constructing trees based on patterns which
are not a function of branch lengths. Lets try it! Below is an alignment from 4
species.
Pos1 5 10 15 20
Seq1GAGGGGTCCTAATGTGGCAC
Seq2TTAAGCGAGACAACGTCCGC
Seq3ATTCTACCCTATCTCGGATG
Seq4CACCCTAAAACCAGATCGCG
a) We can assign the nucleotides to the letters X, Y, Z, and W in the following
manner:
The nucleotide in sequence 1 is assigned as 'X'.
For sequences 2,3, and 4 they are assigned relative to Seq1.
If SeqN matches Seq1 then it is also an 'X'.
Else If the nucleotide in SeqN can be obtained from that in Seq1
by a transition then it is assigned 'Y'.
Else If the nucleotide in SeqN can be obtained from that in Seq1
by a transversion it is either 'Z' or 'W'.
If 'Z' hasn't been assigned it is assigned the nucleotide in SeqN.
Else If SeqN matches 'Z', then it is a 'Z'.
Else 'W' is assigned.
Examples:
In position 1 we have G,T,A, and C.
G is in sequence one and is assigned X.
T is obtainable by a transversion of G, and we don't have a Z yet, so it is assigned Z.
C is also a transversion of G, but we have a Z and T != C, so it is assigned W.
A is a transition of G, so it is assigned Y
The pattern for Position 1 XZWY.
In position 10 we have T,A,T, and A.
T in Seq 1 and 3 is assigned X,
A in Seq 2 and 4 is assigned Z.
The pattern For Position 10 is XZXZ.
For Marks: Complete the following table of Pattern Counts
XXZZ1XZXZ-XXZZ1XZXZXZZX1
XYZW2XZYW4XZWY0
XXZW3XZXW1XZWX1
XYZZ-XYZZXZYZ1XZZY-
# Coded Alignment
# Position1 5 10 15 20
# Seq1X X X X X X X X X X X X X X X X X X X X
# Seq2TTAAGCGAGACAACGTCCGCZ Z Y Y X Z Z Z Z Z Z X Z Z Z Z Z X Y X
# Seq3ATTCTACCCTATCTCGGATGY Z Z Z Z Y Y X X X X Z Y W Y X X Z Z Z
# Seq4CACCCTAAAACCAGATCGCGW X W Z W W W Z W Z Z W Z X W Z Z W W Z
#
# XYZZ Occurs Once
# XZXZ Occurs 5 times
# XZZY Never Occurs
b) An invariant is a value one can calculate which is constant for a particular
property regardless of other properties. In this case we want a value which
not change for incorrect topologies regardless of sequence. Lake in 1987
found a set of invariants which are zero for incorrect topologies, and
non-zero for correct topologies (Caveats of course). In general these
invariants are the number of sites supporting the topology minus the number of
sites conflicting with the topology. There is an invariant
for each of the 3 possible topologies with 4 species.
Topology1 ((1,2),(3,4)) -> (XXZZ + XYZW) - (XXZW + XYZZ)
Topology2 ((1,3),(2,4)) -> (XZXZ + XZYW) - (XZXW + XZYZ)
Topology3 ((1,4),(2,4)) -> (XZZX + XZWY) - (XZWX + XZZY)
For Marks: Calculate the Invariant for each of the topologies
# Topology1: (1 + 2) - (3 + 1) = -1
# Topology2: (5 + 4) - (1 + 1) = 7
# Topology3: (1 + 0) - (1 + 0) = 0
c) With large numbers of sites and randomness it is unlikely that the
invariants will exactly equal 0, so a Chi-Squared test is done to test if the
invariant is significantly different from 0.
In this case X^2 = (P - B)^2 / (P + B), where
P = Count in favour ie. (XXZZ + XYZW) for Topology1, and
B = Count against ie. (XZWX + XZZY) for Topology 3
This simplifies to X^2 = (Invariant)^2 / (Sum of all counts)
For marks: Using a Chi-Squared test with 1 degree of freedom, what is the
chance that the invariant for each topology is actually equal to 0.
Hint: you will have to find a table on the internet for a Chi Square Distribution Table
# Topology1: X^2 = (-1)^2/7 = 0.143, Chi-Squared test: 70.5%
# Topology2: X^2 = (7)^2/11 = 4.455, Chi-Squared test: 3.5%
# Topology3: X^2 = (0)^2/2 = 0, Chi-Squared test: 100%
d) Draw the tree which is best supported by the invariant method.
# ((1,3),(2,4));
# 1 2
# \___/
# / \
# 3 4
e) All of the 12 patterns used in these invariants include transversions, but
none have only transitions. How might these affect the performance of this
method?
# Any of the following:
# This method cannot work in the situation where two organisms are very closely
# related and only differ by transitions
# This method will underestimate transition based differences
# If the 4 rates of transversions aren't equal the results may be biased.
----------------------------------------------------------------------
4. An unrooted phylogenetic tree of the COX genes was constructed by the neighbour-joining method and is found as an attached diagram (this tree is adapted from (Zou et al. 1999)).
a) Write the Newick tree format for this tree.
# a) Note: there are a few possibilities for drawing this unrooted tree. Here are just two:
#(((((RAT-2, MOU-2), HUM-2), CHI-2), TRO-2), ((HUM-1, SHE-1),MOU-1));
#(((((((SHE-1, HUM-1),MOU-1), TRO-2),CHI-2),HUM-2),MOU-2),RAT-2);
b) Draw this unrooted tree in rooted form. Use your biological knowledge to decide where the root can be. Make sure that your tree is the most parsimonious one.
# in Newick format:
# ((((HUM-1,SHE-1),MOU-1),(((RAT-2,MOU-2),HUM-2),CHI-2)),TRO-2)
c) Clearly indicate on the tree you drew in part b) where possible gene duplications and gene losses have occurred.
# Deletions occurred in sheep-COX2, rat-COX2, and chicken-COX1.
# A gene duplication event likely occurred at the ancestral node of the (mouse-COX1, human-COX1, sheep-COX1) clade and the (chick-COX2, human-COX2, rat-COX2, mouse-COX2) clade.
# ((((HUM-1,SHE-1),MOU-1),(((RAT-2,MOU-2),HUM-2),CHI-2))gene_duplication_at_this_node,TRO-2)
#see attached answer figure A for a visual. gene duplication is denoted by a black square
We are performing 100 bootstrap experiments, and find that 25 times Human-COX1 and Mouse-COX1 gene form a clade alone.
d) Based on this, add a single bootstrap value to the COX gene tree you drew in part b)
# The bootstrap value of 75 would be located at the ancestral node of sheep-COX1 and human-COX1.
# ((((HUM-1,SHE-1)75,MOU-1),(((RAT-2,MOU-2),HUM-2),CHI-2)),TRO-2)
#see attached answer figure A for a visual. Bootstrap value is 75.
In a more recent study, the ortholog of Human-COX1 gene was found in trout. Trout-COX1.
e) Does this finding change any of your answers in a), b), and c)? What would the new answers be (draw these on a new rooted tree)?
# This finding suggests that the gene duplication occurred earlier, but the deletions do not change. Most likely the bootstrap value would not change either. The most parsimonious tree is displayed in the attached answer figure B. gene duplication is denoted as a black square and the bootstrap value is 75.
# (((HUM-1,SHE-1),MOU-1,TRO-1),(((RAT-2,MOU-2),HUM-2),CHI-2,TRO-2)gene_duplication_at_this_node)
----------------------------------------------------------------------
5. a) Parsimony is another way to create a phylogenetic tree. In two sentences
explain the process of creating such a tree and what it represents.
# A parsimonious tree is built by arranging the tips in such a way that
# it represents the fewest number of evolutionary changes. This tree
# represents what would happen if nature optimized its changes.
b) Build a parsimonious tree using the following four sequences
Seq1 ATTGTATCCCA
Seq2 ATTGAATCCGA
Seq3 ATTGAAACCGA
Seq4 ATTGTATCCCG
#The informative sites
#
#Seq1 TC
#Seq2 AG
#Seq3 AG
#Seq4 TC
#
# Number of changes: 2
#
# The Tree
#
# Seq1 Seq2
# \__2__/
# / \
# Seq4 Seq3
----------------------------------------------------------------------