Phylogenetic Trees
Phylogenetic Trees
• Bioinformatics is the research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those to acquire, store,
organize, analyze, or visualize such data.
• Objectives of molecular phylogeny:
• Reconstruct the correct evolutionary relationships among biological entities
• Estimate the time of divergence between biological entities
• Chronicle the sequence of events along evolutionary lineages
• In any molecular phylogenetic reconstruction the following 4 points need to be addressed.
1. Molecular sequences
2. Sequence alignment is the essential preliminary to tree reconstruction
3. Converting the alignment data into a phylogenetic tree
4. Assessing accuracy of a reconstructed tree
• Evolutionary relationships are illustrated by means of a phylogenetic tree or a dendogram.
• Genetic distance is the number of mutation/evolutionary events between species since their divergence.
• Phylogenetic tree is a branching diagram showing relationships between species (or higher taxa) based on
shared common ancestry.
• Constructed by studying organismal features (=characters) that vary among species.
• It is based on the assumption that if two organisms possess similar characters, it is likely that they inherited
these features from a common ancestor. The likeness of organisms due to shared ancestry is called
homology.
Not all likeness is inherited from a common ancestor
• The acquisition of similar characteristics in species from different evolutionary branches due to sharing
similar ecological roles with NS shaping analogous adaptations is called convergent evolution.
• Similarity due to convergent evolution is termed non-homologous similarity or homoplasy.
• Phylogeny helps us to understanding and classifying the diversity of life on earth
• It also enables to test evolutionary hypotheses:
• trait evolution
• coevolution
• mode and pattern of speciation
• correlated trait evolution
• biogeography
• geographic origins
• age of different taxa
• nature of molecular evolution
• disease epidemiology
Three main approaches:
1. Phenetic (or numerical taxonomy):
• Based on the upon the degree of overall similarity (phenotype) between organisms. Based on as many
characteristics of the organisms being classified as possible.
• Each organism is then compared with every other for all characters measured, and the number of
similarities (or differences) is calculated.
• The organisms are then clustered in such a way that the most similar are grouped close together and the
more different ones are linked more distantly.
• Pheneticists do not seek to make the distinction between homologous and analogous similarities.
• A phenogram (=branching diagram) of species is then constructed using this distances measures.
2. Cladistic (or phylogenetic):
• Based on the assumption that members of a group share a common evolutionary history, and are thus
more "closely related" to one another than they are to other groups of organisms.
• A phylogenetic hypothesis that attempts to reconstruct the course of evolution by grouping organisms
relative to a common ancestor.
• Branching diagrams called cladograms - these are hypotheses about the relationship among taxa based on
certain shared characteristics
3. Evolutionary systematics:
• Based strongly on evolutionary relationships.
• Suggest that the degree of genetic differences between lineages should be used in addition to their
genealogical (evolutionary) similarities when developing taxonomic classifications.
• The branching diagrams prepared by evolutionary systematists are called phylogenetic trees.
The most popular and frequently used methods of tree building can be classified into two major categories
A. Phenetic methods based on distances: measures the pair-wise distance/dissimilarity between two genes,
the actual size of which depends on different definitions, and constructs the tree totally from the resultant
distance matrix.
• Distance-based methods are more rapid and less computationally intensive than character based methods,
but the actual characters are discarded once the distance matrix is derived.
1. Unweighted pair group method with arithmetic mean (UPGMA),
2. Neighbor joining (NJ)…Saitou & Nei 1987
3. Weighted Neighbor-Joining….Weighbor
4. Fitch-Margoliash (FM) and Minimum Evolution (ME) Methods
5. UPGMA & NJ are relatively fast, suitable for analyzing large data set that are not very strongly
similar.
B. Cladistic methods based on characters: evaluate all possible trees and seek for the one that optimizes the
evolution.
• Make use of all known evolutionary information, i.e. the individual substitutions among the sequences, to
determine the most likely ancestral relationships.
1. Maximum parsimony (MP)
2. Maximum likelihood (ML)
3. Assume that a set of sequences descended from a common ancestor by mutation and selection
processes without hybridization or other horizontal gene transfer.
B. MSA
1. Load notepad file containing sequences in FASTA format.
2. Select phy and .aln as output formats.
3. Remove gap only columns.
4. Do complete alignment.
5. Name as infile.phy
A multiple sequence alignment is a collection of three or more sequences (protein or nucleic acid) that are partially
or completely aligned.
• Homologous residues are aligned in columns across the length of the sequences. These aligned residues are
homologous in an evolutionary sense: they are presumably derived from a common ancestor.
• The residues in each column are also presumed to be homologous in a structural sense: aligned residues
tend to occupy corresponding positions in the three-dimensional structure of each aligned protein.
• There is a unique advantage of multiple sequence alignment because it reveals more biological information
than many pair-wise alignments can.
o Allows the identification of conserved sequence patterns and motifs in the whole sequence family,
which are not obvious to detect by comparing only two sequences.
o Many conserved and functionally critical amino acid residues can be identified in a protein multiple
alignment.
o Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of
sequence families and prediction of protein secondary and tertiary structures.
o Multiple sequence alignment also has applications in designing degenerate polymerase chain reaction
(PCR) primers based on multiple related sequences.
• The most popular web-based program for performing progressive multiple sequence alignment is ClustalW.
• The ClustalW algorithm proceeds in three stages.
o In stage 1, the global alignment approach is used to create pairwise alignments of every sequence that
is to be included in a multiple sequence alignment.
o In the second stage, a guide tree is calculated from the distance (or similarity) matrix. There are two
principal ways to construct a guide tree: the unweighted pair group method of arithmetic averages
(UPGMA) and the neighbor-joining method.
o The two main features of a tree are its topology (branching order) and branch lengths (which can be
drawn so that they are proportional to evolutionary distance). Thus, the tree reflects the relatedness
of all the proteins to be multiply aligned. Guide trees are usually not considered true phylogenetic
trees, but instead are templates used in the third stage of ClustalW to define the order in which
sequences are added to a multiple alignment. A guide tree is estimated from a distance matrix based
on the percent identities between sequences you are aligning. In contrast, a phylogenetic tree almost
always includes a model to account for multiple substitutions that commonly occur at the position of
aligned amino acids (or nucleotides).
o In stage 3, the multiple sequence alignment is created in a series of steps based on the order
presented in the guide tree. The algorithm first selects the two most closely related sequences from
the guide tree and creates a pair-wise alignment. These two sequences appear at the terminal nodes
of the tree, that is, the locations of extant sequences.
The degree of conservation observed in each sequence alignment of the sequences is displayed by default, the
following symbols:
1. "*" means that the residues or nucleotides in that column are identical in all sequences in the
alignment.
2. ":" means that conserved substitutions have been observed, according to the color table.
3. "." means that semi-conserved substitutions are observed.
TRIMMING
1. Open .aln text file and remove the peripheral empty columns from N and C terminal of the sequence.
2. Save the file and then repeat the MSA steps.
3. Now the file with Phy format is to be used for the next step.
Note:
• This step can be skipped if the retrieved sequences are of similar length and good quality.
• Do only external trimming if necessary.
C. TREE CONSTRUCTION
1. Copy infile.phy in Phylip exe folder
2. Open Seqboot and write file name i.e.., infile.phy
3. Select for bootstrap values (press R and change to 1000) and y for accepting changes.
4. Rename output file as infile.
5. Open dnadist …change M to change dataset, enter d and enter the number of datasets to be analysed 100
and press y.
6. Delete old infile. Rename output file as infile
7. Open neighbour and enter file name.
8. Define outgroup by pressing O and enter sequence number. Press m and change to 100. Press y to accept.
9. Output file and outtree will be generated. Delete outfile and infile. Rename outree as intree.
10. Open consense and enter file name intre. define outgroup by pressing O and enter sequence number. press y
to accept.
11. Rename outree as x.tre
Phylip (PHYlogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees)
created by Joseph Felsenstein. These programs are used in a sequential way in which the output from the first
program is used as an input for the next program just by renaming the files. The input for a program is taken from a
file called ‘infile’ and the results are written in a file called’outfile’. Some programs generate two output files-
outfile and outtree.
A. Character based:
1. Maximum parsimony method: It is a character-based method which infers a phylogenetic tree by
minimizing the total number of evolutionary steps or total tree length for a given set of data. It is also
referred to as sequence based tree reconstruction method.
2. Maximum likelihood method: Refers to a model of sequence evolution which finds the tree and gives
highest likelihood of the observed data.
B. Distance matrix baaed
3. Neighbour joining: Evolutionary distances are calculated for all operational taxonomic units and build tree
where distance between the operational taxonomic units match these distances. Distance methods
summarize the differences between sequences by calculating a pairwise distance measure between all
aligned sequences.
Neighbour joining method
• This algorithm does not make the assumption of molecular clock and adjust for the rate variation among
branches.
• Steps:
o It begins with an unresolved star-like tree. Each pair is evaluated for being joined and the sum of all
branches length is calculated of the resultant tree.
o The pair that yields the smallest sum is considered the closest neighbors and is thus joined.
o A new branch is inserted between them and the rest of the tree and the branch length is recalculated.
o This process is repeated until only one terminal is present.
• NJ method is comparatively rapid and generally gives better results than UPGMA method.
• But it produces only one tree and neglects other possible trees, which might be as good as NJ trees, if not
significantly better.
• Moreover, since errors in distance estimates are exponentially larger for longer distances, under some
condition, this method will yield a biased tree.
• Can deal with unequal mutation rates.
• Allows incorporation of evolutionary models
• The NJ tree is an approximation of the minimum evolution tree (that whose total branch length is
minimum).
• In that sense, the NJ method is very similar to parsimony because branch lengths represent substitutions.
• NJ produces always unrooted trees, that need to be rooted by the outgroup method.
• NJ performs well when substitution rates vary among lineages. Thus NJ should find the correct tree if
distances are well estimated.
• The phylogenetic information expressed by an unrooted tree resides entirely in its internal branches.
List of some of the programs in PHYLIP that can be used for the molecular sequence data analysis are as follows:
1. Parsimony method programs
o dnapars: DNA parsimony
o dnapenny: DNA parsimony using branch-and-bound
o protpars: Protein parsimony
o pars: Discrete character parsimony
o mix: Mixed method parsimony
2. Maximum likelihood programs
o dnaml: DNA maximum likelihood without molecular clock
o dnamlk: DNA maximum likelihood with molecular clock
o proml: protein maximum liklihood
3. Computation of distance
o dnadist:Aligned nucleic sequences.
o protdist: Aligned protein sequences
4. Distance matrix method programs
o neighbor: Neighbor-joining or UPGMA
o fitch: Fitch-Margoliash and least-squares methods
o kitsch: Fitch-Margoliash and least-squares methods with molecular clock
5. Manipulation and visualization of phylogenetic tree
o drawtree: Draw a tree.
o drawgram: Draw a cladogram or a phenogram.
o consense: Consensus tree program.
o retree:Unroot a tree
• Consensus trees
This program constructs a consensus tree from multiple trees. For example, Dnapars can produce multiple trees,
which can be summarized by the program Consense. Also the results of bootstrapping are summarized by the
program Consense as a majority rule
• Seqboot: Reads in a data set, and produces multiple data sets from it by bootstrap resampling
• Resampling procedure
The idea behind resampling (bootstrapping and other methods) is to assess how reliable a tree we can produce
with the dataset at hand. Initially, the sequence alignment is analyzed in the usual way. Then, resampling proceeds
by first creating a number (100-10000) of random datasets from the original dataset. These random datasets are
analyzed in exactly the same way the original dataset was analyzed, and the results from the random datasets are
summarized by constructing a majority rule consensus tree (program Consense).
D. TREE VIEWING
1. Open file.
2. Click phylogram.
3. Click tree tab and show internal edge label.
4. Print file as pdf/save as graphic.
Bootstrapping is the generation of artificial data sets by random sampling, with replacement, from the actual data
set. Analyzing the artificial data sets gives an idea of how much the results might change if the study were
replicated many times. The support of each internal branch is expressed as percent of replicates. It is parameter to
test the reliability of each internal branch. High bootstrap values (>70%) indicate reliable trees. Lower percentages
indicate that there is insufficient information in the sequences to be sure about the resulting tree.
Cladograms are diagrams which depict the relationships between different groups of taxa called “clades”. By
depicting these relationships, cladograms reconstruct the evolutionary history (phylogeny) of the taxa. Cladograms
can also be called “phylogenies” or “trees”. Cladograms are constructed by grouping organisms together based on
their shared derived characteristics.
• Characters can be any aspect of the phenotype
1. Morphology
2. Physiology
3. Behavior
4. DNA
Phylogram: A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny. The
branch lengths are proportional to the amount of inferred evolutionary change. Each node is called a taxonomic
unit. Internal nodes are generally called hypothetical taxonomic units. In a phylogenetic tree, each node with
descendants represents the most recent common ancestor of the descendants, and the edge lengths (if present)
correspond to time estimates.
In phylogram displays, the branch lengths are proportional to amino acid changes, and the tree is accompanied by a
scale bar. On the other hand, branch lengths are not proportional to amino acid changes in the cladogram. A
cladogram portrays evolutionary relationships within species and populations.
Dendrogram: A form of a tree that lists the compared objects (e.g., sequences or genes) in a vertical order and joins
related ones by levels of branches extending to one side of the list.
• Characters shared by all members of the group, including the ancestor, are referred to as
symplesiomorphies.
• Homologous characters that already exist in a common ancestor - primitive characters or plesiomorphic
character.
• Homologous characters that have evolved more recently and, therefore, only occur among certain species
in the cladogram - derived characters or apomorphic characters.
• Synapomorphies is used to define characters that have arisen within the group since it diverged from a
common ancestor.