0% found this document useful (0 votes)
292 views11 pages

Phylogenetic Trees

This document discusses constructing phylogenetic trees using bioinformatics tools like Clustal X and Phylip to align nucleotide sequences and interpret evolutionary relationships. It outlines retrieving 16S rRNA gene sequences from databases like NCBI, performing multiple sequence alignment (MSA) using Clustal, and generating phylogenetic trees in Phylip format that can be viewed in TreeView. Key steps include downloading sequences, performing MSA to align homologous residues, and using distance-based and character-based tree building methods like neighbor joining and maximum parsimony to construct trees from the alignment.

Uploaded by

Manisha Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views11 pages

Phylogenetic Trees

This document discusses constructing phylogenetic trees using bioinformatics tools like Clustal X and Phylip to align nucleotide sequences and interpret evolutionary relationships. It outlines retrieving 16S rRNA gene sequences from databases like NCBI, performing multiple sequence alignment (MSA) using Clustal, and generating phylogenetic trees in Phylip format that can be viewed in TreeView. Key steps include downloading sequences, performing MSA to align homologous residues, and using distance-based and character-based tree building methods like neighbor joining and maximum parsimony to construct trees from the alignment.

Uploaded by

Manisha Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Construction of phylogenetic trees with the help of bioinformatics tools (Clustal X,

Phylip, NJ) and its interpretation.

• Bioinformatics is the research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those to acquire, store,
organize, analyze, or visualize such data.
• Objectives of molecular phylogeny:
• Reconstruct the correct evolutionary relationships among biological entities
• Estimate the time of divergence between biological entities
• Chronicle the sequence of events along evolutionary lineages
• In any molecular phylogenetic reconstruction the following 4 points need to be addressed.
1. Molecular sequences
2. Sequence alignment is the essential preliminary to tree reconstruction
3. Converting the alignment data into a phylogenetic tree
4. Assessing accuracy of a reconstructed tree
• Evolutionary relationships are illustrated by means of a phylogenetic tree or a dendogram.
• Genetic distance is the number of mutation/evolutionary events between species since their divergence.
• Phylogenetic tree is a branching diagram showing relationships between species (or higher taxa) based on
shared common ancestry.
• Constructed by studying organismal features (=characters) that vary among species.
• It is based on the assumption that if two organisms possess similar characters, it is likely that they inherited
these features from a common ancestor. The likeness of organisms due to shared ancestry is called
homology.
Not all likeness is inherited from a common ancestor
• The acquisition of similar characteristics in species from different evolutionary branches due to sharing
similar ecological roles with NS shaping analogous adaptations is called convergent evolution.
• Similarity due to convergent evolution is termed non-homologous similarity or homoplasy.
• Phylogeny helps us to understanding and classifying the diversity of life on earth
• It also enables to test evolutionary hypotheses:
• trait evolution
• coevolution
• mode and pattern of speciation
• correlated trait evolution
• biogeography
• geographic origins
• age of different taxa
• nature of molecular evolution
• disease epidemiology
Three main approaches:
1. Phenetic (or numerical taxonomy):
• Based on the upon the degree of overall similarity (phenotype) between organisms. Based on as many
characteristics of the organisms being classified as possible.
• Each organism is then compared with every other for all characters measured, and the number of
similarities (or differences) is calculated.
• The organisms are then clustered in such a way that the most similar are grouped close together and the
more different ones are linked more distantly.
• Pheneticists do not seek to make the distinction between homologous and analogous similarities.
• A phenogram (=branching diagram) of species is then constructed using this distances measures.
2. Cladistic (or phylogenetic):
• Based on the assumption that members of a group share a common evolutionary history, and are thus
more "closely related" to one another than they are to other groups of organisms.
• A phylogenetic hypothesis that attempts to reconstruct the course of evolution by grouping organisms
relative to a common ancestor.
• Branching diagrams called cladograms - these are hypotheses about the relationship among taxa based on
certain shared characteristics
3. Evolutionary systematics:
• Based strongly on evolutionary relationships.
• Suggest that the degree of genetic differences between lineages should be used in addition to their
genealogical (evolutionary) similarities when developing taxonomic classifications.
• The branching diagrams prepared by evolutionary systematists are called phylogenetic trees.

The most popular and frequently used methods of tree building can be classified into two major categories

A. Phenetic methods based on distances: measures the pair-wise distance/dissimilarity between two genes,
the actual size of which depends on different definitions, and constructs the tree totally from the resultant
distance matrix.
• Distance-based methods are more rapid and less computationally intensive than character based methods,
but the actual characters are discarded once the distance matrix is derived.
1. Unweighted pair group method with arithmetic mean (UPGMA),
2. Neighbor joining (NJ)…Saitou & Nei 1987
3. Weighted Neighbor-Joining….Weighbor
4. Fitch-Margoliash (FM) and Minimum Evolution (ME) Methods
5. UPGMA & NJ are relatively fast, suitable for analyzing large data set that are not very strongly
similar.
B. Cladistic methods based on characters: evaluate all possible trees and seek for the one that optimizes the
evolution.
• Make use of all known evolutionary information, i.e. the individual substitutions among the sequences, to
determine the most likely ancestral relationships.
1. Maximum parsimony (MP)
2. Maximum likelihood (ML)
3. Assume that a set of sequences descended from a common ancestor by mutation and selection
processes without hybridization or other horizontal gene transfer.

Principle steps for construction of phylogenetic tree

A. Retrieve nucleotide sequences from database


In this exercise we can retrieve nucleotide sequences from NCBI website (https://www.ncbi.nlm.nih.gov/ ).
For example for downloading of E.coli 16S rRNA gene you can:
1. Go to NCBI site
2. In ALL DATABASE dropdown menu, select Nucleotide.
3. In the search dialogue box write E.coli 16S rRNA gene
4. Select the option which gives approximately 1400bp sequence
5. Click and go to the page which gives you the metadata of the gene as well as the sequence
6. Click on FASTA format and copy the sequence including ‘>’ mark.
7. Open a notepad file and keep pasting the sequence in the file one after another as demonstrated in the
class.
8. Add ingroup taxon sequences and one outgroup taxon sequence belonging to unrelated species .
Note:
• This will be your text file of sequence which you will use in the next step of Multiple Sequence
Alignment (MSA).
• Please download and install the following software in your computer
• Clustal: http://www.clustal.org/clustal2/
• Phylip: http://evolution.genetics.washington.edu/phylip/install.html
• TreeView: Shared on google drive

There are three major public DNA databases

16S rRNA Sequences retrieved from NCBI


Ingroup Taxon (Accession number) Species
AY519130.1 Sphingobium francense Sp+
EF190507.1 Sphingobium chinhatense IP26
AY519129.1 Sphingobium indicum B90A
AF039168.1 Sphingobium japonicum UT26
AF159257.2 Sphingomonas chungbukensis DJ77
AB042233.1 Sphingomonas herbicidovorans
AM489507.1 Sphingobium olei
X87161.1 S.chlorophenolica
Outgroup Taxon (Accession number) Species
AF281031.1 Zymomonas mobilis ATCC10988

B. MSA
1. Load notepad file containing sequences in FASTA format.
2. Select phy and .aln as output formats.
3. Remove gap only columns.
4. Do complete alignment.
5. Name as infile.phy

Homologous sequences can be divided into two groups:


• Orthologous sequences: These are the sequences that differ because they are found in different species
(e.g., human alpha-globin and mouse alpha-globin).
• Paralogous sequences: These are the sequences that differ because of a gene duplication event (e.g.,
human alpha-globin and human beta-globin, various versions of both).
• Sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple sequence
alignment) sequences by searching for a series of individual characters or character patterns
(nucleotides/amino acids) that are in the same order in the sequences. It is fundamental to inferring
homology (common ancestry) and function. Sequence alignment is useful for discovering functional,
structural, and evolutionary information in biological sequences.
• It is necessary to align the conserved and non-conserved residues across all the sequences when comparing
two or more sequences. These residues form a pattern from which the relationship between the sequences
can be determined. When the sequences are aligned, it is possible to identify the locations of insertions or
deletions since their divergence from a common ancestor. There are three possibilities:
• The bases match, which means there is no change since their divergence.
• The bases mismatch, which means there is a substitution since their divergence.
• There is a base in one sequence, no base in the other, which means there is an insertion or a deletion since
their divergence.

A multiple sequence alignment is a collection of three or more sequences (protein or nucleic acid) that are partially
or completely aligned.

• Homologous residues are aligned in columns across the length of the sequences. These aligned residues are
homologous in an evolutionary sense: they are presumably derived from a common ancestor.
• The residues in each column are also presumed to be homologous in a structural sense: aligned residues
tend to occupy corresponding positions in the three-dimensional structure of each aligned protein.
• There is a unique advantage of multiple sequence alignment because it reveals more biological information
than many pair-wise alignments can.
o Allows the identification of conserved sequence patterns and motifs in the whole sequence family,
which are not obvious to detect by comparing only two sequences.
o Many conserved and functionally critical amino acid residues can be identified in a protein multiple
alignment.
o Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of
sequence families and prediction of protein secondary and tertiary structures.
o Multiple sequence alignment also has applications in designing degenerate polymerase chain reaction
(PCR) primers based on multiple related sequences.
• The most popular web-based program for performing progressive multiple sequence alignment is ClustalW.
• The ClustalW algorithm proceeds in three stages.
o In stage 1, the global alignment approach is used to create pairwise alignments of every sequence that
is to be included in a multiple sequence alignment.
o In the second stage, a guide tree is calculated from the distance (or similarity) matrix. There are two
principal ways to construct a guide tree: the unweighted pair group method of arithmetic averages
(UPGMA) and the neighbor-joining method.
o The two main features of a tree are its topology (branching order) and branch lengths (which can be
drawn so that they are proportional to evolutionary distance). Thus, the tree reflects the relatedness
of all the proteins to be multiply aligned. Guide trees are usually not considered true phylogenetic
trees, but instead are templates used in the third stage of ClustalW to define the order in which
sequences are added to a multiple alignment. A guide tree is estimated from a distance matrix based
on the percent identities between sequences you are aligning. In contrast, a phylogenetic tree almost
always includes a model to account for multiple substitutions that commonly occur at the position of
aligned amino acids (or nucleotides).
o In stage 3, the multiple sequence alignment is created in a series of steps based on the order
presented in the guide tree. The algorithm first selects the two most closely related sequences from
the guide tree and creates a pair-wise alignment. These two sequences appear at the terminal nodes
of the tree, that is, the locations of extant sequences.

The degree of conservation observed in each sequence alignment of the sequences is displayed by default, the
following symbols:
1. "*" means that the residues or nucleotides in that column are identical in all sequences in the
alignment.
2. ":" means that conserved substitutions have been observed, according to the color table.
3. "." means that semi-conserved substitutions are observed.

TRIMMING
1. Open .aln text file and remove the peripheral empty columns from N and C terminal of the sequence.
2. Save the file and then repeat the MSA steps.
3. Now the file with Phy format is to be used for the next step.
Note:
• This step can be skipped if the retrieved sequences are of similar length and good quality.
• Do only external trimming if necessary.

C. TREE CONSTRUCTION
1. Copy infile.phy in Phylip exe folder
2. Open Seqboot and write file name i.e.., infile.phy
3. Select for bootstrap values (press R and change to 1000) and y for accepting changes.
4. Rename output file as infile.
5. Open dnadist …change M to change dataset, enter d and enter the number of datasets to be analysed 100
and press y.
6. Delete old infile. Rename output file as infile
7. Open neighbour and enter file name.
8. Define outgroup by pressing O and enter sequence number. Press m and change to 100. Press y to accept.
9. Output file and outtree will be generated. Delete outfile and infile. Rename outree as intree.
10. Open consense and enter file name intre. define outgroup by pressing O and enter sequence number. press y
to accept.
11. Rename outree as x.tre

Phylip (PHYlogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees)
created by Joseph Felsenstein. These programs are used in a sequential way in which the output from the first
program is used as an input for the next program just by renaming the files. The input for a program is taken from a
file called ‘infile’ and the results are written in a file called’outfile’. Some programs generate two output files-
outfile and outtree.

• PHYLIP analyzes DNA amino acid sequences in three different ways:

A. Character based:
1. Maximum parsimony method: It is a character-based method which infers a phylogenetic tree by
minimizing the total number of evolutionary steps or total tree length for a given set of data. It is also
referred to as sequence based tree reconstruction method.
2. Maximum likelihood method: Refers to a model of sequence evolution which finds the tree and gives
highest likelihood of the observed data.
B. Distance matrix baaed
3. Neighbour joining: Evolutionary distances are calculated for all operational taxonomic units and build tree
where distance between the operational taxonomic units match these distances. Distance methods
summarize the differences between sequences by calculating a pairwise distance measure between all
aligned sequences.
Neighbour joining method

• This algorithm does not make the assumption of molecular clock and adjust for the rate variation among
branches.
• Steps:
o It begins with an unresolved star-like tree. Each pair is evaluated for being joined and the sum of all
branches length is calculated of the resultant tree.
o The pair that yields the smallest sum is considered the closest neighbors and is thus joined.
o A new branch is inserted between them and the rest of the tree and the branch length is recalculated.
o This process is repeated until only one terminal is present.
• NJ method is comparatively rapid and generally gives better results than UPGMA method.
• But it produces only one tree and neglects other possible trees, which might be as good as NJ trees, if not
significantly better.
• Moreover, since errors in distance estimates are exponentially larger for longer distances, under some
condition, this method will yield a biased tree.
• Can deal with unequal mutation rates.
• Allows incorporation of evolutionary models
• The NJ tree is an approximation of the minimum evolution tree (that whose total branch length is
minimum).
• In that sense, the NJ method is very similar to parsimony because branch lengths represent substitutions.
• NJ produces always unrooted trees, that need to be rooted by the outgroup method.
• NJ performs well when substitution rates vary among lineages. Thus NJ should find the correct tree if
distances are well estimated.
• The phylogenetic information expressed by an unrooted tree resides entirely in its internal branches.
List of some of the programs in PHYLIP that can be used for the molecular sequence data analysis are as follows:
1. Parsimony method programs
o dnapars: DNA parsimony
o dnapenny: DNA parsimony using branch-and-bound
o protpars: Protein parsimony
o pars: Discrete character parsimony
o mix: Mixed method parsimony
2. Maximum likelihood programs
o dnaml: DNA maximum likelihood without molecular clock
o dnamlk: DNA maximum likelihood with molecular clock
o proml: protein maximum liklihood
3. Computation of distance
o dnadist:Aligned nucleic sequences.
o protdist: Aligned protein sequences
4. Distance matrix method programs
o neighbor: Neighbor-joining or UPGMA
o fitch: Fitch-Margoliash and least-squares methods
o kitsch: Fitch-Margoliash and least-squares methods with molecular clock
5. Manipulation and visualization of phylogenetic tree
o drawtree: Draw a tree.
o drawgram: Draw a cladogram or a phenogram.
o consense: Consensus tree program.
o retree:Unroot a tree

• Consensus trees
This program constructs a consensus tree from multiple trees. For example, Dnapars can produce multiple trees,
which can be summarized by the program Consense. Also the results of bootstrapping are summarized by the
program Consense as a majority rule
• Seqboot: Reads in a data set, and produces multiple data sets from it by bootstrap resampling
• Resampling procedure
The idea behind resampling (bootstrapping and other methods) is to assess how reliable a tree we can produce
with the dataset at hand. Initially, the sequence alignment is analyzed in the usual way. Then, resampling proceeds
by first creating a number (100-10000) of random datasets from the original dataset. These random datasets are
analyzed in exactly the same way the original dataset was analyzed, and the results from the random datasets are
summarized by constructing a majority rule consensus tree (program Consense).

D. TREE VIEWING
1. Open file.
2. Click phylogram.
3. Click tree tab and show internal edge label.
4. Print file as pdf/save as graphic.

Bootstrapping is the generation of artificial data sets by random sampling, with replacement, from the actual data
set. Analyzing the artificial data sets gives an idea of how much the results might change if the study were
replicated many times. The support of each internal branch is expressed as percent of replicates. It is parameter to
test the reliability of each internal branch. High bootstrap values (>70%) indicate reliable trees. Lower percentages
indicate that there is insufficient information in the sequences to be sure about the resulting tree.

Understanding tree topology

Cladograms are diagrams which depict the relationships between different groups of taxa called “clades”. By
depicting these relationships, cladograms reconstruct the evolutionary history (phylogeny) of the taxa. Cladograms
can also be called “phylogenies” or “trees”. Cladograms are constructed by grouping organisms together based on
their shared derived characteristics.
• Characters can be any aspect of the phenotype
1. Morphology
2. Physiology
3. Behavior
4. DNA
Phylogram: A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny. The
branch lengths are proportional to the amount of inferred evolutionary change. Each node is called a taxonomic
unit. Internal nodes are generally called hypothetical taxonomic units. In a phylogenetic tree, each node with
descendants represents the most recent common ancestor of the descendants, and the edge lengths (if present)
correspond to time estimates.
In phylogram displays, the branch lengths are proportional to amino acid changes, and the tree is accompanied by a
scale bar. On the other hand, branch lengths are not proportional to amino acid changes in the cladogram. A
cladogram portrays evolutionary relationships within species and populations.
Dendrogram: A form of a tree that lists the compared objects (e.g., sequences or genes) in a vertical order and joins
related ones by levels of branches extending to one side of the list.

• Topology: the branching patterns of the tree.


• Root: the common ancestor of all taxa.
• Clade: a group of two or more taxa (DNA sequences) that includes both their common ancestor and all
their descendents.
• A set of species descended from a common ancestral species. It is a synonym of monophyletic group.
• Node: a branch point in a tree (a presumed ancestral OTU).
• Branch: defines the relationship between the taxa in terms of descent and ancestry.
• Branch length (scaled trees only): represents the number of changes that have occurred in the branch.
• Any group of theoretically related organisms of interest to an investigator is called an ingroup.
• Any group used for comparative purposes in a phylogenetic analysis is called an outgroup.
• Organisms or species that share derived character states form subsets called clades.
• Unrooted trees:Rrepresents phylogenetic but does not provide an evolutionary path. In an unrooted tree,
an external node represents a contemporary organism. Internal nodes represent common ancestors of
some of the external nodes. In this case, the tree shows the relationship between organisms A, B, C & D
and does not tell us anything about the series of evolutionary events that led to these genes (see figure
above). There is also no way to tell whether or not a given internal node is a common ancestor of any 2
external nodes.
• Rooted trees: One of the internal nodes is used as an outgroup, and, in essence, becomes the common
ancestor of all the other external nodes. The outgroup therefore enables the root of a tree to be located
and the correct evolutionary pathway to be identified. In the above case, five different evolutionary
pathways are possible using an outgroup, each depicted by a different rooted tree.

• Characters shared by all members of the group, including the ancestor, are referred to as
symplesiomorphies.
• Homologous characters that already exist in a common ancestor - primitive characters or plesiomorphic
character.
• Homologous characters that have evolved more recently and, therefore, only occur among certain species
in the cladogram - derived characters or apomorphic characters.
• Synapomorphies is used to define characters that have arisen within the group since it diverged from a
common ancestor.

You might also like