Phylogenetic Analysis

Phylogenetic sequence analysis

The use of comparison of biological sequence information to infer the relationships of the existing organisms from which it was derived.

Can be used as the basis for powerful analysis of RNA structure, besides telling about history of organisms.

Relationships are usually represented by "tree" diagram, with branches occurring where a mutation resulted in a different sequence in the progeny form that of the ancestor.

Assume that:

Organisms studied are related by a common ancestor
Characteristics of the sequence changes over time
Branches are bifurcations (only one type of change occurs at a time)
Sequences studied are descended from a single ancestral sequence (i. e., they are homologs)

Homologs include:

orthologs - genes derived from a common ancestor that have retained a similar function
paralogs - duplicated genes that have acquired a new function
xenologs - genes derived by horizontal transfer from an unrelated organism

Analysis of non-orthologs can be complicated, especially if you don't realize they aren't orthologs.

One of the best-studied examples of phylogenetic analysis is the rRNA of the small ribosomal subunit (ssu-rRNA). The reasons for this include:

It has an ortholog, but only an ortholog, in all cells
It is large enough to provide large information content (1500-2000 nt), but small enough to make sequencing and analysis simple
It has both rapidly evolving regions and slowly evolving and invariant regions
- slow evolving regions are easy to align accurately
- slow regions allow analysis of very distantly related organisms
- fast evolving regions can be used to detect relationships of closely related organisms
Many thousand sequences already determined from vast diversity of organisms for comparison

ssu-rRNA has conserved sequences very close to 5' and 3' ends of RNA, so PCR primers can be designed to amplify almost the entire gene from essentially any organism.

So you can analyze the sequences of a known organism, or you can identify what organisms are growing in a population in strange environments:

Yellowstone Park Octopus Spring picture from Jim Brown, NCSU. Click here or on picture for full-sized original.

Close-up of bacterial mats from Octopus Spring also from Jim Brown. The water coming out of the spring is near boiling and cools as it flows down stream. The colored regions are microbial growth.

Obsidian Pool, Yellowstone, from Jim Brown. Note the boiling slurry of water and minerals at the bottom. The analysis was complicated by contamination of samples with runoff from the valley that included microbes from the gut of the bison grazing nearby.

So water, sludge or microbial mat samples were taken and DNA extracted for amplification. You can thus identify rRNA from organisms that you don't know how to cultivate.

The analysis

Alignment of sequences
Selecting a substitution matrix
Tree building
Evaluation

Alignment

Since most closely related sequences can be aligned most accurately, these sequences should be aligned first, as they are in CLUSTALW using its guide tree.
The choice of alignment tool can be critical to understanding the analysis, since the positioning of gaps and mutations can affects the grouping of sequence characters for numerical analysis (you want to be sure that each position in the alignment contains residues that are structurally homologous, otherwise you will be comparing apples with oranges in some instances).
Placement of gaps in sequences can also be problematic, most phylogenetic programs will not allow analysis of positions in the alignment that include gaps. So manual refinement of the alignment can help to minimize unnecessary gaps. If there is ambiguity about a particular region, alternative versions of the alignment can be analyzed to see what effect the alternative may have on interpretation of the alignment.

Substitution model

Quantitation of differences in aligned sequences relies upon quantitative model of probability of a particular substitution. Typically, these are the same types of matrices that are used to create the alignment. Common examples for protein sequences would be BLOSUM and PAM. For protein coding regions, matrices exist for all 61 codons. For non-coding nucleic acids, matrices that take into account the higher frequency of transition (purine -> purine or pyrimidine -> pyrimidine [R->R or Y->Y]) substitutions than transversion (Y->R or R->Y) mutations.

Tree building

A similarity matrix is constructed from the alignment scores.
Sequence similarities are converted to differences (1-similarity) since the differences are indicative of evolutionary distance.
But the differences you observe could be the result of more than one mutation:

e.g. a G in organism 1 and a C in organism 2 could arise by a direct G-> C mutation, or there could have been more steps: G->A->C
also, even unrelated nucleotide sequences are 25% similar by chance

So differences are adjusted upward to estimate evolutionary distance

Tree building proceeds, using the evolutionary distances to construct branches between organisms.

Neighbor joining: building tree by constructing branches between most closely related organisms, then working you way out the distance matrix.
Parsimony: Seeks tree that requires the least number of changes to explain differences
Maximum likelihood: Examines data and looks for model which is most likely to produce it. Looks at differences in variability within alignment and assigns substitution values accordingly.

Trees can be arranged as:


Dendrograms	Phenograms

Putting a root on the tree

By choosing an outgroup sequence to include with the analysis, you can "root" your tree. Rooting allows you to identify where the sequences analyzed diverged from their common ancestor. The outgroup is a distantly related sequence that is analyzed along with the other sequences, but is known to fall outside the group, such as a eukaryotic ortholog for looking at bacterial sequences.

Signature Analysis

Looking for key features of sequences from particular groups of organisms. Jim Brown's lecture on RNase P signatures.

Evaluation

Given the possibilities for error in data selection, the final tree must be evaluated for validity. Sometimes this can be as simple as visual inspection. If your tree says the closest relative of a mouse is Salmonella, then maybe you used a mitochondrial sequence instead of a nuclear sequence for the mouse. Or perhaps there was horizontal transfer of a gene. Or you might have ambiguous data.

In some cases a bootstrap analysis is used to look for biases in the data. The analysis is rerun numerous times using random subsets of the data. Parts of the tree that occur nearly every time are reliable, ones that occur only occasionally are suspect.

The Result:

The Big Tree of Life from Norm Pace. From: The universal nature of biochemistry. Proc Natl Acad Sci U S A. 2001 98:805-808. Also see his microbial diversity pages

Phylogenetic Comparative Analysis of RNA structure:

Sequence homologous RNAs from a variety of organisms

Align sequences
Look for covariation of base sequences as evidence of interaction (i.e. W-C or other pairs)

     |-|
U<-- A-U -->A
     |-|
A<-- C-G -->U
     |-|

Complementarity maintained by compensatory base changes.
Can also detect some non Watson-Crick tertiary interactions this way.
(Computer programs aid in aligning seqs. and detecting covariations)
Also identifies highly conserved sequences that may be required for special functions.
BUT: Cannot tell base-pairing of conserved residues that don't vary

Examples of well-defined secondary structures derived from phylogenetic analysis:
tRNA
rRNA (16S, 23S, 5S)
Group I intron
Group II intron
RNase P

Comparison of predicted base pairs with those seen in crystal structure of 60S ribosomal subunit:

Pairings shown in red were predicted by phylogenetics and observed in the crystal structure.
Green pairs were predicted, but not observed.
Blue pairs were observed, but not predicted.

Computer prediction based on sequence

Not as straightforward as you would think
Look for regions of molecule that can form W-C pairs with a given region.
But one region of RNA may be complementary to many other short stretches of sequence.

Computer algorithms can help see which folding choices yield lowest free energy. (Zuker, 1989)
e.g. G-C pair more favorable than A-U
G•U (wobble) pair is almost as favorable as A-U
(makes for even more regions of complementarity
Also takes into account "nearest neighbor" rules for stacking:

So identical nucleotide compositions can have different energies due to order of sequence.

Unpaired structures are unfavorable (loops, bulges).
But some are more unfavorable than others.

Even with all of this info, still not always very accurate:
Many regions of molecule not involved in W-C pairs, form "non-canonical" pairs instead: e.g. G-A, A-C, etc.

Non-canonical pairs can have very favorable energy, but many don't tend to fit A-form helix very well.
Phylogenetics can sometimes predict non-canonicals
Many predicted non-canonical pairs have been confirmed by crystal structures.
Also 2'OH is often involved in tertiary pairs, but it is difficult to predict, since it is both an H-bond donor and acceptor.


Self-complementarity	Computer prediction	Phylogenetic

GBCH723 Home Page