Multiple sequence alignments

Why do them?

Provides wealth of information about sequences being analyzed.


Structural information - protein alignment can reveal regions most conserved and critical for function, i.e. active site residues


Chemical nature can be used to infer possible chemistry necessary for reaction.
Hidden Markov Model (HMM) sequence alignments can be used to train program to identify function of unknown ORFs
Less strongly conserved residues may reveal what characteristics are important for their structural role, i.e. conserved alternating pattern of hydrophilic and hydrophobic residues may indicate a beta sheet secondary structure.
Regions that have no insertions or deletions are likely to have specific structure, while regions of variable length and sequence might indicate variable loops or linkers between domains.
Conserved regions of nucleic acids may indicate important protein binding sites (transcriptional or translationsal control factors, for example)or active site residues of a ribozyme ;-)

Nucleic acid conserved regions reveal consensus sequence important for protein binding
can also be used in designing PCR primers to amplify DNA from diverse species


Since conserved regions tend to cluster at sites across alignment, may find conserved sites for PCR primers to amplify non-conserved region between them.
Alignments can also be used to infer evolutionary relationships amongst the organisms from which they were obtained.
More related organisms tend to have more similar sequences for an orthologous protein or nucleic acid than would more distantly related ones.
Relative conservation of sequences can be used to construct a phylogenetic tree of the relatedness of those organisms.
Can be used to attempt to identify which variations in sequences occurred in which order.
Can also be used to identify residues in a sequence which appear to interact with other residues because their identities seem to have changed simultaneously with another to maintain function.
Called covariation. Particularly important in defining RNA structures. Covariation of Watson Crick base pairs is indicative of a double-helical RNA stem.


The same principle can be applied to paralogous sequences (duplicated genes which have acquired new functions within an organism.) Can be used to infer when the gene duplication arose and what residues were important for acquisition of new function .
Information from multiple alignment can help to refine a pairwise alignment of sequences. Remember the BLAST search exercises showed variations in alignments , depending upon matrix used and gap penalties.
By looking at additional sequences, one might find intermediately related sequences which help in optimizing alignment.
Or multiple alignment may reveal which regions appear most prone to gaps, and so aid in aligning 2 particular sequences of interest

Alignment accuracy


How can you tell if a multiple sequence alignment is meaningful?
Structural alignment is generally considered the standard for aligning molecules, particularly proteins.
In a few cases there are crystal structures of the same protein from multiple organisms with multiple sequences.
In this case, one can align the 3-D protein structures and identify residues from the structures which occupy the same region of space. This allows unequivocal alignment of the primary structures.
Usually, there are many more primary sequences available than tertiary structures, so other methods must be used, but they can compared to test cases where structural alignments are known. In general, sequence alignment programs can provide alignments which agree well with structural alignments.

Alignment programs

In theory, you can perform optimal alignment of multiple sequences by extension of pairwise algorithms, but number of calculations needed is the sequence length raised to the power of the number of sequences, so it is generally impractical to calculate true optimal sequence alignment for more than 3 sequences.
One method to work around the computational problem is progressive alignment based upon pairwise alignment of all sequences

CLUSTAL Has number of variations, the most commonly used is:
CLUSTALW
Hierarchical multiple alignment program.
Generates pairwise alignments of all input sequences, then ranks scores of identities among pairs of sequences.
High scoring pairs of sequences align most readily to each other.
More divergent (less related) pairs are then added to the alignment.
So it generates a phylogenetic tree of relationships to determine steps in constructing the alignment.
One can view the phylogenetic tree used to generate the alignment
Individual pairs in the alignment are aligned using a FASTA-type (word-based, fast alignment) or by a dynamic programming algorithm, which is slower, but produces optimal pairwise alignments.

FASTA-like algorithm
Slower, optimal algorithm

Trees based on alignment of RNaseP protein sequences. Note that some sequences are grouped with different organisms depending upon algorithm.
Could have big effect on tree in some cases, but in this case it did not. Compare the fast alignment pdf and the slow alignment pdf
Note that in both cases the secondary structure is only interrupted once in a helix (h) or sheet (s).

Dynamic Programming

Takes results of dot matrix and computes new matrix with penalties for gaps
Highest score at end of matrix represents optimal alignment end point.

  P V I L E P M M K V T I E M P
P 1         1                 1
V   1               1          
I     1                 1      
L       1                      
E         1               1    
P 1         1                 1
I     1                 1      
M             1 1           1  
R                              
V   1               1          
E         1               1    
V   1               1          
T                     1        
T                     1        
P 1         1                 1
  P V I L E P M M K V T I E M P
P 3         1     1 1         1
V   3               1 1        
I     3               1 1   1 1
L       3               1 1   1
E 1       2         1     1 1  
P 1 1     1 2         1 1     1
I     1     1 1         1 1    
M             1 2           1  
R 1   1           1   1     1 1
V   1   1       1   1   1     1
E 1       1       2       1    
V   1             1 2          
T       1           1 1   1    
T     1   1           2     2 1
P 1   1 1   1         1 1   1 3
  P V I L E P M M K V T I E M P
P 3                            
V   3                          
I     3                        
L       3                      
E         2                    
P           2                  
I                              
M               2              
R                              
V                              
E                 2            
V                   2          
T                              
T                     2     2  
P                             3
Word Size = 1
Word Size = 3
Word Size = 3, threshold=2

 

Path is then traced back by shortest route to highest score diagonally above
Result is optimal pairwise alignment
Used for both local (Smith-Waterman) and global (Needleman-Wunsch) pairwise alignments

Order of addition of sequences to multiple alignment is important, since it influences the number of gaps in the alignment.
So once pairwise alignments are produced (including gaps) gap penalties are reweighted so that gap penalties are less in poorly conserved regions which already contain gaps)
Gaps are also weighted differently in hydrophilic vs. hydrophobic regions, since hydrophobic regions tend to be buried secondary structures, but hydrophilic regions are more exposed and variable.

All of the weighting features tend to help confine gaps to the regions of proteins where they would be expected, those not in alpha or beta structure.
If the secondary structure of one or more members of the alignment is known, this information can be used to further weight the alignment to minimize disruption of secondary structure
Sequences are also weighted so that those that diverged from the phylogenetic tree near the root have greater weight than those that diverged further out on the tree.

Progressive programs are only as good as the initial pairwise alignments, so errors can be propagated throughout the alignment.
Iterative programs recalculate initial pairwise alignments later to improve pairwise and overall alignments. MultAlin is an example.

Other programs

PSI-BLAST
Iterative version of BLAST.
Hits above a threshold E score are used to guide produce position specific scoring matrix(PSSM) which reflects signature conserved features of the proteins.
This PSSM is used to query a protein database to find additional proteins that fit the PSSM. New matches above threshold are added to PSSM for next round. Database is can be requeried until no additional hits above threshold are found.
biggest advantage of PSI-BLAST is not necessarily for the quality of the alignment, but its sensitivity for identifying essentially all matches in the database.

BLAST itself produces multiple alignments, but they are based on initial query and thus not necessarily optimal.

Hidden Markov models

Can produce excellent alignments, provided enough sequences are available.
Number of sequences required depends on diversity of sequence set, since more diverse sequences may make identification of patterns of conservation more difficult.

HMM program (such as HMMer) is first trained on a set of sequences
The training set need not be aligned, but it can be.
Other input data can be amino acid distribution data or substitution matrices.
Input data are used to identify statistical weights for amino acid identities, as well as gaps and deletions, for each position in the model.
Training sequences are then aligned to the model and the model readjusted to provide the best fit to the input sequences.
This step is repeated until no further improvement is made.

Additional sequences can then be fed into the system to align them to the training sequences. Model can be adjusted during this step as well.
Model can be used to produce Position Specific scoring matrix (PSSM)
PSSMs are log odds scores of finding a particular amino acid at a particular position in a protein sequence, much as the protein substitution matrices were log odds scores of substitution of one amino acid for another.
PSSM can then be used to identify new protein sequences which are related to the known aligned proteins

Sequence Viewers

Need convenient means to manipulate and display multiple sequence alignments.
Most programs have basic text output, with symbols to flag conserved residues
Often convenient to rearrange order of sequences, color code them by residue type, or box conserved regions.
May also need to do some manual tweaking and editing of alignment.
A number of programs available for different types of computers

You can try looking at the fast clustal alignment or the slow clustal alignment by copying and pasting into one of these viewers.

JalView is Java-based editor available at the CLUSTALW web site. Since it is Java, it can be downloaded in your web browser and launched with your CLUSTAL alignment.
Can use to manipulate alignment and make PostScript or GIF files for presentation.
Can also save edited alignment in FASTA format for later use. Jalview can also be downloaded for stand-alone use.
Other available programs
CINEMA is also Java-based. you must upload your alignments to a server

Sequence Manipulation Site has easy interface for a variety of sequence ... manipulations

ClustalX is a clustal viewer program with a nice interface, available for most computer platforms.

SeqPup
Initially Mac only, but has been ported to Java for all platforms. Is a stand alone program (it doesn't require browser.)
BioEdit is a similar program for PCs only

Other types of viewers


SeqLogo will produce graph which displays conserved amino acids of a motif from a PSSM or alignment.
Boxshade displays alignments with conserved regions boxed for ease in identification and presentation.

GBCH723 Home Page