Multiple sequence alignments

Why do them?

Provides wealth of information about sequences being analyzed.


Structural information - protein alignment can reveal regions most conserved and critical for function, i.e. active site residues

Nucleic acid conserved regions reveal consensus sequence important for protein binding
can also be used in designing PCR primers to amplify DNA from diverse species

Alignments can also be used to infer evolutionary relationships amongst the organisms from which they were obtained.

Information from multiple alignment can help to refine a pairwise alignment of sequences. Remember the BLAST search exercises showed variations in alignments, depending upon matrix used and gap penalties.
By looking at additional sequences, one might find intermediately related sequences which help in optimizing alignment.
Or multiple alignment may reveal which regions appear most prone to gaps, and so aid in aligning 2 particular sequences of interest

Alignment accuracy


How can you tell if a multiple sequence alignment is meaningful?

Alignment programs

In theory, you can perform optimal alignment of multiple sequences by extension of pairwise algorithms, but number of calculations needed is the sequence length raised to the power of the number of sequences, so it is generally impractical to calculate true optimal sequence alignment for more than 3 sequences.
One method to work around the computational problem is progressive alignment based upon pairwise alignment of all sequences

CLUSTAL

Has number of variations, the most commonly used is:

CLUSTALW

Hierarchical multiple alignment program.

Sample output from clustalw as it runs:
Start of Pairwise alignments
 Aligning...
 Sequences (1:2) Aligned. Score:  73
 Sequences (1:3) Aligned. Score:  43
 Sequences (1:4) Aligned. Score:  39
 Sequences (1:5) Aligned. Score:  41
 Sequences (1:6) Aligned. Score:  27
 Sequences (1:7) Aligned. Score:  22
 Sequences (2:3) Aligned. Score:  37
 Sequences (2:4) Aligned. Score:  39
 Sequences (2:5) Aligned. Score:  38
 Sequences (2:6) Aligned. Score:  28
 Sequences (2:7) Aligned. Score:  22
 Sequences (3:4) Aligned. Score:  44
 Sequences (3:5) Aligned. Score:  51
 Sequences (3:6) Aligned. Score:  33
 Sequences (3:7) Aligned. Score:  22
 Sequences (4:5) Aligned. Score:  50
 Sequences (4:6) Aligned. Score:  34
 Sequences (4:7) Aligned. Score:  24
 Sequences (5:6) Aligned. Score:  32
 Sequences (5:7) Aligned. Score:  20
 Sequences (6:7) Aligned. Score:  18
 Guide tree        file created:   [./Gene_9.dnd]
 Start of Multiple Alignment
 There are 6 groups
 Aligning...
 Group 1: Sequences:   2      Score:5430
 Group 2: Sequences:   2      Score:4776
 Group 3: Sequences:   3      Score:4578
 Group 4: Sequences:   5      Score:4138
 Group 5: Sequences:   6      Score:3100
 Group 6:                     Delayed
 Sequence:7     Score:2418
 Alignment Score 11280

 Consensus length = 344
 
 CLUSTAL W (1.82) multiple sequence alignment


 9_T4            -----MFIQEPKKLIDTGEIGNASTGDILFDGGNKINSDFNAIYNAFGDQRKMAVANG--
 9_RB69          -----MYIQTPKQLIDVGEIGNASTGDILFDGGVKINNDINAIYNAFGDQRKMATANG--
 9_RB49          ------MAQVPKKRIDTGTVGNPSTGDILYDGGNKINDNFDSIYDAFADQR-FKDVD---
 9_44RR          ------MLQDTKKRVDVGVIGNPSTGDILYDGGVKLNTVIDALYNTFGDYRLYSSAS---
 9_RB43          -----MAYQTGKKLIDVGQIGNPSTGDPLYDGGVKLNEIITNVYNAFGDVR-LLTAN---
 9_Aeh1          -------MQQTKILVDTGEEGVEGTGDTLYQGGTSLNRNLNSLYNVFGDYRQFKSDAG--
 9_KVP40         MFAPFLEIKMGRQRLETGTAGSVGTGDTIHNGGNKLRDNSDEFYHINADFELTEAGENYI
                         :  :  ::.*  *  .*** :.:** .:.     .*.  .* .

 9_T4            --TGADGQIIHATGYYQKHSITEYAT------------PVKVGTRHDIDTS----TVGVK
 9_RB69          --TGPNGQIIHATGYYQKGGPTDYFT------------PVPVGSRHDIDAS----TGGVI
 9_RB49          --HGVGRMVIHATGYYQKHPRQSYAGS-----------AVEIGSLHDIDTT----QGALI
 9_44RR          --EGADSQILHATGFYQKLSRQYYAGN-----------PVDIGSMHDADTT----SGALT
 9_RB43          --DGVGTMLLHATGYFQKLPRTYYSAN-----------PIELGSLHDMDTA----TGPIT
 9_Aeh1          --QGNQVMTLHAGGYYQKHSRAYYAGAEQPSG-----NPVEFGSLHDLSVTRDG-AGDLV
 9_KVP40         PAYNDTSPRPFAGGYWQRIPPSSTAGGSGGPNNTNLLTSIITGGKYDIDTSALGGGASMT
                    .      .* *::*:                    .:  *  :* ..:       :

FASTA-like algorithm
Slower, optimal algorithm

Trees based on alignment of RNaseP protein sequences. Note that some sequences are grouped with different organisms depending upon algorithm.
Could have big effect on tree in some cases, but in this case it did not. Compare the fast alignment pdf and the slow alignment pdf
Note that in both cases the secondary structure is only interrupted once in a helix (h) or sheet (s).

 

 

Order of addition of sequences to multiple alignment is important, since it influences the number of gaps in the alignment.

All of the weighting features tend to help confine gaps to the regions of proteins where they would be expected, those not in alpha or beta structure.

Progressive programs are only as good as the initial pairwise alignments, so errors can be propagated throughout the alignment.

Other programs

PSI-BLAST
Iterative version of BLAST.
Hits above a threshold E score are used to guide produce position specific scoring matrix(PSSM) which reflects signature conserved features of the proteins.
This PSSM is used to query a protein database to find additional proteins that fit the PSSM. New matches above threshold are added to PSSM for next round. Database is can be requeried until no additional hits above threshold are found.
biggest advantage of PSI-BLAST is not necessarily for the quality of the alignment, but its sensitivity for identifying essentially all matches in the database.

BLAST itself produces multiple alignments, but they are based on initial query and thus not necessarily optimal.

Hidden Markov models

Can produce excellent alignments, provided enough sequences are available.
Number of sequences required depends on diversity of sequence set, since more diverse sequences may make identification of patterns of conservation more difficult.

HMM program (such as HMMer) is first trained on a set of sequences
The training set need not be aligned, but it can be.
Other input data can be amino acid distribution data or substitution matrices.
Input data are used to identify statistical weights for amino acid identities, as well as gaps and deletions, for each position in the model.
Training sequences are then aligned to the model and the model readjusted to provide the best fit to the input sequences.
This step is repeated until no further improvement is made.

Additional sequences can then be fed into the system to align them to the training sequences. Model can be adjusted during this step as well.
Model can be used to produce Position Specific scoring matrix (PSSM)
PSSMs are log odds scores of finding a particular amino acid at a particular position in a protein sequence, much as the protein substitution matrices were log odds scores of substitution of one amino acid for another.
PSSM can then be used to identify new protein sequences which are related to the known aligned proteins

Sequence Viewers

Need convenient means to manipulate and display multiple sequence alignments.
Most programs have basic text output, with symbols to flag conserved residues
Often convenient to rearrange order of sequences, color code them by residue type, or box conserved regions.
May also need to do some manual tweaking and editing of alignment.
A number of programs available for different types of computers

You can try looking at the fast clustal alignment or the slow clustal alignment by copying and pasting into one of these viewers.

JalView is Java-based editor available at the CLUSTALW web site. Since it is Java, it can be downloaded in your web browser and launched with your CLUSTAL alignment.
Can use to manipulate alignment and make PostScript or GIF files for presentation.
Can also save edited alignment in FASTA format for later use. Jalview can also be downloaded for stand-alone use.

Other available programs
CINEMA is also Java-based. you must upload your alignments to a server

Sequence Manipulation Site has easy interface for a variety of sequence ... manipulations

ClustalX is a clustal viewer program with a nice interface, available for most computer platforms.

SeqPup
Initially Mac only, but has been ported to Java for all platforms. Is a stand alone program (it doesn't require browser.)
BioEdit is a similar program for PCs only

Other types of viewers


SeqLogo will produce graph which displays conserved amino acids of a motif from a PSSM or alignment.
Boxshade displays alignments with conserved regions boxed for ease in identification and presentation.

Alignment Assignment