Provides wealth of information about sequences being analyzed.
Structural information - protein alignment can reveal regions most conserved
and critical for function, i.e. active site residues
Nucleic acid conserved regions reveal consensus sequence important for protein
binding
can also be used in designing PCR primers to amplify DNA from diverse species
Alignments can also be used to infer evolutionary relationships amongst the organisms from which they were obtained.
Information from multiple alignment can help to refine a pairwise alignment
of sequences. Remember the BLAST search exercises showed variations in alignments,
depending upon matrix used and gap penalties.
By looking at additional sequences, one might find intermediately related sequences
which help in optimizing alignment.
Or multiple alignment may reveal which regions appear most prone to gaps, and
so aid in aligning 2 particular sequences of interest
How can you tell if a multiple sequence alignment is meaningful?
In theory, you can perform optimal alignment of multiple sequences by extension
of pairwise algorithms, but number of calculations needed is the sequence length
raised to the power of the number of sequences, so it is generally impractical
to calculate true optimal sequence alignment for more than 3 sequences.
One method to work around the computational problem is progressive alignment
based upon pairwise alignment of all sequences
Has number of variations, the most commonly used is:
Hierarchical multiple alignment program.
Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 73 Sequences (1:3) Aligned. Score: 43 Sequences (1:4) Aligned. Score: 39 Sequences (1:5) Aligned. Score: 41 Sequences (1:6) Aligned. Score: 27 Sequences (1:7) Aligned. Score: 22 Sequences (2:3) Aligned. Score: 37 Sequences (2:4) Aligned. Score: 39 Sequences (2:5) Aligned. Score: 38 Sequences (2:6) Aligned. Score: 28 Sequences (2:7) Aligned. Score: 22 Sequences (3:4) Aligned. Score: 44 Sequences (3:5) Aligned. Score: 51 Sequences (3:6) Aligned. Score: 33 Sequences (3:7) Aligned. Score: 22 Sequences (4:5) Aligned. Score: 50 Sequences (4:6) Aligned. Score: 34 Sequences (4:7) Aligned. Score: 24 Sequences (5:6) Aligned. Score: 32 Sequences (5:7) Aligned. Score: 20 Sequences (6:7) Aligned. Score: 18 Guide tree file created: [./Gene_9.dnd] Start of Multiple Alignment There are 6 groups Aligning... Group 1: Sequences: 2 Score:5430 Group 2: Sequences: 2 Score:4776 Group 3: Sequences: 3 Score:4578 Group 4: Sequences: 5 Score:4138 Group 5: Sequences: 6 Score:3100 Group 6: Delayed Sequence:7 Score:2418 Alignment Score 11280 Consensus length = 344 CLUSTAL W (1.82) multiple sequence alignment 9_T4 -----MFIQEPKKLIDTGEIGNASTGDILFDGGNKINSDFNAIYNAFGDQRKMAVANG-- 9_RB69 -----MYIQTPKQLIDVGEIGNASTGDILFDGGVKINNDINAIYNAFGDQRKMATANG-- 9_RB49 ------MAQVPKKRIDTGTVGNPSTGDILYDGGNKINDNFDSIYDAFADQR-FKDVD--- 9_44RR ------MLQDTKKRVDVGVIGNPSTGDILYDGGVKLNTVIDALYNTFGDYRLYSSAS--- 9_RB43 -----MAYQTGKKLIDVGQIGNPSTGDPLYDGGVKLNEIITNVYNAFGDVR-LLTAN--- 9_Aeh1 -------MQQTKILVDTGEEGVEGTGDTLYQGGTSLNRNLNSLYNVFGDYRQFKSDAG-- 9_KVP40 MFAPFLEIKMGRQRLETGTAGSVGTGDTIHNGGNKLRDNSDEFYHINADFELTEAGENYI : : ::.* * .*** :.:** .:. .*. .* . 9_T4 --TGADGQIIHATGYYQKHSITEYAT------------PVKVGTRHDIDTS----TVGVK 9_RB69 --TGPNGQIIHATGYYQKGGPTDYFT------------PVPVGSRHDIDAS----TGGVI 9_RB49 --HGVGRMVIHATGYYQKHPRQSYAGS-----------AVEIGSLHDIDTT----QGALI 9_44RR --EGADSQILHATGFYQKLSRQYYAGN-----------PVDIGSMHDADTT----SGALT 9_RB43 --DGVGTMLLHATGYFQKLPRTYYSAN-----------PIELGSLHDMDTA----TGPIT 9_Aeh1 --QGNQVMTLHAGGYYQKHSRAYYAGAEQPSG-----NPVEFGSLHDLSVTRDG-AGDLV 9_KVP40 PAYNDTSPRPFAGGYWQRIPPSSTAGGSGGPNNTNLLTSIITGGKYDIDTSALGGGASMT . .* *::*: .: * :* ..: :
![]() |
FASTA-like algorithm |
![]() |
Slower, optimal algorithm |
Trees based on alignment of RNaseP protein sequences. Note that some sequences
are grouped with different organisms depending upon algorithm.
Could have big effect on tree in some cases, but in this case it did not. Compare
the fast alignment pdf and the
slow alignment pdf
Note that in both cases the secondary structure is only interrupted once in
a helix (h) or sheet (s).
Order of addition of sequences to multiple alignment is important, since it influences the number of gaps in the alignment.
All of the weighting features tend to help confine gaps to the regions of proteins where they would be expected, those not in alpha or beta structure.
Progressive programs are only as good as the initial pairwise alignments, so errors can be propagated throughout the alignment.
PSI-BLAST
Iterative version of BLAST.
Hits above a threshold E score are used to guide produce position specific scoring
matrix(PSSM) which reflects signature conserved features of the proteins.
This PSSM is used to query a protein database to find additional proteins that
fit the PSSM. New matches above threshold are added to PSSM for next round.
Database is can be requeried until no additional hits above threshold are found.
biggest advantage of PSI-BLAST is not necessarily for the quality of the alignment,
but its sensitivity for identifying essentially all matches in the database.
BLAST itself produces multiple
alignments, but they are based on initial query and thus not necessarily optimal.
Can produce excellent alignments, provided enough sequences are available.
Number of sequences required depends on diversity of sequence set, since more
diverse sequences may make identification of patterns of conservation more difficult.
HMM program (such as HMMer) is first
trained on a set of sequences
The training set need not be aligned, but it can be.
Other input data can be amino acid distribution data or substitution matrices.
Input data are used to identify statistical weights for amino acid identities,
as well as gaps and deletions, for each position in the model.
Training sequences are then aligned to the model and the model readjusted to
provide the best fit to the input sequences.
This step is repeated until no further improvement is made.
Additional sequences can then be fed into the system to align them to the training
sequences. Model can be adjusted during this step as well.
Model can be used to produce Position Specific scoring matrix (PSSM)
PSSMs are log odds scores of finding a particular amino acid at a particular
position in a protein sequence, much as the protein substitution matrices were
log odds scores of substitution of one amino acid for another.
PSSM can then be used to identify new protein sequences which are related to
the known aligned proteins
Need convenient means to manipulate and display multiple sequence alignments.
Most programs have basic text output, with symbols to flag conserved residues
Often convenient to rearrange order of sequences, color code them by residue
type, or box conserved regions.
May also need to do some manual tweaking and editing of alignment.
A number of programs available for different types of computers
You can try looking at the fast clustal alignment or the slow clustal alignment by copying and pasting into one of these viewers.
JalView
is Java-based editor available at the CLUSTALW web site. Since it is Java, it
can be downloaded in your web browser and launched with your CLUSTAL alignment.
Can use to manipulate alignment and make PostScript or GIF files for presentation.
Can also save edited alignment in FASTA format for later use. Jalview can also
be downloaded for stand-alone use.
Other available programs
CINEMA is also Java-based.
you must upload your alignments to a server
Sequence Manipulation Site has easy interface for a variety of sequence ... manipulations
ClustalX is a clustal viewer program with a nice interface, available for most computer platforms.
SeqPup
Initially Mac only, but has been ported to Java for all platforms. Is a stand
alone program (it doesn't require browser.)
BioEdit is a similar
program for PCs only
SeqLogo will
produce graph which displays conserved amino acids of a motif from a PSSM or
alignment.
Boxshade displays
alignments with conserved regions boxed for ease in identification and presentation.