Sequence alignment

BLAST sequence assignment


Alignment of DNA or protein sequences in bioinformatics serves as a number of purposes:

DNA sequence data can be aligned to assemble larger contiguous DNA sequences
Protein sequences can be aligned to identify important regions, such as active sites
DNA sequences can be aligned to identify regulatory control regions

Alignment of any two sequences implies that those sequences are related.

Overlapping DNA sequence reads are derived from a single DNA molecule
Protein, RNA, or DNA sequences from a particular gene for different organisms are related by the ancestry of those organisms
Duplicated genes (and pseudogenes) within an organism are related to each other or a particular parental gene

Molecular sequences that are not related in some way, do not produce statistically significant sequence alignments

A key then, to answering many bioinformatics questions lies in the ability to reliably align sequence information

Sequences can be aligned:
Globally, in which as many characters as possible over the entire sequence are aligned
Or locally, in which regions of sequences which are highly similar are aligned


Sequences can be compared in pairs or as multiple alignments of related sequences
In practice, all these alignments are performed similarly, and can be understood in the context of a dot-matrix sequence comparison.

Dot Matrix Analysis

Dot matrix analysis displays the primary sequence of pairs of sequences on the X and Y axes of a graph.
Dots are plotted on the graph where the X and Y coordinate sequences are identical.
Regions of identical sequence are revealed as diagonal rows of dots.
Random matches are seen as isolated dots.


A comparison of DNAs sequences for elongation factor EF-G from S. aureus and T. thermophilus is shown below left. A dot is placed every time a nucleotide in the S. aureus sequence matches that of T. thermophilus. Since there are only four types of nucleotides, random matches occur many places throughout the sequence and obscure regions of significant similarity.
A similar analysis of the corresponding peptide sequences is shown at right. Random matches of the 20 amino acids also occur, but diagonal regions of sequence similarity are all still discernible.

DNA alignment - single nucleotide identity Protein alignment - single residue identity

 

Random matches can be filtered by using a sliding window to compare blocks of sequence. In this case, a block of characters is compared between sequences. A dot is placed on the alignment only if a threshold value of matches is detected for that block.

An example is shown below on a small portion of the amino acid sequence. At left, a one or dot is placed wherever the sequences match.
In the center panel, a score is placed for the number of matches between sequences for each block of 3 amino acids.
For 3 consecutive matches, the score is 3
For 2 out of 3 the score is 2
For 1 out of three the score is 1

While there are more 1's on the second matrix, by ignoring all values below 2 (right panel), the random matches are ignored.

  P V I L E P M M K V T I E M P
P 1         1                 1
V   1               1          
I     1                 1      
L       1                      
E         1               1    
P 1         1                 1
I     1                 1      
M             1 1           1  
R                              
V   1               1          
E         1               1    
V   1               1          
T                     1        
T                     1        
P 1         1                 1
  P V I L E P M M K V T I E M P
P 3         1     1 1         1
V   3               1 1        
I     3               1 1   1 1
L       3               1 1   1
E 1       2         1     1 1  
P 1 1     1 2         1 1     1
I     1     1 1         1 1    
M             1 2           1  
R 1   1           1   1     1 1
V   1   1       1   1   1     1
E 1       1       2       1    
V   1             1 2          
T       1           1 1   1    
T     1   1           2     2 1
P 1   1 1   1         1 1   1 3
  P V I L E P M M K V T I E M P
P 3                            
V   3                          
I     3                        
L       3                      
E         2                    
P           2                  
I                              
M               2              
R                              
V                              
E                 2            
V                   2          
T                              
T                     2     2  
P                             3
Word Size = 1
Word Size = 3
Word Size = 3, threshold=2

The full matrix for the proteins is shown below:

Protein- single residue identity Protein identity 5 residues of 23

Even using a low stringency window greatly improves the signal/nose ratio. One can readily see a diagonal stretching across the matrix indicating sequence similarity throughout the length of both sequences.

This works well for DNA sequences, but you must use higher stringencies; longer lengths help too:

DNA alignment - single nucleotide identity DNA alignment - 15 of 23 identical

 


Going back to proteins, this one is Dr. Timpte's favorite!

You can also compare a sequence with itself to look for repeated sequences.

If you compare a typical protein sequence to itself, you only see a diagonal line going from upper left to lower right.

But if the sequence contains multiple repeats of the same sequence you see something much more complicated:

Protein sequence of porcine submaxillary mucin compared to itself looking for exact matches of 39 residues.

You can see diagonal of unique sequence at the ends.
In between are 135.6 repeats of an 81 residue sequence, extending a total of 10991 amino acids, this is an extreme case.
Repeats are often detectable in comparing related sequences to each other.
Dot matrix analysis makes them painfully obvious.


The alignment can be improved by raising the score for a match to +6 and penalizing a mismatch with -1:

Protein identity 5 residues of 23(match=1, mismatch=0) Protein identity 5 residues of 23(match=6, mismatch=-1)

Virtually all the background matches have been eliminated. Even better results can be obtained using weighted scoring matrices.

Substitution matrices

 

Comparison of related proteins revealed that substitution of chemically similar amino acid residues was fairly common in many positions of the protein sequences.

Other substitutions were found to be less common
Some were extremely rare.

Substitution matrices quantitate these differences in substitution frequencies for use in evaluation of alignments

Protein identity 5 residues of 23
(match=6, mismatch=-1)
Protein BLOSUM62 matrix
5 residues of 23 match

 

Where do scoring matrices come from?

BLOSUM Matrix (Henikoff and Henikoff, 1992)

BLOSUM matrix is based upon the BLOCKS database of protein motifs

e.g. RNase P BLOCK AxxRNR...

what are the frequencies of substitutions at the xx positions?
Allows alignment of short regions of sequence from very divergent proteins

2000 blocks of conserved patterns from 500 different families of proteins
Non-conserved amino acids counted and frequency of substitution pairs counted

BLOSUM62: 62% of residues in data set identical

BLOSUM62 Matrix

  C S T P A G N D E Q H R K M I V
C 9                                      
S -1 4                                    
T -1 1 5                                  
P -3 -1 -1 7                                
A 0 1 0 -1 4                              
G -3 0 -2 -2 0 6                            
N -3 1 0 -2 -2 0 6                          
D -3 0 -1 -1 -2 -1 1 6                        
E -4 0 -1 -1 -1 -2 0 2 5                      
Q -3 0 -1 -1 -1 -2 0 0 2 5                    
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8                  
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5                
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5              
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5            
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4          
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4        
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4      
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6    
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7  
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

What the scores mean:
Score=0: odds of substitution of one amino acid with the other equals that of random chance
Score>0: substitutions occur more likely than by chance (i.e. evolutionarily selected for)
Score<0: substitutions occur less likely than by chance (i.e. evolutionarily selected against)

Gap penalties

Gap costs: Introducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements. Can be very deleterious to protein structure compared to amino acid substitution. So there are penalties for making gaps, and a separate penalty for extending the gap.
e.g.

    AGGTTCGACT
    --GT-C-A-T

Usually penalty is high for opening a new gap, but is smaller for extending a gap. (If a gap is actually allowed in a protein sequence, such as variation in the size of a loop, the size difference is probably not critical). Gap and extension penalties vary with matrix used. Usually default values for a given scoring matrix work best.

So how about the actual alignments?

Most commonly used algorithms are FASTA and BLAST

Both are useful for searching for matches of your sequence to a large database, such as GenBank

To do so, both use simplifying shortcuts to minimize computational time without sacrificing much sensitivity

BLAST (we will do the BLAST Tutorial Page together for lab next week.)

First, regions of sequence that are low complexity (like ST-rich regions, which are easily substituted) are filtered out
Then a list of overlapping 3 or 4 letter words are derived from sequence
Substitution score is used to identify high scoring words above a threshold value

-> small number of words (~50) to compare to database

Database already indexed for high-scoring words
Matches within given distance of other matches on same diagonal are used to align sequences

BLAST alignment of S. aureus (Query) vs T. thermophilus (Subject)


Query:     4   EFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGXXX 63
               E+ L++ RNIGI AHIDAGKTTTTERILYYTGRIHKIGE HEGA+ MD+MEQE++RG   
Sbjct:     6   EYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITI 65
np-binding 19               ********

Query:     64  XXXXXXXXWEGHRVNIIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQA 123
                       W+ HR+NIIDTPGHVDFT+EVERS+RVLDGA+ V+D+  GVEPQ+ETVWRQA
Sbjct:     66  TAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQA 125
np-binding 83                   *****

Query:     124 TTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKC 183
                 Y VPRI F NKMDK GA++   + T+++RL A    +QLPIG ED F  IID++ MK 
Sbjct:     126 EKYKVPRIAFANKMDKTGADLWLVIRTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKA 185
np-binding 137            ****

Query:     184 FKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEA 243
               + Y NDLGT+I EI IPE++LD+A E    L+E  A+  + +M KYL  EE +  EL  A
Sbjct:     186 YTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAA 245

Query:     244 IRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-K 302
               IR+ T +++  PV+ G+A+KNKGVQL+LDAV+DYLPSPLD+ PI G   + PE EV+   
Sbjct:     246 IRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIH 302

Query:     303 ADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANS 362
                D +  +AALAFK+M DPYVG+LTF RVYSGT+TSGSYV N+TKG++ERV RLL+MHAN 
Sbjct:     303 PDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANH 362

Query:     363 RQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-IILESMEFPEPVIHLSVEPKSKADQD 421
               R+E++ + +GD+ A VGLK+T TGDTL GE    +ILES+E PEPVI +++EPK+KADQ+
Sbjct:     363 REEVEELKAGDLGAVVGLKETITGDTLVGEDAPRVILESIEVPEPVIDVAIEPKTKADQE 422

Query:     422 KMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVS 481
               K++QAL +L EEDPTF   T  ETGQ II GMGELHL+I+VDR+K+EF V+ NVG P V+
Sbjct:     423 KLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVGKPQVA 482

Query:     482 YRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSV 541
               YRET      V+GKF RQ+GGRGQYG V I+  P   G+GFEF NAIVGGV+P+EYIP+V
Sbjct:     483 YRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAV 542

Query:     542 EAGLKDAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFXXXXXXXXXXXXXXCDPVIL 601
               + G+++AM++G L G+P++D+K  LYDGSYH+VDSSEMAF               DPVIL
Sbjct:     543 QKGIEEAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVIL 602

Query:     602 EPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYVPLSEMFGYATSLRSN 661
               EP+M+V +  PEEYMGD++GD+ +RRG++ GMEPRGNAQV+ A+VPL+EMFGYAT LRS 
Sbjct:     603 EPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFVPLAEMFGYATDLRSK 662

Query:     662 TQGRGTYTMYFDHYAEVPKSIAEDIIK 688
               TQGRG++ M+FDHY EVPK + E +IK
Sbjct:     663 TQGRGSFVMFFDHYQEVPKQVQEKLIK 689

Results similar to FASTA using same matrix. XXXXX indicates regions masked for low complexity.

 

Selecting the BLAST Program

You need to pick the right version of BLAST to suit your data:
 
Program  Description Key uses
blastp Compares an amino acid query sequence against a protein sequence database. Identifying known proteins of similar function in databases.
Most sensitive detection of similarity.
blastn Compares a nucleotide query sequence against a nucleotide sequence database. NOT necessarily the best way to compare 2 DNAs!

 Regions of DNA similarity between close relatives (megablast).
Identifying promoter, splice UTR elements.

blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Good way to find protein sequences related to your DNA.

 Identifying potential open reading frames with similarity to known proteins.
Identifying exons in eukaryotes

tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.  Looking for an ortholog of a known protein in a new DNA sequence.
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Some sites do not allow use of this program because it is computationally intensive.  Sensitive identification of similarities between distantly related genomes.



Selecting the BLAST Database (from the BLAST tutorial)

You can select several NCBI databases to compare your query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot).


Proteins

Database Description
nr All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF 
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. 
swissprot The last major release of the SWISS-PROT protein sequence database.
patents Protein sequences derived from the Patent division of GenBank
yeast Protein translations of the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic CDS translations.
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
kabat Kabat's database of sequences of immunological interest.
alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences


Nucleotides

Database Description
nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest

Non-redundant database of GenBank+EMBL+DDBJ EST Divisions. Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources.

dbsts

Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
Sequence Tagged Site:
random physical genomic map marker

mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.
human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.
other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.
yeast Sequence fragments from the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic nucleotide sequences.
pdb Sequences derived from the 3-dimensional structure of proteins.
kabat [kabatnuc] Kabat's database of sequences of immunological interest.
patents Nucleotide sequences derived from the Patent division of GenBank
vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
mito Database of mitochondrial sequences (Rel. 1.0, July 1995).
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
htgs High Throughput Genomic Sequences.

Evaluation of alignments (from the BLAST tutorial)

The Raw Score (S) is the sum of all the matches for a given alignment, minus mismatch and gap penalties. This value tends to be larger for larger proteins, so how can you tell if it is significant?

Expected (E) value is the number of scores that would be expected to exceed the actual score S by chance.

where m and n are sequence lengths and K and lambda are scaling factors dependent upon the size of the search and the scoring system.

Probability (P) of finding a higher score:

P= 1-e(-E)

When E< 0.01, P~E

So the E value is a good measure of the significance of any alignment. Numbers approaching 1 are not likely significant.

Numbers much less than 1 are highly significant.

FASTA

Works similarly to BLAST, with some differences
Sequences are broken into words (also called k-tuples)
Matches between word characters are detected, offset of positions calculated
Series of character matches with same offset = a hit
Additional matches on same diagonal are identified -> diagonal matches
Diagonal matches within certain distance of other matches are joined using gaps
Joined regions are evaluated using substitution matrix scores

FASTA alignment of EF-G sequences using BLOSUM50 matrix (for very divergent sequences)

>>gi|119190|sp|P13551|EFG_THETH ELONGATION FACTOR G (EF-G)                    (691 aa)
 initn: 2949 init1: 1359 opt: 2943  Z-score: 3250.5  bits: 611.9 E(): 3.1e-174
Smith-Waterman score: 2943;  62.591% identity (63.050% ungapped) in 687 aa overlap (4-688:6-689)
Entrez lookup  Re-search database  General re-search  
>gi|119    4- 688:---------------------------------------------------------------------:

                 10        20        30        40        50        60        70        
gi|692   MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGITITSAATTAAWEGHRVN
            :..:.. ::::: ::::::::::::::::::::::::::.::::. ::.::::..::::::.:.::  :. ::.:
gi|119 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITITAAVTTCFWKDHRIN
               10        20        30        40        50        60        70        80

       80        90       100       110       120       130       140       150        
gi|692 IIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQAN
       :::::::::::.:::::.::::::..:.:...:::::.:::::::  : ::::.:.::::: ::..   . :...:: : 
gi|119 IIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVIRTMQERLGAR
               90       100       110       120       130       140       150       160

      160       170       180       190       200       210       220       230        
gi|692 AAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVS
        . .::::: :: : .:::...:: . : :::::.:.:: :::..::.:.: . .:.:..:. ....: :::  :: .  
gi|119 PVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEE
              170       180       190       200       210       220       230       240

      240       250       260       270       280       290       300        310       
gi|692 ELKEAIRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
       ::  :::..: .... ::. :.:.:::::::.::::.::::::::. :: :   ..:: ::.  . : .. .::::::.:
gi|119 ELVAAIRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
              250       260       270       280       290          300       310       

       320       330       340       350       360       370       380       390       
gi|692 TDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I
       .:::::.:::.::::::.:::::: :.:::..:::.:::.:::: :.:.. . .::..:.::::.: ::::: ::    .
gi|119 ADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV
       320       330       340       350       360       370       380       390       

        400       410       420       430       440       450       460       470      
gi|692 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVG
       ::::.: ::::: ...:::.::::.:..:::..: ::::::.. :  ::::.::.:::::::.:.:::.:.::.:. :::
gi|119 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVG
       400       410       420       430       440       450       460       470       

        480       490       500       510       520       530       540       550      
gi|692 APMVSYRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLKDAMENGVLAG
        :.:.::::. . ..:.::: ::.::::::: :.:.  :   :.:::: :::::::.:.::::.:. :...::..: : :
gi|119 KPQVAYRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIEEAMQSGPLIG
       480       490       500       510       520       530       540       550       

        560       570       580       590       600       610       620       630      
gi|692 YPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKCDPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPR
       .:..:.:. :::::::.:::::::::::.:.:.:::..: :::::::.:.: .  :::::::..::...:::.. :::::
gi|119 FPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPR
       560       570       580       590       600       610       620       630       

        640       650       660       670       680       690   
gi|692 GNAQVVNAYVPLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE
       :::::. :.:::.:::::::.:::.:::::...:.:::: ::::.. : .::     
gi|119 GNAQVIRAFVPLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ   
       640       650       660       670       680       690    

Different matrix can give different alignment and score. Compare BLOSUM50 alignment:
Gap penalty: -10 extension: -2

280       290       300        310       
LPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
::::::. :: :   ..:: ::.  . : .. .::::::.:
LPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM

With PAM120 alignment (less divergent)
Gap penalty: -16 extension: -4

LPSPLDVKPIIGHRASNPEEEVIAKADDSAEFAALAFKVMT
::::::. :: :  ..   : :   .: .. .::::::.:.
LPSPLDIPPIKG--TTPEGEVVEIHPDPNGPLAALAFKIMA
      290         300       310        

Variations of FASTA program are available for protein-protein, DNA-DNA comparisons. Also DNA-protein comparison by translating DNA.

 

 

Other search types

Smith-Waterman and Needleman-Wunsch use dynamic programming algorithm to find optimal local and global alignments, respectively

Dynamic programming

Identifies highest scoring regions of alignment and extends them to end of alignment.
Highest score for a given region must be made from optimal subscore from step before.
Steps are then traced back from highest score at end of alignment to identify optimal path to that score.

Computationally intensive. Not typically used against databases. But our DNA sequence assembly program uses this strategy for assembly.

 


BLAST sequence assignment


Jim Nolan - Tulane Department of Biochemistry, room 6014
Phone: 988-2453 jnolan@tulane.edu
My informatics course web address: http://www.tulane.edu/~biochem/lecture/723/