Most commonly used algorithms are FASTA and BLAST
Both are useful for searching for matches of your sequence to a large database, such as Genbank
To do so, both use simplifying shortcuts to minimize computational time without sacrificing much sensitivity
Global: calculates a similarity score between entire lengths of sequences
Local: calculates score based on smaller, local regions of similarity
Works similarly to FASTA, with some differences
First, regions of sequence that are low complexity (like ST-rich regions, which
are easily substituted) are filtered out
Then a list of overlapping 3 or 4 letter words are derived from sequence
Substitution score is used to identify high scoring words above a threshold
value
-> small number of words (~50) to compare to database
Database already indexed for high-scoring words
Matches within given distance of other matches on same diagonal are used to
align sequences
BLAST alignment of S. aureus (Query) vs T. thermophilus (Subject)
Query: 4 EFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGXXX 63 E+ L++ RNIGI AHIDAGKTTTTERILYYTGRIHKIGE HEGA+ MD+MEQE++RG Sbjct: 6 EYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITI 65 np-binding 19 ******** Query: 64 XXXXXXXXWEGHRVNIIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQA 123 W+ HR+NIIDTPGHVDFT+EVERS+RVLDGA+ V+D+ GVEPQ+ETVWRQA Sbjct: 66 TAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQA 125 np-binding 83 ***** Query: 124 TTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKC 183 Y VPRI F NKMDK GA++ + T+++RL A +QLPIG ED F IID++ MK Sbjct: 126 EKYKVPRIAFANKMDKTGADLWLVIRTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKA 185 np-binding 137 **** Query: 184 FKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEA 243 + Y NDLGT+I EI IPE++LD+A E L+E A+ + +M KYL EE + EL A Sbjct: 186 YTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAA 245 Query: 244 IRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-K 302 IR+ T +++ PV+ G+A+KNKGVQL+LDAV+DYLPSPLD+ PI G + PE EV+ Sbjct: 246 IRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIH 302 Query: 303 ADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANS 362 D + +AALAFK+M DPYVG+LTF RVYSGT+TSGSYV N+TKG++ERV RLL+MHAN Sbjct: 303 PDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANH 362 Query: 363 RQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-IILESMEFPEPVIHLSVEPKSKADQD 421 R+E++ + +GD+ A VGLK+T TGDTL GE +ILES+E PEPVI +++EPK+KADQ+ Sbjct: 363 REEVEELKAGDLGAVVGLKETITGDTLVGEDAPRVILESIEVPEPVIDVAIEPKTKADQE 422 Query: 422 KMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVS 481 K++QAL +L EEDPTF T ETGQ II GMGELHL+I+VDR+K+EF V+ NVG P V+ Sbjct: 423 KLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVGKPQVA 482 Query: 482 YRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSV 541 YRET V+GKF RQ+GGRGQYG V I+ P G+GFEF NAIVGGV+P+EYIP+V Sbjct: 483 YRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAV 542 Query: 542 EAGLKDAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFXXXXXXXXXXXXXXCDPVIL 601 + G+++AM++G L G+P++D+K LYDGSYH+VDSSEMAF DPVIL Sbjct: 543 QKGIEEAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVIL 602 Query: 602 EPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYVPLSEMFGYATSLRSN 661 EP+M+V + PEEYMGD++GD+ +RRG++ GMEPRGNAQV+ A+VPL+EMFGYAT LRS Sbjct: 603 EPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFVPLAEMFGYATDLRSK 662 Query: 662 TQGRGTYTMYFDHYAEVPKSIAEDIIK 688 TQGRG++ M+FDHY EVPK + E +IK Sbjct: 663 TQGRGSFVMFFDHYQEVPKQVQEKLIK 689
Results similar to FASTA using same matrix. XXXXX indicates regions masked for low complexity.
Selecting the BLAST Program
You need to pick the right version of BLAST to suit your data:
Program | Description |
blastp | Compares a protein sequence against a protein sequence database. |
blastn | Compares a nucleotide query sequence against a nucleotide sequence database. |
blastx | Translates your nucleotide query sequence in all reading frames against and compares these translations against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. |
tblastn | Compares a protein query sequence against a nucleotide sequence database which has been translated in all reading frames. |
tblastx | Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. |
Selecting the BLAST Database (from the BLAST
tutorial)
Database | Description |
nr | All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF |
month | similar to nr for sequences released in the last 30 days. |
swissprot | The SWISS-PROT protein sequence database. | patents | Protein sequences from the GenBank Patent database. | yeast | Protein translations of the Yeast genome. |
E. coli | Protein translations of the Escherichia coli genome. |
pdb | Sequences from the 3-dimensional structure Protein Data Bank. |
kabat | Kabat's Database of Sequences of Immunological interest. (e.g. IgG's) |
alu | Translations of Alu repeats from REPBASE |
Database | Description |
nr | All non-redundant GenBank+EMBL+DDBJ+PDB sequences (excluding EST, STS, GSS, and HTGS sequences). |
month | GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. |
dbest |
Non-redundant database of GenBank+EMBL+DDBJ ESTs (Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources.) |
dbsts |
Non-redundant database of GenBank+EMBL+DDBJ STSs (Sequence Tagged Site: random physical genomic map marker) |
mouse ests | The non-redundant Database of mouse GenBank+EMBL+DDBJ ESTs. |
human ests | The non-redundant Database of human GenBank+EMBL+DDBJ ESTs. |
other ests | The non-redundant database of GenBank+EMBL+DDBJ ESTs from all organisms except mouse and human. |
yeast | Sequence fragments from the Yeast complete genome. |
E. coli | Escherichia coli genomic nucleotide sequences. |
pdb | Sequences from the 3-dimensional structure Protein Data Bank. |
kabat [kabatnuc] | Kabat's Database of Sequences of Immunological interest. (e.g. IgG's) |
patents | Protein sequences from the GenBank Patent database. |
vector | Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory). |
mito | Database of mitochondrial sequences (Rel. 1.0, July 1995). |
alu | Translations of Alu repeats from REPBASE |
gss | Genome Survey Sequences. |
htgs | High Throughput Genomic Sequences. |
The Raw Score (S) is the sum of all the matches for a given alignment, minus mismatch and gap penalties. This value tends to be larger for larger proteins, so how can you tell if it is significant?
Expected (E) value is the number of scores that would be expected to exceed the actual score S by chance.
where m and n are sequence lengths and K and lambda are scaling factors dependent upon the size of the search and the scoring system.
Probability
(P) of finding a higher score:
P= 1-e(-E)
When E< 0.01, P~E
So the E value is a good measure of the significance of any alignment. Numbers approacing 1 are not likely significant.
Numbers much less than 1 are highly significant.
Sequences are broken into words (also called k-tuples)
Matches between word characters are detected, offset of positions calculated
Series of character matches with same offset = a hit
Additional matches on same diagonal are identified -> diagonal matches
Diagonal matches within certain distance of other matches are joined using gaps
Joined regions are evaluated using substitution matrix scores
FASTA uses a dotplot based method, using words for comparison.
FASTA alignment of EF-G sequences using BLOSUM50 matrix (for very divergent
sequences)
>>gi|119190|sp|P13551|EFG_THETH ELONGATION FACTOR G (EF-G) (691 aa)
initn: 2949 init1: 1359 opt: 2943 Z-score: 3250.5 bits: 611.9 E(): 3.1e-174
Smith-Waterman score: 2943; 62.591% identity (63.050% ungapped) in 687 aa overlap (4-688:6-689)
Entrez lookup Re-search database General re-search
>gi|119 4- 688:---------------------------------------------------------------------:
10 20 30 40 50 60 70
gi|692 MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGITITSAATTAAWEGHRVN
:..:.. ::::: ::::::::::::::::::::::::::.::::. ::.::::..::::::.:.:: :. ::.:
gi|119 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITITAAVTTCFWKDHRIN
10 20 30 40 50 60 70 80
80 90 100 110 120 130 140 150
gi|692 IIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQAN
:::::::::::.:::::.::::::..:.:...:::::.::::::: : ::::.:.::::: ::.. . :...:: :
gi|119 IIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVIRTMQERLGAR
90 100 110 120 130 140 150 160
160 170 180 190 200 210 220 230
gi|692 AAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVS
. .::::: :: : .:::...:: . : :::::.:.:: :::..::.:.: . .:.:..:. ....: ::: :: .
gi|119 PVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEE
170 180 190 200 210 220 230 240
240 250 260 270 280 290 300 310
gi|692 ELKEAIRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
:: :::..: .... ::. :.:.:::::::.::::.::::::::. :: : ..:: ::. . : .. .::::::.:
gi|119 ELVAAIRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
250 260 270 280 290 300 310
320 330 340 350 360 370 380 390
gi|692 TDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I
.:::::.:::.::::::.:::::: :.:::..:::.:::.:::: :.:.. . .::..:.::::.: ::::: :: .
gi|119 ADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV
320 330 340 350 360 370 380 390
400 410 420 430 440 450 460 470
gi|692 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVG
::::.: ::::: ...:::.::::.:..:::..: ::::::.. : ::::.::.:::::::.:.:::.:.::.:. :::
gi|119 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVG
400 410 420 430 440 450 460 470
480 490 500 510 520 530 540 550
gi|692 APMVSYRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLKDAMENGVLAG
:.:.::::. . ..:.::: ::.::::::: :.:. : :.:::: :::::::.:.::::.:. :...::..: : :
gi|119 KPQVAYRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIEEAMQSGPLIG
480 490 500 510 520 530 540 550
560 570 580 590 600 610 620 630
gi|692 YPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKCDPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPR
.:..:.:. :::::::.:::::::::::.:.:.:::..: :::::::.:.: . :::::::..::...:::.. :::::
gi|119 FPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPR
560 570 580 590 600 610 620 630
640 650 660 670 680 690
gi|692 GNAQVVNAYVPLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE
:::::. :.:::.:::::::.:::.:::::...:.:::: ::::.. : .::
gi|119 GNAQVIRAFVPLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ
640 650 660 670 680 690
Different matrix can give different alignment and score. Compare BLOSUM50
alignment:
Gap penalty: -10 extension: -2
280 290 300 310
LPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
::::::. :: : ..:: ::. . : .. .::::::.:
LPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
With PAM120 alignment (less divergent)
Gap penalty: -16 extension: -4
LPSPLDVKPIIGHRASNPEEEVIAKADDSAEFAALAFKVMT
::::::. :: : .. : : .: .. .::::::.:.
LPSPLDIPPIKG--TTPEGEVVEIHPDPNGPLAALAFKIMA
290 300 310
Variations of FASTA program are available for protein-protein, DNA-DNA comparisons. Also DNA-protein comparison by translating DNA.
Smith-Waterman and Needleman-Wunsch use dynamic programming algorithm to find optimal local and global alignments, respectively
Dynamic programming
Identifies highest scoring regions of alignment and extends them to end of alignment.
Highest score for a given reqion nust be made from optimal subscore from step before.
Steps are then traced back from highest score at end of alignment to identify optimal path to that score.
Computationally intensive. Not typically used against databases.
But gives best alignment of two related sequences.
Searches for optimal positioning of gaps to align stretches of similar sequence.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alignment matrix | Summed using dynamic programming |
# Aligned_sequences: 2 # 1: EFG_THETH # 2: EFG_STAAM # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 697 # Identity: 430/697 (61.7%) # Similarity: 542/697 (77.8%) # Gaps: 10/697 ( 1.4%) # Score: 2290.0 # # #======================================= EFG_THETH 1 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAA 50 :..|:.|::.|||||.||||||||||||||||||||||||||.||||: EFG_STAAM 1 MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGAS 48 EFG_THETH 51 TMDFMEQERERGITITAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRV 100 .||:||||::||||||:|.||..|:.||:||||||||||||:|||||:|| EFG_STAAM 49 QMDWMEQEQDRGITITSAATTAAWEGHRVNIIDTPGHVDFTVEVERSLRV 98 EFG_THETH 101 LDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVI 150 ||||:.|.|:..|||||:|||||||..|.||||.|.|||||.||:....: EFG_STAAM 99 LDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSV 148 EFG_THETH 151 RTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIP 200 .|:.:||.|....:|||||.||.|..|||::.||.:.|.|||||:|.||. EFG_STAAM 149 STLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIE 198 EFG_THETH 201 IPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAAIRKGT 250 |||::||:|.|....|:|..|:..:.:|.|||..||.:..||..|||:.| EFG_STAAM 199 IPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEAIRQAT 248 EFG_THETH 251 IDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGE 297 .:::..||..|:|.|||||||:||||:||||||||:.||.| :.||.| EFG_STAAM 249 TNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEE 298 EFG_THETH 298 VVEIHPDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKG 347 |: ...|.:...||||||:|.|||||:|||.||||||:||||||.|:||| EFG_STAAM 299 VI-AKADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKG 347 EFG_THETH 348 RKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV 397 ::|||.|||:||||.|:|::.:.:||:.|.||||:|.|||||.||... : EFG_STAAM 348 KRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I 396 EFG_THETH 398 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETG 447 ||||:|.|||||.:::|||:||||:|::|||.:|.||||||...|..||| EFG_STAAM 397 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETG 446 EFG_THETH 448 QTIISGMGELHLEIIVDRLKREFKVDANVGKPQVAYRETITKPVDVEGKF 497 |.||.|||||||:|:|||:|:||.|:.|||.|.|:||||......|:||| EFG_STAAM 447 QVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVSYRETFKSSAQVQGKF 496 EFG_THETH 498 IRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIE 547 .||:|||||||.|.|:..|...|:||||.|||||||:|:||||:|:.|:: EFG_STAAM 497 SRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLK 546 EFG_THETH 548 EAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKG 597 :||::|.|.|:|::|:|..|||||||:|||||||||||.|:|:|||.:|. EFG_STAAM 547 DAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKC 596 EFG_THETH 598 DPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFV 647 |||||||:|:|.:..|||||||::||:.:|||::.||||||||||:.|:| EFG_STAAM 597 DPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYV 646 EFG_THETH 648 PLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ 691 ||:|||||||.|||.|||||::.|:||||.||||.:.|.:||.. EFG_STAAM 647 PLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE 693