Sequence alignment

So how about the actual alignments?

Most commonly used algorithms are FASTA and BLAST

Both are useful for searching for matches of your sequence to a large database, such as Genbank

To do so, both use simplifying shortcuts to minimize computational time without sacrificing much sensitivity

BLAST (see the BLAST Tutorial Page)

Databases contain more than just sequences:

May include nucleotide sequences (genes, genomes, ESTs), proteins, protein structures, and Medline.

One powerful means of searching among these is Entrez:

http://www3.ncbi.nlm.nih.gov/Entrez/
We will return to this later.

So, which program should you use to search a database?

How do these programs search?

Global: calculates a similarity score between entire lengths of sequences
Local: calculates score based on smaller, local regions of similarity

There are two main search types for comparing your sequence of interest to a large database. They are FASTA (short for Fast-all) and BLAST (Basic Local Alignment Search Tool). Each has its own strengths and weaknesses.

BLAST is a similarity search program developed by the research staff at NCBI/GenBank. It is available as a free service over the internet which provides very fast, accurate, and sensitive database searching.

BLAST algorithm:

Step1:

Choose a word size, usually 11 for DNA and 3 for protein.
Locates all words in the query sequence and compiles the list of
similar words in high scoring word pairs  to filter out low complexity or repetitive regions within the query sequence.

Step 2:

Scans database for exact matching with the list of words complied in step 1.
e.g. qlnfsagw -> (ql, ln, nf, fs, sa, ag, gw)
Extends the list (using some threshold T).

Step 3:

Scans through the string and whenever a word in the list is found, tries to extend it in both directions (no gaps) to get to a score beyond a threshold S. While extending uses a parameter L that defines how long an extension will be tried to raise the score over S


Modification of step 3:

-Original BLAST: Extension is continued as long as the score continued to increase.
-BLAST2 (gapped BLAST): - Lower value of T is used.
- After extension try to combine (allowing gaps)
- Find maximal scoring segment.

blast

blast

Works similarly to FASTA, with some differences
First, regions of sequence that are low complexity (like ST-rich regions, which are easily substituted) are filtered out
Then a list of overlapping 3 or 4 letter words are derived from sequence
Substitution score is used to identify high scoring words above a threshold value
-> small number of words (~50) to compare to database
Database already indexed for high-scoring words
Matches within given distance of other matches on same diagonal are used to align sequences

BLAST alignment of S. aureus (Query) vs T. thermophilus (Subject)


Query:     4   EFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGXXX 63
               E+ L++ RNIGI AHIDAGKTTTTERILYYTGRIHKIGE HEGA+ MD+MEQE++RG   
Sbjct:     6   EYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITI 65
np-binding 19               ********

Query:     64  XXXXXXXXWEGHRVNIIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQA 123
                       W+ HR+NIIDTPGHVDFT+EVERS+RVLDGA+ V+D+  GVEPQ+ETVWRQA
Sbjct:     66  TAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQA 125
np-binding 83                   *****

Query:     124 TTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKC 183
                 Y VPRI F NKMDK GA++   + T+++RL A    +QLPIG ED F  IID++ MK 
Sbjct:     126 EKYKVPRIAFANKMDKTGADLWLVIRTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKA 185
np-binding 137            ****

Query:     184 FKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEA 243
               + Y NDLGT+I EI IPE++LD+A E    L+E  A+  + +M KYL  EE +  EL  A
Sbjct:     186 YTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAA 245

Query:     244 IRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-K 302
               IR+ T +++  PV+ G+A+KNKGVQL+LDAV+DYLPSPLD+ PI G   + PE EV+   
Sbjct:     246 IRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIH 302

Query:     303 ADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANS 362
                D +  +AALAFK+M DPYVG+LTF RVYSGT+TSGSYV N+TKG++ERV RLL+MHAN 
Sbjct:     303 PDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANH 362

Query:     363 RQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-IILESMEFPEPVIHLSVEPKSKADQD 421
               R+E++ + +GD+ A VGLK+T TGDTL GE    +ILES+E PEPVI +++EPK+KADQ+
Sbjct:     363 REEVEELKAGDLGAVVGLKETITGDTLVGEDAPRVILESIEVPEPVIDVAIEPKTKADQE 422

Query:     422 KMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVS 481
               K++QAL +L EEDPTF   T  ETGQ II GMGELHL+I+VDR+K+EF V+ NVG P V+
Sbjct:     423 KLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVGKPQVA 482

Query:     482 YRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSV 541
               YRET      V+GKF RQ+GGRGQYG V I+  P   G+GFEF NAIVGGV+P+EYIP+V
Sbjct:     483 YRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAV 542

Query:     542 EAGLKDAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFXXXXXXXXXXXXXXCDPVIL 601
               + G+++AM++G L G+P++D+K  LYDGSYH+VDSSEMAF               DPVIL
Sbjct:     543 QKGIEEAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVIL 602

Query:     602 EPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYVPLSEMFGYATSLRSN 661
               EP+M+V +  PEEYMGD++GD+ +RRG++ GMEPRGNAQV+ A+VPL+EMFGYAT LRS 
Sbjct:     603 EPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFVPLAEMFGYATDLRSK 662

Query:     662 TQGRGTYTMYFDHYAEVPKSIAEDIIK 688
               TQGRG++ M+FDHY EVPK + E +IK
Sbjct:     663 TQGRGSFVMFFDHYQEVPKQVQEKLIK 689

Results similar to FASTA using same matrix. XXXXX indicates regions masked for low complexity.

 

Selecting the BLAST Program

You need to pick the right version of BLAST to suit your data:
 
Program Description
blastp Compares a protein sequence against a protein sequence database.
blastn Compares a nucleotide query sequence against a nucleotide sequence database.
blastx Translates your nucleotide query sequence in all reading frames against and compares these translations against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastn Compares a protein query sequence against a nucleotide sequence database which has been translated in all reading frames.
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.



Selecting the BLAST Database (from the BLAST tutorial)

Proteins

Database Description
nr All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF
month similar to nr for sequences released in the last 30 days. 
swissprot The SWISS-PROT protein sequence database.
patents Protein sequences from the GenBank Patent database.
yeast Protein translations of the Yeast genome.
E. coli Protein translations of the Escherichia coli genome.
pdb Sequences from the 3-dimensional structure Protein Data Bank.
kabat Kabat's Database of Sequences of Immunological interest. (e.g. IgG's)
alu Translations of Alu repeats from REPBASE


Nucleotides

Database Description
nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (excluding EST, STS, GSS, and HTGS sequences).
month GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest

Non-redundant database of GenBank+EMBL+DDBJ ESTs (Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources.)

dbsts

Non-redundant database of GenBank+EMBL+DDBJ STSs (Sequence Tagged Site: random physical genomic map marker)

mouse ests The non-redundant Database of mouse GenBank+EMBL+DDBJ ESTs.
human ests The non-redundant Database of human GenBank+EMBL+DDBJ ESTs.
other ests The non-redundant database of GenBank+EMBL+DDBJ ESTs from all organisms except mouse and human.
yeast Sequence fragments from the Yeast complete genome.
E. coli Escherichia coli genomic nucleotide sequences.
pdb Sequences from the 3-dimensional structure Protein Data Bank.
kabat [kabatnuc] Kabat's Database of Sequences of Immunological interest. (e.g. IgG's)
patents Protein sequences from the GenBank Patent database.
vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
mito Database of mitochondrial sequences (Rel. 1.0, July 1995).
alu Translations of Alu repeats from REPBASE
gss Genome Survey Sequences.
htgs High Throughput Genomic Sequences.

Evaluation of alignments (from the BLAST tutorial)

The Raw Score (S) is the sum of all the matches for a given alignment, minus mismatch and gap penalties. This value tends to be larger for larger proteins, so how can you tell if it is significant?

Expected (E) value is the number of scores that would be expected to exceed the actual score S by chance.

where m and n are sequence lengths and K and lambda are scaling factors dependent upon the size of the search and the scoring system.

Probability (P) of finding a higher score:

P= 1-e(-E)

When E< 0.01, P~E

So the E value is a good measure of the significance of any alignment. Numbers approacing 1 are not likely significant.

Numbers much less than 1 are highly significant.

 

FASTA

Sequences are broken into words (also called k-tuples)
Matches between word characters are detected, offset of positions calculated
Series of character matches with same offset = a hit
Additional matches on same diagonal are identified -> diagonal matches
Diagonal matches within certain distance of other matches are joined using gaps
Joined regions are evaluated using substitution matrix scores FASTA uses a dotplot based method, using words for comparison.

FASTA algorithm:

Step1:

Set a word size, usually 6 for DNA and 2 for protein.
FASTA locates regions of the query sequence and matching regions in the database sequences that have high densities of exact word matches (without gaps) and makes lists of each. The k-tuple (K-tup) parameter is the length of the word.

Step 2:

It matches words from each list, creating a diagonal and tries to extend each diagonal bit. The highest scoring regions are rescored using (for instance) the PAM250 scoring matrix. The score for such a pair of regions is saved as the Òinit1Ó.

Step 3:

It then determines if any of the initial regions from different diagonals
can join together to form an approximate alignment with gaps. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region, at the end of this step, is saved as the ÒinitnÓ

Step 4:

It then makes an optimal local alignment after computing the initial scores using the Smith-Waterman algorithm. The score for this alignment is the ÒoptÓ score.

Step 5:

Random Sequence Simulation: In order to evaluate the significance of any alignment FASTA empirically estimates the score distribution from the alignment of many random pairs of sequences. The characters of the query sequences are reshuffled (to maintain length and character composition bias) and searched against a random subset of the database.
This empirical distribution is extrapolated, assuming it is an extreme value distribution, and each alignment to the real query is assigned a Z-score and an E-score( expectation of significance , 0.02 is good)

fasta
fasta

FASTA alignment of EF-G sequences using BLOSUM50 matrix (for very divergent sequences)

>>gi|119190|sp|P13551|EFG_THETH ELONGATION FACTOR G (EF-G)                    (691 aa)
 initn: 2949 init1: 1359 opt: 2943  Z-score: 3250.5  bits: 611.9 E(): 3.1e-174
Smith-Waterman score: 2943;  62.591% identity (63.050% ungapped) in 687 aa overlap (4-688:6-689)
Entrez lookup  Re-search database  General re-search  
>gi|119    4- 688:---------------------------------------------------------------------:

                 10        20        30        40        50        60        70        
gi|692   MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGITITSAATTAAWEGHRVN
            :..:.. ::::: ::::::::::::::::::::::::::.::::. ::.::::..::::::.:.::  :. ::.:
gi|119 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITITAAVTTCFWKDHRIN
               10        20        30        40        50        60        70        80

       80        90       100       110       120       130       140       150        
gi|692 IIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQAN
       :::::::::::.:::::.::::::..:.:...:::::.:::::::  : ::::.:.::::: ::..   . :...:: : 
gi|119 IIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVIRTMQERLGAR
               90       100       110       120       130       140       150       160

      160       170       180       190       200       210       220       230        
gi|692 AAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVS
        . .::::: :: : .:::...:: . : :::::.:.:: :::..::.:.: . .:.:..:. ....: :::  :: .  
gi|119 PVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEE
              170       180       190       200       210       220       230       240

      240       250       260       270       280       290       300        310       
gi|692 ELKEAIRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
       ::  :::..: .... ::. :.:.:::::::.::::.::::::::. :: :   ..:: ::.  . : .. .::::::.:
gi|119 ELVAAIRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
              250       260       270       280       290          300       310       

       320       330       340       350       360       370       380       390       
gi|692 TDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I
       .:::::.:::.::::::.:::::: :.:::..:::.:::.:::: :.:.. . .::..:.::::.: ::::: ::    .
gi|119 ADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV
       320       330       340       350       360       370       380       390       

        400       410       420       430       440       450       460       470      
gi|692 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVG
       ::::.: ::::: ...:::.::::.:..:::..: ::::::.. :  ::::.::.:::::::.:.:::.:.::.:. :::
gi|119 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVG
       400       410       420       430       440       450       460       470       

        480       490       500       510       520       530       540       550      
gi|692 APMVSYRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLKDAMENGVLAG
        :.:.::::. . ..:.::: ::.::::::: :.:.  :   :.:::: :::::::.:.::::.:. :...::..: : :
gi|119 KPQVAYRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIEEAMQSGPLIG
       480       490       500       510       520       530       540       550       

        560       570       580       590       600       610       620       630      
gi|692 YPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKCDPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPR
       .:..:.:. :::::::.:::::::::::.:.:.:::..: :::::::.:.: .  :::::::..::...:::.. :::::
gi|119 FPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPR
       560       570       580       590       600       610       620       630       

        640       650       660       670       680       690   
gi|692 GNAQVVNAYVPLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE
       :::::. :.:::.:::::::.:::.:::::...:.:::: ::::.. : .::     
gi|119 GNAQVIRAFVPLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ   
       640       650       660       670       680       690    

Different matrix can give different alignment and score. Compare BLOSUM50 alignment:
Gap penalty: -10 extension: -2

280       290       300        310       
LPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
::::::. :: :   ..:: ::.  . : .. .::::::.:
LPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM

With PAM120 alignment (less divergent)
Gap penalty: -16 extension: -4

LPSPLDVKPIIGHRASNPEEEVIAKADDSAEFAALAFKVMT
::::::. :: :  ..   : :   .: .. .::::::.:.
LPSPLDIPPIKG--TTPEGEVVEIHPDPNGPLAALAFKIMA
      290         300       310        

Variations of FASTA program are available for protein-protein, DNA-DNA comparisons. Also DNA-protein comparison by translating DNA.

Other search types

Smith-Waterman and Needleman-Wunsch use dynamic programming algorithm to find optimal local and global alignments, respectively

Dynamic programming

Identifies highest scoring regions of alignment and extends them to end of alignment.
Highest score for a given reqion nust be made from optimal subscore from step before.
Steps are then traced back from highest score at end of alignment to identify optimal path to that score.

Computationally intensive. Not typically used against databases.
But gives best alignment of two related sequences.
Searches for optimal positioning of gaps to align stretches of similar sequence.

  P V I L E P M M K V T I E M P
P 3         1     1 1         1
V   3               1 1        
I     3               1 1   1 1
L       3               1 1   1
E 1       2         1     1 1  
P 1 1     1 2         1 1     1
I     1     1 1         1 1    
M             1 2           1  
R 1   1           1   1     1 1
V   1   1       1   1   1     1
E 1       1       2       1    
V   1             1 2          
T       1           1 1   1    
T     1   1           2     2 1
P 1   1 1   1         1 1   1 3
  P V I L E P M M K V T I E M P
P 3 3 4     1 1         1
V 3 6             1 1        
I 9             1 1   1 1
L   12               1 1   1
E 4       14 14       1     1 1  
P 1 5     15 16         1 1     1
I     6     1 17         1 1    
M             1 19           1  
R 1   1         19 20 20 21 21 21 22 22
V   1   1       20 20 21 21 22 22 23 23
E 1       1       22 22 22 22 23 23 23
V   1             23 24 24 24 24 24 24
T       1         23 25 25 25 26 26 26
T     1   1       23 25 27 27 27 28 29
P 1   1 1   1     23 25 27 28 28 29 31
Alignment matrixSummed using dynamic programming

Needleman-Wunsch alignment of EFG sequences

# Aligned_sequences: 2
# 1: EFG_THETH
# 2: EFG_STAAM
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 697
# Identity:     430/697 (61.7%)
# Similarity:   542/697 (77.8%)
# Gaps:          10/697 ( 1.4%)
# Score: 2290.0
# 
#
#=======================================

EFG_THETH          1 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAA     50
                       :..|:.|::.|||||.||||||||||||||||||||||||||.||||:
EFG_STAAM          1   MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGAS     48

EFG_THETH         51 TMDFMEQERERGITITAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRV    100
                     .||:||||::||||||:|.||..|:.||:||||||||||||:|||||:||
EFG_STAAM         49 QMDWMEQEQDRGITITSAATTAAWEGHRVNIIDTPGHVDFTVEVERSLRV     98

EFG_THETH        101 LDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVI    150
                     ||||:.|.|:..|||||:|||||||..|.||||.|.|||||.||:....:
EFG_STAAM         99 LDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSV    148

EFG_THETH        151 RTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIP    200
                     .|:.:||.|....:|||||.||.|..|||::.||.:.|.|||||:|.||.
EFG_STAAM        149 STLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIE    198

EFG_THETH        201 IPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAAIRKGT    250
                     |||::||:|.|....|:|..|:..:.:|.|||..||.:..||..|||:.|
EFG_STAAM        199 IPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEAIRQAT    248

EFG_THETH        251 IDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGE    297
                     .:::..||..|:|.|||||||:||||:||||||||:.||.|   :.||.|
EFG_STAAM        249 TNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEE    298

EFG_THETH        298 VVEIHPDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKG    347
                     |: ...|.:...||||||:|.|||||:|||.||||||:||||||.|:|||
EFG_STAAM        299 VI-AKADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKG    347

EFG_THETH        348 RKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV    397
                     ::|||.|||:||||.|:|::.:.:||:.|.||||:|.|||||.||... :
EFG_STAAM        348 KRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I    396

EFG_THETH        398 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETG    447
                     ||||:|.|||||.:::|||:||||:|::|||.:|.||||||...|..|||
EFG_STAAM        397 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETG    446

EFG_THETH        448 QTIISGMGELHLEIIVDRLKREFKVDANVGKPQVAYRETITKPVDVEGKF    497
                     |.||.|||||||:|:|||:|:||.|:.|||.|.|:||||......|:|||
EFG_STAAM        447 QVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVSYRETFKSSAQVQGKF    496

EFG_THETH        498 IRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIE    547
                     .||:|||||||.|.|:..|...|:||||.|||||||:|:||||:|:.|::
EFG_STAAM        497 SRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLK    546

EFG_THETH        548 EAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKG    597
                     :||::|.|.|:|::|:|..|||||||:|||||||||||.|:|:|||.:|.
EFG_STAAM        547 DAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKC    596

EFG_THETH        598 DPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFV    647
                     |||||||:|:|.:..|||||||::||:.:|||::.||||||||||:.|:|
EFG_STAAM        597 DPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYV    646

EFG_THETH        648 PLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ       691
                     ||:|||||||.|||.|||||::.|:||||.||||.:.|.:||..   
EFG_STAAM        647 PLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE    693