Internet, Pubmed, and BLAST

BLAST Homework Assignment

Internet


Since computers, and usually, the internet, are so heavily involved in the use of bioinformatics, a brief introduction to how the internet itself works may be beneficial.  Much of this info was obtained from UNH InterOperability Lab  and  PC Lube & Tune.

 

Lets start by clicking on a web page link using your internet browser:
for example:

PubMed citation database
contains the link: http://www.ncbi.nlm.nih.gov/PubMed/
This means "use the hypertext transfer protocol" to ask the computer named "www.ncbi.nlm.nih.gov" for the file named "/PubMed/" (actually for some default file found in the directory "PubMed") and send it back to me.

 

The most important parts of this process are identifying computers named "www.ncbi.nlm.nih.gov" and "me" and negotiating  the transfer.

All computers connected to the internet must have an internet protocol (IP) address.
You must have one assigned to you by an internet provider. If you install a computer at Tulane, you can ask TIS for a number, which you then log into your computer:

in this case, my IP address is 129.81.38.94.

All Tulane (tulane.edu) computers have addresses beginning with 129.81, all UNO (uno.edu) computers begin with 137.30. The network within Tulane is subdivided into smaller networks (Subnets) interconnected by routers. My computer can connect with any computer with address 129.81.38.### without going through the router, to reach the outside world I need to use the router gateway.

First of all, how do we find www.ncbi.nlm.nih.gov? It doesn't look much like my address. This is accomplished by Domain Name Servers (DNS), computers which keep lists of IP address numbers and corresponding names like "www.tulane.edu," which are easier to remember.

Each institution is responsible for listing all the computers within its domain and the corresponding name, if it has one. The DNS here can query other DNS to see if they have a "www.ncbi.nlm.nih.gov" and if so, what its real number is so we can contact it.

In this case the Tulane DNS is 129.81.224.50 (ns1.tcs.tulane.edu). You may get a "domain name server error" when you can't get through on the network. This could mean that the DNS is down, in which case you might be able to get through to your destination if you know the IP number. But usually this means the connection between you and the network is down, and the first place your computer checks is the DNS.

So the Tulane DNS queries the NIH DNS (ns2.nih.gov) to find the IP number for www.ncbi.nlm.nih.gov (130.14.29.110). Now the actual request for data can begin between our computers. My computer asks 130.14.29.110 for the file in question. If it supports http (i.e. it is a web server) and if you asked for a real file in the right place on the web server, it will start sending you back data. It does so by sending little packets of data with its address, your address, the data, and some bookkeeping data bits, which tell what part of the file it is and a key to tell you whether the data packet might have been corrupted (mangled in the transfer). If the packet arrives intact, the next one is sent. This transfer is relayed between many routing computers. In this case, it takes about 11 steps:
 1  129.81.133.1 (129.81.133.1)
 2  tidewater-et-4-1.net.tulane.edu (129.81.255.93)
 3  newsouth-atm-1-0-0.net.tulane.edu (129.81.255.70)
 4  abilene-houston-pos-oc3.tis.tulane.edu (129.81.255.2)
 5  atla-hstn.abilene.ucaid.edu (198.32.8.34)  University Corporation for Advanced Internet Development
 6  wash-atla.abilene.ucaid.edu (198.32.8.66)
 7  wash-abilene-oc48.maxgigapop.net (206.196.177.1)  Mid-Atlantic Crossroads (MAX)
 8  clpk-so3-1-0.maxgigapop.net (206.196.178.46)
 9  wash-nlm.maxgigapop.net (206.196.177.34)
10  130.14.38.185 (130.14.38.185)
11  micasaweb.nlm.nih.gov (130.14.22.106)

 

There is the possibility that unscrupulous people may pretend to be other computers and intercept private data, like credit card numbers. This is why some transfers use secure, encrypted transfers (https instead of http) which prevent others from deciphering what is being sent.

Once the file is sent, you browser determines what kind of file it is (picture, text, or html text file with instuctions for downloading other files embedded in it) and displays the file. The server can tell your computer what kind of file it is sending, like an audio file or spreadsheet, which might be used by another program on your computer.

 

Another note on your IP address: If you are dialing in by modem to get internet access, you use the PPP protocol to connect with a UNO computer. In this case the server to which you dialed assigns you a temporary IP number for the duration of the connection. The next time you dial, you will probably get a different number. An analogous assignment is made to some computers connected directly to the local ethernet cable called DHCP. In this case a DHCP server on the network assigns you a temporary IP number, which you keep until you unhook or restart your computer.

 

Important Databases:

Genbank and EMBL DNA sequence databases

Both contain virtually all known sequences, including complete genomes

Genbank and SWISSPROT protein sequence databases

Mostly translated coding sequences from the DNA database
Important file formats for both protein and DNA databases are:


GenBank: protein example - DNA example

PDB: Protein Data Bank 3-D structural database

Genome databases, most accessible through Entrez

Currently there are:
96 complete Bacterial genomes
16 complete Archeael genomes
19 complete Eukaryal genomes, including Human
and hundreds of viral genomes

Last year there were:
53 complete Bacterial genomes
11 complete Archeael genomes
10 complete Eukaryal genomes
and hundreds of viral genomes

PubMed citation database

Thousands of Titles and abstracts from medically relevant journals dating back to the 1960's. Some older citations also available. Powerful searching capabilities essential for identifying articles of interest. Similar databases available for other disciplines (i.e. agricultural)

PubMed introduction and tutorial


This page is condensed from the NCBI PubMed Tutorial Pages . You may find the full tutorial quite useful.

When you enter search terms on the main PubMed search page,  the PubMed server processes your request to attempt to identify what type of search you are attempting: are you looking up an author name, journal title, subject area, or phrase from the article abstract?  It accomplishes this by filtering your search terms through successive lists to identify the types of terms you provide and use them effectively. This process is called:
Automatic Term Mapping

PubMed compares your search terms against several lists of search terms to determine what you are looking for. It checks four lists in order  and stops looking once it finds a match:
 

  1. MeSH (Medical Subject Heading) Translation Table
  2. Journals Translation Table
  3. Phrase List
  4. Author Index


The MeSH Translation Table contains:

  • MeSH terms and Subheadings
  • (searching synonyms for MeSH terms)
  • Chemical Names of Substances
  • The Journals Translation Table contains:
  • Full journal titles
  • MEDLINE title abbreviations
  • International Standard Serial Numbers (ISSN)
  • Since MESH terms are searched before Journal Titles, if you want to look up a Journal whose name is also a MESH term, like  RNA or Cell, the search will stop with the MESH term and the search for your journal will not be done.

    The Phrase List contains several hundred thousand phrases generated from:

  • MeSH
  • Unified Medical Language System (UMLS)
  • Chemical Names of Substances
  • These are frequently used phrases that are not a part of the MeSH translation table

    Author Searching

    The format for author searching is last name plus initials.
    PubMed will automatically truncate the author's name to account for varying initials.

    If the term is not found, PubMed will then search the individual words in All Fields.

  • You can also try putting a phrase in double quotes if the results returned are not what you expected. This will force PubMed to look for the words as a phrase, but it bypasses the Automatic Term Mapping, so you might want to try doing some searches both with and without double quotes.
  • Truncation

    You can truncate a word with the asterisk (*) wildcard This will causes PubMed to return all matches that begin with the truncated string of text. (e.g. enzym* will match enzyme, enzymes, enzymology, enzymatic, etc.) Truncation also turns off Automatic Term Mapping, so the results will be different than nontruncated searches.

    Stopwords

    PubMed also refers to a list of commonly found words that are referred to as "stopwords ." these are very common words which would match almost every citation and so they are skipped.

    The list of stopwords is from PubMed's Help Page.

    Stopwords

     
    a did it perhaps these
    about do its quite they
    again does itself rather this
    all done just really those
    almost due kg regarding through
    also during km seem thus
    although each made seen to
    always either mainly several upon
    among enough make should use
    an especially may show used
    and etc. mg showed using
    another for might shown various
    any found ml shows very
    are from mm significantly was
    as further most since we
    at had mostly so were
    be has must sum what
    because have nearly such when
    been having neither than which
    before here no that while
    being how nor the with
    between however obtained their within
    both I of theirs without
    but if often them would
    by in on then
    can into our there
    could is overall therefore

    Operators

    You can use Boolean operators (AND, OR, NOT) to direct your search. These must be entered in UPPERCASE. Operators are processed left-to-right unless you use parentheses to specify the order.

    Once you click the "Go" button. Your search is performed and the first 20 hits are displayed in a Summary format:

  • Author name(s):
  • Title of the article:
  • Brackets indicate a title translated from a foreign language.
  • Source: a brief journal citation.
  • Identification number: A PubMed Unique Identifier (PMID) is included on each record.
  • Links: Includes links to Related Articles and databases, when available.
  • You can easily scan this first page of citations and see how many of them are really related to what you were trying to find. Though only the first 20 citations are displayed by default (in reverse chronological order) you can see how many total articles matched your search. If you got a surprisingly small or large number of hits, or if there seem to be a high percentage of extraneous hits, you might want to click on the "Details" button in the upper gray box.

    Details Button

    Clicking Details displays:

     

  • The PubMed query box shows exactly how PubMed performed your search using the Automatic Term Mapping. It may have found a synonym in the MeSH headings and used that instead of one of your original terms.
  • You can edit the search used and run the edited search by clicking "Search".
  • If the search worked really well, you can save it as a web link by clicking "URL" This formats your search as a URL link your web browser can save as a bookmark to repeat the search at a later date. You can also use the "Cubby" system described below.
  • The "Result "section shows how many hits you got, and links you back to your hits. The translations section describes how each term of your search was interpreted.
  • The database is PubMed, and The User Query is what you typed in to begin with.
  • Limits Button

    If your search was not specific enough, you can use the "Limits" button in the Features bar to manually limit your search based upon specific fields. The default setting is "All Fields"
  • You can select Publication types (like reviews) from another menu. You can limit searches to specific dates or trials involving subjects in specific age groups, gender, or human/non-human.
  • You can require that hits have Abstracts, though some reviews do not have abstracts, nor do articles indexed before 1975.
  • Preview/Index Button


    You can have even more control over limits by using the Preview/Index Feature. You can add search terms by limiting to specific fields, but you can preview the number of results by clicking on the preview button.

  • By clicking on index, you can also look up search terms in the index (for example the index of MeSH terms). Items can be added to the search window using the AND, OR, or NOT buttons.
  • Different searches can be combined using their Query number found in the Preview/Index page, a more extensive list is found on the History page. (ex, #4 AND #5). Note that these query numbers disappear after 1 hour of inactivity, so you can't use yesterday's Query number tomorrow and get the same result.
  • You also cannot use these numbers to save your results as a URL in the details window, but you can manually cut and paste the query lines together to save them.
  • Results

    Now that you have constructed the perfect search, you can select the perfect format for displaying results. The default is 20 summary results, but you can choose another format: Other available formats for citation display can be chosen by selecting from the list of choices listed under "Summary":

     
    Brief format includes: Abstract format provides the summary information in addition to: Citation format is similar to abstract, but also includes: MEDLINE format is a text file with identifying letters before each field. It is most useful for importing into bibliography programs like EndNote and ProCite.

    Selecting Citations and Display Format

    You can select a subset of the hits to display by clicking the box before each item. If you don't click any boxes, then all are displayed.

    Add to Clipboard

    You can select individual citations to save in a clipboard on the server. This is not the clipboard on your computer. After selecting items by clicking their checkbox, click on the "Add to clipboard" link.

    Save Button

    You can save citations to a file on your computer by clicking the "Save" link. There is a limit of 10,000 hits. To save selected citations, pick a display format and press "Save". You will be prompted for where to save the downloaded file.

    Text Button

    You can have the selected items displayed as plain text by clicking the "Text" button. This may be useful for printing if your browser doesn't print the hypertext files well.

    Cubby

    If you set up a "Cubby", you can save your favorite searches indefinitely on the PubMed server. You have to get a username and password. You can then save your search and rerun it at a later date. Or you can run the search for new articles published since the last time you searched.

    LinkOut Preferences

    The LinkOut service enables publishers, libraries (like LSU Medical School), biological databases, sequence centers, and other Web resources to display links to their sites on records in PubMed.

    You can use Cubby to set which links are displayed by

    When you are logged into Cubby, PubMed displays LinkOut providers according to your preferences.
    Related Articles - Compares words from the title, abstract, and MeSH headings to identify articles similar to the selected article.

    Related Articles

    NCBI Databases

    These are the NCBI databases that may be linked to from individual PubMed citations:

     

    Sequence alignment


    Alignment of DNA or protein sequences in bioinformatics serves as a number of purposes:

    DNA sequence data can be aligned to assemble larger contiguous DNA sequences
    Protein sequences can be aligned to identify important regions, such as active sites
    DNA sequences can be aligned to identify regulatory control regions

    Alignment of any two sequences implies that those sequences are related.

    Overlapping DNA sequence reads are derived from a single DNA molecule
    Protein, RNA, or DNA sequences from a particular gene for different organisms are related by the ancestry of those organisms
    Duplicated genes (and pseudogenes) within an organism are related to each other or a particular parental gene

    Molecular sequences that are not related in some way, do not produce statistically significant sequence alignments

    A key then, to answering many bioinformatics questions lies in the ability to reliably align sequence information

    Sequences can be aligned:
    Globally, in which as many characters as possible over the entire sequence are aligned
    Or locally, in which regions of sequences which are highly similar are aligned


    Sequences can be compared in pairs or as multiple alignments of related sequences
    In practice, these alignments are performed similarly, and can be understood in the context of a dot-matrix sequence comparison.

    Dot Matrix Analysis

    Dot matrix analysis displays the primary sequence of pairs of proteins on the X and Y axes of a graph.
    Dots are plotted on the graph where the X and Y coordinate sequences are identical.
    Regions of identical sequence are revealed as diagonal rows of dots.
    Random matches are seen as isolated dots.


    A comparison of DNAs sequences for elongation factor EF-G from S. aureus and T. thermophilus is shown below left. A dot is placed every time a nucleotide in the S. aureus sequence matches that of T. thermophilus. Since there are only four types of nucleotides, random matches occur many places throughout the sequence and obscure regions of significant similarity.
    A similar analysis of the corresponding peptide sequences is shown at right. Random matches of the 20 amino acids also occur, but diagonal regions of sequence similarity are all still discernible.

    DNA alignment - single nucleotide identity Protein alignment - single residue identity

     

    Random matches can be filtered by using a sliding window to compare sequences. In this case, a block of characters is compared between sequences. A dot is placed on the alignment only if a threshold value of matches is detected.

    An example is shown below on a small portion of the amino acid sequence. At left, a one or dot is placed wherever the sequences match.
    In the center panel, a score is placed for the number of matches between sequences for each block of 3 amino acids.
    For 3 consecutive matches, the score is 3
    For 2 out of 3 the score is 2
    For 1 out of three the score is 1

    While there are more 1's on the second matrix, by ignoring all values below 2 (right panel), the random matches are ignored.

      P V I L E P M M K V T I E M P
    P 1         1                 1
    V   1               1          
    I     1                 1      
    L       1                      
    E         1               1    
    P 1         1                 1
    I     1                 1      
    M             1 1           1  
    R                              
    V   1               1          
    E         1               1    
    V   1               1          
    T                     1        
    T                     1        
    P 1         1                 1
      P V I L E P M M K V T I E M P
    P 3         1     1 1         1
    V   3               1 1        
    I     3               1 1   1 1
    L       3               1 1   1
    E 1       2         1     1 1  
    P 1 1     1 2         1 1     1
    I     1     1 1         1 1    
    M             1 2           1  
    R 1   1           1   1     1 1
    V   1   1       1   1   1     1
    E 1       1       2       1    
    V   1             1 2          
    T       1           1 1   1    
    T     1   1           2     2 1
    P 1   1 1   1         1 1   1 3
      P V I L E P M M K V T I E M P
    P 3                            
    V   3                          
    I     3                        
    L       3                      
    E         2                    
    P           2                  
    I                              
    M               2              
    R                              
    V                              
    E                 2            
    V                   2          
    T                              
    T                     2     2  
    P                             3
    Word Size = 1
    Word Size = 3
    Word Size = 3, threshold=2

    The full matrix for the proteins is shown below:

    Protein- single residue identity Protein identity 5 residues of 23

    Even using a low stringency window greatly improves the signal/nose ratio. One can readily see a diagonal stretching across the matrix indicating sequence similarity throughout the length of both sequences. This works well for DNA sequences, but you must use higher stringencies; longer lengths help too:

    DNA alignment - single nucleotide identity DNA alignment - 15 of 23 identical

     


    Going back to proteins, this one is Dr. Timpte's favorite!

    You can also compare a sequence with itself to look for repeated sequences:

    Protein sequence of porcine submaxillary mucin compared to itself looking for exact matches of 39 residues.

    You can see diagonal of unique sequence at termini.
    In between are 135.6 repeats of an 81 residue sequence, extending 10991 amino acids, this is an extreme case.
    Repeats are often detectable in comparing related sequences to each other.
    Dot matrix analysis makes them painfully obvious.


    The alignment can be improved by raising the score for a match to +6 and penalizing a mismatch with -1:

    Protein identity 5 residues of 23(match=1, mismatch=0) Protein identity 5 residues of 23(match=6, mismatch=-1)

    Virtually all the background matches have been eliminated. Even better results can be obtained using weighted scoring matrices.

    Substitution matrices

     

    Comparison of related proteins revealed that substitution of chemically similar amino acid residues was fairly common in many positions of the protein sequences.

    Other substitutions were found to be less common
    Some were extremely rare.

    Substitution matrices quantitate these differences in substitution frequencies for use in evaluation of alignments

    Protein identity 5 residues of 23
    (match=6, mismatch=-1)
    Protein BLOSUM62 matrix
    5 residues of 23 match

     

    Where do scoring matrices come from?

    BLOSUM Matrix (Henikoff and Henikoff, 1992)

    BLOSUM matrix is based upon the BLOCKS database of protein motifs

    e.g. RNase P BLOCK AxxRNR...

    what are the frequencies of substitutions at the xx positions?
    Allows alignment of short regions of sequence from very divergent proteins

    2000 blocks of conserved patterns from 500 different families of proteins
    Non-conserved amino acids counted and frequency of substitution pairs counted
    Closely related proteins can cause over representation of sequences, so these sequences were merged into a single consensus

    BLOSUM62: 62% of residues in data set identical

    BLOSUM62 Matrix

      C S T P A G N D E Q H R K M I V
    C 9                                      
    S -1 4                                    
    T -1 1 5                                  
    P -3 -1 -1 7                                
    A 0 1 0 -1 4                              
    G -3 0 -2 -2 0 6                            
    N -3 1 0 -2 -2 0 6                          
    D -3 0 -1 -1 -2 -1 1 6                        
    E -4 0 -1 -1 -1 -2 0 2 5                      
    Q -3 0 -1 -1 -1 -2 0 0 2 5                    
    H -3 -1 -2 -2 -2 -2 1 -1 0 0 8                  
    R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5                
    K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5              
    M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5            
    I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4          
    L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4        
    V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4      
    F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6    
    Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7  
    W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

    What the scores mean:
    Score=0: odds of substitution of one amino acid with the other equals that of random chance
    Score>0: substitutions occur more likely than by chance (i.e. evolutionarily selected for)
    Score<0: substitutions occur less likely than by chance (i.e. evolutionarily selected against)

    Gap penalties

    Gap costs: Introducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements. Can be very deleterious to protein structure compared to amino acid substitution. So there are penalties for making gaps, and a separate penalty for extending the gap.
    e.g.

        AGGTTCGACT
        --GT-C-A-T

    Usually penalty is high for opening a new gap, but is smaller for extending a gap. (If a gap is actually allowed in a protein sequence, such as variation in the size of a loop, the size difference is probably not critical). Gap and extension penalties vary with matrix used. Usually default values for a given scoring matrix work best.

    So how about the actual alignments?

    Most commonly used algorithms are FASTA and BLAST

    Both are useful for searching for matches of your sequence to a large database, such as Genbank

    To do so, both use simplifying shortcuts to minimize computational time without sacrificing much sensitivity

    BLAST (see the BLAST Tutorial Page)

    First, regions of sequence that are low complexity (like ST-rich regions, which are easily substituted) are filtered out
    Then a list of overlapping 3 or 4 letter words are derived from sequence
    Substitution score is used to identify high scoring words above a threshold value
    -> small number of words (~50) to compare to database
    Database already indexed for high-scoring words
    Matches within given distance of other matches on same diagonal are used to align sequences

    BLAST alignment of S. aureus (Query) vs T. thermophilus (Subject)

    
    Query:     4   EFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGXXX 63
                   E+ L++ RNIGI AHIDAGKTTTTERILYYTGRIHKIGE HEGA+ MD+MEQE++RG   
    Sbjct:     6   EYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITI 65
    np-binding 19               ********
    
    Query:     64  XXXXXXXXWEGHRVNIIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQA 123
                           W+ HR+NIIDTPGHVDFT+EVERS+RVLDGA+ V+D+  GVEPQ+ETVWRQA
    Sbjct:     66  TAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQA 125
    np-binding 83                   *****
    
    Query:     124 TTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKC 183
                     Y VPRI F NKMDK GA++   + T+++RL A    +QLPIG ED F  IID++ MK 
    Sbjct:     126 EKYKVPRIAFANKMDKTGADLWLVIRTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKA 185
    np-binding 137            ****
    
    Query:     184 FKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEA 243
                   + Y NDLGT+I EI IPE++LD+A E    L+E  A+  + +M KYL  EE +  EL  A
    Sbjct:     186 YTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAA 245
    
    Query:     244 IRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-K 302
                   IR+ T +++  PV+ G+A+KNKGVQL+LDAV+DYLPSPLD+ PI G   + PE EV+   
    Sbjct:     246 IRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIH 302
    
    Query:     303 ADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANS 362
                    D +  +AALAFK+M DPYVG+LTF RVYSGT+TSGSYV N+TKG++ERV RLL+MHAN 
    Sbjct:     303 PDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANH 362
    
    Query:     363 RQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-IILESMEFPEPVIHLSVEPKSKADQD 421
                   R+E++ + +GD+ A VGLK+T TGDTL GE    +ILES+E PEPVI +++EPK+KADQ+
    Sbjct:     363 REEVEELKAGDLGAVVGLKETITGDTLVGEDAPRVILESIEVPEPVIDVAIEPKTKADQE 422
    
    Query:     422 KMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVS 481
                   K++QAL +L EEDPTF   T  ETGQ II GMGELHL+I+VDR+K+EF V+ NVG P V+
    Sbjct:     423 KLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVGKPQVA 482
    
    Query:     482 YRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSV 541
                   YRET      V+GKF RQ+GGRGQYG V I+  P   G+GFEF NAIVGGV+P+EYIP+V
    Sbjct:     483 YRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAV 542
    
    Query:     542 EAGLKDAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFXXXXXXXXXXXXXXCDPVIL 601
                   + G+++AM++G L G+P++D+K  LYDGSYH+VDSSEMAF               DPVIL
    Sbjct:     543 QKGIEEAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVIL 602
    
    Query:     602 EPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYVPLSEMFGYATSLRSN 661
                   EP+M+V +  PEEYMGD++GD+ +RRG++ GMEPRGNAQV+ A+VPL+EMFGYAT LRS 
    Sbjct:     603 EPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFVPLAEMFGYATDLRSK 662
    
    Query:     662 TQGRGTYTMYFDHYAEVPKSIAEDIIK 688
                   TQGRG++ M+FDHY EVPK + E +IK
    Sbjct:     663 TQGRGSFVMFFDHYQEVPKQVQEKLIK 689
    

    Results similar to FASTA using same matrix. XXXXX indicates regions masked for low complexity.

     

    Selecting the BLAST Program

    You need to pick the right version of BLAST to suit your data:
     
    Program  Description
    blastp Compares an amino acid query sequence against a protein sequence database.
    blastn Compares a nucleotide query sequence against a nucleotide sequence database. NOT necessarily the best way to compare 2 DNAs!
    blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Good way to find protein sequences related to your DNA.
    tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
    tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Some sites do not allow use of this program because it is computationally intensive.



    Selecting the BLAST Database (from the BLAST tutorial)

    You can select several NCBI databases to compare your query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot).


    Proteins

    Database Description
    nr All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF 
    month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. 
    swissprot The last major release of the SWISS-PROT protein sequence database.
    patents Protein sequences derived from the Patent division of GenBank.
    yeast Protein translations of the Yeast complete genome.
    E. coli E. coli (Escherichia coli) genomic CDS translations.
    pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
    kabat Kabat's database of sequences of immunological interest.
    alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences


    Nucleotides

    Database Description
    nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).
    month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
    dbest

    Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources.

    dbsts

    Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
    Sequence Tagged Site:
    random physical genomic map marker

    mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.
    human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.
    other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.
    yeast Sequence fragments from the Yeast complete genome.
    E. coli E. coli (Escherichia coli) genomic nucleotide sequences.
    pdb Sequences derived from the 3-dimensional structure of proteins.
    kabat [kabatnuc] Kabat's database of sequences of immunological interest.
    patents Nucleotide sequences derived from the Patent division of GenBank.
    vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
    mito Database of mitochondrial sequences (Rel. 1.0, July 1995).
    alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
    gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
    htgs High Throughput Genomic Sequences.

    Evaluation of alignments (from the BLAST tutorial)

    The Raw Score (S) is the sum of all the matches for a given alignment, minus mismatch and gap penalties. This value tends to be larger for larger proteins, so how can you tell if it is significant?

    Expected (E) value is the number of scores that would be expected to exceed the actual score S by chance.

    where m and n are sequence lengths and K and lambda are scaling factors dependent upon the size of the search and the scoring system.

    Probability (P) of finding a higher score:

    P= 1-e(-E)

    When E< 0.01, P~E

    So the E value is a good measure of the significance of any alignment. Numbers approacing 1 are not likely significant.

    Numbers much less than 1 are highly significant.

    FASTA

    Works similarly to BLAST, with some differences
    Sequences are broken into words (also called k-tuples)
    Matches between word characters are detected, offset of positions calculated
    Series of character matches with same offset = a hit
    Additional matches on same diagonal are identified -> diagonal matches
    Diagonal matches within certain distance of other matches are joined using gaps
    Joined regions are evaluated using substitution matrix scores

    FASTA alignment of EF-G sequences using BLOSUM50 matrix (for very divergent sequences)

    >>gi|119190|sp|P13551|EFG_THETH ELONGATION FACTOR G (EF-G)                    (691 aa)
     initn: 2949 init1: 1359 opt: 2943  Z-score: 3250.5  bits: 611.9 E(): 3.1e-174
    Smith-Waterman score: 2943;  62.591% identity (63.050% ungapped) in 687 aa overlap (4-688:6-689)
    Entrez lookup  Re-search database  General re-search  
    >gi|119    4- 688:---------------------------------------------------------------------:
    
                     10        20        30        40        50        60        70        
    gi|692   MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGITITSAATTAAWEGHRVN
                :..:.. ::::: ::::::::::::::::::::::::::.::::. ::.::::..::::::.:.::  :. ::.:
    gi|119 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITITAAVTTCFWKDHRIN
                   10        20        30        40        50        60        70        80
    
           80        90       100       110       120       130       140       150        
    gi|692 IIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQAN
           :::::::::::.:::::.::::::..:.:...:::::.:::::::  : ::::.:.::::: ::..   . :...:: : 
    gi|119 IIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVIRTMQERLGAR
                   90       100       110       120       130       140       150       160
    
          160       170       180       190       200       210       220       230        
    gi|692 AAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVS
            . .::::: :: : .:::...:: . : :::::.:.:: :::..::.:.: . .:.:..:. ....: :::  :: .  
    gi|119 PVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEE
                  170       180       190       200       210       220       230       240
    
          240       250       260       270       280       290       300        310       
    gi|692 ELKEAIRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
           ::  :::..: .... ::. :.:.:::::::.::::.::::::::. :: :   ..:: ::.  . : .. .::::::.:
    gi|119 ELVAAIRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
                  250       260       270       280       290          300       310       
    
           320       330       340       350       360       370       380       390       
    gi|692 TDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I
           .:::::.:::.::::::.:::::: :.:::..:::.:::.:::: :.:.. . .::..:.::::.: ::::: ::    .
    gi|119 ADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV
           320       330       340       350       360       370       380       390       
    
            400       410       420       430       440       450       460       470      
    gi|692 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVG
           ::::.: ::::: ...:::.::::.:..:::..: ::::::.. :  ::::.::.:::::::.:.:::.:.::.:. :::
    gi|119 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVG
           400       410       420       430       440       450       460       470       
    
            480       490       500       510       520       530       540       550      
    gi|692 APMVSYRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLKDAMENGVLAG
            :.:.::::. . ..:.::: ::.::::::: :.:.  :   :.:::: :::::::.:.::::.:. :...::..: : :
    gi|119 KPQVAYRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIEEAMQSGPLIG
           480       490       500       510       520       530       540       550       
    
            560       570       580       590       600       610       620       630      
    gi|692 YPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKCDPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPR
           .:..:.:. :::::::.:::::::::::.:.:.:::..: :::::::.:.: .  :::::::..::...:::.. :::::
    gi|119 FPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPR
           560       570       580       590       600       610       620       630       
    
            640       650       660       670       680       690   
    gi|692 GNAQVVNAYVPLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE
           :::::. :.:::.:::::::.:::.:::::...:.:::: ::::.. : .::     
    gi|119 GNAQVIRAFVPLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ   
           640       650       660       670       680       690    
    
    
    Different matrix can give different alignment and score. Compare BLOSUM50 alignment:
    Gap penalty: -10 extension: -2

    280       290       300        310       
    LPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
    ::::::. :: :   ..:: ::.  . : .. .::::::.:
    LPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM

    With PAM120 alignment (less divergent)
    Gap penalty: -16 extension: -4

    LPSPLDVKPIIGHRASNPEEEVIAKADDSAEFAALAFKVMT
    ::::::. :: :  ..   : :   .: .. .::::::.:.
    LPSPLDIPPIKG--TTPEGEVVEIHPDPNGPLAALAFKIMA
          290         300       310        

    Variations of FASTA program are available for protein-protein, DNA-DNA comparisons. Also DNA-protein comparison by translating DNA.

     

     

    Other search types

    Smith-Waterman and Needleman-Wunsch use dynamic programming algorithm to find optimal local and global alignments, respectively

    Dynamic programming

    Identifies highest scoring regions of alignment and extends them to end of alignment.
    Highest score for a given reqion nust be made from optimal subscore from step before.
    Steps are then traced back from highest score at end of alignment to identify optimal path to that score.

    Computationally intensive. Not typically used against databases.


    BLAST Assignment


    Jim Nolan - Tulane Deptartment of Biochemistry, room 6053
    Phone: 584-2453 jnolan@tulane.edu
    My informatics course web address: http://www.tulane.edu/~biochem/lecture/723/