Since computers, and usually, the
internet, are so heavily involved in the use of bioinformatics, a brief introduction
to how the internet itself works may be beneficial. Much of this info
was obtained from UNH InterOperability
Lab and PC
Lube & Tune.
Lets start by clicking on a web
page link using your internet browser:
for example:
The most important parts of this process are identifying computers named "www.ncbi.nlm.nih.gov" and "me" and negotiating the transfer.
in this case, my IP address is 129.81.38.94.
All Tulane (tulane.edu) computers have addresses beginning with 129.81, all UNO (uno.edu) computers begin with 137.30. The network within Tulane is subdivided into smaller networks (Subnets) interconnected by routers. My computer can connect with any computer with address 129.81.38.### without going through the router, to reach the outside world I need to use the router gateway.
First of all, how do we find www.ncbi.nlm.nih.gov? It doesn't look much like my address. This is accomplished by Domain Name Servers (DNS), computers which keep lists of IP address numbers and corresponding names like "www.tulane.edu," which are easier to remember.
Each institution is responsible for listing all the computers within its domain and the corresponding name, if it has one. The DNS here can query other DNS to see if they have a "www.ncbi.nlm.nih.gov" and if so, what its real number is so we can contact it.
In this case the Tulane DNS is 129.81.224.50 (ns1.tcs.tulane.edu). You may get a "domain name server error" when you can't get through on the network. This could mean that the DNS is down, in which case you might be able to get through to your destination if you know the IP number. But usually this means the connection between you and the network is down, and the first place your computer checks is the DNS.
1 129.81.133.1 (129.81.133.1)2 tidewater-et-4-1.net.tulane.edu (129.81.255.93)3 newsouth-atm-1-0-0.net.tulane.edu (129.81.255.70)4 abilene-houston-pos-oc3.tis.tulane.edu (129.81.255.2)5 atla-hstn.abilene.ucaid.edu (198.32.8.34) University Corporation for Advanced Internet Development6 wash-atla.abilene.ucaid.edu (198.32.8.66)7 wash-abilene-oc48.maxgigapop.net (206.196.177.1) Mid-Atlantic Crossroads (MAX)8 clpk-so3-1-0.maxgigapop.net (206.196.178.46)9 wash-nlm.maxgigapop.net (206.196.177.34)10 130.14.38.185 (130.14.38.185)11 micasaweb.nlm.nih.gov (130.14.22.106)
There is the possibility that unscrupulous people may pretend to be other computers and intercept private data, like credit card numbers. This is why some transfers use secure, encrypted transfers (https instead of http) which prevent others from deciphering what is being sent.
Once the file is sent, you browser determines what kind of file it is (picture, text, or html text file with instuctions for downloading other files embedded in it) and displays the file. The server can tell your computer what kind of file it is sending, like an audio file or spreadsheet, which might be used by another program on your computer.
Both contain virtually all known sequences, including complete genomes
Mostly translated coding sequences from the DNA databaseImportant file formats for both protein and DNA databases are:
GenBank: protein example - DNA example
Currently there are:
96 complete Bacterial genomes
16 complete Archeael genomes
19 complete Eukaryal genomes, including Human
and hundreds of viral genomes
Last year there were:
53 complete Bacterial genomes
11 complete Archeael genomes
10 complete Eukaryal genomes
and hundreds of viral genomes
Thousands of Titles and abstracts from medically relevant journals dating back to the 1960's. Some older citations also available. Powerful searching capabilities essential for identifying articles of interest. Similar databases available for other disciplines (i.e. agricultural)
This page is condensed from the NCBI
PubMed Tutorial Pages . You may find the full tutorial quite useful.
When you enter search terms on the main PubMed
search page, the PubMed server processes your request to attempt to identify
what type of search you are attempting: are you looking up an author name, journal
title, subject area, or phrase from the article abstract? It accomplishes
this by filtering your search terms through successive lists to identify the
types of terms you provide and use them effectively. This
process is called:
Automatic Term Mapping
PubMed compares
your search terms against several lists of search terms to determine what you
are looking for. It checks four lists in order and stops looking once
it finds a match:
The MeSH Translation Table contains:
The Journals Translation Table contains:MeSH terms and Subheadings (searching synonyms for MeSH terms) Chemical Names of Substances
Since MESH terms are searched before Journal Titles, if you want to look up a Journal whose name is also a MESH term, like RNA or Cell, the search will stop with the MESH term and the search for your journal will not be done.Full journal titles MEDLINE title abbreviations International Standard Serial Numbers (ISSN)
The Phrase List contains several hundred thousand phrases generated from:
These are frequently used phrases that are not a part of the MeSH translation tableMeSH Unified Medical Language System (UMLS) Chemical Names of Substances
Author Searching
The format for author searching is last name plus
initials.
PubMed will automatically truncate the author's
name to account for varying initials.
If the term is not found, PubMed will then search the individual words in All Fields.
You can also try putting a phrase in double quotes if the results returned are not what you expected. This will force PubMed to look for the words as a phrase, but it bypasses the Automatic Term Mapping, so you might want to try doing some searches both with and without double quotes.
Truncation
You can truncate a word with the asterisk (*) wildcard This will causes PubMed to return all matches that begin with the truncated string of text. (e.g. enzym* will match enzyme, enzymes, enzymology, enzymatic, etc.) Truncation also turns off Automatic Term Mapping, so the results will be different than nontruncated searches.
Stopwords
PubMed also refers to a list of commonly found words that are referred to as "stopwords ." these are very common words which would match almost every citation and so they are skipped.
The list of stopwords is from PubMed's Help Page.
Stopwords
a did it perhaps these about do its quite they again does itself rather this all done just really those almost due kg regarding through also during km seem thus although each made seen to always either mainly several upon among enough make should use an especially may show used and etc. mg showed using another for might shown various any found ml shows very are from mm significantly was as further most since we at had mostly so were be has must sum what because have nearly such when been having neither than which before here no that while being how nor the with between however obtained their within both I of theirs without but if often them would by in on then can into our there could is overall therefore
Operators
You can use Boolean operators (AND, OR, NOT) to direct your search. These must be entered in UPPERCASE. Operators are processed left-to-right unless you use parentheses to specify the order.
Once you click the "Go" button. Your search is performed and the first 20 hits are displayed in a Summary format:
Author name(s): Title of the article: Brackets indicate a title translated from a foreign language.Source: a brief journal citation. Identification number: A PubMed Unique Identifier (PMID) is included on each record. Links: Includes links to Related Articles and databases, when available.
You can easily scan this first page of citations and see how many of them are really related to what you were trying to find. Though only the first 20 citations are displayed by default (in reverse chronological order) you can see how many total articles matched your search. If you got a surprisingly small or large number of hits, or if there seem to be a high percentage of extraneous hits, you might want to click on the "Details" button in the upper gray box.
Details Button
Clicking Details displays:
Limits ButtonThe PubMed query box shows exactly how PubMed performed your search using the Automatic Term Mapping. It may have found a synonym in the MeSH headings and used that instead of one of your original terms. You can edit the search used and run the edited search by clicking "Search". If the search worked really well, you can save it as a web link by clicking "URL" This formats your search as a URL link your web browser can save as a bookmark to repeat the search at a later date. You can also use the "Cubby" system described below. The "Result "section shows how many hits you got, and links you back to your hits. The translations section describes how each term of your search was interpreted. The database is PubMed, and The User Query is what you typed in to begin with.
Preview/Index ButtonYou can select Publication types (like reviews) from another menu. You can limit searches to specific dates or trials involving subjects in specific age groups, gender, or human/non-human. You can require that hits have Abstracts, though some reviews do not have abstracts, nor do articles indexed before 1975.
You can have even more control over limits by using
the Preview/Index Feature. You can add search terms by limiting to specific
fields, but you can preview the number of results by clicking on the preview
button.
ResultsBy clicking on index, you can also look up search terms in the index (for example the index of MeSH terms). Items can be added to the search window using the AND, OR, or NOT buttons. Different searches can be combined using their Query number found in the Preview/Index page, a more extensive list is found on the History page. (ex, #4 AND #5). Note that these query numbers disappear after 1 hour of inactivity, so you can't use yesterday's Query number tomorrow and get the same result. You also cannot use these numbers to save your results as a URL in the details window, but you can manually cut and paste the query lines together to save them.
Now that you have constructed the perfect search, you can select the perfect format for displaying results. The default is 20 summary results, but you can choose another format: Other available formats for citation display can be chosen by selecting from the list of choices listed under "Summary":
Brief format includes:Abstract format provides the summary information in addition to:
- First author
- First thirty characters of the title.
- PMID #
- Links
Citation format is similar to abstract, but also includes:
- First Author affiliation
- Abstract, if one is present.
- Links to full-text of the article at provider's Web site, if available.
- Links to Related Articles, Books, LinkOut, and databases.
MEDLINE format is a text file with identifying letters before each field. It is most useful for importing into bibliography programs like EndNote and ProCite.
- MeSH terms.
- Chemical Names of Substances, if any are present.
- Grant numbers, if any are present.
Selecting Citations and Display Format
You can select a subset of the hits to display by clicking the box before each item. If you don't click any boxes, then all are displayed.
Add to Clipboard
- Or you can click on the individual links to see the abstract format for a given citation.
You can select individual citations to save in a clipboard on the server. This is not the clipboard on your computer. After selecting items by clicking their checkbox, click on the "Add to clipboard" link.
Save Button
- The color of item numbers of the hits changes when added to the clipboard.
- If you did not click any boxes, the entire search gets loaded to the clipboard (up to the limit of 500 hits).
- You can view the clipboard by clicking the "Clipboard" link in the features bar. The Clipboard disappears after one hour of inactivity.
You can save citations to a file on your computer by clicking the "Save" link. There is a limit of 10,000 hits. To save selected citations, pick a display format and press "Save". You will be prompted for where to save the downloaded file.
Text Button
You can have the selected items displayed as plain text by clicking the "Text" button. This may be useful for printing if your browser doesn't print the hypertext files well.
Cubby
![]()
If you set up a "Cubby", you can save your favorite searches indefinitely on the PubMed server. You have to get a username and password. You can then save your search and rerun it at a later date. Or you can run the search for new articles published since the last time you searched.
LinkOut Preferences
The LinkOut service enables publishers, libraries (like LSU Medical School), biological databases, sequence centers, and other Web resources to display links to their sites on records in PubMed.
You can use Cubby to set which links are displayed by
When you are logged into Cubby, PubMed displays LinkOut providers according to your preferences.
- Adding icons to the Abstract and Citation formats
- Hiding providers from the LinkOut format
Related Articles - Compares words from the title, abstract, and MeSH headings to identify articles similar to the selected article.Related Articles
NCBI Databases
These are the NCBI databases that may be linked to from individual PubMed citations:
- Protein: Protein sequences from SWISSPROT, Protein Information Resource (PIR), Protein Research Foundation (PRF), Protein Data Bank (PDB), and translated protein sequences from the DNA sequences database.
- Nucleotide: DNA sequences from GenBank , European Molecular Biology Laboratory (EMBL), and DNA Data Bank of Japan (DDBJ).
- PopSet: Sequences submitted as a set from a population studies.
- Structure: experimentally-determined, three-dimensional structures.
- Genome: Records and graphic displays of genomes.
- Taxonomy: Index of organisms represented in the sequence databases.
- OMIM: A catalog of human genes and genetic disorders.
- Books provides links to terms described in selected molecular biology textbooks.
Alignment of DNA or protein sequences in bioinformatics serves as a number of
purposes:
DNA sequence data can be aligned to assemble larger contiguous DNA sequences
Protein sequences can be aligned to identify important regions, such as active sites
DNA sequences can be aligned to identify regulatory control regions
Alignment of any two sequences implies that those sequences are related.
Overlapping DNA sequence reads are derived from a single DNA molecule
Protein, RNA, or DNA sequences from a particular gene for different organisms are related by the ancestry of those organisms
Duplicated genes (and pseudogenes) within an organism are related to each other or a particular parental gene
Molecular sequences that are not
related in some way, do not produce statistically significant sequence alignments
A key then, to answering many bioinformatics questions lies in the ability to reliably align sequence information
Sequences can be aligned:
Globally, in which as many characters as possible over the entire sequence are aligned
Or locally, in which regions of sequences which are highly similar are aligned
Sequences can be compared in pairs or as multiple alignments of related sequences
In practice, these alignments are performed similarly, and can be understood
in the context of a dot-matrix sequence comparison.
Dot matrix analysis displays the primary sequence of pairs of proteins on the X and Y axes of a graph.
Dots are plotted on the graph where the X and Y coordinate sequences are identical.
Regions of identical sequence are revealed as diagonal rows of dots.
Random matches are seen as isolated dots.
A comparison of DNAs sequences for elongation factor EF-G from
S. aureus and T.
thermophilus is shown below left. A dot is placed every time a nucleotide
in the S. aureus sequence matches that of T. thermophilus. Since there are only
four types of nucleotides, random matches occur many places throughout the sequence
and obscure regions of significant similarity.
A similar analysis of the corresponding peptide sequences is shown at right.
Random matches of the 20 amino acids also occur, but diagonal regions of sequence
similarity are all still discernible.
![]() |
![]() |
DNA alignment - single nucleotide identity | Protein alignment - single residue identity |
Random matches can be filtered by using a sliding window to compare sequences. In this case, a block of characters is compared between sequences. A dot is placed on the alignment only if a threshold value of matches is detected.
An example is shown below on a small portion of the amino acid sequence. At left, a one or dot is placed wherever the sequences match.
In the center panel, a score is placed for the number of matches between sequences for each block of 3 amino acids.
For 3 consecutive matches, the score is 3
For 2 out of 3 the score is 2
For 1 out of three the score is 1
While there are more 1's on the second matrix, by ignoring all values below 2 (right panel), the random matches are ignored.
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Word Size = 3 |
|
The full matrix for the proteins is shown below:
![]() |
![]() |
Protein- single residue identity | Protein identity 5 residues of 23 |
Even using a low stringency window greatly improves the signal/nose ratio. One can readily see a diagonal stretching across the matrix indicating sequence similarity throughout the length of both sequences. This works well for DNA sequences, but you must use higher stringencies; longer lengths help too:
![]() |
![]() |
DNA alignment - single nucleotide identity | DNA alignment - 15 of 23 identical |
Going back to proteins, this one is Dr. Timpte's favorite!
You can also compare a sequence with itself to look for repeated sequences:
Protein sequence of porcine submaxillary mucin compared to itself looking for exact matches of 39 residues.
You can see diagonal of unique sequence at termini.
In between are 135.6 repeats of an 81 residue sequence, extending 10991 amino acids, this is an extreme case.
Repeats are often detectable in comparing related sequences to each other.
Dot matrix analysis makes them painfully obvious.
The alignment can be improved by raising the score for a match to +6 and penalizing
a mismatch with -1:
![]() |
![]() |
Protein identity 5 residues of 23(match=1, mismatch=0) | Protein identity 5 residues of 23(match=6, mismatch=-1) |
Virtually all the background matches have been eliminated. Even better results can be obtained using weighted scoring matrices.
Comparison of related proteins revealed that substitution of chemically similar amino acid residues was fairly common in many positions of the protein sequences.
Other substitutions were found to be less common
Some were extremely rare.
Substitution matrices quantitate these differences in substitution frequencies for use in evaluation of alignments
![]() |
![]() |
Protein
identity 5 residues of 23 (match=6, mismatch=-1) |
Protein
BLOSUM62 matrix 5 residues of 23 match |
Where do scoring matrices come from?
BLOSUM matrix is based upon the BLOCKS database of protein motifs
e.g. RNase P BLOCK AxxRNR...
what are the frequencies of substitutions
at the xx positions?
Allows alignment of short regions of sequence from very divergent proteins
2000 blocks of conserved patterns
from 500 different families of proteins
Non-conserved amino acids counted and frequency of substitution pairs counted
Closely related proteins can cause over representation of sequences, so these
sequences were merged into a single consensus
BLOSUM62: 62% of residues in data set identical
BLOSUM62 Matrix
C | S | T | P | A | G | N | D | E | Q | H | R | K | M | I | L | V | F | Y | W | |
C | 9 | |||||||||||||||||||
S | -1 | 4 | ||||||||||||||||||
T | -1 | 1 | 5 | |||||||||||||||||
P | -3 | -1 | -1 | 7 | ||||||||||||||||
A | 0 | 1 | 0 | -1 | 4 | |||||||||||||||
G | -3 | 0 | -2 | -2 | 0 | 6 | ||||||||||||||
N | -3 | 1 | 0 | -2 | -2 | 0 | 6 | |||||||||||||
D | -3 | 0 | -1 | -1 | -2 | -1 | 1 | 6 | ||||||||||||
E | -4 | 0 | -1 | -1 | -1 | -2 | 0 | 2 | 5 | |||||||||||
Q | -3 | 0 | -1 | -1 | -1 | -2 | 0 | 0 | 2 | 5 | ||||||||||
H | -3 | -1 | -2 | -2 | -2 | -2 | 1 | -1 | 0 | 0 | 8 | |||||||||
R | -3 | -1 | -1 | -2 | -1 | -2 | 0 | -2 | 0 | 1 | 0 | 5 | ||||||||
K | -3 | 0 | -1 | -1 | -1 | -2 | 0 | -1 | 1 | 1 | -1 | 2 | 5 | |||||||
M | -1 | -1 | -1 | -2 | -1 | -3 | -2 | -3 | -2 | 0 | -2 | -1 | -1 | 5 | ||||||
I | -1 | -2 | -1 | -3 | -1 | -4 | -3 | -3 | -3 | -3 | -3 | -3 | -3 | 1 | 4 | |||||
L | -1 | -2 | -1 | -3 | -1 | -4 | -3 | -4 | -3 | -2 | -3 | -2 | -2 | 2 | 2 | 4 | ||||
V | -1 | -2 | 0 | -2 | 0 | -3 | -3 | -3 | -2 | -2 | -3 | -3 | -2 | 1 | 3 | 1 | 4 | |||
F | -2 | -2 | -2 | -4 | -2 | -3 | -3 | -3 | -3 | -3 | -1 | -3 | -3 | 0 | 0 | 0 | -1 | 6 | ||
Y | -2 | -2 | -2 | -3 | -2 | -3 | -2 | -3 | -2 | -1 | 2 | -2 | -2 | -1 | -1 | -1 | -1 | 3 | 7 | |
W | -2 | -3 | -2 | -4 | -3 | -2 | -4 | -4 | -3 | -2 | -2 | -3 | -3 | -1 | -3 | -2 | -3 | 1 | 2 | 11 |
What the scores mean:
Score=0: odds of substitution of one amino acid with the other equals that of random chance
Score>0: substitutions occur more likely than by chance (i.e. evolutionarily selected for)
Score<0: substitutions occur less likely than by chance (i.e. evolutionarily selected against)
Gap costs: Introducing gaps
can aid in aligning sequences, but if you have unlimited gaps, you can align
any 2 sequences. Also, gaps in a protein structure mean insertion or deletion
of structural elements. Can be very deleterious to protein structure compared
to amino acid substitution. So there are penalties for making gaps, and a separate
penalty for extending the gap.
e.g.
AGGTTCGACT --GT-C-A-T
Usually penalty is high for opening a new gap, but is smaller for extending a gap. (If a gap is actually allowed in a protein sequence, such as variation in the size of a loop, the size difference is probably not critical). Gap and extension penalties vary with matrix used. Usually default values for a given scoring matrix work best.
Most commonly used algorithms are FASTA and BLAST
Both are useful for searching for matches of your sequence to a large database, such as Genbank
To do so, both use simplifying shortcuts to minimize computational time without sacrificing much sensitivity
First, regions of sequence that are
low complexity (like ST-rich regions, which are easily substituted) are filtered
out
Then a list of overlapping 3 or 4 letter words are derived from sequence
Substitution score is used to identify high scoring words above a threshold
value
-> small number of words (~50) to compare to database
Database already indexed for high-scoring words
Matches within given distance of other matches on same diagonal are used to
align sequences
BLAST alignment of S. aureus (Query) vs T. thermophilus (Subject)
Query: 4 EFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGXXX 63 E+ L++ RNIGI AHIDAGKTTTTERILYYTGRIHKIGE HEGA+ MD+MEQE++RG Sbjct: 6 EYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITI 65 np-binding 19 ******** Query: 64 XXXXXXXXWEGHRVNIIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQA 123 W+ HR+NIIDTPGHVDFT+EVERS+RVLDGA+ V+D+ GVEPQ+ETVWRQA Sbjct: 66 TAAVTTCFWKDHRINIIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQA 125 np-binding 83 ***** Query: 124 TTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQANAAPIQLPIGAEDEFEAIIDLVEMKC 183 Y VPRI F NKMDK GA++ + T+++RL A +QLPIG ED F IID++ MK Sbjct: 126 EKYKVPRIAFANKMDKTGADLWLVIRTMQERLGARPVVMQLPIGREDTFSGIIDVLRMKA 185 np-binding 137 **** Query: 184 FKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVSELKEA 243 + Y NDLGT+I EI IPE++LD+A E L+E A+ + +M KYL EE + EL A Sbjct: 186 YTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEEELVAA 245 Query: 244 IRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-K 302 IR+ T +++ PV+ G+A+KNKGVQL+LDAV+DYLPSPLD+ PI G + PE EV+ Sbjct: 246 IRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIH 302 Query: 303 ADDSAEFAALAFKVMTDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANS 362 D + +AALAFK+M DPYVG+LTF RVYSGT+TSGSYV N+TKG++ERV RLL+MHAN Sbjct: 303 PDPNGPLAALAFKIMADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANH 362 Query: 363 RQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-IILESMEFPEPVIHLSVEPKSKADQD 421 R+E++ + +GD+ A VGLK+T TGDTL GE +ILES+E PEPVI +++EPK+KADQ+ Sbjct: 363 REEVEELKAGDLGAVVGLKETITGDTLVGEDAPRVILESIEVPEPVIDVAIEPKTKADQE 422 Query: 422 KMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVGAPMVS 481 K++QAL +L EEDPTF T ETGQ II GMGELHL+I+VDR+K+EF V+ NVG P V+ Sbjct: 423 KLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVGKPQVA 482 Query: 482 YRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSV 541 YRET V+GKF RQ+GGRGQYG V I+ P G+GFEF NAIVGGV+P+EYIP+V Sbjct: 483 YRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAV 542 Query: 542 EAGLKDAMENGVLAGYPLIDVKAKLYDGSYHDVDSSEMAFXXXXXXXXXXXXXXCDPVIL 601 + G+++AM++G L G+P++D+K LYDGSYH+VDSSEMAF DPVIL Sbjct: 543 QKGIEEAMQSGPLIGFPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVIL 602 Query: 602 EPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPRGNAQVVNAYVPLSEMFGYATSLRSN 661 EP+M+V + PEEYMGD++GD+ +RRG++ GMEPRGNAQV+ A+VPL+EMFGYAT LRS Sbjct: 603 EPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPRGNAQVIRAFVPLAEMFGYATDLRSK 662 Query: 662 TQGRGTYTMYFDHYAEVPKSIAEDIIK 688 TQGRG++ M+FDHY EVPK + E +IK Sbjct: 663 TQGRGSFVMFFDHYQEVPKQVQEKLIK 689
Results similar to FASTA using same matrix. XXXXX indicates regions masked for low complexity.
Selecting the BLAST Program
You need to pick the
right version of BLAST to suit your data:
Program | Description |
blastp | Compares an amino acid query sequence against a protein sequence database. |
blastn | Compares a nucleotide query sequence against a nucleotide sequence database. NOT necessarily the best way to compare 2 DNAs! |
blastx | Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Good way to find protein sequences related to your DNA. |
tblastn | Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. |
tblastx | Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Some sites do not allow use of this program because it is computationally intensive. |
Selecting the BLAST Database (from the BLAST
tutorial)
You can select several
NCBI databases to compare your query sequences against. Note that some databases
are specific to proteins or nucleotides and cannot be used in combination with
certain BLAST programs (for example a blastn search against swissprot).
Database | Description |
nr | All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF |
month | All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. |
swissprot | The last major release of the SWISS-PROT protein sequence database. | patents | Protein sequences derived from the Patent division of GenBank. | yeast | Protein translations of the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic CDS translations. |
pdb | Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. |
kabat | Kabat's database of sequences of immunological interest. |
alu | Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences |
Database | Description |
nr | All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences). |
month | All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. |
dbest | Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources. |
dbsts | Non-redundant
database of GenBank+EMBL+DDBJ STS Divisions. |
mouse ests | The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse. |
human ests | The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human. |
other ests | The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human. |
yeast | Sequence fragments from the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic nucleotide sequences. |
pdb | Sequences derived from the 3-dimensional structure of proteins. |
kabat [kabatnuc] | Kabat's database of sequences of immunological interest. |
patents | Nucleotide sequences derived from the Patent division of GenBank. |
vector | Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory). |
mito | Database of mitochondrial sequences (Rel. 1.0, July 1995). |
alu | Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. |
gss | Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. |
htgs | High Throughput Genomic Sequences. |
The Raw Score (S) is the sum of all the matches for a given alignment, minus mismatch and gap penalties. This value tends to be larger for larger proteins, so how can you tell if it is significant?
Expected (E) value is the number of scores that would be expected to exceed the actual score S by chance.
where m and n are sequence lengths and K and lambda are scaling factors dependent upon the size of the search and the scoring system.
Probability
(P) of finding a higher score:
P= 1-e(-E)
When E< 0.01, P~E
So the E value is a good measure of the significance of any alignment. Numbers approacing 1 are not likely significant.
Numbers much less than 1 are highly significant.
Works similarly to BLAST, with some
differences
Sequences are broken into words (also called k-tuples)
Matches between word characters are detected, offset of positions calculated
Series of character matches with same offset = a hit
Additional matches on same diagonal are identified -> diagonal matches
Diagonal matches within certain distance of other matches are joined using gaps
Joined regions are evaluated using substitution matrix scores
FASTA alignment of EF-G sequences
using BLOSUM50 matrix (for very divergent sequences)
>>gi|119190|sp|P13551|EFG_THETH ELONGATION FACTOR G (EF-G) (691 aa) initn: 2949 init1: 1359 opt: 2943 Z-score: 3250.5 bits: 611.9 E(): 3.1e-174 Smith-Waterman score: 2943; 62.591% identity (63.050% ungapped) in 687 aa overlap (4-688:6-689) Entrez lookup Re-search database General re-search >gi|119 4- 688:---------------------------------------------------------------------: 10 20 30 40 50 60 70 gi|692 MAREFSLEKTRNIGIMAHIDAGKTTTTERILYYTGRIHKIGETHEGASQMDWMEQEQDRGITITSAATTAAWEGHRVN :..:.. ::::: ::::::::::::::::::::::::::.::::. ::.::::..::::::.:.:: :. ::.: gi|119 MAVKVEYDLKRLRNIGIAAHIDAGKTTTTERILYYTGRIHKIGEVHEGAATMDFMEQERERGITITAAVTTCFWKDHRIN 10 20 30 40 50 60 70 80 80 90 100 110 120 130 140 150 gi|692 IIDTPGHVDFTVEVERSLRVLDGAVTVLDAQSGVEPQTETVWRQATTYGVPRIVFVNKMDKLGANFEYSVSTLHDRLQAN :::::::::::.:::::.::::::..:.:...:::::.::::::: : ::::.:.::::: ::.. . :...:: : gi|119 IIDTPGHVDFTIEVERSMRVLDGAIVVFDSSQGVEPQSETVWRQAEKYKVPRIAFANKMDKTGADLWLVIRTMQERLGAR 90 100 110 120 130 140 150 160 160 170 180 190 200 210 220 230 gi|692 AAPIQLPIGAEDEFEAIIDLVEMKCFKYTNDLGTEIEEIEIPEDHLDRAEEARASLIEAVAETSDELMEKYLGDEEISVS . .::::: :: : .:::...:: . : :::::.:.:: :::..::.:.: . .:.:..:. ....: ::: :: . gi|119 PVVMQLPIGREDTFSGIIDVLRMKAYTYGNDLGTDIREIPIPEEYLDQAREYHEKLVEVAADFDENIMLKYLEGEEPTEE 170 180 190 200 210 220 230 240 240 250 260 270 280 290 300 310 gi|692 ELKEAIRQATTNVEFYPVLCGTAFKNKGVQLMLDAVIDYLPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM :: :::..: .... ::. :.:.:::::::.::::.::::::::. :: : ..:: ::. . : .. .::::::.: gi|119 ELVAAIRKGTIDLKITPVFLGSALKNKGVQLLLDAVVDYLPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 gi|692 TDPYVGKLTFFRVYSGTMTSGSYVKNSTKGKRERVGRLLQMHANSRQEIDTVYSGDIAAAVGLKDTGTGDTLCGEKND-I .:::::.:::.::::::.:::::: :.:::..:::.:::.:::: :.:.. . .::..:.::::.: ::::: :: . gi|119 ADPYVGRLTFIRVYSGTLTSGSYVYNTTKGRKERVARLLRMHANHREEVEELKAGDLGAVVGLKETITGDTLVGEDAPRV 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 gi|692 ILESMEFPEPVIHLSVEPKSKADQDKMTQALVKLQEEDPTFHAHTDEETGQVIIGGMGELHLDILVDRMKKEFNVECNVG ::::.: ::::: ...:::.::::.:..:::..: ::::::.. : ::::.::.:::::::.:.:::.:.::.:. ::: gi|119 ILESIEVPEPVIDVAIEPKTKADQEKLSQALARLAEEDPTFRVSTHPETGQTIISGMGELHLEIIVDRLKREFKVDANVG 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 gi|692 APMVSYRETFKSSAQVQGKFSRQSGGRGQYGDVHIEFTPNETGAGFEFENAIVGGVVPREYIPSVEAGLKDAMENGVLAG :.:.::::. . ..:.::: ::.::::::: :.:. : :.:::: :::::::.:.::::.:. :...::..: : : gi|119 KPQVAYRETITKPVDVEGKFIRQTGGRGQYGHVKIKVEPLPRGSGFEFVNAIVGGVIPKEYIPAVQKGIEEAMQSGPLIG 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 gi|692 YPLIDVKAKLYDGSYHDVDSSEMAFKIAASLALKEAAKKCDPVILEPMMKVTIEMPEEYMGDIMGDVTSRRGRVDGMEPR .:..:.:. :::::::.:::::::::::.:.:.:::..: :::::::.:.: . :::::::..::...:::.. ::::: gi|119 FPVVDIKVTLYDGSYHEVDSSEMAFKIAGSMAIKEAVQKGDPVILEPIMRVEVTTPEEYMGDVIGDLNARRGQILGMEPR 560 570 580 590 600 610 620 630 640 650 660 670 680 690 gi|692 GNAQVVNAYVPLSEMFGYATSLRSNTQGRGTYTMYFDHYAEVPKSIAEDIIKKNKGE :::::. :.:::.:::::::.:::.:::::...:.:::: ::::.. : .:: gi|119 GNAQVIRAFVPLAEMFGYATDLRSKTQGRGSFVMFFDHYQEVPKQVQEKLIKGQ 640 650 660 670 680 690Different matrix can give different alignment and score. Compare BLOSUM50 alignment:
280 290 300 310
LPSPLDVKPIIGHRASNPEEEVIA-KADDSAEFAALAFKVM
::::::. :: : ..:: ::. . : .. .::::::.:
LPSPLDIPPIKG---TTPEGEVVEIHPDPNGPLAALAFKIM
With PAM120 alignment (less divergent)
Gap penalty: -16 extension: -4
LPSPLDVKPIIGHRASNPEEEVIAKADDSAEFAALAFKVMT
::::::. :: : .. : : .: .. .::::::.:.
LPSPLDIPPIKG--TTPEGEVVEIHPDPNGPLAALAFKIMA
290 300 310
Variations of FASTA program are available for protein-protein, DNA-DNA comparisons. Also DNA-protein comparison by translating DNA.
Smith-Waterman and Needleman-Wunsch use dynamic programming algorithm to find optimal local and global alignments, respectively
Dynamic programming
Identifies highest scoring regions of alignment and extends them to end of alignment.
Highest score for a given reqion nust be made from optimal subscore from step before.
Steps are then traced back from highest score at end of alignment to identify optimal path to that score.
Computationally intensive. Not typically
used against databases.