REMINDER: EXAM 3 (Tou, Karam,Nolan) Thursday Dec. 13
Room 6065 1:00 - 3:00 PM
Both contain virtually all known sequences, including complete genomes
Mostly translated coding sequences from the DNA database
Important file formats for both protein and DNA databases are:
GenBank: protein example - DNA example
FASTA: protein example - DNA example
Currently there are:
53 complete Bacterial genomes
11 complete Archeael genomes
10 complete Eukaryal genomes, including Human
and hundreds of viral genomes
Thousands of Titles and abstracts from Medically relevent journals dating back to the 1960's. Some older citations also available. Powerful searching capabilities essential for identifying articles of interests. Similar databases available for other disciplines (ie agricultural)
Selecting the BLAST Program
You need to pick the right version of BLAST to suit your data:
Program | Description |
blastp | Compares an amino acid query sequence against a protein sequence database. |
blastn | Compares a nucleotide query sequence against a nucleotide sequence database. |
blastx | Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. |
tblastn | Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. |
tblastx | Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. |
Selecting the BLAST Database
You can select several NCBI databases to compare your
query sequences against. Note that some databases are specific to
proteins or nucleotides and cannot be used in combination with certain
BLAST programs (for example a blastn search against swissprot).
Database | Description |
nr | All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF |
month | All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. |
swissprot | The last major release of the SWISS-PROT protein sequence database. | patents | Protein sequences derived from the Patent division of GenBank. | yeast | Protein translations of the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic CDS translations. |
pdb | Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. |
kabat | Kabat's database of sequences of immunological interest. |
alu | Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences |
Database | Description |
nr | All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences). |
month | All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. |
dbest |
Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources. |
dbsts |
Non-redundant database of GenBank+EMBL+DDBJ STS Divisions. |
mouse ests | The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse. |
human ests | The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human. |
other ests | The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human. |
yeast | Sequence fragments from the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic nucleotide sequences. |
pdb | Sequences derived from the 3-dimensional structure of proteins. |
kabat [kabatnuc] | Kabat's database of sequences of immunological interest. |
patents | Nucleotide sequences derived from the Patent division of GenBank. |
vector | Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory). |
mito | Database of mitochondrial sequences (Rel. 1.0, July 1995). |
alu | Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. |
gss | Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. |
htgs | High Throughput Genomic Sequences. |
Entering your Sequence
The BLAST web pages accept input sequences in three formats; FASTA sequence
format, NCBI Accession numbers, or GI Genbank identifier.
Accession or GI number
If you know the Accession number or the GI of a sequence in GenBank, you can use this as the query sequence in a BLAST search.
Set search parameters
This can greatly affect the success of your search, although the defaults often work well. One key parameter for protein similarity searches is the similarity scoring matrix.
Since amino acids vary in abundance, the odds of 2 sequences sharing a stretch of similar amino acids is different depending upon how commonly the amino acids appear in all proteins.
Also, some amino acids have very similar properties to others and can substitute readily for certain residues, but would cause problems if substituted for other residues.
To account for these observations, Sequence Scoring Matrices have been designed for use in similarity searching. Scores are based upon observed substitution rates among proteins known to be similar.
The BLOSUM62 scoring matrix for proteins.
C S T P A G N D E Q H R K M I L V F Y W C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 6 D -3 0 -1 -1 -2 -1 1 6 E -4 0 -1 -1 -1 -2 0 2 5 Q -3 0 -1 -1 -1 -2 0 0 2 5 H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11 Other parameters:
Word size: window of amino acids to score at a time. Small blocks of high scoring regions are stitched together to form the complete alignment. Words found in database proteins are indexed to speed search
Filtering: ignores regions of the protein that have lots of uninformative residues. Helps avoid artifactual matches.
Gap costs: Intorducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements. Can be very deleterious to protein structure compared to amino acid substitution. So there are penalties for making gaps, and a separate penalty for extending the gap.
Nucleotide searches are similar, but the scoring matrix is simpler, or only identities are used. "Words" are longer.
Assessing results: Scores are returned based upon the total score from the sequence using the scoring matrix. But longer sequences score better than short ones, so raw score can be deceiving,
The "E-value" is more reliable: it is an estimation of how many random sequences of the same length and composition would score higher. Significant scores are usually <<1.
Sequence similarity used to align raw DNA sequence reads to determine complete sequence of most genomes.