REMINDER: EXAM 3 (Tou, Karam,Nolan) Thursday Dec. 13

Room 6065 1:00 - 3:00 PM

 

Important Databases:

Genbank and EMBL DNA sequence databases

Both contain virtually all known sequences, including complete genomes

Genbank and SWISSPROT protein sequence databases

Mostly translated coding sequences from the DNA database

Important file formats for both protein and DNA databases are:
GenBank: protein example - DNA example
FASTA: protein example - DNA example

PDB: Protein Data Bank 3-D structural database

Genome databases, most accessible through Entrez

Currently there are:
53 complete Bacterial genomes
11 complete Archeael genomes
10 complete Eukaryal genomes, including Human
and hundreds of viral genomes

PubMed citation database

Thousands of Titles and abstracts from Medically relevent journals dating back to the 1960's. Some older citations also available. Powerful searching capabilities essential for identifying articles of interests. Similar databases available for other disciplines (ie agricultural)

Many other specialized databases

Key to constructing and accessing most biological sequence databases:

Sequence similarity searches

Most commonly used program: BLAST (Basic Local Alignment Search Tool)
Annotated from the BLAST tutorial pages

Selecting the BLAST Program

You need to pick the right version of BLAST to suit your data:
 
Program  Description
blastp Compares an amino acid query sequence against a protein sequence database.
blastn Compares a nucleotide query sequence against a nucleotide sequence database.
blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.



Selecting the BLAST Database

You can select several NCBI databases to compare your query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot).


Proteins

Database Description
nr All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF 
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. 
swissprot The last major release of the SWISS-PROT protein sequence database.
patents Protein sequences derived from the Patent division of GenBank.
yeast Protein translations of the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic CDS translations.
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
kabat Kabat's database of sequences of immunological interest.
alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences


Nucleotides

Database Description
nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest

Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources.

dbsts

Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
Sequence Tagged Site:
random physical genomic map marker

mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.
human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.
other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.
yeast Sequence fragments from the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic nucleotide sequences.
pdb Sequences derived from the 3-dimensional structure of proteins.
kabat [kabatnuc] Kabat's database of sequences of immunological interest.
patents Nucleotide sequences derived from the Patent division of GenBank.
vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
mito Database of mitochondrial sequences (Rel. 1.0, July 1995).
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
htgs High Throughput Genomic Sequences.


Entering your Sequence
The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GI Genbank identifier.



Accession or GI number

If you know the Accession number or the GI of a sequence in GenBank, you can use this as the query sequence in a BLAST search.


Set search parameters

This can greatly affect the success of your search, although the defaults often work well. One key parameter for protein similarity searches is the similarity scoring matrix.

Since amino acids vary in abundance, the odds of 2 sequences sharing a stretch of similar amino acids is different depending upon how commonly the amino acids appear in all proteins.

Also, some amino acids have very similar properties to others and can substitute readily for certain residues, but would cause problems if substituted for other residues.

To account for these observations, Sequence Scoring Matrices have been designed for use in similarity searching. Scores are based upon observed substitution rates among proteins known to be similar.

The BLOSUM62 scoring matrix for proteins.

  C S T P A G N D E Q H R K M I L V  F  Y W
C 9                                      
S -1 4                                    
T -1 1 5                                  
P -3 -1 -1 7                                
A 0 1 0 -1 4                              
G -3 0 -2 -2 0 6                            
N -3 1 0 -2 -2 0 6                          
D -3 0 -1 -1 -2 -1 1 6                        
E -4 0 -1 -1 -1 -2 0 2 5                      
Q -3 0 -1 -1 -1 -2 0 0 2 5                    
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8                  
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5                
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5              
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5            
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4          
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4        
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4      
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6    
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7  
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Other parameters:

Word size: window of amino acids to score at a time. Small blocks of high scoring regions are stitched together to form the complete alignment. Words found in database proteins are indexed to speed search

Filtering: ignores regions of the protein that have lots of uninformative residues. Helps avoid artifactual matches.

Gap costs: Intorducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements. Can be very deleterious to protein structure compared to amino acid substitution. So there are penalties for making gaps, and a separate penalty for extending the gap.

Nucleotide searches are similar, but the scoring matrix is simpler, or only identities are used. "Words" are longer.

Assessing results: Scores are returned based upon the total score from the sequence using the scoring matrix. But longer sequences score better than short ones, so raw score can be deceiving,

The "E-value" is more reliable: it is an estimation of how many random sequences of the same length and composition would score higher. Significant scores are usually <<1.

Sequence similarity used to align raw DNA sequence reads to determine complete sequence of most genomes.