REMINDER: EXAM 3 (Hamori,Nolan) Friday Dec. 5
Room 5001 8:30 - 10:30 AM
Both contain virtually all known sequences, including complete genomes
Mostly translated coding sequences from the DNA database
Important file formats for both protein and DNA databases are:
GenBank: protein example - DNA example
FASTA: protein example - DNA example
Last year there were:
89 complete Bacterial genomes
15 complete Archeael genomes
18 complete Eukaryal genomes, including Human
and hundreds of viral genomes
The previous year there were:
53 complete Bacterial genomes
11 complete Archeael genomes
10 complete Eukaryal genomes
and hundreds of viral genomes
Thousands of Titles and abstracts from Medically relevent journals dating back to the 1960's. Some older citations also available. Powerful searching capabilities essential for identifying articles of interests. Similar databases available for other disciplines (ie agricultural)
Selecting the BLAST Program
You need to pick the right version of BLAST to suit your data:
Program | Description |
blastp | Compares an amino acid "query" sequence against a protein sequence database. |
blastn | Compares a nucleotide sequence against a nucleotide sequence database. |
blastx | Compares a nucleotide sequence translated in all 6 reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. |
tblastn | Compares a protein sequence against a nucleotide sequence database translated in all reading frames. |
tblastx | Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database. |
Selecting the BLAST Database
You can select several NCBI databases to compare your query sequences
against. Note that some databases are specific to proteins or nucleotides and
cannot be used in combination with certain BLAST programs (for example a blastn
search against swissprot).
Database | Description |
nr | All non-redundant GenBank CDS translations (open reading frames of nucleotide seq)+PDB+SwissProt+PIR+PRF |
month | All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. |
swissprot | The last major release of the SWISS-PROT protein sequence database. | patents | Protein sequences derived from the Patent division of GenBank. | yeast | Protein translations of the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic CDS translations. |
pdb | Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. |
Database | Description |
nr | All non-redundant GenBank+EMBL+PDB sequences (but no EST, STS, GSS, or HTGS sequences). |
dbest |
Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.Expressed sequence tags: sequences of random cDNAs obtained from specific tissue sources. |
dbsts |
Non-redundant database of GenBank+EMBL+DDBJ STS Divisions. |
yeast | Sequence fragments from the Yeast complete genome. |
E. coli | E. coli (Escherichia coli) genomic nucleotide sequences. |
pdb | Sequences derived from the 3-dimensional structure of proteins. |
Entering your Sequence
The BLAST web pages accept input sequences in three formats; FASTA sequence
format, NCBI Accession numbers, or GI Genbank identifier.
Accession or GI number
If you know the Accession number or the GI of a sequence in GenBank,
you can use this as the query sequence in a BLAST search.
Set search parameters
This can greatly affect the success of your search, although the defaults often work well. One key parameter for protein similarity searches is the similarity scoring matrix.
Since amino acids vary in abundance, the odds of 2 sequences sharing a stretch of similar amino acids is different depending upon how commonly the amino acids appear in all proteins.
Also, some amino acids have very similar properties to others and can substitute readily for certain residues, but would cause problems if substituted for other residues.
To account for these observations, Sequence Scoring Matrices have been designed for use in similarity searching. Scores are based upon observed substitution rates among proteins known to be similar.
The BLOSUM62 scoring matrix for proteins.
Back to 601 Web Page.
C S T P A G N D E Q H R K M I L V F Y W C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 6 D -3 0 -1 -1 -2 -1 1 6 E -4 0 -1 -1 -1 -2 0 2 5 Q -3 0 -1 -1 -1 -2 0 0 2 5 H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11 Other parameters:
Word size: window of amino acids to score at a time. Small blocks of high scoring regions are stitched together to form the complete alignment. Words found in database proteins are indexed to speed search
Filtering: ignores regions of the protein that have lots of uninformative residues. Helps avoid artifactual matches.
Gap costs: Introducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements. Can be very deleterious to protein structure compared to amino acid substitution. So there are penalties for making gaps, and a separate penalty for extending the gap.
Nucleotide searches are similar, but the scoring matrix is simpler, or only identities are used. "Words" are longer.
Assessing results: Scores are returned based upon the total score from the sequence using the scoring matrix. But longer sequences score better than short ones, so raw score can be deceiving,
The "E-value" is more reliable: it is an estimation of how many random sequences of the same length and composition would score higher. Significant scores are usually <<1.
Sequence similarity used to align raw DNA sequence reads to determine complete sequence of most genomes
Human genome sequence
Celera genome sequence
Science v. 291, pp 1304-1351.Also released the same week:
Draft sequence of the publicly funded Human Genome Project
Nature 409, 860-921 (15 February 2001)
(Links are to abstracts, but you can download PDF or HTML versions of papers from there).
The sequences generally agree, though the public database is more accurate.Link to lecture on Public Human Genome Project if you are interested.
Genome was sequenced using a shotgun approach
DNA sequence obtained from random clones of sheared genomic DNAs
The size of the genome (3 billion base pairs) makes assembling short stretches of sequence (500 to 600 base pairs) a computational challenge
27 million sequencing reactions were performed, yielding 15 billion base pairs of sequence
Sequence was obtained from five individuals of different ethnic backgroundsKey to assembling sequence: pairs of sequencing reads obtained from opposite ends of cloned insert ("mate-pairs").
Clones were obtained from three different size specific libraries: to tensile basis and 50 kilobases
Knowing the distance between pairs of reads from individual clones helps in assembling contiguous sequences
Raw sequence was trimmed of known vector sequence, E. coli sequence, and mitochondrial sequence
Additional sequence data was obtained by shredding sequence from the public Human Genome Project
Genome sequences were assembled in 2 ways:
A whole genome assembly in which the entire sequence was assembled into the chromosomes
An alternative strategy sorted data based upon homology to know chromosomes, allowing assembly of individual chromosomes
A key hurdle in assembling sequence is avoiding misalignments caused by repeated sequences in the genome and chimeric clones.
Either of these problems can result in assembling connecting sequences from different regions in the genome together.
Repeated sequences were screened by removing known repeated sequences
Also, unknown repeats indicated by over representation of particular sequences
unique sequences were then assembled and oriented using mate-pair information
Sequence scaffolds could be assembled based upon distances between pairs of reads
The initial scaffold covered 73% of the genome additional gaps were filled using reads which showed homology to scaffolds at one end
Additional gaps were filled with known sequence from the public database
Sequence scaffolds were mapped to chromosome by homology to known chromosome markersThe assembled sequence was assessed for completeness and correctness
Completeness was assessed by comparison to known finished sequences for chromosome 21 and chromosome 22 completeness was better than 95%
Accuracy will was determined by comparison to known chromosome 21 and chromosome 22 sequences and estimated at 99.96%Gene prediction
Genes were detected by homology searches of protein databases and expressed sequence tags the total number of genes estimated between 25 and 30 thousand
Nearly 50% of the predicted genes are of unknown function.
Genes are found associated with GC rich sequences
Genes tend to cluster and desert regions > 500 kilobases have been found without genes
CpG islands also correlates with gene locations
A number of gene duplications caused by a retrotransposition were detected
These are genes which have only a single exon flanked by repeated sequence and homologous to known multi-exon genes
Duplication
Chromosome duplications were detected by concatenation of predicted protein sequences for each of the 24 chromosomes and searching by homology between chromosomes
Clusters of related genes on different chromosomes were indicative of ancient chromosomal duplications most appeared to have occurred before the divergence of primate and rodent lineages
Duplicated genes which are related to genes of known function are interesting candidates for further study
The draft sequence of the mouse genome has also been released. comparison of mouse and human sequences will aid in discovery of new genes.