____promoter term
DNA=======================================================
RNA ~~~~~~~~~~~~~~~---------------~~~~~~~~~~~~~~~~~transcribe
RNA ~~~/~~~~~~~~~~~~|~~~~~~~~~~~~~~~\~~AAAAAA
processing
protein dsafkldfklkldfkdsdlkfprltsqwre
translation
Signals for transcription into mRNA
mRNA encodes triplet codons to specify protein sequence
mRNA translated into protein at ribosome
tRNAs carry amino acids specified by mRNA
64 codons -> 20 amino acids
Prokaryotic easier, since there is little processing to interrupt coding regions
Translation coding signals
First look for long open reading frame (ORF - uninterrupted string of codons between start and stop codons)
often use a threshold value for shortest ORF you will evaluate.
Prokaryotes usually have ribosome binding site immediately upstream of initiator AUG codon (purine rich)
Can usually identify transcriptional coding signals, too:
TATACT consensus at -10 and TTGACA at -35 from start of transcription.
Content:
The Genetic Code
First
Second
UC
A
G
Third
U
UUU Phe
UUC Phe
UUA Leu
UUG LeuUCU Ser
UCC Ser
UCA Ser
UCG SerUAU Tyr
UAC Tyr
UAA Stop
UAG StopUGU Cys
UGC Cys
UGA Stop
UGG TrpU
C
A
GC
CUU Leu
CUC Leu
CUA Leu
CUG LeuCCU Pro
CCC Pro
CCA Pro
CCG ProCAU His
CAC His
CAA Gln
CAG GlnCGU Arg
CGC Arg
CGA Arg
CGG ArgU
C
A
GA
AUU Ile
AUC Ile
AUA Ile
AUG MetACU Thr
ACC Thr
ACA Thr
ACG ThrAAU Asn
AAC Asn
AAA Lys
AAG LysAGU Ser
AGC Ser
AGA Arg
AGG ArgU
C
A
GG
GUU Val
GUC Val
GUA Val
GUG ValGCU Ala
GCC Ala
GCA Ala
GCG AlaGAU Asp
GAC Asp
GAA Glu
GAG GluGGU Gly
GGC Gly
GGA Gly
GGG GlyU
C
A
G
Multiple codons for most amino acids.
codon bias - some codons preferred by given organism over others to code for a particular amino acid
due to relative abundance of tRNAs
Where O= optimal codons, S=suboptimal, R=rare and U= unfavorable codons.
Note how few rare and unfavorable codons are usedIn contrast, the same sequence, but in a different reading frame (>1 above) yields much lower quality of codon bias:
Many more rare codons and fewer optimal codons than the real reading frame.
This bias is species specific. The example above used the E. coli codon bias table. If you look at the correct, translated reading frame using the S. cereviseae codon bias table, the codon bias is not favorable:
Some pairs of codons tend to be found together more than others. May
be due to neighbor effects on translation. May also be due to constraints of
protein structure.
3rd base of codon (wobble) tends to fit G-C bias of rest of genome
These more subtle rules may be hard to identify and quantitate, but Hidden Markov Model programs can be used to address problem
HMM programs must be trained on known sequences to establish statistical rules
to evaluate unknowns.
So take known genes encoding known proteins to use as input into program.
Model genes will then provide statistics on codon bias, codon pairs, etc.
Depends on accuracy of model genes (poor data can obscure important rules) and relatedness to unknown genes (i.e., need E. coli sequences to establish rules for identifying E. coli genes)
Can try to identify coding regions and exons in genomic DNA using programs which utilize hidden Markov models (HMM), statistical models combing information on splice sites, codon bias, and lengths of introns and exons
For the phage genome projects, we have been using 2 different HMM programs:
GeneMarkS and Glimmer.
Both use long ORFs in new genome to identify sequence rules for ORFs.
These rules used to identify shorter ORFs.
So programs can identify genes in phage even if it has different sequence bias
than host.
(Most T4-like phages encode some tRNAs of their own, infulencing codon bias.
BLAST search can identify close or distant orthologs or paralogs
Can identify new variants of ancestral genes, or domains that have been swapped
around to make novel genes
similarity of translated reading frame to known protein is suggestive that new
reading frame is real.
Can also be used to identify genes to train HMM programs.
Can be based upon prokaryotic prediction programs, but require additional complexity to reflect complexity of eukaryotic transcription, processing, and translation
most eukaryotic splicing performed by spliceosomal complex:
splice sites determined by sequences at the ends of introns "GTAG rule"
|
|
In many cases this requires use of neural network programs:
suites of programs that evaluate different aspects of sequences and compare
results to identify best candidates
i.e., an HMM to evaluate reading frames, AND a similarity search program to
compare to database AND a matrix program to identify consensus splice sites,
polyA sites, etc.
Neural network program which compares GC composition of putative gene
to flanking regions
scores splice donor and acceptor sites
evaluates ORF
scores polyA sites
Compares to EST mRNAs
Regions that score highly by one criterion are fed through other analyses (Grail-EXP BRCA prediction benchmark)
Pattern discrimination program
Plots functions such as exon preference vs 3' splice site score on X-Y graph.
FGENESH adds HMM analysis (FGENESHGC BRCA prediction
benchmark)
Similar type of analysis to FGENES, but uses quadratic equation line to separate winners from losers. Only tries to pick individual exons, does not try to assemble them into a model gene.
Figure shows example of how parameter discrimination program might distinguish "hits" from "misses". Anything above the blue diagonal would be a hit for a linear discrimination program like FGENES, Any above the Green arc might be hits for a quadratic parameter program like MZEF. (MZEF BRCA prediction benchmark)
Looks for match of query DNA to model of genome composition and gene structure,
using HMM. Predicts optimal exons, but also suboptimal ones. Accuaracy of prediction
~ P value, so some useful info can be found down to P = 0.5.
Genomescan adds blastx info to increase accuracy of hits. (GENSCAN
BRCA prediction benchmark)
In order to assess the relative strengths and weaknesses of the variious prediction programs, a benchmark region of human genomic DNA encoding the BRCA2 mRNA has been used as a tester sequence for all of the major prediction programs. The results of many of these comparisons are found at the Banbury Cross Web site.
.Graphic comparison of predictions using UCSC Genome Browser. Note: the nucleotide position numbers have changed with the revised draft of the genome.
Ensemble uses prediction of Genscan (an HMM program), then checks these predictions
against ESTs, mRNAs and protein motifs in known databases
Merged these predictions (35,500) with those from Genie, which tries to match
5' end ESTs with 3' end ESTs to make full-length predictions
Also merged with known sequences from RefSeq
Came up with total of 31778 predicted
14,882 from known genes
12,839 from Ensemble
4,057 from Ensemble-Genie
Matches to mouse cDNAs which do not match known proteins may indicate proteins of novel function previously not identified
Comparison of protein coding capacity of sequenced genomes (proteomes)