Describe the eukaryotic genome structure.
What is TRY4?
What are V28 and V29-1?
What does gene finding aim to find?
Describe prokaryotic genome finding.
Prokaryotes have small genomes (0.5-5 Mbp) with a high coding density (>90%) and no introns. This makes gene identification relatively easy (~99% success rate). Problems include overlapping ORFs (due to the prokaryotic genome being so small), short genes missing the cutoff point (roughly 50 aa), and finding promoters.
Describe eukaryotic genome finding.
Eukaryotes have large genomes (10-120,000 Mbp) with a low coding density (<50%, 2-3% in humans), and they contain introns. This makes gene identification relatively difficult (~50% success rate). There are many problems in eukaryotic genome finding.
Methods of gene finding
What are Ab initio methods?
Making predictions based on typical gene features such as splice signals and sequence composition. Regions to look for include initial 5’ exons, internal exons, and final 3’ exons.
What are similarity-based methods?
Problems with similarity-based methods
ORF scanning in prokaryotes
ORF scanning in eukaryotes
Codon usage in genomes
ORF scanning and moving windows
Sequence information only is used to identify coding exons through integrating coding statistics. We want to calculate the likelihood that a triplet is in a coding region and plot a graph of it (above zero is likely below zero is unlikely)
ORF scanning: exon/intron boundaries
Exon-intron boundaries have distinctive sequence features.
- Upstream boundary: invariant GT and consensus sequence
- Downstream boundary: T or C, any amino acid, then CAG
ORF scanning: upstream regulatory sequences
Locate where genes begin using distinct sequence features (e.g. recognition signals for DNA-binding proteins). Regulatory sequences are variable and difficult to incorporate into gene prediction algorithms)
Best Ab initio methods
Based on HMMs, a machine learning approach that takes sequences and encodes them in a statistical framework
Examples of Ab initio methods
GENSCAN, HMMgene, GeneMark
What is GenScan?
GenScan identifies complete intron/exon structures of genes in genomic DNA and predicts multiple genes, partial and complete genes. It uses HMM to model gene structure and has separate HMMs for exons, introns, and intergenic regions. There are different parameters for regions with different GC content.
P values in GenScan
P is the probability that the exon is correct.
When P>0.99, the exon is almost exactly correct.
0.50<=P<=0.99, the exon is correct most of the time.
P<0.50, not reliable
Sensitivity (nucleotide level accuracy)
no. of correct exons/no. of actual exons
Specificity (nucleotide level accuracy)
no. of correct exons/no. of predicted exons
Sensitivity (exon level accuracy)
true prediction/(actual exons + missed exons)
Specificity (exon level accuracy)
true prediction/(true prediction + false prediction)