  • State-of-the-art statistical and remote-homology gene-prediction methods are successful at identifying the location of exons in unannotated genomic DNA, but are often quite poor at predicting the details of gene structure, necessitating human curation [ 14].

  • Gene model splits were often necessitated by the facts that gene-prediction programs such as Genie and GENSCAN can string together genes that lie close to each other and do not resolve nested genes.

  • This pipeline aligns Drosophila ESTs, cDNA sequences, and other sequences in GenBank using sim4 [ 53], performs DNA and protein sequence similarity searches of the GenBank and SwissProt/TrEMBL databases using BLASTX and TBLASTX, and executes the gene-prediction algorithms Genie [ 68] and Genscan [ 69].

  • In general, the accuracy and scalability of gene-prediction and similarity-search programs is such that computing on 20 million base (Mb) chromosome arms is ill-advised, and we therefore cut the finished genomic sequence into smaller segments.

  • The annotated gene models that are supported by cDNA and/or high-quality TBLASTX matches provide a useful dataset for training and testing gene-prediction algorithms on heterochromatic sequence.

