Comparative genome analysis
My laboratory will be working on a variety of problems that can be formulated and solved within the framework of comparative genomics. Below I list some of the projects. These projects represent only a small sample of fascinating questions that can be answered by comparing genomic sequences. There will be more to come…
Comparative approach to gene annotation. Application of comparative genomics to gene prediction is an intuitive and simple way to increase its accuracy and reliability. We already demonstrated the utility of the simple KA/KS test for the identification of protein-coding exons using human-mouse sequence comparisons (Nekrutenko, Makova, Li, 2002). The test takes advantage of the fact that in the majority of coding regions synonymous substitutions (that do not change amino acid) are much more frequent that non-synonymous ones (altering the encoded amino acid; for review see Li, 1997). My lab will be pursuing the following directions:
- Establish an alternative estimate of protein-coding capacity of the human genome using a comparative approach to challenge the current estimate of gene number in the human genome;
- Develop a comparative approach for the assembly of the entire gene structure including identification of exon/intron boundaries and untranslated regions.
Annotation of promoters. It appears that the gene number does not reflect the complexity of a given organism: humans only have between 30,000 and 40,000 genes, which is only a two-fold increase over C. elegans. Thus we think that the study of regulatory regions will provide answers to many fundamental evolutionary and genetic questions. Dubchak et al. (2000) demonstrated that cross-species comparisons are useful in identification of conserved regions within non-coding sequences. Almost all currently used promoter-prediction algorithms suffer from the problem of overprediction (very high rate of false-positives). We are going to initially approach this problem by identifying groups of conserved regions and studying their combinations.
Evo-bio project. With new genomes coming out of sequencing centers every week, comparative analysis becomes the most powerful approach to unraveling the biological meaning of genomic sequences. Comparative genomics relies heavily on algorithms developed by evolutionary biologists. However there is a problem — implementation. The idea of evo-bio project is to develop a set of PERL-modules implementing numerous sequence analysis algorithms developed by evolutionary biologists. Once this project is complete, complex analyses, such as analyzing patterns of nucleotide substitutions in a sample of sequences, will be as easily done as typing use EvoBio.