Bioinformatics Main Page
Gene prediction, Plant promoter finding
PlantProm: Plant Promoters database
Arabidopsis genome analysis: genes, promoters, proteins
Prokaryotic promoter prediction demo
Escherichia coli
promoter map
Department of Computer Science
Computer Learning Research Centre
Royal Holloway, University of London
Useful links
People

 

 

Bioinformatics is an active research area in the Department of Computer Science, Royal Holloway, University of London (RHUL). This activity aims to apply machine learning techniques such as Confidence Machines (CM) and Support Vector Machines (SVM) for the solution of a wide range of tasks in the field of structural, comparative and functional genomics. This work, supported by grant 111/BIO14428 "Pattern Recognition Techniques for Gene Identification in Plant Genomic Sequences", from the UK Biotechnology and Biological Sciences Research Council (BBSRC),is conducted in collaboration with the School of Biological Sciences, RHUL and the software development company Softberry Inc. (USA).

 

Improving Plant genes/promoters prediction

We are developing a plant promoter database and promoter prediction algorithms.

  • We collected RNA Pol II plant promoter sequences with experimentally verified transcription start sites (TSS) for more than 300 genes from various species.
  • We explored and created weight matrices of major transcription regulatory signals such as TATA, CCAAT-elements and start of transcription sequences (TSS).
  • Plant promoter prediction programs developed using the promoter database and the Softberry Plant Regulatory motifs database are presented at this server.

Plant genomes annotation

Annotation of genes and promoters of the model organism Arabidopsis thaliana (presented in Softberry Genome Explorer); and annotation of rice genome. Using the FGENESH++ program from Softberry and known full-length cDNAs information 27,509 genes (proteins) on the five Arabidopsis chromosome sequences have been predicted, where more than 90% of our predicted proteins show significant similarity to the NR database (for comparison, NCBI in collaboration with TIGR predicted 25795 genes of Arabidopsis). About 38% of our predicted genes and NCBI predicted genes are the same in the sense that they encode identical proteins. More than 78% of predicted proteins encode identical or virtually identical (>95%) proteins. In approximately 4% of cases either the FGENESH++ or NCBI/TIGR prediction contained insertions/deletions with respect to one another, possibly corresponding to additional internal exons. More than 88% of genes predicted by the two methods occupy overlapping chromosome regions and less than 2% of these do not show significant sequence similarity (BLAST E value < 10-32). And only less than 2% of genes predicted by either method did not overlap with a gene predicted by the other method.

Prokaryotic promoter prediction

We are developing a method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of similarity between two sequences. This kernel function is further used in Dual Support Vector Machine, which finds the discrimination surface.

Sequence Alignment Kernel method has been tested on a set of 669 known sigma-70 promoters of Escherichia Coli,  together with several other methods and has shown comparable results, being a rather general technique that can be applied to different problems (see L.Gordon, A.Ya.Chervonenkis, A.J.Gammerman, I.A.Shahmuradov and V.Solovyev (2003), Sequence Alignment Kernel for recognition of promoter regions, to appear in Bioinformatics).

However, when it comes to searching for a promoter in a long sequence, we can't use the SVM binary rule as it is, because doing so would yield too many false positives. So we take one step back and work with the signed distance to the discrimination surface, which can be regarded as a likelihood measure of the example being a promoter. Moving the threshold up and down allows us to balance between false positive and false negative rates.

  • A demo version of the SAK-based method is available online. It scans the given sequence and provides the likelihood of every base of the sequence to be a potential TSS.

Running our program along the whole genome of E.Coli   we obtained a "promoter map" that can be used for locating yet undiscovered promoters in this organism.

  • An online viewer for this map plots the prediction curve together with other information that is known about the genome (known and putative genes, known sigma-70 promoters, known promoters of other kinds), which might help in decision making.