|
Bioinformatics is an active research area in the Department of Computer Science, Royal Holloway, University of London (RHUL). This activity aims to apply
machine learning techniques such as Confidence Machines (CM) and Support Vector Machines (SVM)
for the solution of a wide range of tasks in the field of structural, comparative and functional
genomics. This work, supported by grant 111/BIO14428 "Pattern Recognition Techniques
for Gene Identification in Plant Genomic Sequences", from the UK Biotechnology
and Biological Sciences Research Council (BBSRC),is conducted in collaboration with the
School of Biological Sciences, RHUL and the software development company
Softberry Inc. (USA).
Improving Plant genes/promoters prediction
We are developing a plant promoter database and promoter prediction algorithms.
- We collected RNA Pol II plant promoter sequences with experimentally
verified transcription start sites (TSS) for more than 300 genes from
various species.
- We explored and created weight matrices of major transcription
regulatory signals such as TATA, CCAAT-elements
and start of transcription sequences (TSS).
- Plant promoter prediction programs developed using
the promoter database and the
Softberry Plant Regulatory motifs database are presented at this server.
Plant genomes annotation
Annotation of genes and promoters of the model organism Arabidopsis thaliana
(presented in Softberry Genome Explorer); and annotation of rice genome.
Using the FGENESH++ program from Softberry and known full-length cDNAs information
27,509 genes (proteins) on the five Arabidopsis chromosome sequences have been predicted,
where more than 90% of our predicted proteins show significant similarity to the NR database
(for comparison, NCBI in collaboration with TIGR predicted 25795 genes of Arabidopsis).
About 38% of our predicted genes and NCBI predicted genes are the same in the sense that
they encode identical proteins. More than 78% of predicted proteins encode identical or
virtually identical (>95%) proteins. In approximately 4% of cases either the FGENESH++
or NCBI/TIGR prediction contained insertions/deletions with respect to one another,
possibly corresponding to additional internal exons. More than 88% of genes predicted
by the two methods occupy overlapping chromosome regions and less than 2% of these do
not show significant sequence similarity (BLAST E value < 10-32). And only less than
2% of genes predicted by either method did not overlap with a gene predicted by the other
method.
Prokaryotic promoter prediction
We are developing a method for recognition of prokaryotic promoter regions
with startpoints of transcription. The method is based on Sequence Alignment
Kernel, a function reflecting the quantitative measure of similarity between
two sequences. This kernel function is further used in Dual Support Vector
Machine, which finds the discrimination surface.
Sequence Alignment Kernel method has been tested on a set of 669
known sigma-70 promoters of Escherichia Coli, together
with several other methods and has shown comparable results,
being a rather general technique that can be applied to different problems
(see L.Gordon, A.Ya.Chervonenkis, A.J.Gammerman, I.A.Shahmuradov
and V.Solovyev (2003), Sequence Alignment Kernel for recognition
of promoter regions, to appear in Bioinformatics).
However, when it comes to searching for a promoter in a long sequence,
we can't use the SVM binary rule as it is, because doing so would yield
too many false positives. So we take one step back and work with
the signed distance to the discrimination surface, which can be
regarded as a likelihood measure of the example being a promoter.
Moving the threshold up and down allows us to balance between false positive
and false negative rates.
-
A demo version of the SAK-based method is
available online. It scans the given sequence and provides
the likelihood of every base of the sequence to be a potential TSS.
Running our program along the whole genome of E.Coli
we obtained a "promoter map" that can be used for locating
yet undiscovered promoters in this organism.
-
An online viewer
for this map plots the prediction curve together with other information
that is known about the genome (known and putative genes,
known sigma-70 promoters, known promoters of other kinds),
which might help in decision making.
|