Download kamath-slides - Human Competitive

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Koinophilia wikipedia , lookup

Primary transcript wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

Human–animal hybrid wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Genealogical DNA test wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genome evolution wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

History of genetic engineering wikipedia , lookup

ENCODE wikipedia , lookup

Metagenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microsatellite wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Transcript
9th Annual "Humies" Awards
2012 — Philadelphia, Pennsylvania
Genetic Programming Based Feature Generation for
Automated DNA Sequence Analysis
Uday Kamath, Amarda Shehu ,Kenneth A De Jong
Department of Computer Science
George Mason University
Fairfax,VA, 22030
{ukamath, amarda, kdejong}@gmu.edu
Bioinformatics and Molecular Biology
LarrañagaP et al. Brief Bioinform2006;7:86-112
Promoter Site Identification
Background
• Promoters signal the
beginning of a coding
region
• They are important
signals for initiation of
DNA->RNA transcription.
Challenges
Copyright 2012 the British Journal of Anaesthesia
•
•
•
Complex
Gene-specific
Many decoys
DNA Splice Site Identification
Background
• Splice sites mark
boundaries between
exons and introns in a gene
Challenges
•
No known sequence
pattern
i. Diverse sequence length
ii. Diverse exon lengths
iii. Diverse number and
lengths of introns
Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch
TUTORIAL: SUPPORT VECTOR MACHINES AND KERNELS FOR COMPUTATIONAL BIOLOGY
[2008]
•
0.1 to 1% true splice sites,
rest decoys
Evolutionary (GP) Approach
Finding Functional Features
• GP Functional Features
Terminals
 A,C,T,G
 Integers for position/region
Basic Non Terminals





Motif (combination of ACTG)
Position based Motifs
Correlation based Motifs
Region based Motifs
Composition based Motifs
Complex Non Terminals




Conjuntions
Disjunctions
Negations
Features Evolved combining
accuracy/precision
Why Human Competitive ?
B) The result >= than a result that was accepted
as a new scientific result
E) The result >= than the most recent humancreated solution to a long-standing problem
F) The result >= than a result that was considered
an achievement when was first discovered
G) The result solves a problem of indisputable
difficulty in its field
Why Human Competitive ?
B) The result >= than a result that was accepted as a new
scientific result
F) The result >= than a result that was considered an
achievement when was first discovered
Splice Site Prediction
•Research compares state of the art Enumeration, Iterative, Probabilistic
methods, Kernel methods etc.
•Best Precision with statistical significant improvements on most datasets
Promoter Prediction
•Research compares results with 7 state of the art algorithms ranging from
Enumeration, Iterative, Neural Networks, Kernel based etc.
•Best Precision and with statistical significant improvements on different datasets
Why Human Competitive ?
F) The result >= than a result that was considered an
achievement when was first discovered
On Promoter Identification Problem
What was considered achievement
Where we stand
Uday Kamath, Kenneth A De Jong, and Amarda Shehu. "An Evolutionary-based
Approach for Feature Generation: Eukaryotic Promoter Recognition." IEEE Congress
on Evolutionary Computation (IEEE CEC), New Orleans, LA, pg. 277-284, 2011
Why Human Competitive ?
F) The result >= than a result that was considered an
achievement when was first discovered
On Splice site Identification Problem
What was considered
achievement
Where we stand
Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda
Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data
and its Application to DNA Splice-Site Prediction. Trans Comp Biol and Bioinf 2012
Why Human Competitive ?
E) The result >= than the most recent human-created solution
to a long-standing problem
Long Standing Problem(s)
Genome Sequence prediction and annotation of Splice sites and Promoters
Computational Results >=
Around 7 datasets and 10 algorithms compared
Advancing Understanding in Genomics
•Our top features do contain signals painstakingly determined by biologists through decades
of wet-lab research.
• More importantly, new features are found that may help biologists further advance their
understanding of DNA architecture
•All our features are available online for experts to analyze and spur further wet-lab research
Why Human Competitive?
G) The result solves a problem of indisputable difficulty in its
field
• Estimated 10-25K human protein-coding genes (only 1.5% of entire genome)
• Wet-lab models of discovery costly and prone to errors
• Cannot keep pace with growing genomic sequences
• Computational models good complements, but
• Black Box Models – No or Little help to Biologists
• White Box Models- Lower precision/accuracy and reliant on manual steps
• Decades of research into DNA function and architecture
•“Gene finding” on pubmed returns > 80,000 research articles
• Progress crucial to speed up our understanding of disease and development of
targeted treatments
Why is this the Best Entry
• Addresses central problems to molecular biology and health research
• Finding functional signals in genome sequences is complex and NP-Hard
• Improvements over state of the art are statistically significant
• Extensive statistical analysis validates usefulness of GP features
– F-score and Information gain techniques
• Advances understanding to motivate further research
– Features found by GP reproduce results of decades of research by biologists
– Novel interesting features also reported
– Features, data sets, and software publicly available for community
• Far reaching implications, spurring research beyond genomics
– Example: finding what features determine anti-microbial activity for the
purpose of generating novel peptides to combat drug resistance.