Download “biology driven” challenges for the stc cs researchers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

RNA-Seq wikipedia , lookup

Oncogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Transcript
Data to Biology
Shankar Subramaniam
University of California at San Diego
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• KNOWLEDGE EXTRACTION FROM DATA
– DEALING WITH THE COFFEE
DRINKERS PROBLEM
– HOW CAN BIOLOGICAL DATA BE
INTEGRATED?
– DEFINING THE GRANULARITY OF
DATA
– UNBIASED STATISTICAL METHODS
– BIOLOGY-CONSTRAINED METHODS
– INFORMATION METRICS
– HOW DO WE DEAL WITH CONTEXT?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• NOISY DATA
– CAN WE DEFINE HOW MUCH
NOISE AND WHAT TYPE OF
NOISE CAN BE TOLERATED IN
EXTRACTING KNOWLEDGE?
– IS MISSING DATA TANTAMOUNT
TO NOISE? IF NOT HOW DO WE
DEAL WITH IT?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• CLASSIFICATION OF MODULARITY
FROM DATA
– HOW CAN WE DEFINE MODULES
(FUNCTIONAL, SPATIAL,
TEMPORAL, ETC.) FROM DATA?
– WHAT IS THE INFORMATION
CONTENT IN THE MODULES?
– CAN WE COMPARE MODULES
QUANTITATIVELY?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• DEALING WITH DYNAMICAL DATA
– HOW DO WE DEAL WITH TIME
SERIES DATA?
– HOW IS INFORMATION
PROCESSED IN TIME SERIES
DATA?
– WHAT GRANULARITY AND
CONTEXT IS NECESSARY TO
ANALYZE THIS DATA?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• The coffee drinkers problem [highly
skewed distributions]:
– 90% of people are coffee drinkers
• What does this say about making
drink predictions that are 90%
accurate?
• Biology is all about highly skewed
distributions – posing significant
challenges for methods, measures, and
validation
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• The coffee drinkers problem – examples:
– 99% of us likely do not have the
disease one might be looking for
– 99% of protein interactions are
accounted for by 5% of the proteins
– 99% of the known disease-implicated
mutations occur in less than 5% of the
people
– (all estimates, but largely realistic)
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The coffee drinkers problem:
– Most current techniques in data analysis are rendered useless
because of this.
– Statistical significance with meaningful null hypotheses are critical
(information content is one of the most commonly used measures
even today)
– Simulation based methods often do not work – requiring analytics
– Methods must optimize for these analytical measures of quality
– Validation in the absence of complete data is hard
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
•
The coffee drinkers problem (real examples)
•
When is a module in a network significant?
•
When is an observed mutation in a sequenced phenotype
implicated genome significant?
•
When is an alignment of two networks significant?
•
When is correlation in time-course microarray data significant?
Conversely:
•
How do we detect the most significant modules in a network?
•
How do we identify all phenotype-implicated mutations from a
large number of sequenced diseased and normal genomes?
•
How do we align networks for most statistically significant
alignments?
•
How do we find most correlated signals and associated groups
of genes?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The Hidden Terminal Problem:
– Consider a phenotype, reflected in its genetic variants (i.e., what
are nucleotide-level variations associated with a disease, say).
– Often, these variations are not consistent (e.g., liver cancer
manifests itself in gene mutations that are not all at the same
place).
– However, these variations correspond to significantly aligned
pathways in the underlying networks (i.e., they disrupt the same
function, albeit by altering different genes).
– How do we go from an observable (phenotype/disease) to an
abstraction (where the observable has little informative content) to
other abstractions (where the observable might have significant
information content).
– More importantly, how do we go backwards (predict observables)?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The Hidden Terminal Problem: Specific Instance
•
Start from observed mutations in a specific disease (liver or
breast cancer has significant genomic data available)
•
The mutations result from both noise, other phenotypes, and the
specific disease. A simple intersection yields no signal.
•
Cross-reference against synthetic lethality data.
•
Redefine intersection over pathways.
•
Reassess mutations under this definition and quantify the
significance of these mutations w.r.t. observed phenotype.