Download bos_sci_nov2009 - users.cs.umn.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Overview of Biomedical Informatics
Vipin Kumar
University of Minnesota
[email protected]
www.cs.umn.edu/~kumar
Team Members: Michael Steinbach, Rohit Gupta, Gowtham Atluri, Gang Fang,
Gaurav Pandey, Sanjoy Dey, Vanja Paunic
Collaborators: Brian Van Ness, Bill Oetting, Gary L. Nelsestuen, Christine Wendt,
Piet C. de Groen, Michael Wilson
Research Supported by NSF, IBM, BICB-UMR, Pfizer
Nov 12th, 2009
Understanding Biotechnology – The Science of the ‘Omics’
Biomedical Informatics
 Recent technological advances are helping to
generate large amounts of biomedical data
•
•
Data from high-throughput experimental techniques
- Gene expression data
- Biological networks
- Proteomics and metabolomics data
- Single Nucleotides Polymorphism (SNP) data
Electronic Medical Records
- IBM-Mayo clinic partnership has created a DB of 5
million patients
 Great potential benefits from the analysis of
these large-scale data sets:
•
•
•
Automated analysis of patients history for customized
treatment
Discovery of biomarkers for complex diseases and
other phenotypes
Cheminformatics and drug discovery
2
Large-scale Data is Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation and
collection technologies
 New mantra

Homeland Security
Gather whatever data you can
whenever and wherever
possible.
 Expectations

Gathered data will have value
either for the purpose collected
or for a purpose not envisioned.
Scientific Data
Geo-spatial data
Sensor Networks
Business Data
Computational Simulations
Data Mining
• Automated techniques for analyzing large data sets.
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems.
Data
10
Milk
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced
220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
Predictive Modeling: Classification
• Find a model for class attribute as a function of
the values of other attributes
Model for predicting credit
worthiness
Class
1
Yes
Graduate
# years at
present
address
5
2
Yes
High School
2
No
3
No
Undergrad
1
No
4
Yes
High School
10
Yes
Tid Employed
…
…
Level of
Education
…
…
Employed
Credit
Worthy
Yes
No
Yes
No
Education
Graduate
…
{ High school,
Undergrad }
10
Number of
years
Number of
years
> 3 yr
< 3 yr
> 7 yrs
< 7 yrs
Yes
No
Yes
No
Discovering biomarkers
• Gene Expression Data
• Given: n labeled subjects, each with expression levels of p genes
• Objectives: build a predictive model to identify cancer subtypes
Genes
Classical study of cancer subtypes
Golub et al. (1999)
identification of diagnostic genes
• SNP Data
• Given: n labeled subjects, each with genotypes of p SNPs
• Objectives: build a model using genotypes to predict labels.
……..
…….
SNP 1
SNP 2
SNP 3
Patient 1
AC
GT
AA
1
Patient 2
AA
GG
GG
0
………
Patient n
Class
..
CC
GG
AG
1
Predicting short-term vs. long-term survivors
among myeloma subjects
•
•
•
3404 SNPs (Selected according to potential relevance to Myeloma)
Cases: 70 Patients who survived shorter than 1 year
Controls: 73 Patients survived longer than 3 years
Brian Van Ness et al,
Genomic Variation in Myeloma:
Design, content and initial
application of the Bank On A
Cure SNP Panel to detect
associations with progression
free survival,
BMC Medicine, Volume 6, pp
26, 2008.
controls
cases
SNPs
Clustering
Finding groups of objects such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups
•
Applications:
– Finding groups of similar genes or proteins
based upon their expression profiles
– Clustering of patients based on phenotypic
and genotypic factors for efficient disease
diagnosis
– Market Segmentation
– Document Clustering
Courtesy: Michael Eisen
Michael Eisen et al, 1999
Association Pattern Discovery
• Given a set of records each of which contain some number of items
from a given collection;
– Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
• Biological applications
– Identifying functional modules in protein interaction networks
– Identifying transcription modules in gene expression data
– Identifying biological entities associated with disease phenotypes
•
Biomarker discovery from genomic data, e.g. gene expression, Single-nucleotide polymorphism(SNP),
metabolite data etc.
Discovery of Discriminative Patterns from
Lung Cancer Gene Expression Data
•
67 Normal samples, 102 cancer patients, 8787 genes
[Stearman et al. 2005], [Su et al. 2007], [Bhattacharjee et al. 2001]
•
Visualization of a size-10 pattern using a new discriminative pattern finding technique
Enriched with the TNF/NFkB signaling pathway
which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)
Gang Fang, Rui Kuang, Gaurav Pandey, Michael Steinbach, Chad L.
Myers and Vipin Kumar, Subspace Differential Coexpression Analysis:
Problem Definition and A General Approach, In the Proceedings of the
15th Pacific Symposium on Biocomputing (PSB), pp. 145-156, 2010.
Discriminative Metabolite Patterns from Liver
Cirrhosis Data
•
41 alcoholic liver cirrhosis (row 1-41), 19 controls (row 42-60), 3610
metabolites
–
•
Data from Gary Nelsestuen et al.
A sample group of five metabolites having very similar (in relative terms)
intensity values in cases, but mostly absent in controls.
– (a) The rank values (black is 10, white is 0),
– (b) original intensity values.
Gaurav Pandey, Gowtham Atluri, Michael
Steinbach, Chad L. Myers and Vipin
Kumar, An Association Analysis
Approach to Biclustering, Proceedings of
the ACM International Conference on
Knowledge Discovery and Data Mining
(SIGKDD), 677-686, 2009.
(a)
(b)
Summary
• Data mining techniques hold great
promise for data-driven hypothesis
generation in the biomedical domain.
• Ample scope exists for the development
and application of novel techniques for the
analysis of different types of biomedical
data.
For further information…
• Visit www.cs.umn.edu/~kumar/dmbio.
• Send email to [email protected].
Pang-Ning Tan, Michael
Steinbach and Vipin
Kumar, Introduction to
Data Mining, AddisonWesley, 2005.