Download Doing Machine Learning at the Wellcome Trust Sanger Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Doing Machine Learning at the Wellcome Trust Sanger Institute
Doing Machine Learning at the
Wellcome Trust Sanger Institute
Manuela Zanda
Wellcome Trust Sanger Institute,
Cambridge, UK
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
The Wellcome Trust Sanger Institute
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
Research at The Sanger Institute
Four main research areas ...
... but they all require informatic tools to analyse the data
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
The Genome
Genome ⇒ all the DNA in an organism or a cell
DNA is organised in chromosomes
Two copies of each chromosomes
Genes ⇒ proteins ⇒ do all the work in a cell
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
The Concept of Genomic Variation
Humans share ∼ 99.9% of their genome.
What makes us different is the remaining 0.1%!
But then DNA is made of 3 billion base pairs...
Differences can be either:
Small changes, single DNA letter
Single Nucleotide Polymorphisms
(SNPs)
Large structural changes:
Copy Number Variants – CNVs
Deletions
Duplications
More complex rearrangements
...accgattaagcgaa...
...accgaataagcgaa...
...accgattaagcgaa...
...accgacgaa...
...accgattaagcgaa...
...accgattaagttaagcgaa...
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
The Concept of Copy Number Variant (CNV)
[Courtesy of the Wellcome Trust Sanger Institute]
Doing Machine Learning at the Wellcome Trust Sanger Institute
The Sanger Institute
Why are we interested in Copy Number Variants
Certain (family) diseases caused by CNVs
Certain CNVs protect from diseases (e.g. malaria)
These CNVs are present in healthy individuals
Impact on Disease
Changes in copy number can alter gene regulation
Not all of them! Which ones?
Identify mutations associated with disease
Doing Machine Learning at the Wellcome Trust Sanger Institute
My Projects
My Role at the Sanger
Genomic mutation and genetic disease group [Matt Hurles]
Structural Variation
Copy Number Variation
in human populations
its role in diseases
Characterise mutation mechanisms
NextGen sequencing
Current focus on:
Developmental disorders (DDD)
Congenital heart disease
AIM: To improve clinical diagnosis
Doing Machine Learning at the Wellcome Trust Sanger Institute
My Projects
Projects Portfolio
NIDDK T1D DP3 CNV Project
In collaboration with Cambridge, UCL and UVa
Investigate the role of CNVs in Type I Diabetes
1000 Genomes Project
International Project
In depth genotyping of difficult SVs with 400K microarray
CGH
Doing Machine Learning at the Wellcome Trust Sanger Institute
My Projects
The Objective of the T1D DP3 CNV Project
To investigate the role of rare and common
CNVs on disease susceptibility
What we already know
Diabetes is a complex disorder
Different genomic locations involved
Few (weak) association between common CNVs in
previous GWAS studies (WTCCC)
Doing Machine Learning at the Wellcome Trust Sanger Institute
My Projects
The Novelties of the T1D DP3 CNV Project
Target specific regions of the genome
Focus on difficult common CNVs (VNTRs)
Focus on rare CNVs
Use of different types of probes
internal probes
breakpoint probes
Family data
Doing Machine Learning at the Wellcome Trust Sanger Institute
My Projects
The Data: aCGH microarrays
R
2008
Nature Education All rights reserved
Each dot ≡ intensity → difference in CN with ref sample
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
How Does Machine Learning Fit in All This?
Large Scale Data Analysis
Data denoising/transformations
Dimensionality reduction (60,000 features x 10,000
samples)
Data modelling
CNV discovery
Data Analysis Automation
Analysis made of multiple steps → automation?
Different approaches → Which one to choose?
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
Data Denoising/Transformation
Noise in the data
Amount of DNA per sample differs
8x60k aCGH array
Data converted to image → data loss!
Data has to be scaled
Quantile Normalisation
PCA
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
Dimensionality Reduction
Example: Summarising Probe Intensities
Input: ≈ 3-10 probes per (12,000) CNV per (10,000)
samples
Output: Estimates how many DNA copies per each sample
Two main ways:
Summarise probes: PCA, mean, median
no one fits all!
Multidimensional problem
which model?
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
Data Modelling: Clustering
1.5
1.0
0.5
0.0
Density
2.0
2.5
CNVR1877.2 Chr4:34916594−34919278
Model = G, ncomp = 3, 12 probes, Clustering Status = C
−4
−3
−2
−1
ldf − pca
0
1
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
CNV Discovery of Rare CNVs
0
−2
−1
Log2 Ratio
1
2
T1D_21q22.3 : 44581997−44582498
Called samples: 16860604
44581000
44581500
44582000
44582500
44583000
Ensemble method: CNSolidate algorithm
weighted combination of CN detection algorithms
44583500
Doing Machine Learning at the Wellcome Trust Sanger Institute
ML Problems
Open Problem: How to Choose Best Feature Scaling?
CNVR1877.2 −− pca_summary − breakpoint_probes
5
10
200
150
50
0
−5
0
5
10
−5
5
normalised4
normalised5
normalised6
10
150
0
50
100
Frequency
150
0
50
100
Frequency
150
100
5
summary_data_std
10
200
summary_data_std
200
summary_data_std
50
0
0
summary_data_std
0
−5
100
Frequency
150
0
50
100
Frequency
150
100
Frequency
50
0
0
200
−5
Frequency
normalised3
200
normalised2
200
normalised1
−5
0
5
summary_data_std
10
−5
0
5
summary_data_std
10
Doing Machine Learning at the Wellcome Trust Sanger Institute
Conclusion
Other Machine Learning Applications
Interactions/Pathways Analysis
Comparative genomics (HMMs)
Sequence Alignment
Haplotype inference (HMMs)
Networks of interactions (non-genetic factors)
SNP genotyping (HMMs/clustering)
Metabolomics
Gene expression and interaction networks
Classification of proteins and RNAs
... and many more ...
Doing Machine Learning at the Wellcome Trust Sanger Institute
Conclusion
Considerations and Final Remarks
Genomics is an exciting field to work in
Feeling for real data
Early stages of analysis/discovery
More and more interest for Machine Learning
Lack of expertise in the field
How to get in
Minimum requirement: PhD
Apply for jobs at https://jobs.sanger.ac.uk
Doing Machine Learning at the Wellcome Trust Sanger Institute
Jobs?