Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Doing Machine Learning at the Wellcome Trust Sanger Institute Doing Machine Learning at the Wellcome Trust Sanger Institute Manuela Zanda Wellcome Trust Sanger Institute, Cambridge, UK Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute The Wellcome Trust Sanger Institute Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute Research at The Sanger Institute Four main research areas ... ... but they all require informatic tools to analyse the data Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute The Genome Genome ⇒ all the DNA in an organism or a cell DNA is organised in chromosomes Two copies of each chromosomes Genes ⇒ proteins ⇒ do all the work in a cell Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute The Concept of Genomic Variation Humans share ∼ 99.9% of their genome. What makes us different is the remaining 0.1%! But then DNA is made of 3 billion base pairs... Differences can be either: Small changes, single DNA letter Single Nucleotide Polymorphisms (SNPs) Large structural changes: Copy Number Variants – CNVs Deletions Duplications More complex rearrangements ...accgattaagcgaa... ...accgaataagcgaa... ...accgattaagcgaa... ...accgacgaa... ...accgattaagcgaa... ...accgattaagttaagcgaa... Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute The Concept of Copy Number Variant (CNV) [Courtesy of the Wellcome Trust Sanger Institute] Doing Machine Learning at the Wellcome Trust Sanger Institute The Sanger Institute Why are we interested in Copy Number Variants Certain (family) diseases caused by CNVs Certain CNVs protect from diseases (e.g. malaria) These CNVs are present in healthy individuals Impact on Disease Changes in copy number can alter gene regulation Not all of them! Which ones? Identify mutations associated with disease Doing Machine Learning at the Wellcome Trust Sanger Institute My Projects My Role at the Sanger Genomic mutation and genetic disease group [Matt Hurles] Structural Variation Copy Number Variation in human populations its role in diseases Characterise mutation mechanisms NextGen sequencing Current focus on: Developmental disorders (DDD) Congenital heart disease AIM: To improve clinical diagnosis Doing Machine Learning at the Wellcome Trust Sanger Institute My Projects Projects Portfolio NIDDK T1D DP3 CNV Project In collaboration with Cambridge, UCL and UVa Investigate the role of CNVs in Type I Diabetes 1000 Genomes Project International Project In depth genotyping of difficult SVs with 400K microarray CGH Doing Machine Learning at the Wellcome Trust Sanger Institute My Projects The Objective of the T1D DP3 CNV Project To investigate the role of rare and common CNVs on disease susceptibility What we already know Diabetes is a complex disorder Different genomic locations involved Few (weak) association between common CNVs in previous GWAS studies (WTCCC) Doing Machine Learning at the Wellcome Trust Sanger Institute My Projects The Novelties of the T1D DP3 CNV Project Target specific regions of the genome Focus on difficult common CNVs (VNTRs) Focus on rare CNVs Use of different types of probes internal probes breakpoint probes Family data Doing Machine Learning at the Wellcome Trust Sanger Institute My Projects The Data: aCGH microarrays R 2008 Nature Education All rights reserved Each dot ≡ intensity → difference in CN with ref sample Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems How Does Machine Learning Fit in All This? Large Scale Data Analysis Data denoising/transformations Dimensionality reduction (60,000 features x 10,000 samples) Data modelling CNV discovery Data Analysis Automation Analysis made of multiple steps → automation? Different approaches → Which one to choose? Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems Data Denoising/Transformation Noise in the data Amount of DNA per sample differs 8x60k aCGH array Data converted to image → data loss! Data has to be scaled Quantile Normalisation PCA Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems Dimensionality Reduction Example: Summarising Probe Intensities Input: ≈ 3-10 probes per (12,000) CNV per (10,000) samples Output: Estimates how many DNA copies per each sample Two main ways: Summarise probes: PCA, mean, median no one fits all! Multidimensional problem which model? Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems Data Modelling: Clustering 1.5 1.0 0.5 0.0 Density 2.0 2.5 CNVR1877.2 Chr4:34916594−34919278 Model = G, ncomp = 3, 12 probes, Clustering Status = C −4 −3 −2 −1 ldf − pca 0 1 Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems CNV Discovery of Rare CNVs 0 −2 −1 Log2 Ratio 1 2 T1D_21q22.3 : 44581997−44582498 Called samples: 16860604 44581000 44581500 44582000 44582500 44583000 Ensemble method: CNSolidate algorithm weighted combination of CN detection algorithms 44583500 Doing Machine Learning at the Wellcome Trust Sanger Institute ML Problems Open Problem: How to Choose Best Feature Scaling? CNVR1877.2 −− pca_summary − breakpoint_probes 5 10 200 150 50 0 −5 0 5 10 −5 5 normalised4 normalised5 normalised6 10 150 0 50 100 Frequency 150 0 50 100 Frequency 150 100 5 summary_data_std 10 200 summary_data_std 200 summary_data_std 50 0 0 summary_data_std 0 −5 100 Frequency 150 0 50 100 Frequency 150 100 Frequency 50 0 0 200 −5 Frequency normalised3 200 normalised2 200 normalised1 −5 0 5 summary_data_std 10 −5 0 5 summary_data_std 10 Doing Machine Learning at the Wellcome Trust Sanger Institute Conclusion Other Machine Learning Applications Interactions/Pathways Analysis Comparative genomics (HMMs) Sequence Alignment Haplotype inference (HMMs) Networks of interactions (non-genetic factors) SNP genotyping (HMMs/clustering) Metabolomics Gene expression and interaction networks Classification of proteins and RNAs ... and many more ... Doing Machine Learning at the Wellcome Trust Sanger Institute Conclusion Considerations and Final Remarks Genomics is an exciting field to work in Feeling for real data Early stages of analysis/discovery More and more interest for Machine Learning Lack of expertise in the field How to get in Minimum requirement: PhD Apply for jobs at https://jobs.sanger.ac.uk Doing Machine Learning at the Wellcome Trust Sanger Institute Jobs?