Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bios 540 Introduction to Bioinformatics Lecture 1. Course outline The biological system Omics and its impact Big data The statistician/bioinformatician’s role Course Outline Instructor: Tianwei Yu Office: GCR Room 334 Email: [email protected] Office Hours: by appointment. Teaching Assistant: Mr. Qingpo Cai Office Hours: TBA Course Website: http://web1.sph.emory.edu/users/tyu8/540/index.htm Evaluation Class participation (5%) Three homeworks (15% × 3) Final report based on a research article (50%). Course Outline Bioinformatics Other Disciplines CS Biology Genetics …… Machine learning and other courses 540 Statistics Course Outline Biological sequence analysis Pariwise alignment; multiple alignment; sequence models; motifs; fast alignment; phylogenetic trees High-throughput data generation and preprocessing Next generation sequencing; Microarray RNA/DNA profiling; LC/MS based Proteomics/Metabolomics. (Technique; popular models) General statistical technics in high-throughput data Multiple testing & FDR; clustering; classification Data Interpretation & Integration Ontology; Some Important Databases; Networks Related course Bios 740 (Bios/CS 534 from 2017): Machine Learning. Supervised learning: Classification: Bayesian decision theory, LDA, classification tree, random forest, SVM, boosting, bump hunting, neural networks, deep learning. Model generalization. Variance/Bias, training/testing error, cross validation. Unsupervised learning: Dimension reduction: PCA, factor analysis, ICA, NCA,SIR Clustering: similarity measures, hierarchical, k-means, model-based clustering … Tentative schedule Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 Lecture 8 Lecture 9 Lecture 10 Lecture 11 Lecture 12 Lecture 13 Lecture 14 Introduction Sequencing; Dynamic programming sequence alignment BLAST; Hidden Markov Models in alignment (1) Hidden Markov Models (2); Multiple Alignment Motif discovery; Phylogeny Gene expression: microarray and deep sequencing Supervised and Unsupervised Learning (1) Supervised and Unsupervised Learning (2) Multiple Testing Analyzing the DNA by deep sequencing (1) Analyzing the DNA by deep sequencing (2); MS-based Proteomics & Metabolomics(1) MS-based Proteomics & Metabolomics (2) Networks and Ontology Data integration Course Outline Recommended Readings for the basics: Richard Durbin et al. (2005) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Michael Waterman (1996) Introduction to Computational Biology – Maps, Sequences & Genomes. The complex biological system Red: central dogma Blue line: interactions Metabolites (Picture edited from http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/) 9 The complex biological systems Organism Cell Tissue architectures Genome Transcriptome interactome Cell interactions Sigaling Proteome …… Metabolome 10 Environment Chemicals Microorganisms The complex biological systems How many players are there in the human system ? 30,000~70,000 genes one or more regulatory sequence per gene ~70% of the genes are alternatively spliced to generate >1 transcripts and >1 proteins per gene > 40,000 different metabolites (Human Metabolome Database) Hundreds of signaling molecules Different cellular architectures The above listed are just Species. Amounts of each species also matter! 11 The complex biological systems The complex biological systems Our goal – the comprehensive understanding of diseases We face the “big data” challenge Comprehensive studies of a disease Li et al. Seminars in Immunology 25:209 The complex biological system – our goals “Omics” Advanced preprocessing techniques Reliable highthroughput information includes Genomics Transcriptomics Proteomics Metabolomics Interactomics …… To reduce noise Measured by High-throughput Sequencing Microarrays LC/MS NMR Two hybrid …… Their data are High-noise Techniques to analyze highdimensional data and knowledgebases Biological knowledge Medical knowledge Improved health The complex biological systems --- the genome http://content.answers.com/main/conten 16 t/wp/en/f/f0/DNA_Overview.png http://www.insectscience.org/2.10/ref/fig5a.gif The complex biological systems --- the genome The human genome is a book with 3 billion characters. 5% are words (protein coding sequences) and 95% are not. The mouse genome contains about 2.5 billion characters. It is very similar to the human genome (85% identical in protein coding regions). That is one of the reasons why mice are suitable for elucidation of biological mechanisms and drug discovery. The similarity results from a common ancestor 80 million years ago. How many genomes are sequenced? The number increases rapidly. 17 http://gregoryzynda.com/ncbi/genome/python/2 014/03/31/ncbi-genome.html The complex biological systems --- the genome Small variations in the genome can cause huge differences in pheonotypes – disease susceptibility, drug response etc. The sequence variations in the genome can be measured by PCR (low-throughput), microarray and deep-sequencing (highthroughput) – individual genome. 18 http://archive.hpcwire.com/hpcwire/2013-06-05/dell_boxes_up_hpc_for_life_sciences.html The complex biological systems --- the epigenome DNA is structured. Modifications to relevant proteins (methylation/acetylation/…) and DNA itself can change its structure and control gene expression (DNA -> RNA). http://www.roadmapepigenomics.org/ The complex biological systems --- transcriptome and proteome The cell is a complex machinery. The active parts of it are the proteins. The DNA records how each protein should be made, but not the quantity at a given moment. To understand the operation of the machinery, we want to know how much of each protein is present under certain conditions. There are potentially >10,000 species of proteins in the cell. Protein modifications further complicate things. The proteome can be directly measured by methods like LC/MS/MS, which is costly. A much easier way is to measure the transcriptome. The messenger RNAs serve as the molds in the making of the parts. Normally, the more molds, the more parts made. mRNA doesn’t have tertiary structures – much easier to quantify by micro-arrays. http://www.katiephd.com/a-whole-new-rna-world/ The complex biological systems --- metabolome Small molecules – not coded by DNA. Substrates of enzymes (proteins). Reflects activities of the regulatory systems and the environment. Directly reflects Metabolic regulation Nutrition Environmental response Drug response Indirectly reflects system changes (redox potential…) Measured by NMR, GC/MS, LC/MS,……. The interactome The Scientist 2004, 18(12):18 The reactome KEGG network – proteins(enzymes) & metabolites. The relevance of omics experiments in medicine Biomarker discovery To find non-invasive methods to: Predict disease risk; early detection Disease classification Predict response to treatment Monitor disease progression Before the era of high-throughput experiments, what did the doctors do? Age, gender, ethnicity, behavioral measures, … Disease stage, dissection of disease tissue … Use one-at-a-time methods to analyze proteins/metabolites in disease tissue or biological fluids The relevance of omics experiments in medicine To study the disease mechanism to find a cure: (1)Diseases with pathogen Interaction of the human system with the pathogen. Protein interaction, regulation of gene expression, change in metabolite concentration … What can we block to stop the disease progression? (2)Diseases without pathogen What goes wrong in the human system? Is it a genetic disorder? Is it disturbance of the regulation of the system? The relevance of omics experiments in medicine Genomics – a few examples Medical question. Experimental Techniques. Computational Techniques Is there a (set of) special mutation causing a disease? Deep sequencing; Single Nucleotide Polymorphism (SNP) arrays Association Analysis; Linkage Analysis; Multiple testing; …… How to find gene products that aid/suppress the development of a certain type of cancer ? array comparative genomic hybridization; SNP array CGH/LOH; Deep sequencing. Segmentation; Multiple testing; Clustering; Classification …… How to find a region of DNA whose folding structure affect disease 26 status? Deep sequencing. Alignment; Peak modeling; Segmentation ….. The relevance of omics experiments in medicine Transcriptomics – a few examples Medical question. Experimental Techniques. Computational Techniques Are certain gene products associated with the incidence/progression of a disease? Expression microarrays Are there subtypes of disease undetected by regular medical examination? Gene expression (and potentially all other “omics” methods) 27 Whole Transcriptome Shotgun Sequencing Alignment; Multiple testing; Dimension reduction; Clustering; Classification …… (same as above) The relevance of omics experiments in medicine Proteomics – a few examples Medical question. Experimental Techniques. Computational Techniques Are certain proteins Mass spectrometry associated with the (2D gel -> MS, tandom MS, incidence/progression LC/MS/MS,……) of a disease? Sequence matching; Multiple testing; Dimension reduction; Clustering; Classification …… How do proteins change (targeted) Mass spectrometry their modification patterns in a disease state? (same as above) How do proteins of pathogens work and interact with human 28 proteins? (same as above) Protein structure analysis; Mass spectrometry Immunological methods Large-scale structural study The relevance of omics experiments in medicine Metabolomics – a few examples Medical question. Experimental Techniques. Computational Techniques How are bodily metabolic Mass spectrometry networks disrupted in NMR metabolic diseases? etc Data alignment Metabolite mapping Multiple testing Dimension reduction Functional data analysis…… How do some drugs interfere with the human metabolome? How are they transformed/ degraded? (same as above) (same as above) Do pollutants accumulate in the human body and cause diseases? Mass Spectrometry (same as above) 29 The relevance of omics experiments in medicine “Omics” is revolutionizing medicine. Personalized medicine Understand each patient’s system, match them with treatments. (success example: Oncotype DX breast cancer test from Genomic Health, in order to tailor treatment.) Predictive medicine & preventive medicine Find the increased risk, even before the disease onset. Predict the progression of disease after it occurs. Systems biology better understanding of diseases How are all the “omics” measurements related? How do they interact? What does it say about possible treatments and development of drugs? 30 The relevance of omics experiments in medicine What is Personalized medicine? Each person is different by Different DNA sequence (tens of millions of sequence variations) Different DNA structures Different gene expression levels Number of SNPs Different protein modification/ degradation patterns Different metabolite levels in the blood Different exposure history …… 31 Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748 The relevance of omics experiments in medicine Fig. 1. Personalized medicine. Personal genomics connect genotype to phenotype and provide insight into disease. Pharmacogenomics connect connects genotype to patientspecific treatment. Traditional medicine defines the pathologic states and clinical observations to evaluate and adjust treatments. Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748 The relevance of omics experiments in medicine nature medicine volume 17 | number 3 | 297 – 303. nature medicine volume 17 | number 3 | 297 – 303. The relevance of omics experiments in medicine 35 The challenges to statisticians/bioinformaticians Luckily, or unluckily, we are part of the “big data” game. http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/ The challenges to statisticians/bioinformaticians All Omics experiments share one characteristic: Omics The “totality” there are many ! We are measuring hundreds of thousands of features from one single person. We are overwhelmed by data --- even eyeballing the data becomes impossible. The task: Reduce the data into a more useful form. Make use of the data in medicine and biological research ! 37 Nature Methods 6, S2 - S5 (2009) The challenges to statisticians/bioinformaticians The sample size issue. Up to now, most genome-wide association studies (GWAS) yielded very weak biomarkers. Biomarkers found by microarray are often unreliable. Why? Diseases are complicated! The human population is diverse! We are limited by sample size! If a disease is caused by the combinatorial effect from 3 genes located at different regions in the genome, high-throughput technology will have difficulty finding them, even with 1000 samples! 38 The challenges to statisticians/bioinformaticians Many medical questions using Omics can be generalized into these forms: Processing the data to find the features. Pre-processing, sequence comparison, data modeling… Identifying features (SNPs, genes, proteins etc) associated with a disease (or disease state) Find if a feature is significantly different between normal/disease samples. Statistical models, Multiple Testing, model validation, generalization… Finding previously unknown subtypes of a disease Group samples based on there feature measurements. Dimension Reduction, Clustering, … Predicting disease/normal status or different disease subtypes/states Based on the measurements of some features, predict a new case. Predictive Model Building… 39 The challenges specific to statistical bioinformaticians Compromise The models may be too complex assumptions may not hold; theoretical rigors may not be achieved Too much background knowledge Computing needs Work with others Different data types - integration “Dirty” data Speed: the first few methods (not the best method) dominates, and data evovles 40