* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Biochemical cascade wikipedia , lookup
Chemical biology wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
History of molecular biology wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Expression vector wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Evolutionary developmental biology wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Neurogenetics wikipedia , lookup
Symbiogenesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions 2 Overview Human Genome Project Why is it important? Sequence functionality Prevention & treatment of disease Where is there computation in it? Lab hardware/software Analysis: assembly, element discovery Could not accomplish w/o computers 3 Bigger Picture Biology of the (not so) past Isolated Low level (one X at a time) Slow accumulation of knowledge Biology of the present Global High level (organismal/theoretical) Rapid accumulation of knowledge Rapid generation of open questions 4 Example: S. cerevisiae (yeast) Yeast: before expression arrays Model organism for experiments Easy to grow, modify, and study Genetics similar to higher organisms Yeast: after expression arrays Immensely more useful Now know most gene functions New results every month that used to take five years Results are directly applicable to higher organisms 5 A good beginning … The genome is not the end Code to be deciphered Human road map Greater need for computational tools and power Example: dbSNP Data exists Need help finding and relating it all 6 Computers – not just for analysis Role reversal Before: Biologists generate data, computers analyze it Now: Computers generate experiments, biologists perform them Cycle New future for CMBists Biotech has greatest opportunity for real science to be done, and CS is crucial! 7 CB is good for CS Old research revisited and applied Clustering Expired in the 70s, reborn 3 years ago New papers reacceptance as research topic Data mining, web statistics, e-commerce Machine learning Well-studied over the past couple decades New needs in CB new research on tuning 8 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions 9 Biochemistry 101 Cells Basic building blocks of life * Proteins Key to functionality Catalyze reactions * Store and release energy Build cells and cell components Process-specific, yet resource-efficient 10 The genetics of proteins DNA Four-base alphabet * Genes are instructions for building proteins Cell cycle * Extensive regulatory mechanism Construct proteins at right time and place Break down proteins and reuse components Incredibly complex series of steps 11 Transcription & translation DNA RNA RNA protein Transcription factors * RNA polymerase Translation at ribosome * Amino acid chains Protein degradation 12 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions 13 Technology DNA microarrays Consensus RNAs adhered to slide Test and control cDNAs produced * Fluorescently labeled Hybridized with RNAs on slide Scan fluorescence with computer Results: how much RNA present! * What does this signify? 14 Example uses Timepoints in the cell cycle Which genes are always “on”? Which genes are responsible for certain events in the cycle? Differential expression in experiment Which genes are responsible for a particular cell response? What is the response pattern over time? 15 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions 16 “Low tech” algorithms 90s: DNA is just a bunch of strings Questions became answerable! Are there gross similarities in the genome? What do they imply? Are there smaller recurring elements in the genome? What is their function? I know what Gene A does? Can I use that to figure out what Gene B does? 17 String and sequence matching String matching Sequence matching Find exact replicas of DNA sequence elsewhere in the genome Are they statistically unlikely? Regions of DNA that look similar: allows for evolution Also applied to proteins In reality, sequences are more important 18 Computer tools Biological questions could be answered better by a computer than by a biologist GenBank, FASTA, BLAST, GAP Not trivial developments, even for CS Required novel approaches to NP-hard problems Web proliferation (ongoing) www.cs.jhu.edu/~salzberg/appendixa.html 19 High tech: expression arrays Use active gene data to classify a cell Example: Cancer type prediction Subtypes appear very similar histologically Very different clinical courses Diagnoses: biologists’ insight rather than systematic/unbiased approaches 20 Classifying cancer ALL vs. AML Two kinds of leukemia (only recently separated) Must be treated very differently Distinguishable in clinic, but not 100% reliable Golub (1999) Goal: Determine cancer type by overall gene expression; build an automated classifier By-product: One of earliest quantitative uses of DNA microarrays 21 Strategy Get expression data for 6800 genes from 27 ALL and 11 AML patients Clustering: Find genes with expression levels that are strongly correlated with the ALL-AML class distinction Give each such gene a weighted predictive vote for its class Let important genes vote on test cases 22 Determining correlation w/ class Idealized expression patterns Neighborhood analysis * Correlation metric Significance Euclidean distance, regression, TNOM Q: Is gene more highly correlated with IEP than would be expected by chance? A: Examine correlation w/ random IEP permutations Results: 1100 genes more highly correlated with ALL-AML class distinction than expected by chance 23 Making a class predictor Subset of informative genes will elect the class of a new sample Each casts weighted vote for its class: * Expression level of gene in test sample Original correlation of gene w/ class distinction Prediction strength (PS) Margin of victory after all genes vote If less than threshold, then uncertain 24 Validation of the model (a) Initial data set: cross-validation For each patient sample: Build a classifier without it (i.e. w/ 37 others) Predict class of left-out sample Calculate cumulative error rate Results Used top 50 genes 36/38 samples classified correctly, 2 uncertain 25 Validation of the model (b) Independent data set: test validation 34 samples from diverse tissues 29/34 “strong” predictions; 100% accuracy PS values quite high for both .77 in cross-validation; .73 in independent Mean PS lower for samples from one particular laboratory: importance of standardization in clinical setting 26 Further results of clinical importance 10 200 voting gene set had same accuracy Voter gene function: not just lineage markers Surface receptors, anti-apoptotic agents, cell cycle regulators, DNA manipulators, known oncogenes These genes provide insight into cancer causes New biological knowledge as a result of computational methods! Other applications of CP & feature selection Response to chemotherapy Eventual outcome of disease 27 Other array-based classifiers (a) k-means clustering Select “high-scoring” features like before Pick k points as initial cluster centroids Add each new data point to nearest cluster Move that cluster centroid to new mean Use these centroids to classify test cases 28 Other array-based classifiers (b) Support Vector Machines Goal: find a plane that separates data points If not separable Boost the data points into a higher dimensional space using some well-behaved kernel function Try to find a separating hyperplane there Key benefits of SVM version Kernel avoids explicit representation of higher-dim space Finding the maximum margin separating classes avoids overfitting 29 Class discovery What if we don’t know how many clusters we want? The discovery of finer-grained subtypes of cancer has been arduous and slow How can microarrays help here? Golub (1999) again … Automatic class discovery based solely on gene expression 30 Self-organizing maps (SOMs) Very much like k-means clustering Results for 27 ALL/11 AML data set However, we don’t know the discriminating features in advance Cluster based on all gene expression levels Class A: 24/25 samples were ALL Class B: 10/13 samples were AML Quite effective, but not perfect 31 SOMs (cont’d) How can we evaluate the “learned” clusters w/o knowing the true classes? Test by class prediction – accuracy should be high if classes reflect true structure Results Predictors w/ variety of genes did well in cross-validation Exception: the one AML in class A was often predicted to be in class B This suggests an iterative method for class discovery: discover, predict, refine 32 Independent model validation Cannot assess “accuracy” on test data Instead, assess prediction strength High PS indicates that structure in initial data is also present in test data Results Median PS=.61, 74% of samples above threshold Compared w/ random clusters, PS’s were highly statistically significant We have discovered ALL-AML distinction! Even lower-level distinctions also discovered 33 Other CS w/ expression arrays Regulatory element detection Correlate expression data with frequency of DNA motifs Taxing even for fastest processors today Discovery of regulatory pathways Treat expression arrays over time as a graph Establish a Bayesian network model for regulatory pathways over the array graph structure Infer network parameters pathway structure 34 Problems with DNA arrays Different companies, different types Even within one company Much time spent on normalization Different products over time Different binding efficiencies Even then, different groups’ results are hard to compare Biggest worry: RNA levels in cells do not accurately reflect current protein content Perhaps limits our discovery potential 35 Proteonomics If protein is most important, why not study it directly? Much work is done on proteins already But difficult to purify, prepare, quantify Results are very coarse Emerging technologies More efficient protein purification and protein arrays are being developed! Lots of discoveries to come 36 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions 37 Looking to the future Biology is becoming a more theoretical, unified science The problem w/ biology has always been that there are too many layers Work has always been somewhere in the middle Now research is beginning to focus on processes and pathways and networks in general This is the proper path to developing theories Along the way … Lots of hard computational problems to be solved! 38