* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Multimedia Information Gathering
Gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Protein moonlighting wikipedia , lookup
Gene nomenclature wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Two Classifiers for Bioinformatics Course: CSI7162 Present to Dr. Stan Matwin Presented by Jun Ouyang 1 Road Map Introduction Discovery of regulatory connections in Microarray data (Gene level) Classifying protein fingerprints (Protein Level) Conclusion Research Challenges 2 Introduction to Bioinformatics An emerging interdisciplinary research area Interface between biological and computational sciences Computational management of all kinds of biological information Research scope of bioinformatics: DNA mRNA protein protein interactions informational pathways informational networks cells tissues or networks of cells an organism populations ecologies (hierarchical biological information) www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt 3 Introduction to Bioinformatics DNA level: DNA sequence alignment; gene prediction; gene evolution;… RNA level: Study of gene expression; transcription mechanism; posttranscription modification;… Protein level: protein 2D and 3D structure prediction; protein active site prediction; protein-protein interactions; protein-DNA interactions;… System level: (pathways, networks) Genome (gene-to-gene interactions) Ex: use gene chips to study gene regulatory network Proteome (protein-protein interactions) Ex: use protein chips to study protein interaction network www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt 4 Discovery of Regulatory Connections in Microarray Data (Gene level) What is a microarray? Obtaining microarray data Definition of regulatory relations between sets of genes (class labels) Reversible Jump Markov Chain Monte Carlo (RJMCMC) Algorithm to learn dynamic Bayesian network Use dynamic Bayesian network classifiers to predict regulatory relations Conclusion (1) 5 What is a Microarray? A kind of gene chip used to discover gene function or gene expression patterns Allow these patterns to be studied in parallel Example: In each location, a known probe (cDNA) is placed with cDNA from a certain sample For example, cDNA from cancerous and healthy cells with different probes (known strands of cDNA) Colour indicates the relative abundance of a labeled cDNA, meaning the gene has been activated 6 Obtaining Microarray Data What are the steps? [1] Choose cell population [or sample for diagnosis] [2] mRNA extraction, purify [3] Fluorescent label cDNA [4] Combine different strands of cDNA on microarray 6 time [5] Scan data over time [6] Interpret time series to study gene regulation over time 7 Regulatory Relations between Sets of Genes Goal of this work is to identify which genes regulate each other We are interested in two types of gene regulation: Co-regulation Two genes are perfectly co-regulated when their relative abundance functions w.r.t time have the same first-order derivatives Control-regulation Two genes are inversely co-regulated, or control-regulated, when their relative abundance functions w.r.t time have first-order derivatives which are inverses of one another 8 Microarray Data Representation From the microarray, we obtain a time series describing the gene interactions At different moments in time the microarray would show a different colour depending on which gene is active 9 Discretization and Classifier Construction We must discretize the time signal in order to facilitate learning with a Markov model For each point in time, the sample value is set to change, local minimum or local maximum. These features are used to learn a dynamic Bayesian classifier using a variant of the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique The classifier then can identify gene interactions as co-regulatory and controlled-regulatory Details provided later… 10 Discretization of Continuous Measurements Re-encoding of data using 2 binary variables for each of the 3 possible values Time series represented as change, local minima, local maxima 11 Monte Carlo Principle If we take a sample every 1/100th of a second and we measure for 10 minutes, we get 60000 samples per gene We need a method for reducing the number of samples without destroying the pertinent details Given a very large set X and a distribution p(x) over it We draw an i.i.d. set of N samples We can then approximate the distribution using these samples p(x) 1 p N ( x) N N 1( x i 1 (i ) x) X p(x) N An Introduction to Markov Chain Monte Carlo, Teg Grenager 12 Dynamic Bayesian Network Classifiers A Bayesian network: a statistical model for capturing the direct dependencies between discrete stochastic variables Train dynamic Bayesian networks to discover relations between sets of genes Co-regulated Control-regulated time time The Bayesian networks are trained using a variant of the RJMCMC sampling algorithm 13 Learning with Probabilistic Network Classifiers Objective function of network classifier P( L(C ) | X , G, * , D) l ( P(c | x, G, * , d ); , ) d D D: the learning database X: a set of predictor variables L(C): a score function l ( y, , ) : a modif. step function : the span of indeterminacy : the bounding probability G: a DAG structure. C: a classification variable D=(C,X): the learning database is separated into X and C Goal: sample models from the above target distribution P(L(C)|D). 14 Experiments Result Controlledregulated Co-regulated Lag(-1,-2) Valid SWI4 CLB2 Y CDC6 CLB2 Y AGA1 CLB2 Y ASH1 CLB2(SWI5) Y CDC45 CLB2(MCM1) Y Target gene CLB1 Lag(0) Valid Close CLB2 Y Y BUD4 CLB2 Y Y CDC47 CDC28 N N CTS1 FUS1 MFA2 SWI5(MCM1) N (?) SWI5 Y CLN3(CLB2) N(Y) N Testing result of yeast cell-cycle expression dataset 15 Conclusion (1) A new approach to discovering supposed regulatory relations between genes Processing techniques capture dynamic relations between sets of genes The results obtained from the microarray data are promising 16 Classifying Protein Fingerprints Motivation Task and Data Representation Data Preprocessing and Classification Methods Experimentation and Results Conclusion (2) 17 Motivation The need for automated protein fingerprint labeling Protein fingerprint: group of amino acid motifs used to depict protein families These fingerprints may be useful in grouping proteins together Improve on PRECIS (an annotation tool) This tool performs poorly: 40% error rate Classifies fingerprints using simple heuristics 18 Task and Data Representation Goal: replace PRECIS’s handcrafted heuristics with classification models extracted from data. Three distinct kinds of fingerprints Fingerprint itself Its component motifs (motif is a common sequence of amino acids) Protein 19 Task and Data Representation Fingerprint Number of motifs (nmt) Number of proteins (npr) True positive rate Partial positive rate Motif Motif length (average, std, etc.) Motif coverage ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Motif entropy ( average, stdev, etc.) Intermotif distance (average, stdev, etc.) 20 Task and Data Representation Protein sequence SWISS-PROT ID: fraction of proteins with ID LHS: frac of proteins whose length>=3|4 chars frac of proteins with common first 1|2|3|4 chars in LHS entropy of LHS averaged over first 1|2|3|4chars RHS: frac of proteins with a common RHS (species) entropy of RHS taken as a unit CC-belongs: sequence belongs to family CC-contains: sequence contains domain 21 Data Preprocessing Dealt with missing values using a technique based on KNN Considered several feature selection algorithms Ranking based on information gain I(X,Y)=H(X)-H(X|Y)= H(Y)-H(Y|X) Ranking based on mutual information H ( X ) H (Y ) H ( X , Y ) U ( X , Y ) 2 H ( X ) H (Y ) 22 Classification Methods This work compares the performance of several machine learning algorithms when combined with a feature selection method ML Algorithms considered: Logic-based learning algorithm Decision trees and rules (J48 and C5.0, etc) Density-estimation based learners NBayes, IBL, Lindisc, MLPS, SVM-RBF 23 Experimentation and Results Method Defaults PRECIS SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes Parameters G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 C=0.05 C=0.01 C=0.05 K CV error 45.60. 39.55 14.06 14.59 15.13 15.13 15.47 15.80 16.27 16.48 19.97 23.20 HO error 46.19 40.28 14.65 17.46 18.59 16.62 19.44 17.18 17.46 19.15 21.69 27.07 Best performance Error rates on the full 45-feature set CV: Cross-validation, HO: Holdout 24 Experimentation and Results Method Parameters SVM-RBF RandomForest C5.0boost MLP IBL Lindisc LTree J48 Part Nbayes G=0.05,C=50 I=100,K=6 B=10,C=0.1 H=10 K=10 C=0.05 C=0.01 C=0.05 K Feature selector ReliefF InfoGain ReliefF ReliefF SymmU ReliefF SymmU SymmU CFS CFS #features CV error HO error 36 40 32 40 32 40 32 32 7-10 7-10 14.08 16.61 16.90 16.90 17.46 17.18 18.59 19.72 18.03 23.66 14.09 14.19 14.79 14.86 14.93 15.40 15.53 15.53 17.35 18.02 Cross-validation and Holdout error rates after feature selection 25 Conclusion (2) SVM does not seem to benefit from the feature selection process (feature selection only removed 9 features!) Using a SVM-RBF learned classifier achieves a 26% improvement in accuracy over PRECIS 26 Research Challenges First paper Validate the new approach on real data sets (only simulated data was used) Second paper Correcting data imbalance to increase accuracy Incorporate available data from other databases 27 References M. Egmont-Petersen. W. de Jonge, A. Siebes. "Discovery of regulatory connections in microarray data," In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) 2004:149-160 Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, Terri K. Attwood: Classifying Protein Fingerprints. PKDD 2004: 197-208 M. Egmont-Petersen. "Discovering possible co-relations and control-regulations between gene pairs in time series microarray data using salient dynamic features", Presented at the working group Bioinformatics, Symposium 2004. M. Egmont-Petersen. "Feature selection by Markov Chain Monte Carlo Sampling - a Bayesian approach," In Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the Joint IAPR Workshops SSPR 2004 and SPR 2004, Lecture Notes in Computer Science 3138, Eds. A. Fred et al., pp. 1034-1042, 2004. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, Vol. 9, No. 12, pp. 3273-3297, 1998. Green PJ. “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, Vol. 82, No. 4, pp. 711-732, 1995. An Introduction to Markov Chain Monte Carlo. Teg Grenager. July 1, 2004. Agenda www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt 28 Q&A Thank you! 29