Download Data Mining - functional statistical genetics/bioinformatics

S S G ection ON tatistical enetics Laura Kelly Vaughan, Ph.D. Assistant Professor Department of Biostatistics Section on Statistical Genetics [email protected] Data Mining: Functional Statistical Genetics & Bioinformatics NCBI (National Center for Biotechnology Information) Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be discerned. http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html Integrative Data Analysis  Genetic studies tend to focus on one data source    Genetic variation RNA levels Blood biochemistry  This fails to utilize the information contained in the connections among these variables… Central Dogma of Molecular Biology Replication TXN DNA TSN RNA Protein Phenotype Proteomics Genetics Structural Genomics PTM Functional Genomics (Transcriptomics) Phenomics Metabolomics Different sources of annotation data  Gene Ontology  Pathways/Networks  Protein/protein interactions  Literature  Functional annotations  Expression       Cross species Cellular localization Methylation ChIP Sequence similarity Promoter & Regulatory Network  Protein domains Gene Ontology  www.geneontology.org  The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in a species-independent manner.    biological processes- series of events accomplished by one or more ordered assemblies of molecular functions cellular components- parts of the cell molecular functions- activities, such as catalytic or binding activities, that occur at the molecular level Example of a GO annotation http://www.yeastgenome.org/help/images/cytokinesisDAGrels.jpg What is a Pathway?  Physical and functional interactions between genes and gene products    Metabolic pathways Kinase based signaling cascades Transcriptional signaling pathways TNF Signaling TNFa TNF a/b TNFR1 TNFR2 SODD TRADD I-TRAF TRAF2 RAIDD TRAF2 MADD Ceramides P MEKKs TAK1 NIK P Caspase8 Caspase1 P JNKK1 IKKs P P p38 Caspases 3,6,7 P JNK1 P ERKs IkBs P NF-kB Apoptosis BID Caspase9 tBID ATFs c-Jun CytoC APAF1 c-Fos NF-kB Elk1 CytoC Gene Expression and Cell Survival C 2007-2009 SABiosciences.com IkBs Degradation What is a Network?  Graphical representation if relationship between genes, gene products, or other objects  Formed with information such as      Genes in interacting pathways Gene products that share protein-protein interactions Gene products protein-nucleotide relationships Regulatory relationships Metabolic interactions Metabolic Disease Network ©2008 by National Academy of Sciences Lee D. et.al. PNAS 2008;105:9880-9885 Analysis tools  Numerous methods have been developed to aid in the interpretation of biological experiments  2 basic categories  Pre-analysis methods where the raw data is grouped together & the groups are tested   Dimension reduction Post-analysis methods where significant or interesting results are grouped together to identify trends Before you start…  There are many methods available for integrative data analysis  Before you chose one, you must properly define the questions you are trying to answer…  What is your hypothesis? DBA ~10 mins Methods  Unsupervised, or data based methods   Utilizes all the data to identify trends Hypothesis generating  Supervised, or prior information based   Requires the user to provide a ‘training set’ of genes Hypothesis testing Gene Set Analysis  Test statistic intended to measure the deviation of gene-set expression measurements from the null hypothesis of no association with the phenotype is calculated  The statistical significance (P-value) for each gene set is calculated based on permutation of samples Types of enrichment methods  Class 1- Singular enrichment (SEA)  P-value calculated on each term from pre-selected list & enrichment terms are listed  Class 2- Gene set enrichment (GSEA)  All genes (without pre-selection) are included  No need to select list  Experimental values integrated into P-value calculations  Pairwise comparisons (e.g., disease vs. control)  Most appropriate for expression data  Class 3- Modular enrichment (MEA)  Predetermined list, with term-term or gene-gene relationships included in enrichment P-value calculation  Closest to nature of biological data structure DAVID  Provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes  Extensive annotation database  Includes both pathways and GO  SEA and MEA algorithms  Visualization tools  http://david.abcc.ncifcrf.gov/ DAVID and LVH gene expression  GO clustering of significant genes between different mouse treatment groups Stansfield et al 2009 Cardiopulmonary Support and Physiology Babelomics Suite  Suite of web tools for the functional profiling of genome scale experiments  Multiple annotation sources    Pathways, GO, regulation, text mining, interactions Allows for functional enrichment Several gene set methods  Mostly SEA methods  http://babelomics.bioinfo.cipf.es/ Babelomics and thyroid carcinoma  Identified 1031 gene with differential expression  Enriched pathways included  MAPkinase  TGF-B  Focal adhesion  Cell motility  Activation of actin polymerization  Cell cycle  Identified 30 genes that predict prognosis with 95% accuracy Montero-Conde et al 2008 Oncogene GSEA  Computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).  http://www.broad.mit.edu/gsea/ GSEA: Steps in the Methodology  Define a Gene Set from prior knowledge  Order the genes by correlation with phenotype  Estimate the gene set’s Enrichment Score  Assess Statistical Significance using permutation tests  Adjust for Multiple Hypothesis Subramanian et. al, PNAS, 2005 Biological pathways involved in chemotherapy response in breast cancer  GSEA for ER+ breast cancer tumors chemotherapy responders and non-responders  Of >850 gene sets, 4 were significant Tordai et al 2008 Breast Cancer Research Significance Analysis of Function and Expression (SAFE)  Generalization and extension of GSEA method  2 stage permutation based approach to asses significant changes in gene expression across experimental conditions   First computes gene-specific local statistics to test for association between gene expression and the phenotype. Gene-specific statistics then used to estimate global statistics that detects shifts in the local statistics within a gene category.  The significance of the global statistics is assessed by repeatedly permuting the response values.  SAFE implements a rank-based global statistics that enables a better use of marginally significant genes than those based on a p-value cutoff.  http://www.bioconductor.org/packages/bioc/1.6/src/contrib/html/safe.html Dietary resveratrol and aging in mice  SAFE analysis based on GO annotations  Overlap of classes with significant effect caloric restrictive response with low dose resveratrol Barger et al 2008 PLoS One Supervised Analysis Endeavour  Web based prioritization of candidate genes  Infers models for the training set in each data source  Application of each model to the candidate geens to rank against profiles of training set  Merges rankings from each data source to give global ranking of genes http://homes.esat.kuleuven.be/~bioiuser/endeavour/endeavour.php ENDEAVOUR: the algorithm behind the wizard Tranchevent, L.-C. et al. Nucl. Acids Res. 2008 36:W377-W384; doi:10.1093/nar/gkn325 Copyright restrictions may apply. Genetic disorder prioritization using Endeavour Network Analysis  Dynamic representation of cellular process through the incorporation of annotation & experimental data  Structures are not fixed and change with context  Many methods available…  Suderman & Hallett 2007 Bioinformatics Ingenuity IPA Pathway Analysis of WTCCC Type 2 Hypertension GWAs  No single SNP was significant at the genome wide level  High degree of relationship between pathways suggests multiple related mechanisms  Large number of low penetrance risk alleles  Pathway analysis with MetaCore Torkamani et al. 2008 Genomics The next step Translational Science  Integration of 49 genome wide experiments for the prediction of previously unknown obesity related genes  Greatly outperforms individual experiments English, S. B. et al. Bioinformatics 2007 23:2910-2917; doi:10.1093/bioinformatics/btm483 References                    Song & Black 2008. BMC Bioinformatics. 9:502 Huang et al 2009. NAR 37(1):1-13 Chen et al 2008 Nature 452(27)429-435 Dinu et al 207 Journal of Biomedical Info 40:75-760 Al-Shahrour et al NAR 36:W341-346 Barry et al 2005 Bioinformatics 21(9)1943-1949 Huang et al Nature Protocols 4(1)44-57 Tranchevent et al 2008 NAR 36:W377-384 Mehta et al 2006 Physiol Genomics 28:24-32 Suderman & Hallett Bioinformatics 23(20)2651-2659 Dinu et al 2008 Briefings in Bioinformaics Curtis et al 2005 Trends in Biotech 23(8) Price and Shmulevich 2007 Current Op in Biotech 18:365-370 Zhang et al 2008 BMC Systems Bio 2:5 Werner 2008 Current Op in Biotech 19:50-54 Lui et al 2007 BMC Bioinformatics 8:431 Goeman & Buhimann 2007 Bioinformatics 23(8)980-987 Rivals et al 2007 Bioinformatics 23(4)401-407 Nam & Kim 2008 Briefings in Bioinformatics 9(3) 89-97

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining - functional statistical genetics/bioinformatics