Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Large-scale mining of gene expression patterns Paul Pavlidis [email protected] VanBUG September 2007 Students Leon French Meeta Mistry Vaneet Lotay Postdoc Jesse Gillis Undergraduates Raymond Lim Suzanne Lane Programmers Kelsey Hamer Luke McCarthy Injury Stress Disease Aging Development Signal transduction Synapse Genome Synaptic modulation Topics • • • • Connectivity database and analysis Gene expression data re-use system Scaling up gene coexpression analysis Applications and ongoing work Another ‘ome Leon French, Suzanne Lane Growth of GEO 120000 Submissions 100000 80000 60000 40000 20000 0 Dec-99 Apr-01 Sep-02 Jan-04 Date May-05 Oct-06 Feb-08 Age Genes With JJ Mann, V Arango, E Sibille et al. Samples Age Genes Samples Data from http://national_databank.mclean.harvard.edu/ GEO Goals for a system • Researchers should be able to put their new expression data in a wider context of previous studies without extraordinary effort. • Move analyzing multiple microarray data sets from a niche activity to the mainstream • Integration of other data types, domain specific information. Public data sources Coexpression Differential expression Challenges to comparing data sets • • • • • • Need to match genes/transcripts across platforms Data from third parties not always easy to handle Varying scales, normalization, etc. Varying data quality Varying levels of “raw data” available Selecting appropriate data to compare With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.) 15 10 Frequency 8 6 5 4 0 2 0 Frequency 10 12 20 14 Probe specificity (or lack thereof) 0.0 0.2 0.4 0.6 Fraction non-specific probes 0.8 1.0 0.0 0.2 0.4 0.6 Fraction of probes with alignments 0.8 1.0 Which data sets are reasonable to compare? Too general, but lots of power All mouse data sets Mouse brain data sets Mouse neocortex data sets Mouse neocortex data sets examining stress Mouse neocortex data sets examining hypoxic stress Mouse neocortex data sets examining hypoxic stress after 3 hours of hypoxia Very specific, low power Expression experiments 519 Mus musculus 254 Homo Sapiens 203 Rattus norvegicus 62 178 Assays (i.e., chips): 20837 Array Designs: Coexpression links (probe-level): >100 million Scaling up analysis of gene coexpression Eisen et al., 1998 PNAS Genes that are coexpressed tend to have related function • • • Needed at the same place at the same time “Guilt by association” Reasonable to compare across studies Two ribosomal protein genes. Expression • Samples Biological noise • Induced gene expression effects are often small. • Gene expression varies between “replicates” in biologically-meaningful ways. • Allows us to repurpose data Sample type Functional coexpression should be (somewhat) generalized • • • If two genes are coexpressed under one condition, they will probably be coexpressed under at least some other conditions (or data sets). Coexpression seen “only once” needs special care in interpretation. We shouldn’t expect coexpression to be perfectly reproducible (for biological and technical reasons) Correlation Correlation A simple approach: Count Recurring patterns Genome Research, June 2004 Pipeline for one dataset Proof of concept analysis • • • • 60 human data sets, 15700 RefSeq genes. 70% cancer data 11 million “links” About 9.7 million different links Many links are replicated across studies 1.E+07 Observed 1.E+06 Number of links Shuffled database (mean) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1 10 Minimum number of data sets link is seen in 100 Evaluation on biological grounds Cluster involving NMDAR1 (GRIN1) GRIN1 ATP6V0A1 Allen Brain Institute PLD3 Application: analysis of imprinted genes Laurent Journot, INSERM – Universités Montpellier Correlation p-value LYAR interacting proteins LYAR-interactors Ewing et al, 2007 Molecular Systems Biology Vote counting limitations • Weak evidence distributed across data sets will not be picked up. • This example meets strict “vote counting” criteria in only 2/23 data sets Correlation 2 4 6 8 10 12 Support (datasets) Support (# of datasets) 14 -1.0 -0.5 0.0 0.5 (Global) Correlation Global effect size 1.0 Genes pairs Datasets Related work: Zhou XJ et al., Nat.Biotech 2005 Summary • Reuse of public data: ‘adding value’ • Meta-analysis of coexpression • Some applications • Functional prediction • Candidate identification • Platform evaluation Ongoing and future work • Applications and analyses • Protein interactions and hubs • Prediction of gene function at the synapse • Differential expression analysis • Regionalization • Mouse models of brain injury • Mouse models of psychosis • Expanding our public database and software http://www.bioinformatics.ubc.ca/Gemma Web-based tools for biologists; web services coming soon • Integration with other information sources Thanks Gemma Xiang Wan Kelsey Hamer Luke McCarthy Kiran Keshav Suzanne Lane Meeta Mistra Jesse Gillis And to: NCBI GEO team Groups who made data available Collaborators who provided data prior to publication Conrad Gilliam Abraham Palmer Joseph Santos Gozde Cozen David Quigley Anshu Sinha Spiro Pantazatos Wei-Keat Lim Tmm Homin Lee Amy Hsu Jon Sajdak Jie Qin Tzu-Lin Hsaio Andreas Kottmann Etienne Sibille Collaborators Barclay Morrison Joseph Gogos Michael Hayden Blair Leavitt Tony Blau Panos Papapanou Answers to FAQs • • • • No, they don’t have to be time course experiments. Yes, we’re using cDNA as well as Affymetrix etc. Yes, we see reproducible negative correlations. Yes, we’re interested in finding differences as well as similarities between data sets. • No, we aren’t necessarily inferring regulatory relationships • Yes, we know that RNA is just one way of measuring cell state. • No, we don’t have {worm,fly,yeast…} data, but we’d like to.