Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data integration Brixen 2008 Wolfgang Huber EMBL-EBI Overview • • • • Along genomic coordinates By gene (by pairs of genes) (by sets of genes) • Here, "gene" is used in loose sense, to be defined as appropriate for the application; the concept encompasses: – Loci on the DNA – Transcripts (RNA molecules) – Proteins Integration of data along genomic coordinates An example: We measured the frequency of recombination events (crossovers, gene conversions not associated with crossover) throughout the genome of S cerevisiae. Is this pattern ('hotspots') associated with: •GC content •promoters •across- or within species conservation CO NCO Testing for association You can consider the different sets of features along the genome as continuous-valued, or binary, "time" series X1(t), ..., Xn(t) Consider, e.g., the case where Xi(t) and Xj(t) are {0,1} indicators. A simple (but as we will see, inadequate) approach would be to compute an overlap statistic such as X (t ) X (t ) X (t ) i Sij j t i t or Sij X (t ) X (t ) X (t ) X (t ) i j i i t t and estimate its null distribution through random permutation in t. Testing for association Alternatively, one could also compute, for each feature in series i, the distance to the closest feature in j, and then take a summary of the distribution of that statistic (e.g. median). 0.0 0.4 0.8 "Boring" association of features in inhomogenous time series 2000 4000 6000 8000 10000 0.0 0.4 0.8 0 2000 4000 6000 8000 10000 0.02 w1 w2 0.00 Density 0.04 multidensity 0 20 40 60 Distances nearest neighbour distance: uniform genome genome with blocks 80 100 120 ## Flawed testing for association along the genome library("geneplotter") library("RColorBrewer") n = 10000 oneplot = function(weights, s=200) { e1 = sample(n, s, prob=weights) e2 = sample(n, s, prob=weights) d = matchpt(e1, e2)$distance plot(x=e1, y=rep(1, length(e1)), type="p", pch=16, col= "#A6CEE3", ylim=c(-0.1, 1.1), xlab="", ylab="") points(x=e2, y=rep(0.9, length(e2)), pch=16, col= "#B2DF8A") lines(weights/sum(weights)*0.3, col="grey") return(d) } w1 = rep(1, n) w2 = rep(rep(c(0, 1), each=n/8), 4) par(mfrow=c(3,1)) dists = list( w1=oneplot(w1), w2=oneplot(w2)) multidensity(dists, xlab="Distances", xlim=c(0, 120)) legend("topright", names(dists), lwd=2, lty=1, col=brewer.pal(9, "Set1")) Testing for association "Everything is correlated with GC-content"; Etc. Hence everything is correlated with everything else. That is not very interesting. Are two sets of genomic features correlated more than expected? To be interesting, this "expectation" is not just uniform random distribution along the genome, but includes some "background model". When setting up such a test, we need to define what an interesting background model (null hypothesis) is, then set up an appropriate randomization scheme to try to reject it. For example, we could say that we know that there are long range structures in the genome, in which we are not interested, and we want to test whether two features that we mapped at fine scale show local correlation above the coarse-scale correlation. Data integration via "genes" A common and intuitive method for data integration is to compare the data from different experiments (assays) by mapping them all to the same set of genes. This sounds easier than it is: different assays investigate different aspects of a gene transcript(s) level protein product(s) level, localization, structure, ... chromatin state promoter UTRs antisense transcript and our understanding of how these aspects are organised together in a gene may be subtle, controversial, and changeable over time. Data integration via "genes" The reagents and target molecule identifiers used in different experiments may be different: RefSeq ID Entrez ID Ensembl Gene ID Ensembl Transcript ID Uniprot ID Gene coordinate on the chrosome Microarray probe sequence siRNA sequence Peptide sequence identified in MS Short Read Sequence Bioconductor offers tools to map these to each other (annotation packages; biomaRt; Biostrings). Data integration via "genes" Bioconductor offers tools to map these to each other, so others can reproduce your mapping you can redo the mapping as the biological databases get updated you can try out different ways to do the mapping and see how they affect the subsequent data analysis Think about this early: - keep the primary reagent identifiers around - use versioning (annotation packages!) - make the mapping process part of your reproducible, documented, automated workflow Acknowledgement Robert Gentleman Richard Bourgon Jörn Tödling Greg Pau Report Generation hwriter package References Visualizing Genomic Data, R. Gentleman, F. Hahne, W. Huber (2006), Bioconductor Project Working Papers, Paper 10 Choosing Color Palettes for Statistical Graphics, A. Zeileis, K. Hornik (2006), Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Research Report Series, Report 41