Download Visualisation and data integration

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data integration
Brixen 2008
Wolfgang Huber
EMBL-EBI
Overview
•
•
•
•
Along genomic coordinates
By gene
(by pairs of genes)
(by sets of genes)
• Here, "gene" is used in loose sense, to be defined as
appropriate for the application; the concept encompasses:
– Loci on the DNA
– Transcripts (RNA molecules)
– Proteins
Integration of data along genomic coordinates
An example:
We measured the frequency of recombination events (crossovers, gene conversions not associated with crossover)
throughout the genome of S cerevisiae.
Is this pattern ('hotspots')
associated with:
•GC content
•promoters
•across- or within species
conservation
CO
NCO
Testing for association
 You can consider the different sets of features along the
genome as continuous-valued, or binary, "time" series
 X1(t), ..., Xn(t)
 Consider, e.g., the case where Xi(t) and Xj(t) are {0,1}
indicators. A simple (but as we will see, inadequate)
approach would be to compute an overlap statistic such as
 X (t ) X (t )

 X (t )
i
Sij
j
t
i
t
 or
Sij
 X (t )  X (t )

 X (t )  X (t )
i
j
i
i
t
t
 and estimate its null distribution through random
permutation in t.
Testing for association
 Alternatively, one could also compute, for each
feature in series i, the distance to the closest feature
in j, and then take a summary of the distribution of
that statistic (e.g. median).
0.0
0.4
0.8
"Boring" association of features in
inhomogenous time series
2000
4000
6000
8000
10000
0.0
0.4
0.8
0
2000
4000
6000
8000
10000
0.02
w1
w2
0.00
Density
0.04
multidensity
0
20
40
60
Distances
nearest neighbour distance:
uniform genome
genome with blocks
80
100
120



## Flawed testing for association along the genome
library("geneplotter")
library("RColorBrewer")










n = 10000
oneplot = function(weights, s=200) {
e1 = sample(n, s, prob=weights)
e2 = sample(n, s, prob=weights)
d = matchpt(e1, e2)$distance
plot(x=e1, y=rep(1, length(e1)), type="p", pch=16, col= "#A6CEE3", ylim=c(-0.1,
1.1), xlab="", ylab="")
points(x=e2, y=rep(0.9, length(e2)), pch=16, col= "#B2DF8A")
lines(weights/sum(weights)*0.3, col="grey")
return(d)
}



w1 = rep(1, n)
w2 = rep(rep(c(0, 1), each=n/8), 4)
par(mfrow=c(3,1))






dists = list(
w1=oneplot(w1),
w2=oneplot(w2))
multidensity(dists, xlab="Distances", xlim=c(0, 120))
legend("topright", names(dists), lwd=2, lty=1, col=brewer.pal(9, "Set1"))
Testing for association
 "Everything is correlated with GC-content"; Etc.
 Hence everything is correlated with everything else. That is not
very interesting.
 Are two sets of genomic features correlated more than
expected?
 To be interesting, this "expectation" is not just uniform random
distribution along the genome, but includes some "background
model". When setting up such a test, we need to define what an
interesting background model (null hypothesis) is, then set up
an appropriate randomization scheme to try to reject it.
 For example, we could say that we know that there are long
range structures in the genome, in which we are not interested,
and we want to test whether two features that we mapped at fine
scale show local correlation above the coarse-scale correlation.
Data integration via "genes"
 A common and intuitive method for data integration is to
compare the data from different experiments (assays) by
mapping them all to the same set of genes.
 This sounds easier than it is: different assays investigate
different aspects of a gene
 transcript(s) level
 protein product(s) level, localization, structure, ...
 chromatin state
 promoter
 UTRs
 antisense transcript
 and our understanding of how these aspects are organised
together in a gene may be subtle, controversial, and
changeable over time.
Data integration via "genes"
 The reagents and target molecule identifiers used in
different experiments may be different:










RefSeq ID
Entrez ID
Ensembl Gene ID
Ensembl Transcript ID
Uniprot ID
Gene coordinate on the chrosome
Microarray probe sequence
siRNA sequence
Peptide sequence identified in MS
Short Read Sequence
 Bioconductor offers tools to map these to each other
(annotation packages; biomaRt; Biostrings).
Data integration via "genes"
 Bioconductor offers tools to map these to each other, so
 others can reproduce your mapping
 you can redo the mapping as the biological databases get
updated
 you can try out different ways to do the mapping and see
how they affect the subsequent data analysis




Think about this early:
- keep the primary reagent identifiers around
- use versioning (annotation packages!)
- make the mapping process part of your reproducible,
documented, automated workflow
Acknowledgement




Robert Gentleman
Richard Bourgon
Jörn Tödling
Greg Pau
Report Generation
hwriter package
References


Visualizing Genomic Data, R. Gentleman, F. Hahne, W. Huber (2006),
Bioconductor Project Working Papers, Paper 10
Choosing Color Palettes for Statistical Graphics, A. Zeileis, K. Hornik (2006),
Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Research
Report Series, Report 41