Download Your Task

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Minimal genome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Oncogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Intro to Comp Genomics
Lecture 7: Using large scale
functional genomics datasets
Your
YourTask
Task
Preparations:
• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg
chr17, bin-size = 50bp
• Cut the data into segments of 50,000
data points
Modeling:
• Use EM to build a probabilistic model
for the peak signals and the
background.
• Use heuristics for peak finding to
initialize the EM
Modeling
S
P1
P2
P( x | P2 )  N ( x; 2 ,  2 )
B
P3
F
P..
Analysis:
• Test if your model for single peak
structure is as good as the model for
two peak structures.
• Compute the distribution of peaks
relative to transcription start sites
P( x | P1 )  N ( x; 1,1 )
P( x | P3 )  N ( x; 3 ,  3 )
P( x | P4 )  N ( x; 4 ,  4 )
P( x | B)  N ( x;  ,  )
The model use k-states for the peak
and one state for the background
Use K=40.
Your
YourTask
Task
Preparations:
• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg
chr17, bin-size = 50bp
• Cut the data into segments of 50,000
data points
Modeling:
• Use EM to build a probabilistic model
for the peak signals and the
background.
• Use heuristics for peak finding to
initialize the EM
Analysis:
• Test if your model for single peak
structure is as good as the model for
two peak structures.
• Compute the distribution of peaks
relative to transcription start sites
Modeling
Implement HMM inference: forwardbackward
Make sure your total probability is the
same in the forward and the
backward forms!
Implement the EM update rules
Run EM from multiple random points
and record the likelihoods you derive
Implement smarter initialization: take
the average values around all probes
with value over a threshold.
Compute posterior peak probabilities:
report all loci with P(Peak)>0.8
Your
YourTask
Task
Preparations:
• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg
chr17, bin-size = 50bp
• Cut the data into segments of 50,000
data points
Analysis
Compare the two peak structures you
get (from CTCF and PolII)
Retrain a model together on the two
datasets
Modeling:
• Use EM to build a probabilistic model
for the peak signals and the
background.
• Use heuristics for peak finding to
initialize the EM
Analysis:
• Test if your model for single peak
structure is as good as the model for
two peak structures.
• Compute the distribution of peaks
relative to transcription start sites
Compute the log-likelihood of the
unified model and compare to the
sum of likelihood for the two models
Optional: test if the difference is
significant by:
-sampling data from the unified model
-training two models on the synthetic
data and compute the likelihood delta
as for real data
-Use a set of known TSSs to compute
the distribution of peaks relative to
genes
Functional genomics
•
10 years after the appearance of microarrays, thousands of experiments
were performed on different cells and conditions
•
One of the original promises of the technology is that it will for a vast body
of data that can serve future modeling and analysis purposes
•
Standards have been established, and it is mandatory to deposit data high
throughput datasets when publishing papers describing it
•
Unlike pubmed for literature or blast/blat for sequence, the functional
genomics database is not usable using a single simple tool
•
We will discuss and practice some strategies for utilizing this powerful
resource
NCBI - GEO
Platform
Sample
Series
Data availability
GEO:
268,611 experiments (!!)
5343 platforms
(Any species, condition, experiment)
Gene expression:
Different sets of genes or gene model!
Still most of the data
Conditions are critical
Mandatory submission for all published papers
Also: EBI-Array express
Comparative genomic hybridization (aCGH):
Challenge: find what you need
Important for disease with genomic aberrations
Specific databases are curated and organized:
Species: e.g., SGD for yeast
TF binding profiles
Old type: gene arrays
Currently: Tiling array or ChIP-seq
Disease: e.g., Oncomine for cancer – 28,800
arrays organized around specific cancer types
Phenotype?
Other specific assays?
Gene expression data is using
different platforms (old cDNA, affy,
new long oligo arrays)
Vastly different gene sets and gene
models
RNA genes are now on most arrays
Understanding the experimental
conditions for each array is a
challenge
Avoiding replicates or using them
smartly
Be careful from systematic prenormalization of original data –
subtracting the median/mean from a
specific dataset introduce a strong
bias for all the arrays in it when
compared to other datasets!
Transcription factor interactions, histone modifications maps:
Histone modifications
Genes bound by certain TFs
Genes (or regions) enriched for specific histone
modifications
Hundreds of factors and modifications
Different experimental conditions
Abundant data for yeast,flies,mouse and human
Knock-down/knock-out library phenotype
Library of mutants lacking
each of the non-essential
yeast genes is available
(knockout)
Essential genes can be
knocked down using a
sepcialized promoter
Libraries can be
automatiaclly screened for
viability and/or growth rate
in different conditions using
robotics and 96/384 well
plate formats
Libraries of RNAi construct
allow similar screens for
worms and flies.
Mammalian screens are
becoming possible as well
Genetic interactions
Testing the phenotype of
multi-gene knockout
provide key insights into the
genetic network
A gene may be essential fro
growth under some
condition, but become
dispensable when another
gene is knocked-down
A mutation can be lethal
only in the presence of
another knockout (synthetic
lethality)
In yeast, systematic
screens for synthetic
lethality are practical for
over 5 years.
Genetic interactions
Improved technology
provide more quantitative
measurement of the growth
phenotype of double knockdown
Matching all pairs of a
genes in a large subset of
the genome is practical,
and the resulted EMAP
provide qunatitative
estimate to the epistasis in
the group (e.g., Schuldiner
lab here at WIS)
f ( AB)  f ( A)  f ( B)  X ?
Protein interactions
Physcial interaction between proteins highlight
post-translational regulatory networks and
structural organization of key organelles
Data comes from several technologies:
most reliably techniques involving Mass
spectrometry and isolation of protein complexes.
Indirect techniques involving transcriptional
assays (yeast-two hybrid)
And more..
Data is partial and sometime difficult to interpret
(what do we mean by interaction?)
A large body of literature is dealing with
speculation on protein network – relevance to
actual biology is questionable…
Array CGH/genetic aberrations
Data on deletion/insertion and copy number
variation is generated by hybridization to arrays
or more recently through sequencing
Data is critical for studies of cancer .
Databases also incule lists of genomic loci that
are known to be instable in (specific types of)
cancer.
Gene ontology
Hierarchical vocabulary (GO terms)
Annotations: association of term with gene in a
specific species
Unifying different research communities
Also associating all super-terms
Process-…
Function-…
Component-..
GO-Slim is a flat version of the ontologies
Z-scores, T-test – the basics
You want to test if the mean (RNA expression) of a
gene set A is significantly different than that of a gene
set B.
In a common scenario, you have a small set of
genes, and you screen a large set of
conditions for interesting biases.
If you assume the variance of A and B is the same:
You need a quick way to quantify deviation of
the mean
t
XA  XB
(n A  1) S A2  (nB  1) S B2  1
1 
  
n A  nB  2
 n A nB 
t is distributed like T with nA+nB-2 degrees of freedom
For a set of k genes, sampled from a standard
normal distribution, how would the mean be
distributed?
N (0,
The Mean
1
)
K
If you don’t assume the variance is the same:
t
XA  XB
s A2 s B2

n A nB
2
2

 s A2 s B2    s A2 
 sB2 
d .o. f :    /   /( n A  1)    /( nB  1) 

 n A nB    n A 
 nB 

But in this case the whole test becomes rather flaky!
So if your conditions are normally distributed,
and pre-standartize to mean 0, std 1
You can quickly compute the sum of values
over your set and generate a z-score
Z
XA
| A|
Kolmogorov-smirnov statistics
The D statistics distribution is given by a the
form:

QKS ( )  2 (1) j 1 e 2 j 
2 2
j 1
Ne 
N1 N 2
N1  N 2
P( D  observed ) 


QKS ( N e  0.12  0.11 / N e D)
An a-parameteric variant on the T-test theme
is the Mann-Whitney test.
D  max | S N ( x)  P( x) |
 x 
D  max | S N 2 ( x)  S N 2 ( x) |
 x 
The D-statistics is a-parameteric: you can transform x
arbitrarly (e.g. logx) without changing it
You Take your two sets and rank them
together. You count the ranks of one of your
set (R1)
U  R1 
n1 (n1  1)
2
U ~ N ( U ,  U )
U  n1n2 / 2
U 
n1n2 (n1  n2  1)
12
Hyper-geometric and chi-square test
A
n11 n12
n21 n22
n31 n32
n1 n2
n13
n23
n33
n3
n1
n2
n3
N
B
2  
i, j
(ni , j 
ni ,n, j
N
)2
ni , j
Chi-square distributed with m*n-m-n+1 d.o.f.
 N  n A  n A 

 
n

k
 k 
P(| A  B | k )   B
N
 
 nB 
Testing hypotheses on interaction graphs
Given your gene set and a set of genegene or protein-protein interactions.
How can you test if your set is enriched in
intra- interactions?
Criterion for an additional gene that is
strongly interaction with your set?
Are complex tend to be split by your set or
maybe tend to be contained in the set?
Node’s degree in the graph?
Overall network density?
The iterative signature algorithm
AC ,1
e1, A
Matrix normalized for conditions
Matrix normalized for conditions
AG ,0
e A,1
Simple statistics:
Plug in your favorite:
e A, j
e j,A
en , A
e A,m
C ,1
A
{j |
e A, j
k
 TG }
AC ,1  { j | pval( j )  thres}
ei , A
Simple statistics:
AG ,1  {i |
Plug in your favorite:
AG ,1  {i | pval(i)  thres}
C ,1
|A
|
 TC }
The iterative signature algorithm
AC ,iter
e1, A
Iterate until convergence (Small changes
in gene/condition sets)
Convergence is not guaranteed..
AG ,iter
e j,A
Try starting from your target gene set or
from random sets.
Thresholds are critical
en , A
Variants: use a weighted average instead
of plain average
Allow signs for conditions
e A,1
e A, j
Simple statistics:
Plug in your favorite:
e A,m
AC ,1  { j |
C ,1
A
e A, j
k
 TG }
 { j | pval( j )  thres}
Different statistics for thresholding (aparametericKS/MW? Parameteric nonnormal?
Can you think of a probabilistic version?
A Probabilistic formulation
Pr(eij )  d i c j N (  ;  )  (1  d i c j N (0;1))
d1  0
d i1  1
d i1  1
Matrix normalized for conditions
Pr(c j  1 | e, d 0 )
 d c N (; )

 d c N (; )   (1  d c N (0;1))
i
di
j
j
i
j
j
i
j
j
Pros and cons?
 d c N (; )

 d c N (; )   (1  d c N (0;1))
i
cj
j
i
i
i
j
i
i
j
Playing with the condition/gene means?
Convergence?
Multiple-testing
Testing for high mean of your gene set in
100,000 conditions in the database.
You expect to get one case with p<0.00001 !
Stringent correction: multiply the p-value by the
number of tests
A rational alternative: control the falsediscovery rate (FDR):
In many cases, your tests are not really
independent
For example, testing enrichment for functional
annotations that are hierarchical
Another example are multiple gene expression
conditions that are very similar (same tumor
type)
You can estimate the empirical distribution of
your statistics on random sets of the same size
and use this as your p-value
This should be done with care: making sure
your sampled sets are really similar in nature
to your true sets and controlling for effects you
want to factor out.
10 times
“hits” than
expected
errors
P-value
cutoff
Go term 1
Your
YourTask
Task
•
•
•
•
•
•
Download the GNF human expression atlas from UCSC genome browser or
GEO
Find 1-5 datasets on breast cancer in GEO
Combine IDs, merge the dataset
Download gene ontologies human associations. Extract gene set(s) related to
apoptosis and to cell cycle.
Use your previous analysis of chromosome 17 to generate the set of 40 genes
for which the 20k window containing their promoter had the lowest correlation to
the overall k-mer spectrum
Also generate a set of 40 chr17 genes with the highest G+C content on the 1kb
upstream their promoter (you can use the Genome browser tools for that)
•
Implement your version of the iterative signature algorithm (you are free to select
the statistics you are using). You can implement the deterministic or probabilistic
version.
•
Starting from the above gene set, see if and how your algorithm is converging.
Compute the intersection of the converged set with the original sets and report
the conditions you found
•
Change your algorithm parameters to get smaller or larger biclusters, plot the
size of the resulted sets as a function of the parameter you are changing