Download Analysis of Protein Geometry, Particularly Related to Packing at the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
AWG Members:
Srinka Ghosh, Sandra Michelsen, David MacAlpine,
Matt Eaton, Steve Henikoff, Ann Hammonds, Gos Micklem, Manolis
Kellis, Peter Park, Xiaole Liu, Mark Gerstein,
Sue Celniker, Eric Lai, Kris Gunsalus,
Bob Waterston, David Miller, Lincoln Stein
modENCODE Consortium meeting
Rockville, MD
2008.06.17, 9:15-10:30
(10' near beginning of session)
Slides downloadable from Lectures.GersteinLab.org.
(Please read permissions statement.)
Paper references mostly from Papers.GersteinLab.org.
(Quick overview of the ENCODE pilot results, focusing on G&T results, pgenes + DART.
[I:ENCODE], fit into time )
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1 1modENCODE.org
Thoughts on Integrative Genome
Analysis -- as Stimulus for a
Discussion towards Consortium
Publication(s)
Purpose of the Session
• Our charge
 "NHGRI would like the Consortium
to think early on about what would be involved in an
integrative paper
and what would be the steps forward to accomplish this goal."
 What integrative analyses do we want to do?
 What have we done, a year in?
 What do we need to do?
• Data freezes ? Analysis discussions ?
• Bringing in new types of data or analysts ?
• Fast-track particular experiments or analyses ?
Do not reproduce without permission
3 modENCODE.org
• Specific Things to think about
• What were integrative analyses
in the framework of pilot ENCODE? (case study)
• Where is modENCODE with respect to these?
 2 Case studies on
 microRNAs in fly (E Lai) and
 the 3' UTRome in worm (K Gunsalas)
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
5 5modENCODE.org
What is Integrative Analysis?
Brief presentations
to give us "data" for our deliberations
• Case study in the
annotation of unannotated
transcription and
relating it other
annotation
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
6 6modENCODE.org
A Tale of
TARs
(& TxFrags)
• Vignette drawn from activity in pilot
ENCODE Genes & Transcripts group
(Gingeras TR, Guigó R, Snyder M,
Birney E, Zhang ZD, Reymond A,
Kapranov P, Rozowsky J, Zheng D,
Castelo R, Frankish A, Harrow J, Ghosh S...)
Production
Integrated
Different types of analyses carried out
in the modENCODE/ENCODE consortia
• Development of Sequence (and Array) Technology
• Output of Production Pipelines and Surveying of
Single Type of Annotation
• Integrated Analysis Connecting Different Types of
Annotation
Where are we now in modENCODE?
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
7 7modENCODE.org
Tech
Production
Li et al., PLOS one (2007)
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
8 8modENCODE.org
A Starting Point: Noisy Raw Signal from Tiling Arrays
Johnson et al. (2005) TIG, 21, 93-102.
Tech
Tech
Production
• Array data can be
normalized by mean,
median, quantile, &c. How to
do this consistently?
• Tile scoring using a
(smoothing) sliding window
generates the signal map
and the P-value map.
Zhang et al. (2007) GenomeBiology
Source: Bolstad, B.M., et al (2003), Bioinformatics,
19, 185-93.
Do not reproduce
without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
9 9modENCODE.org
Signal Processing to
Normalize and Standardize Signals to Get a Useable
and Cross-experiment comparable Signal Map
Tech
• Sens. v Spec. for
different platforms
• How does this
extrapolate genomewide, across
samples?
• Attempting to score
experiments in
uniform fashion
• Understand "labeffects" vs real ones
• Where is NextGen
seq. technology in
rel. to this?
Emanuelsson Do
et not
al.reproduce
('07) without
Gen.permission
Res.
10 modENCODE.org
Calibrating Error Rates for Each
Platform
Segmentation:
Finding
Discrete
Annotation
Blocks
(TARs/TxFrags)
from Processed
Signal
• Iterative Process of Building a Model, Segmenting Signal into
discrete, easily useable "hits" (TARs/TxFrags), validating some of
them
• Defining consistent definitions of "Hits" and TARs (e.g. point
sources)
• Defining consistent thresholds
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1111modENCODE.org
Production
[Du et al. (2006) Bioinformatics; Fig. from Gerstein et al., Gen. Res. (’07)]
Tech
Production
• Annotated
and
unannotated
TxFrags
detected in
different cell
lines.
[ENCODE Consortium, Nature 447, 2007]
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1212modENCODE.org
Statistics on the TxFrags: Surveys
of a Single Type of Annotation
Production
Integrated
Rozowsky et al. Gen. Res. (2007)
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1313modENCODE.org
Phylogenetic Profiles
or
More Developed Annotation: Clustering and
Classifying Blocks of Un-annotated
Transcription into larger units
Vast Amounts of
Different Data
Types to Integrate
in pilot ENCODE
• Determining
experimental signals
for biochemical activity
across each base of
genome
• Large-scale sequence
comparison in relation
to the human genome
[ENCODE Consortium, Nature 447, 2007]
Feature Class
Expt. Tech.
Numb. Expt.
Data Pts.
Transcription
Tiling array, Integrated
annotation
63,348,656
Tag sequencing
864,964
Tiling array
4,401,291
Chromatin
structure
QT-PCR, Tiling array
15,318,324
Sequencespecific factors
Tiling array, tag
sequencing, Promoter
assays
324,846,018
Replication
Tiling array
14,735,740
Computational
analysis
Computational
methods
NA
Comparative
sequence
analysis
Genomic sequencing,
multi- sequence
alignments,
computational
analyses
NA
Polymorphisms
Resequencing, copy
number variation
NA
5′ Ends of
transcripts
Histone
modifications
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1414modENCODE.org
Integrated
Composite
ChIP
hit
Special
yG
tracks in
browser
diTAG
CAGE
TARs
ChIPchip
Connecting
TARs (TxFrags)
in Integrative
fashion to
different types
of Annotation
• Single Ex. of
Pseudogene
Intersecting with
Transcriptional
and Regulatory
Evidence
• Are integrated
experiments
comparable -- i.e.
done on
consistent cell
lines, on same
coordinate sys.,
&c.
Zheng et al. (2007)
Gen.
Res.
Do not reproduce
without
permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1616modENCODE.org
Integrated
Integrated
Integrating Transcriptional Evidence with
Gene Annotation and Sequence Constraints
Ka/Ks
Avg. Integration
over many
instances
• No Greater
Tendency for
Transcribed
Pseudogenes
to be under
Selective
Constraint
• Need a way of
easily defining
degree of
constraint on
sequence (not
so easy for
non-coding)
Measurement of Short-time variation (pN+pS)
Zheng et al. (2007)
Gen.without
Res.
Do not reproduce
permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1717modENCODE.org
Processed pseudogene
Non-processed pseudogene
Gene
Transcribed
Integrated
• Integrating &
averaging results
over larger and
larger sets
• Comparison of
integrated
quantities
[ENCODE Consortium, Nature 447, 2007]
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1818modENCODE.org
Biochemically
Active Regions
Don't all Appear
to be Under
Constraint
Integrated
• Not all constrained sequence
annotated in some fashion
• Exactly how things are defined in
terms of overlap?
•
"At the outset of the ENCODE Project, many believed that the broad collection of
experimental data would nicely dovetail with the detailed evolutionary information
derived from comparing multiple mammalian sequences to provide a neat
‘dictionary’ of conserved genomic elements, each with a growing annotation about
their biochemical function(s). In one sense, this was achieved; the majority of
constrained bases in the ENCODE regions are now associated with at least some
experimentally-derived information about function. However,
we have
also encountered a remarkable excess of
unconstrained experimentally-identified functional
elements, and these cannot be dismissed for
technical reasons. This is perhaps the biggest
surprise of the pilot phase of the ENCODE Project,
and suggests that we take a more ‘neutral’ view of many of the functions
[ENCODE Consortium, Nature 447, 2007]
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1919modENCODE.org
Grand Summary: Biochemical
Activity vs. Sequence Constraints
Integrated
++
• Making thresholds and statistics
comparable across organisms (so
we can really say whether or not
worm has or less novel
transcription than fly)
• Can we relate tissues and
developmental states betw.
organisms?
• Can we deal with seq. constraint in
a uniform fashion?
• Defining orthologs betw.
organisms and lineage-specific
genes
• Proper liaison with Encode
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
2020modENCODE.org
An Added Challenge for modENCODE Comparing Results among
Evolutionary Distant Organisms
Production
Integrated
++
?
Tale of TARs:
Where is modENCODE?
• Where are we on Tech, Production, Integrated Spectrum?
 More action on seq. tech call than AWG
 Tech stuff is important, towards getting useable,
comparable error-parameterized annotation blocks
• Importance of comparative genomics issues
 Constraint, orthologs
Do not reproduce without permission
(c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
2121modENCODE.org
Tech