Download ppt - University of Pennsylvania

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

RNA silencing wikipedia , lookup

RNA interference wikipedia , lookup

Non-coding RNA wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene desert wikipedia , lookup

Ridge (biology) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

X-inactivation wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Integrated Data Systems for
Genomic Analysis
Genomics and Bioinformatics for the
Advancement of Clinical Sciences
Thomas Jefferson University, Oct. 14, 2002
Chris Stoeckert, Ph.D.
Dept. of Genetics & Center for Bioinformatics
University of Pennsylvania
Plasmodium genomics: Genomics and
proteomics pave the way for controlling malaria
Nature, October 3, 2002
Thinking Genomically
Genome
•Genome structure
•Genes and function
•Pathways
•Expression patterns
•(Complex) diseases
Phenotype
Using a Genomics Unified Schema
(GUS) to ask genomic questions
Genomic Unified Schema (GUS) is a relational
database that warehouses and integrates
biological sequence, sequence annotation, and
gene expression data from a number of
heterogeneous sources. User-friendly web
interfaces present slices of the GUS database
and allow researchers to execute structured
queries for information concerning gene
structure, function, and expression.
GUS Powers Multiple Genomics Projects
AllGenes
Allgenes is based on a comprehensive
mouse and human gene index. The genes
are approximated by transcripts predicted
from EST and mRNA clustering
PlasmoDB
PlasmoDB is the official database of the
Plasmodium falciparum genome project which
provides an integrated view of genome
sequence data including expression data from
EST, SAGE, and microarray projects
EPConDB
EPConDB is an index of genes expressed in
endocrine pancreas. Expression is defined
either through microarray experiments or
sequence annotation.
allgenes.org query
"Is my cDNA similar to any mouse genes that are
predicted to encode transcription factors and have
been localized to mouse chromosome 5?"
This query illustrates several aspects of the GUS database including:
Data Integration
Data Analysis
Tools
•RHMap
•GOFunction
•Sequence
•GOFunction
assigments
•Boolean function
•History function
•BLAST
http://www.allgenes.org/
Steve Fischer, Debbie Pinney, Brian Brunk, Joan Mazarelli, Jonathan
Crabtree, Yongchang Gan, Sharon Diskin
Nikolay Kolchanov, Alexey Katohkin
Select the allgenes.org boolean query page
Click on the "AND" button
Choose the RH map and GO function queries
Select mouse chromosome 5 and "transcription factor"
There are 26 mouse RNAs (assemblies) that meet these criteria:
This query result set now appears on the query "history" page:
Now use the BLAST page to identify RNAs similar to my cDNA
The results of the BLAST search appear in the query history
Intersect ("AND") the BLAST search with the previous query:
And we have our answer (the third row on the query history page):
Other transcripts from
the same gene
Predicted GO function(s)
(some manually reviewed)
External links
Mapping information
Gene trap insertions
Protein/motifs, etc.
predicted protein
CAP4 assembly EST expression profile UCSC BLAT
PlasmoDB query:
"List all genes whose proteins are predicted to
contain a signal peptide and for which there is
evidence that they are expressed in Plasmodium
falciparum's merozoite stage."
This query illustrates several aspects of the GUS database including:
Data Integration
•Genome annotation
•Mass spec
Data Analysis
•Sequence analysis
Tools
•History function
http://plasmodb.org/
David Roos, Jessie Kissinger, Bindu Gajria, Martin Fraunholz, Jules Milgram, Phil
Labo, Amit Bahl, Dave Pearson, Dinesh Gupta, Hagai Ginsburg
Jonathan Crabtree, Jonathan Schug, Brian Brunk, Greg Grant, Trish Whetzel, Matt
Mailman, Li Li
Select Queries from the PlasmoDB homepage
Choose signal
peptide
Choose chromosome and Gene/prediction type-submit
There are 651 genes
with predicted signal
peptides
Choose Gene Expression from the queries page, then Proteomics
Then choose chromosome, lifecycle stage, evidence - submit
There are 828 gene
predictions that
satisfy this query
Go to the history page and choose which simple
queries to combine. Select intersect.
We have an answer.
There are 86
predicted genes that
satisfy our complex
query
Click on a gene
to get a full
report
There is a variety of
information available
from the report page
including:
Predicted protein features
and gene models
EPConDB
query:
"Which DOTS assemblies (RNA) represented on
the Endocrine Pancreas Consortium’s chip 2.0 are
constituents of the insulin initiated signal
transduction pathway ?"
Data
Integrationes3www
w
•Sequence
•Microarray
experiment
•Transduction
pathway
Data Analysis
•BLAST
Tools
•History function
http://www.cbil.upenn.edu/EPConDB
Klaus Kaestner, Marie Scearce, John Brestelli, Phillip Le
Elisabetta Manduchi, Angel Pizarro, Debbie Pinney, Greg Grant, Joan
Mazzarelli, Jonathan Crabtree, Hongxian He,Shannon Mcweeney, Matt
Mailman
Go to the gene information query page and click on
“DOTS assemblies involved in a pathway”
Choose the insulin pathway, a p-value, pancreas, the species, and
whether an assembly must include an mRNA - submit
There are 59 dots
assemblies that are
constituents of the
insulin pathway
Return to the gene information query page and select
clones sets. Choose chip 2.0 - submit
There are 3242 assemblies represented on chip 2
Go to the history page, select the queries to
combine and select intersect – view the results
There are 8
assemblies
that satisfy
the complex
query.
Clicking on an
RNA retrieves
an allgenes
report.
Using Databases to Think Genomically
• Draw attention to these resources
• Show how different data sources and
approaches can be used to ask
powerful questions
• This can be done for different
organisms, different systems
How GUS Works
AllGenes
PlasmoDB
EPConDB
Java Servlets
DoTS RAD TESS SRES Core
Oracle RDBMS
Object Layer for Data Loading
Other sites,
Other projects,
e.g. GeneDB
Goals of GUS
• Generic platform for model organism or disease specific
databases
• Integration of genome, transcript and protein data, including:
–
–
–
–
–
–
Sequence
Function
Expression
Interaction
Regulation
Orthologs and paralogs
• Support for:
–
–
–
–
automated annotation and integration
manual curation
data mining/analysis and sophisticated queries
web access
http://www.gusdb.org
Jonathan Crabtree, Jonathan Schug, Steve Fischer, Elisabetta Manduchi, Angel
Pizarro, Junmin Liu, Debbie Pinney, Greg Grant, Trish Whetzel, Li Li, Sharon
Diskin, Hongxian He
Architecture of GUS
GenBank,
InterPro,
GO, etc
Genomic
Sequence
Automated
Analysis &
Integration
GSSs &
ESTs
Annotation
Object Layer
DoTS Oracle/SQL TESS
RAD
WWW
queries,
browsing, &
download
Mapping
Data
Java Servlets
&
Perl CGI
Core
SRes
Mining
Applications
microarray
& SAGE
Experiments
QTL,POP,
SNP, Clinical
Annotator’s
Interface
Five domains
GUS is divided into 5 domains* (separate name spaces)
Namespace
Domain
Highlights
Core
Data Provenance
Evidence
Shared Resources
Ontologies
Sequence and
annotation
Central dogma
Gene expression
MIAME/MAGE
Gene regulation
Grammars
SRes
(Shared Resources)
DoTS
(DB of Transcribed Seqs)
RAD
(RNA Abundance DB)
TESS
(Trans Elem Search Site)
* Protein Abundance DB domain underway
DoTS central dogma schema
Gene
RNA
Protein
Gene
Instance
RNA
Instance
Protein
Instance
Gene
Feature
Genomic
Sequence
(isa NA Feature)
(isa NA Sequence)
RNA
Feature
RNA
Sequence
(isa NA Feature)
(isa NA Sequence)
Protein
Feature
Protein
Sequence
(isa NA Feature)
(isa AA Sequence)
RAD schema uses MAGE/MIAME
0..*
MAGE
Experiment
Array
BioMaterial
BioAssay
BioAssayData
Protocol, Descr.
HigherLevelAnalysis
StudyAssay
1
Array
1
1
0..*
1
Assay
0..*
1
1
0..*
Study
1
1
1
1
1
0..*
1
0..*
0..*
1
StudyDesignAssay
ArrayAnnotation
StudyDesign
1
0..*
0..*
0..*
Control
ElementAnnotation
0..*
0..1
0..*
1
1
BioMaterialCharacteristic
0..*
BioMaterialImp
1
ElementImp
1
StudyFactor
0..*
1
0..*
0..*
0..*
0..*
0..*
StudyDesignDescription
0..*
StudyFactorValue
AssayLabeledExtract
0..*
1
Channel
CompositeElementImp
1
1
10..1
0..*
0..*
0..*
0..*
BioMaterialMeasurement
0..*
0..1
1
0..*
1
0..*
1
0..1
0..*
Acquisition
1
1
1
0..*
0..*
1
LabelMethod
RelatedAcquisition
0..*
1
0..*
CompositeElementAnnotation
1
0..*
0..*
1
OntologyEntry
Treatment
0..*
0..1
AcquisitionParam
0..*
0..*
0..1
ElementResultImp
0..1
0..1
CompositeElementResultImp
0..*
0..*
0..*
1
ProcessResult
Quantification
0..*
0..*
1
1
1
MAGEDocumentation
RelatedQuantification
0..*
ProtocolParam
0..*
ProcessIO
1
MAGE_ML
QuantificationParam
0..*
1
0..1
0..*
1
MIAME
Protocol
1
0..*
Experimental Design
Array design
Samples
Hybridization, Measure
Normalization
.
0..*
1
0..*
0..*
1
AnalysisInput
0..*
1
1
ProcessInvocation
ProcessInvocationParam
ProcessImplementationParam
1
0..*
0..*
1
0..*
AnalysisInvocation
AnalysisInvocationParam
1
0..*
AnalysisOutput
1
ProcessImplementation
0..*
1
1
Analysis
0..*
0..*
AnalysisImplementation
1
0..*AnalysisImplementatio
nParam
0..*
http://www.mged.org
Journals are Adopting the MGED Standards
Use of Minimal Information About Microarray Experiment (MIAME)
TESS Schema
TESS.Moiety
Moiety
MoietyHeterodimer
MoietyMultimer
MoietyComplex
TESS.Activity
ActivityProteinDnaBinding
TESS.FootprintInstance
DoTS.NaFeature
ActivityTissueSpecificity
BindingSite
TESS.TrainingSet
TESS.Model
ModelString
Promoter
...
TESS.ParameterGroup
ModelConsensusString
DoTS.NaSequence
ModelPositionalWeightMatrix
TESS.Note
ModelGrammar
RAD
DoTS
EST clustering
and assembly
Identify shared
TF binding sites
Genomic alignment
and comparative
Sequence analysis
TESS
Using GUS for Genomic Research
Annotating mouse chromosome 5
– Maja Bucan
• Identifying novel genes expressed in the
endocrine pancreas
– Klaus Kaestner, Alan Permutt, Doug Melton
• Identifying genes regulated by CREB
– Allan Pack, Mirek Mackiewicz
Annotation of Mouse
Chromosome 5
• What are all the genes?
• What is their structure and function?
• Where are they expressed and how is this
regulated?
Maja Bucan, Otto Valladeres, Kyle Gaulton
Jonathan Crabtree, Yongchang Gan, Joan Mazzarelli, Jonathan Shug
Areas of Focus on Mouse Chromosome 5
15
Reln
Sema3a,c,d,e
Nos3
Dpp6
Htr5a
4p16.3
20
23
Hdh, Adra2c
Drd5
4p15.31
30
Qdpr
4q12
40
7q21-22
7q36
8
12
43
Gabrb1, a2, g1
Pdgfra, Kit, Flk1
Clock
Rw as a
balancer
Approach to Annotating Mouse Chromosome 5
• Genomic sequence
Public release: chromosome 5 has many gaps
– Celera
– Combine to eliminate gaps where possible
• Gene models
– ENSMBL prediction
– Celera predictions
BLAT alignment of DoTS
– Comparison to human regions
Known RefSeq Genes in (72-76Mb) Region as Viewed
in UCSC Genome Browser
Only 14 RefSeq Genes plus an additional 7 from Ensembl
Known Genes on Mouse Chromosome 5
MGI approved symbols
~72Mb
~76Mb
5033405K12Rik
6030432N09Rik
1810027I20Rik
AI836376
Sgcb
1700067I02Rik
C78283
2700023E23Rik
1190017B18Rik
6720475M21Rik
1300019H17Rik
Lnx1
Chic2
Gsh2
Pdgfra
Kit
Kdr
Gabarapl2 (homolog)
Srd5a2l
Tparl
Clock
Pdcl2
Nmu
Gene symbol synonyms
KIAA1458
KIAA0826
LOC231293
KIAA0276
FLJ12552
Identified 28 known genes
15 genes have assigned GO Functions
5 enzyme
4 signal transducer
4 ligand binding or carrier
3 nucleic acid binding
2 transporter
Example of Known Mouse Chromosome 5 Gene - Chic2
*
*Alignment reveals exon differences between RNAs belonging to gene (Alternative forms)
Putative Genes on Mouse Chromosome 5
putative gene mouse chr5
Note:multi-exon alignment; single image clone 583253; polyA signal suggests 3’ end of gene
putative gene mouse chr5
Note:Singleton ESTs from IMAGE clone 551428 align
putative gene mouse chr5
Note:multi-exon alignment; ESTs from single image clone 515319;
possible polyA signal in 3'sequence
putative gene mouse chr5
Note:multiple span alignment; 9/02- RNAs also aligning to another region of mouse chr5
putative gene mouse chr5
Note: 3 ESTs in assembly from embryo
…….
…….
Total 21 (some putative genes may later be merged)
Example of a Putative Mouse Gene
Example DT.40155293 image clone sequences (5’ and 3’ in same assembly)
Genes on Mouse Chromosome 5
• 72-76 Mb region
–
–
–
–
65 genes from automated DoTS analysis
49 manual evaluation
21 Ensembl genes
14 RefSeq genes
• Whole chromosome 5 (151 Mb)
– 2157 genes from automated DoTS analysis
– 1275 Ensembl genes
Summary
• To make links between genotype and phenotype, the output
of technologies such as genomic sequencing, microarrays,
mass spec, etc., must be integrated
• Our solution is GUS, Genomics Unified Schema, used for
multiple systems: AllGenes, PlasmoDB, EPConDB
– GUS is freely available as a system for use and development
– RAD as part of GUS and uses microarray standards now available
• Using GUS for genomic research such as annotating
mouse chromosome 5.
– Possibly doubling the number of genes in annotated regions!
http://www.cbil.upenn.edu