Download ArrayExpress and Expression Atlas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
ArrayExpress and Expression Atlas:
Mining Functional Genomics data
Gabriella Rustici, PhD
Functional Genomics Team
EBI-EMBL
[email protected]
What is functional genomics (FG)?
• The aim of FG is to understand the function of genes and
other parts of the genome
• FG experiments typically utilize genome-wide assays to
measure and track many genes (or proteins) in parallel
under different conditions
• High-throughput technologies such as microarrays and
high-throughput sequencing (HTS) are frequently used in
this field to interrogate the transcriptome
2
ArrayExpress
What biological questions is FG addressing?
• When and where are genes expressed?
• How do gene expression levels differ in various cell types
and states?
• What are the functional roles of different genes and in
what cellular processes do they participate?
• How are genes regulated?
• How do genes and gene products interact?
• How is gene expression changed in various diseases or
following a treatment?
3
ArrayExpress
Components of a FG experiment
4
ArrayExpress
FG public repositories: ArrayExpress
 Is a public repository for FG data, which provides easy access to
well annotated data in a structured and standardized format
 Serves the scientific community as an archive for data supporting
publications, together with GEO at NCBI and CIBEX at DDBJ
 Facilitates the sharing of experimental information associated with
the data such as microarray designs, experimental protocols,……
 Based on community standards: MIAME guidelines & MAGE-TAB
format for microarray, MINSEQE guidelines for HTS data
(http://www.mged.org/minseqe/)
5
ArrayExpress
Community standards for data requirement
 MIAME = Minimal Information About a Microarray Experiment
 MINSEQE = Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment
 The checklist:
Requirements
6
MIAME
MINSEQE
1. Experiment design / background description


2. Sample annotation and experimental factor


3. Array design annotation (e.g. probe sequence)

4. All protocols (wet-lab bench and data processing)


5. Raw data files (from scanner or sequencing machine)


6. Processed data files (normalised and/or transformed)


ArrayExpress
Standards for microarray & sequencing
MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of
different files to capture information about a microarray or HTS
experiments
IDF
Investigation Description Format file, contains top-level information about the
experiment including title, description, submitter contact details and protocols.
SDRF
Sample and Data Relationship Format file contains the relationships between
samples and arrays, as well as sample properties and experimental factors, as
provided by the data submitter.
ADF
Data files
7
ArrayExpress
Array Design Format file that describes probes on an array, e.g. sequence,
genomic mapping location (for array data ony)
Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or
.sff). Fastq format files are also accepted, but SRF format files are preferred. The
trace data files that you submit to ArrayExpress will be stored in the European
Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing
processed values, e.g. files in which the expression values are linked to genome
coordinates.
ArrayExpress – two databases
8
ArrayExpress
What is the difference between them?
ArrayExpress Archive
• Central object: experiment
• Query to retrieve experimental information and
associated data
Expression Atlas
• Central object: gene/condition
• Query for gene expression changes across
experiments and across platforms
9
ArrayExpress
ArrayExpress – two databases
10
ArrayExpress
ArrayExpress Archive – when to use it?
• Find FG experiments that might be relevant to your
research
• Download data and re-analyze it.
Often data deposited in public repositories can be used to
answer different biological questions from the one asked in
the original experiments.
• Submit microarray or HTS data that you want to publish.
Major journals will require data to be submitted to a public
repository like ArrayExpress as part of the peer-review
process.
11
ArrayExpress
How much data in AE Archive?
(as of mid-September 2012)
12
ArrayExpress
HTS data in AE Archive
(as of mid-September 2012)
Microarray vs HTS
RNA-, DNA-, ChIPseq breakdown
13
ArrayExpress
ArrayExpress
www.ebi.ac.uk/arrayexpress/
14
ArrayExpress
The date when the
data were loaded
in the Archive
Browsing the AE Archive
AE unique
experiment ID
Curated title of
experiment
Number of
assays
Species
investigated
loaded in
Atlas flag
Raw sequencing
data available in
ENA
The list of experiments retrieved
can be printed, saved as Tabdelimited format or exported to
Excel or as RSS feed
15
ArrayExpress
The total number of
experiments and assay
retrieved
The direct link to raw and
processed data. An icon
indicates that this type of
data is available.
Browsing the AE Archive
16
ArrayExpress
Searching AE with the Experimental factor
ontology (EFO)
 Application focused ontology modeling the relationship between
experimental factors (EFs) in AE
 Developed to:
• increase the richness of annotations that are currently made
in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
 EFs are transformed into an ontological representation, forming
classes and relationships between those classes
 Combine terms from a subset of well-maintained and compatible
ontologies, e.g. Gene Ontology, NCBI Taxonomy
17
ArrayExpress
Building EFO
An example
Take all experimental
factors
sarcoma
Find the logical connection between them
Organize them in an ontology
disease
disease is the parent term
[-]
cancer
neoplasm
is a type of
disease
neoplasm
[-]
neoplasm
cancer
is synonym of
cancer
neoplasm
[-]
disease
sarcoma
is a type of
sarcoma
cancer
[-]
Kaposi’s sarcoma
18
ArrayExpress
Kaposi’s sarcoma
is a type of
sarcoma
Kaposi’s sarcoma
Exploring EFO
An example
More information at: http://www.ebi.ac.uk/efo
19
ArrayExpress
Searching AE Archive
Simple query
20
ArrayExpress
AE Archive query output
• Matches to exact terms are highlighted in yellow
• Matches to synonyms are highlighted in green
• Matches to child terms in the EFO are highlighted in pink
21
ArrayExpress
AE Archive – experiment view
22
ArrayExpress
SDRF file – sample & data relationship
23
ArrayExpress
ArrayExpress – two databases
24
ArrayExpress
Expression Atlas – when to use it?
• Find out if the expression of a gene (or a group of genes
with a common gene attribute, e.g. GO term) change(s)
across all the experiments available in the Expression
Atlas;
• Discover which genes are differentially expressed in a
particular biological condition that you are interested in.
25
ArrayExpress
Expression Atlas construction
Experiment selection criteria during curation
• Array (platform) designs relating to the experiment must be
provided. Probe annotation must be adequate to enable reannotation of external references (e.g. Ensembl gene ID, Uniprot ID)
• At least 3 replicates for each value of the experimental factor
• Maximum 4 experimental factors
• Adequate sample annotation using EFO terms
• Presence of data files: CEL raw data files for Affymetrix assays,
processed data files for non-Affymetrix ones
26
ArrayExpress
Expression Atlas construction
Analysis pipeline
Cond.1 Cond.2 Cond.3
genes
Cond.1
Cond.2
Cond.3
Input data
(Affy CEL, non-Affy processed)
Linear model*
(Bio/C Limma)
Output: 2-D matrix
1= differentially expressed
0 = not differentially expressed
* More information about the statistical methodology:
http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
27
ArrayExpress
Expression Atlas construction
Analysis pipeline
“Is gene X differentially
expressed in condition 1 in this
experiment?”
=a
Cond.1 mean
single expression value for gene X
Cond.2 mean
Cond.3 mean
Compare and calculate statistic
28
ArrayExpress
Mean of all samples
Expression Atlas construction
Exp.1
Cond.1
Cond.2
Cond.3
genes
Statistical
test
Exp. 2
Cond.4
Cond.5
Cond.6
genes
Statistical
test
Exp. n
Cond.X
Cond.Y
Cond.Z
genes
Statistical
test
29
ArrayExpress
Each experiment has its own
“verdict” or “vote” on whether a gene
is differentially expressed or not
under a certain condition
Expression Atlas construction
Summary of
the “verdicts”
from different
experiments
30
ArrayExpress
Expression Atlas
31
ArrayExpress
Atlas home page
http://www.ebi.ac.uk/gxa/
Query for
genes
Restrict query by
direction of differential
expression
Query for conditions
The ‘advanced
query’ option
allows building
more complex
queries
32
ArrayExpress
Atlas home page
The ‘Genes’ and ‘Conditions’ search boxes
33
ArrayExpress
Atlas home page
A single gene query
34
ArrayExpress
Atlas gene summary page
Atlas experiment page
36
ArrayExpress
Atlas experiment page – HTS data
37
ArrayExpress
Atlas home page
A ‘Conditions’ only query
38
ArrayExpress
Atlas heatmap view
39
ArrayExpress
Atlas gene-condition query
40
ArrayExpress
Atlas advanced search
41
ArrayExpress
Atlas advanced search
42
ArrayExpress
Atlas advanced search
43
ArrayExpress
A glimpse of what’s coming…
“Differential atlas”
“Is gene X differentially expressed in
condition 1 in this experiment?”
=a
Cond.1 mean
single expression value for gene X
Cond.2 mean
Cond.3 mean
Compare and calculate statistic
44
ArrayExpress
Mean of all samples
A glimpse of what’s coming…
“Differential atlas” mock-up (1)
45
ArrayExpress
A glimpse of what’s coming…
“Differential atlas” mock-up (2)
46
ArrayExpress
A glimpse of what’s coming…
“Baseline atlas”
• Gene expression in normal tissues, not looking for differentially expressed
genes based on different conditions
• E.g. “Give me all the genes expressed in normal human kidney”
• Can also filter genes by expression level (e.g. FPKM values)
• Start with Illumina Body Map 2.0 RNA-seq data
• 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung,
lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells
• We are working on something similar for mouse
47
ArrayExpress
A glimpse of what’s coming…
“Baseline atlas” mock-up display
48
ArrayExpress
Data submission to AE
49
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
50
ArrayExpress
Submission of HTS data
ArrayExpress acts as a “broker” for submitter.
• Meta-data and processed data: ArrayExpress
• Raw sequence reads* (e.g. fastq, bam): ENA
*See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format
51
ArrayExpress
Find out more
• Visit our eLearning portal, Train online, at
http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and
Atlas
• Watch this short YouTube video on how to navigate the MAGE-TAB
submission tool: http://youtu.be/KVpCVGpjw2Y
• Email us at: [email protected]
• Atlas mailing list: [email protected]
52
ArrayExpress