Download Data Mining in Ensembl with BioMart

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia, lookup

List of types of proteins wikipedia, lookup

Transcriptional regulation wikipedia, lookup

X-inactivation wikipedia, lookup

Genomic imprinting wikipedia, lookup

Molecular evolution wikipedia, lookup

Gene expression wikipedia, lookup

Genome evolution wikipedia, lookup

Gene wikipedia, lookup

Promoter (genetics) wikipedia, lookup

Gene therapy wikipedia, lookup

Endogenous retrovirus wikipedia, lookup

RNA-Seq wikipedia, lookup

Gene desert wikipedia, lookup

Community fingerprinting wikipedia, lookup

Gene nomenclature wikipedia, lookup

Gene regulatory network wikipedia, lookup

Gene expression profiling wikipedia, lookup

Silencer (genetics) wikipedia, lookup

Artificial gene synthesis wikipedia, lookup

Transcript
Data Mining in Ensembl with
BioMart
www.ensembl.org/biomart/martview
www.biomart.org/biomart/martview
Nov, 2009
BioMart- Data mining
• BioMart is a search engine that can find
multiple terms and put them into a table
format.
• Such as: mouse gene (IDs), chromosome
and base pair position
• No programming required!
General or Specific Data-Tables
• All the genes for one species
• Or… only genes on one specific region of
a chromosome
• Or… genes on one region of a
chromosome associated with an InterPro
domain
The First Step: Choose the
Dataset
Dataset: Current Ensembl, Human genes
The Second Step: Filters
Filters: Define a gene set
Attributes attach information
Attributes: Determine output columns
Results
Tables or sequences
Query:
• For the human CFTR gene, can I export
the EntrezGene ID, and also, probes with
this gene sequence from the “Affy HG
U133 Plus 2” microarray platform?
• In the query:
Filters: what we know
Attributes: what we want to know.
Query:
• For the human CFTR gene, can I export
the EntrezGene ID, and also, probes with
this gene sequence from the “Affy HG
U133 Plus 2” microarray platform?
• In the query:
Filters: what we know
Attributes: what we want to know.
Query:
• For the human CFTR gene, can I export
the EntrezGene ID, and also, probes with
this gene sequence from the “Affy HG
U133 Plus 2” microarray platform?
• In the query:
Filters: what we know
Attributes: what we want to know
(columns in the result table)
A Brief Example
Use the current
Ensembl (archives
are also available)
Select
Homo sapiens
Select the genes with Filters
Expand the
‘REGION’ panel.
Click
Filters
Expand the GENE panel to enter in the gene ID(s).
Filters
Change this to HGNC
symbol. Enter “CFTR” in
the box.
Click “Count” to see if genes passed through your
filters.
Attributes (Output Options)
Expand the “GENE”
section.
Click on ‘Attributes’
Attributes (Output Options)
Select “Description”
and “Associated Gene
Name”.
Expand the ‘EXTERNAL’ panel for non-Ensembl IDs.
Attributes (Output)
………………………………………………………………….
External IDs include EntrezGene IDs and also
Microarray probe IDs.
The Results Table - Preview
For the full result
table: click “Go” or
View “ALL” rows.
“Results” show Description, Name, EntrezGene
and Probe matches from the Affy HG U133-Plus-2
platform.
Full Result Table
Ensembl Gene and
Transcript IDs
Description
Gene
Name
Affy HG
probe
EntrezGene
ID
Other Export Options (Attributes)
 Sequences: UTRs, flanking sequences, cDNA
and peptides, etc
 Gene IDs from Ensembl and external sources
(MGI, Entrez, etc)
 Microarray data
 Protein Functions/descriptions (Interpro, GO)
 Orthologous gene sets
 SNP/ Variation Data
BioMart Data Sets
• Ensembl genes
• Vega genes
• Variations
BioMart around the
world…
BioMart started at
Ensembl…
To where has it travelled?
Central Portal
www.biomart.org
WormBase
HapMap
Population
frequencies
Interpopulation
comparisons
Gene
annotation
DictyBase
GRAMENE
www.gramene.org
The Potato Center
How to Get There
http://www.biomart.org/biomart/martview
http://www.ensembl.org/biomart/martview
• Or click on ‘BioMart’ from Ensembl
The Flow
• Choose Dataset (All genes for a species)
• Choose Filters (narrows the gene set)
• Choose Attributes (output options)
Now Try the Worked Example on Page
23!
Ensembl Core Databases
Relational Database
• Normalised
• Each data point stored only once
Therefore:
• Quick updates
• Minimal storage requirements
But:
• Many tables
• Many joins for complicated queries
• Slow for data mining applications
Normalised Schema
gene_id stable_id
9970
ENSG00000170365
1712
ENSG00000175387
8240
ENSG00000166949
1967
ENSG00000141646
…
…
gene_id
gene.symbol
gene_id
transcript
9970
SMAD1
9970
ENST00000302085
1712
SMAD2
1712
ENST00000262160
8240
SMAD3
1712
ENST00000356825
1967
SMAD4
8240
ENST00000327367
…
…
1967
ENST00000342988
…
…
BioMart Database
Data warehouse
• De-normalised
• Query-optimised
Therefore:
• Fast and flexible
• Ideal for data mining
But:
• Tables with apparent “redundancy”
• Needs rebuilding from scratch for every release
from normalised core databases
De-Normalised Schema
gene_id
transcript_id
gene.symbol
ENSG00000170365
ENST00000302085
SMAD1
ENSG00000175387
ENST00000262160
SMAD2
ENSG00000175387
ENST00000356825
SMAD2
ENSG00000166949
ENST00000327367
SMAD3
ENSG00000141646
ENST00000342988
SMAD4
…
…
…
Information Flow
DATASET
SPECIES
FOCUS
FILTER
ATTRIBUTES
REGION
REGION
GENE
GENE
EXPRESSION
EXPRESSION
HOMOLOGY
HOMOLOGY
PROTEIN
PROTEIN
SNP
SNP
SWISSPROT
FASTA
EMBL
GTF
REFSEQ
HTML
GO
TEXT
INTERPRO
EXCEL
AFFYMETRIX
FILE