* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining in Ensembl with BioMart
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Protein moonlighting wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Ridge (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
X-inactivation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Point mutation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy wikipedia , lookup
The Selfish Gene wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Microevolution wikipedia , lookup
Data Mining with BioMart www.ensembl.org/biomart/martview www.biomart.org/biomart/martview 1 / 30 What is BioMart? • A data export tool • A quick table generator • A web interface to mine Ensembl data 2 / 30 BioMart- Data mining • BioMart is a search engine that can find multiple terms and put them into a table format. • Such as: mouse gene (IDs), chromosome and base pair position • No programming required! 3 / 30 General or Specific Data-Tables • All the genes for one species • Or… only genes on one specific region of a chromosome • Or… make BioMart select genes (I.e. all transcripts that match a microarry probe set, GO term, or InterPro domain). 4 / 30 Results Tables or sequences 5 / 30 The First Step: Choose the Dataset Dataset: Current Ensembl, Human genes 6 / 30 The Second Step: Filters Filters: Define a gene set 7 / 30 Attributes attach information Attributes: Determine output columns 8 / 30 Query For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) 9 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) • In the query: Filters: what we know Attributes: what we want to know. 10 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) • In the query: Filters: what we know Attributes: what we want to know. 11 / 30 Query: For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s) • In the query: Filters: what we know Attributes: what we want to know 12 / 30 A Brief Example Use the current Ensembl (archives are also available) Select Homo sapiens genes 13 / 30 Select the Genes with Filters Click Filters Expand the ‘GENE’ panel. Expand the GENE panel to enter in the gene ID(s). 14 / 30 Filters (and Count) Change this to HGNC curated name. Enter “CFTR” in the box. Click “Count” to see if genes passed through your filters. 15 / 30 Attributes (Output Options) Click on ‘Attributes’ ‘Attributes’ allows you to output information. 16 / 30 Attributes (Output Options) Select ‘EntrezGene ID’ 17 / 30 Attributes (Output Options) Select the Affy Platform ‘HG U133-PLUS-2’ in the ‘Microarray’ section 18 / 30 The Results Table - Preview For the full result table: click “Go” or View “ALL” rows. 19 / 30 Full Result Table Ensembl Gene ID for CFTR Ensembl Transcript IDs EntrezGene ID Affy HG probeset 20 / 30 Other Export Options (Attributes) Sequences: UTRs, flanking sequences, cDNA and peptides, etc Gene IDs from Ensembl and external sources (MGI, Entrez, etc) Microarray data Protein Functions/descriptions (Interpro, GO) Orthologous gene sets SNP/ Variation Data 21 / 30 BioMart around the world… BioMart started at Ensembl… To where has it travelled? 22 / 30 Central Portal www.biomart.org 23 / 30 WormBase 24 / 30 HapMap Population frequencies Interpopulation comparisons Gene annotation 25 / 30 DictyBase 26 / 30 GRAMENE www.gramene.org 27 / 30 The Potato Center 28 / 30 How to Get There http://www.biomart.org/biomart/martview http://www.ensembl.org/biomart/martview • Or click on ‘BioMart’ from Ensembl 29 / 30 Worked Example • Follow the worked example on pg 26 • Then, do the exercises on pg 34 (answers on pg 37) This module should do the following: • Show you how to export multiple data types from Ensembl for gene IDs or chromosomal regions. 30 / 30 Ensembl Core Databases Relational Database • Normalised • Each data point stored only once Therefore: • Quick updates • Minimal storage requirements But: • Many tables • Many joins for complicated queries • Slow for data mining applications 31 / 30 Normalised Schema gene_id stable_id 9970 ENSG00000170365 1712 ENSG00000175387 8240 ENSG00000166949 1967 ENSG00000141646 … … gene_id gene.symbol gene_id transcript 9970 SMAD1 9970 ENST00000302085 1712 SMAD2 1712 ENST00000262160 8240 SMAD3 1712 ENST00000356825 1967 SMAD4 8240 ENST00000327367 … … 1967 ENST00000342988 … … 32 / 30 BioMart Database Data warehouse • De-normalised • Query-optimised Therefore: • Fast and flexible • Ideal for data mining But: • Tables with apparent “redundancy” • Needs rebuilding from scratch for every release from normalised core databases 33 / 30 De-Normalised Schema gene_id transcript_id gene.symbol ENSG00000170365 ENST00000302085 SMAD1 ENSG00000175387 ENST00000262160 SMAD2 ENSG00000175387 ENST00000356825 SMAD2 ENSG00000166949 ENST00000327367 SMAD3 ENSG00000141646 ENST00000342988 SMAD4 … … … 34 / 30 Information Flow DATASET SPECIES FOCUS FILTER ATTRIBUTES REGION REGION GENE GENE EXPRESSION EXPRESSION HOMOLOGY HOMOLOGY PROTEIN PROTEIN SNP SNP SWISSPROT FASTA EMBL GTF REFSEQ HTML GO TEXT INTERPRO EXCEL AFFYMETRIX FILE 35 / 30