Download Ensembl Compara Perl API

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Koinophilia wikipedia , lookup

Transposable element wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Gene expression programming wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

DNA barcoding wikipedia , lookup

Gene nomenclature wikipedia , lookup

Copy-number variation wikipedia , lookup

NUMT wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Polyploid wikipedia , lookup

Metagenomics wikipedia , lookup

Minimal genome wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

ENCODE wikipedia , lookup

Genomic library wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Ensembl Compara
Perl API
compara
Stephen Fitzgerald
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/
EBI - Wellcome Trust Genome Campus, UK
What is Ensembl Compara?
A single database which contains precalculated
comparative genomics data
Access via perl API and mysql
A production system for generating that database
(not in this presentation)
Compara data
Raw genomic sequence

Whole genome alignments
(tBLAT, BlastZ-net, PECAN)

Syntenic regions (based on BlastZ-net)
Protein Sequen ces




Raw Protein Alignments
Protein Family clusters
Protein trees
Gene orthology / paraology predictions
46 species in Ensembl release-52
Compara database & the Ensembl
core databases
Since there is minimal primary data inside Compara, to gain
full access to the data external links with core DBs must be reestablished
Example: compara_52 must be linked with the
Ensembl core_52 databases
Proper REGISTRY configuration is critical
Or load_registry_from_db is probably the best choice here
The Compara Perl API




Written in Object-Oriented Perl
Used to retrieve data from and store data into
ensembl-compara database
Generalized to extend to non-ensembl genomic data
(Uniprot)
Follows same ‘Data Object’ & ‘Object Adaptor’
DBAdaptor design as the other Ensembl APIs
PRIMARY DATA
Compara object model overview
NCBITaxon
GenomeDB
Member
RESULTS
ANALYSIS
DnaFrag
MethodLinkSpeciesSet
GenomicAlignBlock SyntenyRegion
GenomicAlign
ProteinTree Homology
Family
DnaFragRegion
AlignedMember
Attribute
Primary data



GenomeDB: relates to a particular Ensembl core DB

name(), assembly(), genebuild(), taxon()

fetch_by_name_assembly(), fetch_by_registry_name(),
fetch_by_Slice(), fetch_all()
DnaFrag: represents a “top level” SeqRegion

name(), length(), genome_db(), slice(), coord_system_name()

fetch_by_Slice(), fetch_by_GenomeDB_and_name()
Member: list all Ensembl genes + SwissProt + SPTrEMBL

source_name(), stable_id(), genome_db(), taxon(), sequence(),
get_all_peptide_Members(), get_longest_peptide_Member(),
gene_member()

fetch_by_source_stable_id()
Analysis

MethodLinkSpeciesSet provides a handle to isolate
specific data from the shared tables (homology,
genomic_align_block)

MethodLink: Each individual analysis in compara is tagged
with a unique name called a method_link_type




BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY,
ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES
SpeciesSet: the sets of species as (a ref. to) an array of
GenomeDBs
fetch_by_method_link_type_GenomeDBs(),
fetch_by_method_link_type_registry_aliases()
name(), method_link_type(), species_set(), source()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomeDB
1. Find out the versions of human and mouse genomes in the database
2. Print the name of all the GenomeDBs in the database
DnaFrag
1. Get the DnaFrag for the chromosome 1 of the macaque genome
(using a genome_db object as an argument)
2. Get the DnaFrag for the chromosome X of the mouse genome
(using a core slice object as an argument)
MethodLinkSpeciesSet
1. Find out how many analyses are stored in the database
2. Get the name of the MethodLinkSpeciesSet corresponding to the
BlastZ-net analysis for human and mouse
3. Get the names of the all the species using the mlss corresponding to
the Pecan analyses
GenomeDB example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor(
"Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor->
fetch_by_registry_name("human");
print “Name
:”,$genome_db->name,
"\n";
print “Assembly :”,$genome_db->assembly, "\n";
print “GeneBuild :”,$genome_db->genebuild, "\n";
GenomeDB example code
$> perl genome_db1.pl
Homo sapiens NCBI36 2006-08-Ensembl
Mus musculus NCBIM36 2006-04-Ensembl
DnaFrag example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor(
"Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor->
fetch_by_registry_name("human");
my $dnafrag_adaptor = $reg->get_adaptor(
"Multi", "compara", "DnaFrag");
my $dnafrag = $dnafrag_adaptor->
fetch_by_GenomeDB_and_name($genome_db, "13");
print "Name
print "Length
print "CoordSystem
"\n";
:", $dnafrag->name,
"\n";
:", $dnafrag->length, "\n";
:", $dnafrag->coord_system_name,
DnaFrag example code
$> perl test1.pl
Name
:13
Length
:114142980
CoordSystem
:chromosome
MethodLinkSpeciesSet
example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $mlssa = $reg->get_adaptor("Multi", "compara",
"MethodLinkSpeciesSet");
my $mlss = $mlssa->
fetch_by_method_link_type_registry_aliases(
"BLASTZ_NET", ["human", "mouse"]);
print $mlss->name, "\n";
print "type: ", $mlss->method_link_type, "\n";
my $species_set = $mlss->species_set();
foreach my $this_genome_db (@$species_set) {
print $this_genome_db->name(), "\n";
}
MethodLinkSpeciesSet
example code
$ > perl method_link_species_set.pl
H.sap-M.mus blastz-net (on H.sap)
Genomic Alignments

BlastZ-Net



Translated BLAT


used to compare closely related pair of species
BlastZ-raw -> BlastZ-chain -> BlastZ-net
used to compare more distant pair of species
Pecan


multiple global alignments
all vs all coding exons wublastp -> Mercator ->
Pecan on each syntenic block
GenomicAlignBlock

GenomicAlignBlock

represents a genomic alignment

contains 1 GenomicAlign per sequence

fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice)

Methods:


method_link_species_set(), score(), length(), perc_id(),
get_all_GenomicAligns(), get_SimpleAlign()
GenomicAlign

dnafrag(), genome_db(), get_Slice(), dnafrag_start,
dnafrag_end(), dnafrag_strand(), aligned_sequence()
GenomicAlignBlock
$all_GAlign
$Simplealign
= $GABlock->get_all_GenomicAligns()
= $GABlock->get_SimpleAlign()
$arrayref
$object
$Simplealign: a bioperl object which contains the whole
alignment - can be printed in various format using bioperl
modules
$Galign:
an object which represents one of the sequences
in the alignment only
Hsap.X.1223-1230: ACCTTC-A
Cfam.X.1390-1395: ACC--CGA
<- $ga
<- $ga
Synteny


Based on BlastZ-net alignments
SyntenyRegionAdaptor

fetch_all_by_MethodLinkSpeciesSet_Slice(),
fetch_all_by_MethodLinkSpeciesSet_DnaFrag()

Methods:


get_all_DnaFragRegions(), method_link_species_set(),
DnaFragRegion

slice(), dnafrag(), dnafrag_start(), dnafrag_end(),
dnafrag_strand()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomicAlignBlock
1. Fetch all the BLASTZ_NET alignments between the first 130K
nucleotides of the human chromosome X and the mouse genome.
2. Print the exact location of the alignment blocks.
3. Compare the original and the aligned sequences.
4. Find the BLASTZ_NET alignments between human gene BRCA2
and the mouse genome.
5. Print the BLASTZ_NET alignments between the rat gene ECSIT and
the mouse genome.
6. Print the PECAN multiple alignments between the rat gene ECSIT
and 11 other amniote vertebrates.
7. Print the constrained-element alignments within the rat ECSIT locus
(use the constrained elements generated from the 12-way alignments).
Synteny
1. Get the human-mouse syntenic map for human chromosome X.
GenomicAlignBlock example code
[...]
my $slice_adaptor = $reg->get_adaptor(
"human", "core", "Slice");
my $slice = $slice_adaptor->
fetch_by_region("chromosome", "12", 1e4, 2e4);
my $gaba = $reg->get_adaptor("Multi", "compara",
"GenomicAlignBlock");
my $genomic_align_blocks = $gaba->
fetch_all_by_MethodLinkSpeciesSet_Slice(
$method_link_species_set, $slice);
foreach my $this_gab (@$genomic_align_blocks) {
}
my $all_gas = $this_gab->get_all_GenomicAligns();
foreach my $this_ga (@$all_gas) {
print
$this_ga->genome_db->name(),
":", $this_ga->get_Slice()->name(), "\n";
print
$this_ga->aligned_sequence(), "\n";
}
print "\n";
GenomicAlignBlock example code
$>perl gab.pl
Mus musculus:chromosome:NCBIM37:6:121449987:121450302:-1
CCTCTTAATAAACATTATTGTCAA[…]
Homo sapiens:chromosome:NCBI36:12:19128:19507:1
CCTCTTAATAAGCACACATATCCT[..]
Synteny example code
[...]
my $synteny_region_adaptor = $reg->get_adaptor(
"Multi", "compara", "SyntenyRegion");
my $synteny_regions = $synteny_region_adaptor->
fetch_all_by_MethodLinkSpeciesSet_Slice(
$human_mouse_synteny_method_link_species_set,
$human_slice);
foreach my $this_synteny_region (@$synteny_regions) {
my $these_dnafrag_regions =
$this_synteny_region->get_all_DnaFragRegions();
foreach my $this_dnafrag_region
(@$these_dnafrag_regions) {
print $this_dnafrag_region->dnafrag->
genome_db->name, ": ",
$this_dnafrag_region->slice->name, "\n";
}
}
print "\n";
Homology

(e! 38):




Orthologue predictions based on ‘best reciprocal
blast hits’
Paralogues for a selected set of species
No global view of the evolution history of the
gene considered
e! 39+:


Orthologues and paralogues are inferred from
protein trees
Phylogeny: Orthology/Paralogy in one go
BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared,
BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the
initial clustering step is 0.33.
Homology types
Homology

Homology object



contains 1 pair of Member/Attribute per gene/protein
fetch_all_by_Member(),
fetch_all_by_MethodLinkSpeciesSet(),
fetch_all_by_Member_MethodLinkSpeciesSet()
Methods:

method_link_species_set(), description(),
subtype(), perc_id(), get_all_Member_Attribute(),
get_SimpleAlign()
Family



Compara compute gene family clusters
Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT
and Uniprot/SPTREMBL metazoan proteins
The algorithm is based on :
All vs all blastp
MCL clustering
Muscle multiple aligner

Results stored in family, family_member tables
Family

Family object

contains 1 pair of Member/Attribute per gene/protein

fetch_all by_Member()

Methods:

method_link_species_set(), description(),
description_score(), get_all_Member_Attribute(),
get_SimpleAlign()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
Members
1. Find the Member corresponding to SwissProt protein O93279
2. Find the Member for the human gene BRCA2
3. Find all the peptide Members corresponding to the human gene
CTDP1
Homology
1. Get all the predicted homologues for the human gene BRCA2
2. Get all the mouse orthologues predicted for the human gene CTDP1
Family
1. Get family predicted for the human gene BRCA2
2. Get the alignments corresponding to the family of the human gene
HBEGF
Member example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $member_adaptor = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $member_adaptor->
fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
print "All proteins:\n";
my $all_peptide_members = $member->
get_all_peptide_Members();
foreach my $this_peptide (@$all_peptide_members) {
print $this_peptide->stable_id(), "\n";
}
Member example code
$> perl test2.pl
All proteins:
ENSP00000356399
ENSP00000356398
ENSP00000352658
Homology example code
[...]
my $ma = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $ma->fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
my $homology_adaptor = $reg->get_adaptor(
"Multi", "compara", "Homology");
my $homologies = $homology_adaptor->
fetch_all_by_Member($member);
foreach my $this_homology (@$homologies) {
print $this_homology->description, "\n";
my $member_attributes = $this_homology->
get_all_Member_Attribute();
foreach my $this_mem_attr (@$member_attributes) {
my ($this_member, $this_attribute) =
@$this_mem_attr;
print $this_member->genome_db->name, " ",
$this_member->source_name, " ",
$this_member->stable_id, "\n";
}
print "\n";
}
Family example code
[...]
my $ma = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $ma->fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
my $family_adaptor = $reg->get_adaptor(
"Multi", "compara", "Family");
my $families = $family_adaptor->
fetch_all_by_Member($member);
foreach my $this_family (@$families) {
print $this_family->description, "\n";
my $member_attributes = $this_family->
get_all_Member_Attribute();
foreach my $this_mem_attr (@$member_attributes) {
my ($this_member, $this_attribute) =
@$this_mem_attr;
print $this_member->taxon->binomial, " ",
$this_member->source_name, " ",
$this_member->stable_id, "\n";
}
print "\n";
}
Getting More Information

perldoc – Viewer for inline API documentation.




Tutorial document:


cvs: ensembl-compara/docs/ComparaTutorial.pdf
ensembl-dev mailing list:


shell> perldoc Bio::EnsEMBL::Compara::GenomeDB
shell> perldoc
Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor
online at: http://www.ensembl.org/
[email protected]
Exercise solutions:

http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/solutions.html
Ensembl-dev mailing list and
HelpDesk

ensembl-dev mailing list is great for questions around
the API and the DB

HelpDesk is very helpful

Give detailed info on what you are trying to do

Check that you have the modules installed
($PERL5LIB pointing to them)
Ensembl Team
Leaders
Database Schema and Core API
BioMart
Distributed Annotation System (DAS)
Outreach
Web Team
Comparative Genomics
Analysis and Annotation Pipeline
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)
Glenn Proctor, Ian Longden, Patrick Meidl, Andreas Kähäri
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed Haldar
Eugene Kulesha
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael Schuster
James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)
Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo Gordon
Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel,
Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White
Functional Genomics
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios
Zebrafish Annotation
Kerstin Jekosch, Mario Caccamo, Ian Sealy
VectorBase Annotation
Systems & Support
Research
Martin Hammond, Dan Lawson, Karyn Megy
Guy Coates, Tim Cutts, Shelley Goddard
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino
A special case of ortholog