Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Ensembl Database www.ensembl.org Lecture 3.1 Bioinformatics 1 What is Ensembl? • • • • Public annotation of the Mammalian genomes Open source software Relational database system The future of genomic bioinformatics? Lecture 3.1 Bioinformatics 2 The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust” Lecture 3.1 Bioinformatics 3 The Ensembl Genome Annotation Ensembl: • • • • • Utilises raw DNA sequence data from public sources Creates a tracking database (The “Ensembl” database” Joins the sequences - based on a sequence scaffold or “Golden Path” Automatically finds genes and other features of the Sequence Provides a publicly accessible web based interface to the database Lecture 3.1 Bioinformatics 4 Ensembl Software System • Uses extensively BioPerl (www.bioperl.org) • The free mySQL database • Entire Ensembl code base us freely available under Apache open source license. • Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Appollo). Lecture 3.1 Bioinformatics 9 Ensembl Software System • Core design feature is the “virtual contig” object. • Allows genome sequence to be accessed as a single large contiguous sequence, but is stored in the database as a collection of fragments. • VC object handles reading and writing to the DNA sequence data. Lecture 3.1 Bioinformatics 10 Ensembl Software System • If sequences were stored as single large sequences, this would be impractical e.g. whole database entry would need to be changed if a single base changed. By being able to store constituent DNA as fragments, can move easily between database versions and assemblies. Lecture 3.1 Bioinformatics 11 Ensembl Software System • Software can be accessed by FTP • Can also be accessed through CVS (concurrent versions system) • Possible to set up a mirror of the entire Ensembl system. Lecture 3.1 Bioinformatics 12 Ensembl Now Supports a Number of Organisms Lecture 3.1 Bioinformatics 13 Lecture 3.1 Bioinformatics 14 The Chromosome Overview Lecture 3.1 Bioinformatics 15 Entering through Disease Genes via the OMIM database Lecture 3.1 Bioinformatics 16 The Ensembl Gene Report Lecture 3.1 Bioinformatics 17 Lecture 3.1 Bioinformatics 18 Lecture 3.1 Bioinformatics 19 Lecture 3.1 Bioinformatics 20 Expanding Annotation Features, e.g. Unigene and Human mRNA similarities Lecture 3.1 Bioinformatics 21 Ensembl links out to other databases to access individual entries Lecture 3.1 Bioinformatics 22 Lecture 3.1 Bioinformatics 25 Lecture 3.1 Bioinformatics 26 Lecture 3.1 Bioinformatics 27 Lecture 3.1 Bioinformatics 28 Lecture 3.1 Bioinformatics 29 Lecture 3.1 Bioinformatics 30 Direct Access to Ensembl We will use the underlying Ensembl data. For reasons of simplicity and space we will use the lite version of the Ensembl database {xhost01}/home/pubseq/acedb> mysql -u anonymous -h kaka.sanger.ac.uk Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11928 to server version: 3.23.32 Type 'help' for help. mysql> use homo_sapiens_lite_8_30b Database changed ; mysql> show tables ; +-----------------------------------+ | Tables_in_homo_sapiens_lite_8_30b | +-----------------------------------+ | cpg | | eponine | | gene | | gene_xref | | genscan | | repeat | | repeat_types | | snp | | transcript | | trna | +-----------------------------------+ 10 rows in set (0.15 sec) Lecture 3.1 Bioinformatics 40 Explore the Database mysql> describe gene ; +---------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+------------------+------+-----+---------+----------------+ | id | int(10) unsigned | | PRI | NULL | auto_increment | | db | varchar(40) | | MUL | | | | type | varchar(40) | | | | | gene_id | int(10) unsigned | | MUL | 0 | | | gene_name | varchar(40) | | MUL | unknown | | | chr_name | varchar(20) | | MUL | | | | chr_start | int(10) unsigned | | | 0 | | | chr_end | int(10) unsigned | | | 0 | | | chr_strand | tinyint(4) | | | 0 | | | description | varchar(255) | | | | | | external_db | varchar(40) | | | | | | external_name | varchar(40) | | | | | | +---------------+------------------+------+-----+---------+----------------+ 12 rows in set (0.14 sec) Lecture 3.1 Bioinformatics 41 mysql> select * from gene limit 10 ; +----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+ | id | db | type | gene_id | gene_name | chr_name | chr_start | chr_end | chr_strand | description | external_d b | external_name | +----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+ | 1 | embl | standard | 1 | AB001523.1.1.1 | 21 | 41942232 | 42033272 | 1 | | protein_id | BAA21099.1 | | 2 | embl | standard | 2 | AB001523.1.1.2 | 21 | 42037154 | 42038833 | 1 | | protein_id | BAA21100.1 | | 3 | embl | standard | 3 | AB015355.1.1.1 | 12 | 51605465 | 51627823 | -1 | | SPTREMBL | O94801 | | 4 | embl | pseudo | 4 | AB019437.1.1.31 | 14 | 104129078 | 104129346 | -1 | | EMBL | AB019437.1.1 | | 5 | embl | standard | 5 | AB019437.1.1.23 | 14 | 104166399 | 104166849 | -1 | | protein_id | BAA75024.1 | | 6 | embl | standard | 6 | AB019437.1.1.15 | 14 | 104214187 | 104214630 | -1 | | protein_id | BAA75022.1 | | 7 | embl | pseudo | 7 | AB019437.1.1.24 | 14 | 104163156 | 104163428 | -1 | | | | | 8 | embl | standard | 8 | AB019437.1.1.16 | 14 | 104205297 | 104205735 | -1 | | protein_id | BAA75023.1 | | 9 | embl | pseudo | 9 | AB019437.1.1.1 | 14 | 104323128 | 104323419 | -1 | | | | | 10 | embl | pseudo | 10 | AB019437.1.1.25 | 14 | 104157422 | 104157916 | -1 | | | | +----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+ 10 rows in set (0.16 sec) Lecture 3.1 Bioinformatics 42 mysql> select gene_name, chr_start, (chr_end-chr_start) AS Length, description from gene where chr_name = "2" and chr_start > 20000000 and chr_end < 20500000 limit 10 ; +--------------------+-----------+--------+-----------------------------------------------------+ | gene_name | chr_start | Length | description | +--------------------+-----------+--------+-----------------------------------------------------+ | ENSG00000118965 | 20001311 | 79861 | | | ENSestG00000028348 | 20004285 | 18142 | | | ENSestG00000028342 | 20023444 | 14129 | | | ENSestG00000028189 | 20024182 | 33984 | | | ENSestG00000028339 | 20044953 | 36215 | | | ENSestG00000028192 | 20081242 | 14009 | | | ENSestG00000028199 | 20083093 | 2338 | | | ENSG00000132031 | 20083093 | | ENSestG00000028338 | 20085340 | 6263 | | | ENSestG00000028337 | 20093048 | 10687 | | 20642 | MATRILIN-3 PRECURSOR. [Source:SWISSPROT;Acc:O15232] | +--------------------+-----------+--------+-----------------------------------------------------+ 10 rows in set (0.15 sec) Lecture 3.1 Bioinformatics 43 Using the PERL API • API means “application programming interface” • This gives us a seamless way to access the mysql database from perl code. • A JAVA API also exists for Ensembl. Lecture 3.1 Bioinformatics 44 A typical API interface would involve:1) Connecting to the database use Bio::EnsEMBL::DBSQL::DBAdaptor; This line has to be in all your ensembl scripts; my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $dbname = 'current'; The all important variables telling perl where and what your database is. And now we connect my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $dbname); Lecture 3.1 Bioinformatics 45 2) Use Function calls to access the underlying data my $clone = $db->get_Clone('AC005663'); The function would return an associative array print "Clone is " . $clone->id . "\n"; What Functions you can use and what data is returned will be outlined in the documentation Lecture 3.1 Bioinformatics 46 3) We can get the API to return a large amount of data in an array my @contigs = $clone->get_all_Contigs; We now have an array of contig objects which are very useful for obtaining information. Say we want to get the sequence for each contig: foreach my $contig (@contigs) { my $seqobj = $contig->primary_seq; my $length = $contig->length; my $id = $contig->id; print $seqobj->seq . "\n"; } Note: These specific examples are taken from the ensembl tutorial http://www.ensembl.org/Docs/ensembl_tutorial.pdf Lecture 3.1 Bioinformatics 47 Further Information • The Ensembl Project www.ensembl.org • Ensembl Trace Server trace.ensembl.org • Ensembl Distributed Annotation Server servlet.sanger.ac.uk/das • Human Genome Central Resources www.ensembl.org/genome/central • Distribributed Annotation System www.biodas.org Lecture 3.1 Bioinformatics 48