Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
The Ensembl Database
www.ensembl.org
Lecture 3.1
Bioinformatics
1
What is Ensembl?
•
•
•
•
Public annotation of the Mammalian genomes
Open source software
Relational database system
The future of genomic bioinformatics?
Lecture 3.1
Bioinformatics
2
The Ensembl Project
“Ensembl is a joint project between EMBL
European Bioinformatics Institute and the
Sanger Institute to develop a software
system which produces and maintains
automatic annotation on eukaryotic
genomes. Ensembl is primarily funded by
the Wellcome Trust”
Lecture 3.1
Bioinformatics
3
The Ensembl Genome Annotation
Ensembl:
•
•
•
•
•
Utilises raw DNA sequence data from public sources
Creates a tracking database (The “Ensembl” database”
Joins the sequences - based on a sequence scaffold or “Golden Path”
Automatically finds genes and other features of the Sequence
Provides a publicly accessible web based interface to the database
Lecture 3.1
Bioinformatics
4
Ensembl Software System
• Uses extensively BioPerl (www.bioperl.org)
• The free mySQL database
• Entire Ensembl code base us freely available
under Apache open source license.
• Mainly written in Perl, extensions in C. Some
viewers have been written in Java (e.g.
Appollo).
Lecture 3.1
Bioinformatics
9
Ensembl Software System
• Core design feature is the “virtual contig”
object.
• Allows genome sequence to be accessed as
a single large contiguous sequence, but is
stored in the database as a collection of
fragments.
• VC object handles reading and writing to the
DNA sequence data.
Lecture 3.1
Bioinformatics
10
Ensembl Software System
• If sequences were stored as single large
sequences, this would be impractical e.g.
whole database entry would need to be
changed if a single base changed. By being
able to store constituent DNA as fragments,
can move easily between database versions
and assemblies.
Lecture 3.1
Bioinformatics
11
Ensembl Software System
• Software can be accessed by FTP
• Can also be accessed through CVS
(concurrent versions system)
• Possible to set up a mirror of the entire
Ensembl system.
Lecture 3.1
Bioinformatics
12
Ensembl Now Supports a Number of
Organisms
Lecture 3.1
Bioinformatics
13
Lecture 3.1
Bioinformatics
14
The Chromosome Overview
Lecture 3.1
Bioinformatics
15
Entering through Disease Genes via the OMIM database
Lecture 3.1
Bioinformatics
16
The Ensembl Gene Report
Lecture 3.1
Bioinformatics
17
Lecture 3.1
Bioinformatics
18
Lecture 3.1
Bioinformatics
19
Lecture 3.1
Bioinformatics
20
Expanding Annotation Features, e.g. Unigene and Human mRNA similarities
Lecture 3.1
Bioinformatics
21
Ensembl links out to other databases to access individual entries
Lecture 3.1
Bioinformatics
22
Lecture 3.1
Bioinformatics
25
Lecture 3.1
Bioinformatics
26
Lecture 3.1
Bioinformatics
27
Lecture 3.1
Bioinformatics
28
Lecture 3.1
Bioinformatics
29
Lecture 3.1
Bioinformatics
30
Direct Access to Ensembl
We will use the underlying Ensembl data. For reasons of simplicity and space we
will use the lite version of the Ensembl database
{xhost01}/home/pubseq/acedb> mysql -u anonymous -h kaka.sanger.ac.uk
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 11928 to server version: 3.23.32
Type 'help' for help.
mysql> use homo_sapiens_lite_8_30b
Database changed
;
mysql> show tables ;
+-----------------------------------+
| Tables_in_homo_sapiens_lite_8_30b |
+-----------------------------------+
| cpg
|
| eponine
|
| gene
|
| gene_xref
|
| genscan
|
| repeat
|
| repeat_types
|
| snp
|
| transcript
|
| trna
|
+-----------------------------------+
10 rows in set (0.15 sec)
Lecture 3.1
Bioinformatics
40
Explore the Database
mysql> describe gene ;
+---------------+------------------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+---------------+------------------+------+-----+---------+----------------+
| id
| int(10) unsigned |
| PRI | NULL
| auto_increment |
| db
| varchar(40)
|
| MUL |
|
|
| type
| varchar(40)
|
|
|
|
| gene_id
| int(10) unsigned |
| MUL | 0
|
|
| gene_name
| varchar(40)
|
| MUL | unknown |
|
| chr_name
| varchar(20)
|
| MUL |
|
|
| chr_start
| int(10) unsigned |
|
| 0
|
|
| chr_end
| int(10) unsigned |
|
| 0
|
|
| chr_strand
| tinyint(4)
|
|
| 0
|
|
| description
| varchar(255)
|
|
|
|
|
| external_db
| varchar(40)
|
|
|
|
|
| external_name | varchar(40)
|
|
|
|
|
|
+---------------+------------------+------+-----+---------+----------------+
12 rows in set (0.14 sec)
Lecture 3.1
Bioinformatics
41
mysql> select * from gene limit 10 ;
+----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+
| id | db
| type
| gene_id | gene_name
| chr_name | chr_start | chr_end
| chr_strand | description | external_d b | external_name |
+----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+
|
1 | embl | standard |
1 | AB001523.1.1.1
| 21
|
41942232 |
42033272 |
1 |
| protein_id
| BAA21099.1
|
|
2 | embl | standard |
2 | AB001523.1.1.2
| 21
|
42037154 |
42038833 |
1 |
| protein_id
| BAA21100.1
|
|
3 | embl | standard |
3 | AB015355.1.1.1
| 12
|
51605465 |
51627823 |
-1 |
| SPTREMBL
| O94801
|
|
4 | embl | pseudo
|
4 | AB019437.1.1.31 | 14
| 104129078 | 104129346 |
-1 |
| EMBL
| AB019437.1.1
|
|
5 | embl | standard |
5 | AB019437.1.1.23 | 14
| 104166399 | 104166849 |
-1 |
| protein_id
| BAA75024.1
|
|
6 | embl | standard |
6 | AB019437.1.1.15 | 14
| 104214187 | 104214630 |
-1 |
| protein_id
| BAA75022.1
|
|
7 | embl | pseudo
|
7 | AB019437.1.1.24 | 14
| 104163156 | 104163428 |
-1 |
|
|
|
|
8 | embl | standard |
8 | AB019437.1.1.16 | 14
| 104205297 | 104205735 |
-1 |
| protein_id
| BAA75023.1
|
|
9 | embl | pseudo
|
9 | AB019437.1.1.1
| 14
| 104323128 | 104323419 |
-1 |
|
|
|
| 10 | embl | pseudo
|
10 | AB019437.1.1.25 | 14
| 104157422 | 104157916 |
-1 |
|
|
|
+----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+
10 rows in set (0.16 sec)
Lecture 3.1
Bioinformatics
42
mysql> select gene_name, chr_start, (chr_end-chr_start) AS Length,
description from gene where chr_name = "2" and chr_start > 20000000 and
chr_end < 20500000 limit 10 ;
+--------------------+-----------+--------+-----------------------------------------------------+
| gene_name
| chr_start | Length | description
|
+--------------------+-----------+--------+-----------------------------------------------------+
| ENSG00000118965
|
20001311 |
79861 |
|
| ENSestG00000028348 |
20004285 |
18142 |
|
| ENSestG00000028342 |
20023444 |
14129 |
|
| ENSestG00000028189 |
20024182 |
33984 |
|
| ENSestG00000028339 |
20044953 |
36215 |
|
| ENSestG00000028192 |
20081242 |
14009 |
|
| ENSestG00000028199 |
20083093 |
2338 |
|
| ENSG00000132031
|
20083093 |
| ENSestG00000028338 |
20085340 |
6263 |
|
| ENSestG00000028337 |
20093048 |
10687 |
|
20642 | MATRILIN-3 PRECURSOR. [Source:SWISSPROT;Acc:O15232] |
+--------------------+-----------+--------+-----------------------------------------------------+
10 rows in set (0.15 sec)
Lecture 3.1
Bioinformatics
43
Using the PERL API
• API means “application programming
interface”
• This gives us a seamless way to access
the mysql database from perl code.
• A JAVA API also exists for Ensembl.
Lecture 3.1
Bioinformatics
44
A typical API interface would involve:1) Connecting to the database
use Bio::EnsEMBL::DBSQL::DBAdaptor;
This line has to be in all your ensembl scripts;
my $host = 'kaka.sanger.ac.uk';
my $user = 'anonymous';
my $dbname = 'current';
The all important variables telling perl where and what your database is.
And now we connect
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host,
-user => $user,
-dbname => $dbname);
Lecture 3.1
Bioinformatics
45
2) Use Function calls to access the underlying data
my $clone = $db->get_Clone('AC005663');
The function would return an associative array
print "Clone is " . $clone->id . "\n";
What Functions you can use and what data is returned will be
outlined in the documentation
Lecture 3.1
Bioinformatics
46
3) We can get the API to return a large amount of data in an array
my @contigs = $clone->get_all_Contigs;
We now have an array of contig objects which are very useful for
obtaining information.
Say we want to get the sequence for each contig:
foreach my $contig (@contigs) {
my $seqobj = $contig->primary_seq;
my $length = $contig->length;
my $id = $contig->id;
print $seqobj->seq . "\n";
}
Note: These specific examples are taken from the ensembl tutorial
http://www.ensembl.org/Docs/ensembl_tutorial.pdf
Lecture 3.1
Bioinformatics
47
Further Information
• The Ensembl Project www.ensembl.org
• Ensembl Trace Server trace.ensembl.org
• Ensembl Distributed Annotation Server
servlet.sanger.ac.uk/das
• Human Genome Central Resources
www.ensembl.org/genome/central
• Distribributed Annotation System
www.biodas.org
Lecture 3.1
Bioinformatics
48