Download Diapositive 1 - Institut Pasteur

Document related concepts

Human genetic variation wikipedia , lookup

Copy-number variation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Polyploid wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Essential gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Transposable element wikipedia , lookup

Designer baby wikipedia , lookup

Ridge (biology) wikipedia , lookup

NUMT wikipedia , lookup

Oncogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomic imprinting wikipedia , lookup

ENCODE wikipedia , lookup

Genome (book) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genome wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genomes Databases and Open
Access Bibliographic
Resources
Sonia Abdelhak
Institut Pasteur Tunis
Ahmed Rebaï
Centre of Biotechnology Sfax
Fredj Tekaia
Institut Pasteur Paris
Outline
• General introduction and overview of
complete genome sequences
• Genomes databases and where to
find them
• Comparative Genomics Databases
• Other Omics resources
• Bibliographic/Open access resources
Why databases?
• In the genomic era we have billions of data
that need to be stored, curated and made
accessible for analysis and knowledge
discovery
• Databases are essential resources for
both experimental and computational
biologists
• We have crossed the Terabyte threshold of
genomic data (Huge, massive, explosion!)
Chronology of completely sequenced genomes
• 1977: first viral genome (5386 base pairs;
encoding 11 genes). Sanger et al. sequence
bacteriophage fX174.
• 1981: Human mitochondrial genome. 16,500 base
pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
• 1986: Chloroplast genome. 156,000 base pairs
(most are 120 kb to 200 kb)
1995: first genome of a free-living organism, the
bacterium Haemophilus influenzae, by TIGR, 1830 Kb,
1713 genes.
1996: first genome of an archaeal genome:
Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb,
1773 genes.
1997: first eukaryotic genome : Saccharomyces cerevisiae
S288C; International collaboration; 16 Chromosomes;
12,057 Kb, ~6000 genes.
1998: first multicellular organism Nematode
Caenorhabditis elegans; 97 Mb; ~19,000 genes.
1999: first human chromosome: Chromosome
22 (49 Mb, 673 genes))
• 2000: Fruitfly Drosophila melanogaster (137 Mb;
~13,000 genes)
• 2000 first plant genome: Arabidopsis thaliana (115,428
Mb; 22670 genes
• 2001: draft sequence of the human genome (3300 Mb;
~28000 genes)
• 2002: Plasmodium falciparum (22,9 Mb; 5334 genes)
• 2002: mouse genome (2700 Mb; ~28000 genes)
• 2004: Fish draft Tetraodon nigroviridis genome (x Mb;
~28000 genes);
• 2005: Dog (41Mb, 33651 genes) and chicken genomes
( 18031 genes)
Complete genomes
 2467 projects
• 524 published
(03-17-07)
• 1091 Bacteria
Tree of life
•
59 Archaea
• 720 eukaryotes
• 3 phylogenetic
domains;
http://www.genomesonline.org/
• Lifestyles: mesophiles;
(hyper)thermophiles;
psychrophiles;extreme
conditions,...
Genome sequencing projects
There are several web-based resources that
document the progress of completely
sequenced genomes and their reference
publication, including:
GOLD
Genomes Online Database
http://www.genomesonline.org/gold.cgi
How big are genome sizes?
Viral genomes: 1 kb to 360 kb (Canarypox virus)
Note: Mimivirus: 1.2 Mb
http://www.giantvirus.org/top.html
(Top 100 largest viral genome sequences)
Bacterial genomes: 0.5 Mb to 13 Mb;
Eukaryotic genomes: 8 Mb to 670 Gb;
Database of Genome sizes:
http://www.cbs.dtu.dk/databases/DOGS/index.php
Genome Sizes (MegaBases)
3500
3000
2500
2000
Size
1500
1000
600000
500
500000
0
E.coli
Yeast
Worm
Fly
Fugu
Human
400000
300000
200000
100000
0
Fly
Fugu
Human
Wheat
Amoeba
BIOLOGICAL DATABASE CATEGORIES
•Databases of nucleic acid sequences (RNA, DNA)
•Databases of protein sequences
•Databases of protein motifs and protein domains
•Databases of structures
•Databases of genomes
•Databases of genes
•Databases of expression profiles
•Databases of SNPs and mutations
•Databases of metabolic pathways and protein
associations
•Databases of taxonomy
•…
Can we find a list of ‘clean’
databases ?
The NAR Database issue
• The 2007 update includes 968 databases, 110
more than the previous one.
• 68 new databases
• updates of 106 existing databases
• The complete database list and summaries are
available online on the Nucleic Acids Research
web site http://nar.oxfordjournals.org/
NAR Database Category List
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nucleotide Sequence Databases
RNA sequence databases
Protein sequence databases
Structure Databases
Genomics Databases (non-vertebrate)
Metabolic and Signaling Pathways
Human and other Vertebrate Genomes
Human Genes and Diseases
Microarray Data and other Gene Expression Databases
Proteomics Resources
Other Molecular Biology Databases
Organelle databases
Plant databases
Immunological databases
• Genomics Databases (non-vertebrate)
– MGD - Mouse Genome Database ?????
– TIGR Gene Indices ?????
– Genome annotation terms, ontologies and
nomenclature
– Taxonomy and identification
– General genomics databases
– Viral genome databases
– Prokaryotic genome databases
– Unicellular eukaryotes genome databases
– Fungal genome databases
– Invertebrate genome databases
Three type of Genome database
• Databases which collect data of all
sequenced genomes (Entrez_Genomes;
EBI_genomes)
• Databases which collect data of a
category of organisms with sequenced
genomes (Microbial Genomes at TIGR)
• Databases specific for one organism with
sequenced genomes (Flybase, MGD,
Ensembl)
What kind of information you find
there?
• Genome databases contain genomic information
collected from many sources.
– Genome assembly
– Gene predictions
– Known genes, mRNA, ESTs, proteins
– Genetic maps, markers and polymorphisms
– Gene expression and phenotypes
– Annotations
– Interspecies homologues
Resources for genomes
There are two main resources for genomes:
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk/genomes/
NCBI
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/Genomes/
But many others resources from sequencing Institutions:
Sanger
The welcome Trust Sanger Institute
http://www.sanger.ac.uk/
TIGR
The Institute for Genomic Research
http://cmr.tigr.org/tigr-scripts/CMR/shared/Genomes.cgi
Genolevures
http://cbi.labri.fr/Genolevures/index.php
Databases by phylogenetic groups
Eucaryotic genomes:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
Bacteria, fungi genomes:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Fungi&taxgroup
=11:Fungi|12:
Insects:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=12:Insects&taxgrou
p=11:|12:Insects
Plant genomes:
http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
...
The (ever expanding) Entrez System
OMIM
PubMed
PubMed Central
3D Domains
Journals
Structure
Books
CDD/CDART
Entrez
Protein
Taxonomy
Genome
GEO/GDS
UniSTS
UniGene
Nucleotide
SNP
PopSet
Mouse
Assembly
WGS
Other
GenBank
RefSeq
Contig
BAC
RefSeq
Transcript
UniGene
Transcript
Maps and Options
Common features of genomic
database
• Possibility to download all the sequences
of the genome or part of them
(chromosomes, clones, genes, CDS,..)
• Most of them have a corresponding
protein resource (the set of proteins
obtained by translating all CDS)
• Example: Entrez-Genome of the NCBI
Genpept
Comparative Genomics
databases
Comparative genomics
Analyses of the genetic material of different species help
understanding the similarity and differences between genomes,
their evolution and the evolution of their genes.
•Intra-genomic comparisons help understanding the degree of
duplication (genome regions; genes) and genes organization,...
•Inter-genomic comparisons help understanding the degree of
similarity between genomes; degree of conservation between genes;
•understanding gene and genome evolution
Internet resources for whole-genome comparative
analysis and associated tools
Resource
URL
UCSC Genome4 Bioinformatics http://genome.ucsc.edu/
Ensembl
http://www.ensembl.org/
MapViewer
http://www.ncbi.nlm.nih.gov/mapview/
VISTA Genome Browser
http://pipeline.lbl.gov/
K-BROWSER http://hanuman.math.berkeley.edu/cgi-bin/kbrowser2
Comparative Regulatory Genomics
http://corg.molgen.mpg.de/
GALA http://www.bx.psu.edu/
EnsMart
http://www.ensembl.org/EnsMart/
ETOPE http://www.bx.psu.edu/
PipMaker and MultiPipMaker
http://www.bx.psu.edu/
VISTA server http://www-gsd.lbl.gov/vista/
MAVID server http://baboon.math.berkeley.edu/mavid/
zPicture server http://zpicture.dcode.org/
rVISTA server http://rvista.dcode.org/
COGs: Clusters of Orthologous Groups:
http://www.ncbi.nlm.nih.gov/COG/
UCSC Comparative Genomics
NCBI
Homo sapiens Genome:
Statistics -- Build 36 version 2
Genes 28,961
Some considerations
• Organism specific databases can be more
up-to-date than general databases
• Genome databases are not a one stop
shop for all information, other databases
like UniProt are still needed!
Bibliographic Databases
and Open Access resources
Pubmed
http://www.pubmed.org/
• An access to more than 12 millions papers
since 1950 (3790 jounals)
• Simple and advanced literature Search
with keywords, author name, MESH terms,
journals, single citation,..
• Some papers are free from the journal
website or through the editors
Pubmed central
http://www.pubmedcenral.com/
Free access journals
• Authors pay to allow readers to get the
papers free
• The BMC initiative
• The Plos initiative
• Other initiatives: some journals are giving
immediate free online access and others
after few (1-12) months from publication
Biomedcentral (BMC)
http://www.biomedcentral.com/
The PLOS initiative
http://www.plos.org/
Highwire
http://highwire.stanford.edu/
The HINARI initiative
• The Health InterNetwork Access to Research
Initiative (HINARI) provides free or very low cost
online access to the major journals in biomedical and
related social sciences to local, not-for-profit
institutions in developing countries.
• HINARI was launched in January 2002, with some
1500 journals from 6 major publishers. 22 additional
publishers joined in May 2002, bringing the total
number of journals to over 2000.
• Today more than 70 publishers are offering their
content in HINARI and others will soon be joining the
programme.
And also books!
If you want to learn
Just try and RTM
• General genomics databases
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Animal Genome Size Database
BacMap
COG - Clusters of Orthologous Groups of proteins
CoGenT++
DEG - Database of Essential Genes
EBI Genomes
Entrez Gene
Entrez Genomes
ERGO-Light
GenDiS
GeneNest
Genome information broker
Genome Project Database
Genome Reviews
GOLD
GtRDB - Genomic tRNA Database
Inparanoid
Integr8 (formerly Proteome Analysis Database)
INVHOGEN
KaryotypeDB
KEGG - Kyoto Encyclopedia of Genes and Genomes
MBGD - Microbial Genome Database
MeGX
MetaCyc
NegProt - Negative Proteome database