* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides
Survey
Document related concepts
Neocentromere wikipedia , lookup
Non-coding DNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Minimal genome wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Human genome wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
Genomic library wikipedia , lookup
Genome evolution wikipedia , lookup
Transcript
High throughput biology projects Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The new biology Traditional biology: Small team working on a specialized topic Well defined experiment to answer precise questions New « high-throughput » biology Large international teams using cutting edge technology defining the project Results are given raw to the scientific community without any underlying hypothesis Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Example of « high-throughput » Complete genome sequencing Large-scale sampling of the transcriptome Simultaneous gene expression analysis of thousands of gene (DNA chips) Large-scale sampling of the proteome Protein-protein analysis large-scale 2-hybrid (yeast, worm) Large-scale 3D structure production (yeast) Metabolism modelling Biodiversity Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Role of bioinformatics Control and management of the data Analysis of primary data e.g. Base calling from chromatograms Mass spectra analysis DNA chips images analysis Statistics Results analysis in a biological context Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Genomes in numbers Sizes: virus: 103 to 105 nt bacteria: 105 to 107 nt yeast: 1.35 x 107 nt mammals: 108 to 1010 nt plants: 1010 to 1011 nt Gene number: virus: 3 to 100 bacteria: ~ 1000 yeast: ~ 7000 mammals: ~ 30’000 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Sequencing projects « small » genomes (<107): bacteria, virus « large » genomes (107-1010) eucaryotes Many already sequenced (industry excluded) More than 60 bacterial genomes already in the public domain More to come! (one every two weeks…) 5 finished (S.cerevisiae, C.elegans, D.melanogaster, A.thaliana, Homo sapiens) Many more to come: mouse, rat, rice (and other plants), fishes, many pathogenic parasites EST sequencing Partial mRNA sequences ~8.5x106 sequences in the public domain Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Human genome Size: 3 x 109 nt for a haploid genome Highly repetitive sequences 25%, moderately repetitive sequences 25-30% Size of a gene: from 900 to >2’000’000 bases (introns included) Proportion of the genome coding for proteins: 5-7% Number of chromosomes: 22 autosomal, 1 sexual chromosome Size of a chromosome: 5 x 107 to 5 x 108 bases Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to sequence the human genome? Consortium « international » approach: Generate genetic maps (meiotic recombination) and pseudogenetic maps (chromosome hybrids) for indicator sequences Generate a physical map based on large clones (BAC or PAC) Sequence enough large clones to cover the genome « commercial » approach (Celera): Generate random libraries of fixed length genomic clones (2kb and 10kb) Sequence both ends of enough clones to obtain a 10x coverage Use computer techniques to reconstitute the chromosomal sequences, check with the public project physical map Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Mapping resources Genetic and physical maps: Genethon, GDB, NCBI Radiation hybrid map: Sanger BAC production & mapping: Oakland, Caltech, others Clone information and retrieval: RZPD (Germany) Physical maps in ACEDB format from chromosome coordinators Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Sequencing Create shotgun library from BAC/PAC Sequence individual clones to get a ten-fold coverage Phases: 0 = single sequence (like STS) 1 = unordered contigs 2 = ordered, oriented contigs 3 = finished, annotated sequence Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Chromosome size sequences Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications Solution: split the sequence in overlapping “chunks” New problem: have to reassemble chunks if you want to analyze the whole sequence GenBank provides “meta-entries” (CON division) with assembly instructions Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Interpretation of the human draft Many gaps and unordered small pieces A genomic sequence does not tell you where the genes are encoded. The genome is far from being « decoded » One must combine genome and transcriptome to have a better idea Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The transcriptome The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome The documentation of the localization (cell type) and conditions under which these RNAs are expressed The documentation of the biological function(s) of each RNA species Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Public draft transcriptome Information about the expression specificity and the function of mRNAs « full » cDNA sequences of know function « full » cDNA sequences, but « anonymous » (e.g. KIAA or DKFZ collections) EST sequences cDNA libraries derived from many different tissues Rapid random sequencing of the ends of all clones ORESTES sequences Limited set of expression data Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to organise EST collections? Clustering: associate individual EST sequences with unique transcripts or genes Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster Mapping: associate ESTs (or EST contigs) with exons in genomic sequences Interpreting: find and correct coding regions Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Example mapping of ESTs and mRNAs mRNAs ESTs Computer prediction Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to cope with the amount of data? Enormous increase of sequences Always moving data (phases…) Automatic annotation projects RefSeq (NCBI) ENSEMBL (EBI) HAMAP (SIB) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 RefSeq: NCBI Reference sequences mRNAs and Proteins NM_123456 Reference mRNA NP_123456 Reference Protein XM_123456 Predicted Transcript XP_123456 Predicted Protein XR_123456 Predicted non-coding Transcript Gene Records NG_123456 Reference Genomic Sequence Assemblies NT_123456 Reference Contig (Mouse and Human Genomes) NC_123455 Reference Chromosome, Microbial Genomes, Plasmid Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Status codes RefSeq records are provided with a status code which provides an indication of the level of review a RefSeq record has undergone. REVIEWED PROVISIONAL The RefSeq record has not yet been subject to individual review. PREDICTED The RefSeq record has been the reviewed by NCBI Staff. The review process includes reviewing available sequence data and frequently also includes a review of the literature. Some aspect of the RefSeq record is predicted and there is supporting evidence that the locus is valid. GENOME ANNOTATION This identifies the contig (NT_ accessions), mRNA (XM_), non-coding transcript (XR_), and protein (XP_) RefSeq records provided by the NCBI Genome Annotation process. These records are provided via automated processing. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Map view of RefSeq NT_ XM_ NM_ Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 ENSEMBL Goals of Ensembl Accurate, automatic analysis of genome data Analysis maintained on the current data Presentation of the analysis to biologists via the Web Distribution of the analysis to other bioinformatics laboratories. The Ensembl project will be a foundation for a next generation sequence database that provides a curated, distributed, non redundant view of the genomes of model organisms. Commitments of the Ensembl project Public release of data Open, collaborative software development All the data and analysis will be put into the public domain immediately. The software which forms the automated pipeline will be available to everyone under an open license, modelled after the Apache license. Collaboration on agreed standards for distribution We hope to provide the data in as many useful forms as is practical, including the EMBL flat file formats and new data distribution channels such as XML and CORBA. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 ENSEMBL Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 ENSEMBL views Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 High quality Automated Microbial Annotation of Proteomes Aim: automatically annotate with the highest level of quality a significant percentage of proteins originating from microbial genome sequencing projects. The programs being developed are specifically designed to track down "eccentric" proteins. Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated. This project should allow annotators in the SWISS-PROT groups at SIB and EBI to concentrate on the proteins that really need careful manual annotation. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 HAMAP origin About 60 microbial genomes are available today >1000 in a few years; >1 million microbial proteins! Functional analysis and detailed biochemical characterization will only be available: For « all » proteins in a handful of model organisms (i.e. E.coli, B.subtilis, etc.) For proteins involved in pathogenicity (medical and pharmaceutical interests) For proteins involved in specific biosynthetic or catabolic pathways (biotechnological and food industry interests) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 HAMAP overview Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 HAMAP flow chart Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 HAMAP study case The case of the Escherichia coli proteome According to the original analysis in 1997: 4286 protein coding genes 60 were missed (almost all <100 residues) 120 are most probably « bogus » 50 pairs or triplets of ORFs had to be fused 719 have proven or probable wrong start sites ~1800 are still not biochemically characterized; only one new « functionalisation » per week… Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Unix reminder General: man, pwd, cd, ls, mkdir, rmdir, passwd, exit Files manipulation: cat, more, cp, mv, rm, grep, find, diff, head, tail, chmod Editing: vi, pico, emacs Compression: tar, (un)compress, gzip Various: redirection (<>>) and piping (|) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11