* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequencing
Non-coding DNA wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Pathogenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
DNA sequencing wikipedia , lookup
Human Genome Project wikipedia , lookup
Exome sequencing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Metagenomics wikipedia , lookup
Large-scale genome projects • Sequencing DNA molecules in the Mb size range • All strategies employ the same underlying principles: Random Shotgun sequencing Strategy Libraries Sequencing Assembly Closure Annotation Release Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence Nucleotide Database Growth EMBL breakdown by organism EMBL Release 65 Progress on Large Sequencing Projects Strategies for sequencing • How big can you go?? Strategy • Large-insert clones Libraries • cosmids 30-40 kb • BACs/PACs 50 - 100 kb • Whole chromosomes • Whole genomes Sequencing Assembly Closure Annotation Release Genome size and sequencing strategies Genome size (log Mb) 0 1 2 3 4 H.sapiens (3000 Mb) D.melanogaster (170 Mb) C.elegans (100Mb) P.falciparum (30 Mb) S.cerevisiae (14 Mb) E.coli (4 Mb) Whole genome shotgun (WGS) Clone-by-clone Whole Chromosome Shotgun (WCS) Whole Genome Shotgun (WGS) with Clone ‘skims’ Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence Strategies for sequencing • Size and GC composition of genome • Volume of data • Ease of cloning • Ease of sequencing • Genome complexity Strategy Libraries Sequencing Assembly Closure • dispersed repetitive sequence Annotation • telomeres & centromeres Release • Politics/Funding Strategies: Clone by Clone • Simple (0.5 - 2 K reads) Strategy • Few problems with repeats Libraries • Relatively simple informatics • Scalability • Quality of physical map • Fingerprint / STS maps • End sequencing Sequencing Assembly Closure Annotation Release Strategies: Whole Chromosome shotgun (WCS) • Requires chromosome isolation Strategy • Moderate complexity (10’s K reads) Libraries • Problems with repeats • Complex informatics • Inefficient in isolation • Quality of physical map • Skims of mapped clones Sequencing Assembly Closure Annotation Release Strategies: Whole Genome shotgun (WGS) • Moderate to High complexity (10-100’s K reads) Strategy • Problems with repeats Libraries • Complex informatics • Quality of physical map • Fingerprint map • STS markers • End-sequences • Skims of mapped clones Sequencing Assembly Closure Annotation Release Sequencing my genome Politics Strategy Libraries Production Sequencing Assembly Finishing Closure Annotation Annotation Release TIME MONEY What do you get? DATA!!, DATA !!, and more DATA!! Strategy • Sequence • incomplete v complete Libraries Sequencing • First-pass annotation Assembly • Gene discovery Closure • Full annotation Annotation • A starting point for research Release Genome annotation is central to functional genomics ORFeome based functional genomics RNAi phenotypes Gene Knockout Expression Microarray Sequencing • Library construction Strategy • Colony picking Libraries • DNA preparation • Sequencing reactions • Electrophoresis • Tracking/Base calling Sequencing Assembly Closure Annotation Release Libraries • Essentially Sub-cloning • Generation of small insert libraries in a well characterised vector. • Ease of propagation • Ease of DNA purification • e.g. puc18, M13 Strategy Libraries Sequencing Assembly Closure Annotation Release Libraries - testing • Simple concepts • Insert/Vector ratio • Real data • Insert size • Sequence …. • Simple analysis Strategy Libraries Sequencing Assembly Closure Annotation Release Sequence generation • Pick colonies • Template preparation • Sequence reactions • Standard terminator chemistry • pUC libraries sequenced with forward and reverse primers Strategy Libraries Sequencing Assembly Closure Annotation Release Sequence generation • Electrophoresis of products • Old style - slab gels, 32 > 64 > 96 lanes • New style - capillary gels, 96 lanes • Transfer of gel image to UNIX • Sequencing machines use a slave Mac/PC • Move data to centralised storage area for processing Strategy Libraries Sequencing Assembly Closure Annotation Release Gel image processing • Light-to-Dye estimation • Lane tracking • Lane editing • Trace extraction • Trace standardisation • Mobility correction • Background substitution Strategy Libraries Sequencing Assembly Closure Annotation Release Pre-processing • Base calling using Phred • modifies SCF file • Quality clipping • Vector clipping • Sequencing vector • Cloning vector • Screen for contaminants • Feature mark up (repeats/transposons) Strategy Libraries Sequencing Assembly Closure Annotation Release Finishing • Assembly: Process of taking raw single-pass reads into contiguous consensus sequence • Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Strategy Libraries Sequencing Assembly Closure • Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb Annotation Release Genome Assembly Strategy • Pre-assembly • Assembly • Automated appraisal • Manual review Libraries Sequencing Assembly Closure Annotation Release Pre-Assembly Strategy • Convert to CAF format • flatfile text format • choice of assembler • choice of post-assembly modules • choice of assembly editor Libraries Sequencing Assembly Closure Annotation Release www.sanger.ac.uk/Software/CAF Assembly Strategy • Assemble using Phrap • Read fasta & quality scores from CAF file • Merge existing Phrap .ace file as necessary • Adjust clipping Libraries Sequencing Assembly Closure Annotation Release Assembly appraisal • auto-edit • removes 70% of read discrepancies • Remove cloning vector • Mark up sequence features • finish • Identify low-quality regions • Cover using ‘re-runs’ and ‘long-runs’ • Compare with current databases • plate contamination Strategy Libraries Sequencing Assembly Closure Annotation Release Manual Assembly appraisal Strategy • Use a sequence editor (GAP/consed) • Tools to identify Internal joins • Tools to identify and import data from an overlapping projects • Tools to check failed or mis-assembled reads for inclusion in project Libraries Sequencing Assembly Closure Annotation Release Manual editing • Sanger uses 100% edit strategy Strategy • Where additional data is required: Libraries • Check clipping • Additional sequencing • Template / Primer / Chemistry • Assemble new data into project • GAP4 Auto-assemble • Repeat whole process Sequencing Assembly Closure Annotation Release Manual Quality Checks • Force annotation tag consistency • All unedited data is re-assembled using Phrap Strategy • All high-quality discrepancies are reviewed Libraries • Confirm restriction digest (clones) • Check for inverted repeats • Manually check: • Areas of high-density edits • Areas with no supporting unedited data • Areas of low read coverage Sequencing Assembly Closure Annotation Release Gap closure • Read pairs Strategy • PCR reactions (long-range / combinatorial) Libraries • Small-insert libraries Sequencing • Transposon-insertion libraries Assembly Closure Annotation Release Gap closure - contig ordering • Read pair consistency Strategy • STS mapping Libraries • Physical mapping Sequencing • Genetic mapping • Optical mapping • Large-insert clone • skims • end-sequencing Assembly Closure Annotation Release Annotation • DNA features (repeats/similarities) Strategy • Gene finding Libraries • Peptide features • Initial role assignment • Others- regulatory regions Sequencing Assembly Closure Annotation Release Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA translation Nascent polypeptide Comparative gene prediction folding Active enzyme Functional identification Function Reactant A Product B Genome analysis overview: C.elegans DNA features • Similarity features • mapping repeats • simple tandem and inverted • repeat families • mapping DNA similarities Strategy Libraries Sequencing Assembly • EST/mRNAs in eukaryotes Closure • Duplications, Annotation • RNAs Release • mapping peptide similarities • protein similarities Gene finding • ORF finding (simple but messy) Strategy • ab initio prediction Libraries • Measures of codon bias • Simple statistical frequencies • Comparative prediction • Using similarity data • Using cross-species similarities Sequencing Assembly Closure Annotation Release Peptide features • Peptide features • low-complexity regions • trans-membrane regions • structural information (coiled-coil) • Similarities and alignments • Protein families (InterPro/COGS) Strategy Libraries Sequencing Assembly Closure Annotation Release Initial role assignment • Simple attempt to describe the functional identity of a peptide • Uses data from: • peptide similarities • protein families • Vital for data mining • Large number of predicted genes remain hypothetical or unknown Strategy Libraries Sequencing Assembly Closure Annotation Release Other regulatory features • Ribosomal binding sites Strategy • Promoter regions Libraries Sequencing Assembly Closure Annotation Release Data Release • DNA release • Unfinished Strategy • Finished Libraries • Nucleotide databases • GENBANK/EMBL/DDBJ • Peptide databases • SWISSPROT/TREMBL/GENPEPT • Others Sequencing Assembly Closure Annotation Release