Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Spring 2007 Bioinformatiatics Ch. 6 - Genomics Spring 2009 Bioinformatiatics Completed genomes • http://www.genomesonline.org Spring 2009 Bioinformatiatics •Avg. genome = 5 mb •Typical sequence coverage = 20X, therefore approx. 100 mb of DNA •Avg. English word size = 5 letters •Avg. words per page = 250, therefore 1250 letters per page •Avg. book size = 200 pages, therefore 250,000 letters per book •Approximately 400 books per genome •958 completed genomes as of January 1, 2009 •Approximately 383,200 books worth of genomic information •MSU library holdings: 182,000 Approaches to Genome Sequencing Spring 2007 Bioinformatiatics • Whole Genome Sequencing • Shotgun Sequencing • Expressed Sequence Tags • Comparative Genomics • Metagenomics Overview of Genome Sequencing Isolate Genomic DNA Genomic DNA Create Genomic Library BAC Clones Construction of Genome Map DNA Sequencing and Assembly Isolating Genomic DNA ala, Qiagen’s DNeasy kit Lysis: • Proteinase K digestion • Lysis by chaotropic salt Purification: • DNA negatively charged • Bind positively charged column • Wash (EtOH) away impurities Elution: • Removal of DNA • Disrupt ionic interaction with high salt buffer Preservation: • Store at -20°C to -160°C • Tris•EDTA buffer [pH 8.0] Sephadex Structure Creating a Genomic Library Cut Genomic DNA: • Partial Restriction Digest •EcoRI & EcoRI methylase • Mechanical Shearing • Determine Avg. fragment size Clone Fragments into BAC vectors: • Proporties of BACs BAC Clones Transform E. coli: • Electroporation Pulse Field Gel Electrophoresis Average Insert Size by Pulse Field Gel Electrophoresis Average Insert Size in Human BACs Creating a Genomic Library Cut Genomic DNA: • Partial Restriction Digest •EcoRI & EcoRI methylase • Mechanical Shearing • Determine Avg. fragment size Clone Fragments into BAC vectors: • Proporties of BACs BAC Clones Transform E. coli: • Electroporation Bacterial Artificial Chromosome • Derived from F plasmids • Multiple cloning site • Selectable Marker • Antibiotic Resistance Gene - ie, cm • Ori S - unidirectional • Par genes • partitioning genes • maintain single copy of BAC Creating a Genomic Library Cut Genomic DNA: • Partial Restriction Digest •EcoRI & EcoRI methylase • Mechanical Shearing • Determine Avg. fragment size Clone Fragments into BAC vectors: • Proporties of BACs BAC Clones Transform E. coli: • Electroporation Construction of Genome Map Transformed E. coli: Plasmid Miniprep BAC Clones Construction of Genome Map • BAC end sequencing • Identify overlapping BACs • Subclone BACs into plasmids DNA Sequencing and Assembly Genome Assembly and Annotation Overview of Shotgun Sequencing Isolate Genomic DNA Genomic DNA Create Genomic Library Plasmid Clones DNA Sequencing and Assembly Construction of Genome Map Overview of EST Sequencing Isolate mRNA Create cDNA Create Genomic Library DNA Sequencing Comparative Genomics Isolate mRNA and create cDNA Create Genomic Library BAC Clones Construction of Genome Map DNA Sequencing and Assembly Synteny - same gene order preserved between species Comparative Genomics BAC array Comparative Genome Hybridization Bordetella phylogeny Comparative Genome Hybridization Comparative Genome Hybridization Metagenomic analysis • What is metagenomics? – Metagenomics is the genomic analysis of the collective genomes of an assemblage of organisms from a defined environment. » Handelsman, et al, 2002 – a.k.a., community genomics, environmental genomics – Derived from tools, techniques and models used in genomics. • Why do metagenomic analysis? – Genomic content of all eucaryotes, bacteria, archaea and viruses in an evironment. – Provides a picture of genetic/functional potential of the community. Metagenomics Venter’s Trip Yooseph, et al, PLOS biology, 2007 Yooseph, et al, PLOS biology, 2007 Creation of Fosmid Libraries Preliminary Categorization of 263 ORFs from a Fosmid Library of Subgingival Plaque Category Percentage of library Eucaryotic 34% Bacterial 21% Archaeal 1.1% Viral1 0.8% Bacteriophage 2% Unidentified 41% 1not bacteriophage Spring 2007 Bioinformatiatics Genome Annotation Genome Assembly and Annotation RefSeq db Caveats • Finding genes involves computational methods as well as experimental validation • Computational methods are often inadequate, and often generate erroneous ‘gene’ (false positive) sequences which: – – – – Are missing exons Have incorrect exons Over predict genes Where the 5’ and 3’ UTR are missing Things we are looking to annotate? • • • • • • • • CDS mRNA Alternative RNA Promoter and Poly-A Signal Pseudogenes ncRNA Repeat elements G+C content Pseudogenes • Could be as high as 20-30% of all Genomic sequence predictions could be pseudogene • Non-functional copy of a gene – Processed pseudogene • • • • Retro-transposon derived No 5’ promoters No introns Often includes poly-A tail – Non-processed pseudogene • Gene duplication derived – Both include events that make the gene non-functional • Frameshift • Stop codons • We assume pseudogenes have no function, but we really don’t know! Noncoding RNA (ncRNA) • tRNA – transfer RNA: involved in translation • rRNA – ribosomal RNA: structural component of ribosome, where translation takes place • snRNA – small nuclear RNA: functional/catalytic in RNA maturation • Antisense RNA - gene regulation • siRNA - gene silencing Noncoding RNA (ncRNA) • ncRNA represent 80-98% of all transcripts in cell • ncRNA have not been taken into account in gene counts • cDNA • ORF computational prediction • Comparative genomics looking at ORF • ncRNA can be: – Structural – Catalytic – Regulatory GenBank Features -10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR attenuator CAAT_signal CDS conflict C_region D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA primer_bind prim_transcript promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA satellite scRNA sig_peptide snoRNA snRNA S_region stem_loop STS TATA_signal terminator transit_peptide tRNA unsure variation V_region V_segment LOCUS DEFINITION NG_005487 1850 bp DNA linear ROD 14-FEB-2006 Mus musculus ubiquitin-conjugating enzyme E2 variant 2 pseudogene (LOC625221) on chromosome 6. ACCESSION NG_005487 VERSION NG_005487.1 GI:87239965 KEYWORDS . SOURCE Mus musculus (house mouse) ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus. REFERENCE 1 (bases 1 to 1850) AUTHORS Wilson,R. TITLE Mus musculus BAC clone RP24-201D17 from 6 JOURNAL Unpublished (2003) COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AC121925.2. FEATURES Location/Qualifiers source 1..1850 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" /chromosome="6" /note="AC121925.2 32277..34126" gene 101..1750 /gene="LOC625221" /pseudo /db_xref="GeneID:625221" repeat_region 1792..1827 /rpt_family="ID" ORIGIN 1 tcttctgcct caattcctca agtgctagta tcatatgccc atgccattat ttttaactcc 61 cctttttcat gctaagaatt gaacacacgg ccctgcgtgc ggtggtgcgt ctggtagcag 121 gagaagatgg cggtctccac aggagttaaa gttcctcgta attttcgctt gttggaagaa The ideal annotation of “MyGene” All clones All SNPs Promoter(s) MyGene All mRNAs All proteins All structures • All protein modifications • Ontologies • Interactions (complexes, pathways, networks) •Expression (where and when, and how much) •Evolutionary relationships