Download bchm6280_lect1_16

Course Expectations Sequencing technology and (very) large datasets 5/17/2016 Goals for the course • Understand how next-generation sequencing technologies are used in biomedical research • Learn how to use publicly available databases/websites to find specific information about genes • Learn how to analyze gene lists to form hypotheses that can be tested experimentally • Learn to write a results section for a manuscript Logistics • Course website: – http://biochem.slu.edu/bchm628/ • Contact: – Phone: 977-8858 – Email: [email protected] • Office – DRC 611 – Call or email. – Usually at WashU on Thursday afternoons • Lab – DRC 654 Exercise format • There will be 6 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks. • You’ll provide the answer in the same format as you would write for the results section of a paper 1. 2. 3. 4. Why did you do this experiment or analysis? What did you actually do? What did you observe? What does it mean? • Include supporting data – Figures with figure legends – Correctly formatted tables of data. Exercises, cont • You will hand in your exercise via email – Exercise in Word or PDF format – Supplemental data in Excel, Word or PDF format. • The exercise should print in portrait orientation. • The exercise should include a header with your name at the top and the file should be named: – Your Name-Ex #. • There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not. Final project • This will be a project summary of the analyses that you will do over the course of the 4 weeks. • You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. – You will be asked to give a rationale for making the choices that you did. • You will analyze the three genes virtually using some of the tools from weeks 1- 4. • You will be asked to propose hypothetical bench experiments for the genes • Final project will be due June 21st at 3:00 pm. Data tables In general, columns describe attributes and rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns. Table 1: Gene expression for WT cells under conditions X,Y, Z. Gene name Log 2 (Cond. X/untreated) Log 2 (Cond. Y/untreated) Log 2 (Cond. Z/untreated) NM_00522 2.56 3.12 2.75 NM_06588 -1.25 -1.02 -0.98 Table 2: Comparison of clinical parameters for groups 1 and 2. Clinical parameter ALT/AST ratio Leukocyte count 1 Statistical 2 Group 1 (avg ± mean) Group 2 (avg ± mean) P-value 25 ± 1 35 ± 2 0.0021 1200 ± 32 950 ± 65 0.0512 significance was determined by a Mann-Whitney test Statistical significance was determined by 2-tailed t-test Data tables, cont • For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. • If your table is so wide that it forces the page into landscape orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment. • Refer to supplemental tables in your write-up and number then and the file as Name_SuppTable1, ect. • Supplemental tables can be in Excel format. Figures • If you can export the figure from whatever program in jpeg or png format, those can be inserted into a Word document easily. • PDFs can be converted to other formats using Illustrator • There are some online converters – http://www.wikihow.com/Convert-PDF-to-JPEG • Screen capture and placement may also work. • Talk to me if you have issues. • I won’t be very picky about high resolution. Figures, cont. • Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used. • Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data. • Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary. • Again, talk to me if you have issues. Grading • Grading: – Exercises – Final exam – Class attendance 65 % 25 % 10 % • Grading policy handout – Details about late assignment and tests Lecture outline • Overview of sequencing a genome • Next generation sequencing • High-throughput experiments by sequencing • Genome browsers Genome sequencing Approach depends on the source, size, complexity and goal for the data for a given organism Goal? – De novo sequencing – Re-sequencing for annotation – Sequencing to identify variations • Size and complexity – Virus, bacterial, single-celled eukaryote, mammal, plant – Quasi-species or repetitive sequences • Sample prep – Can it be cultured? – Tissue source: unlimited or limited quantities? – Virus levels, RNA or DNA Genome sizes Genome size (base pairs) Number of genes Hepatitis C virus 0.01 x 106 10 Epstein-Barr virus 0.172 x 106 37 Bacterium (E. coli) 4.6 x 106 4406 Yeast (S. cerevisiae) 12.5 x 106 6172 Nematode worm (C. elegans) 100.3 x 106 19,099 Thale cress (A. thaliana) 115.4 x 106 25,498 Fruit fly (D. melanogaster) 128.3 x 106 13,601 Corn (Z. mays) 2500 x 106 39,469 Human (H. sapiens) 3223 x 106 20,500 Wheat (T. aestivium) 5500 x 106 (x 3) ~95,000 Organism Types of questions • How many genes? – How many functional genetic elements – miRNAs, ncRNAs • What’s different about this genome compared to another one? – Virulence differences in pathogenic organisms – What is the cause of this particular phenotype? • What taxonomic groups are represented in this population of bacteria, viruses or fungi? • How do the gene expression patterns change between samples (and across time)? • Where does this transcription factor bind in the genome? DNA sequencing – Overview • Gel electrophoresis – Predominant in 1980s • Whole genome strategies Cost/base for DNA sequence 1.0E+02 1.0E+01 Physical mapping (BAC clones) Walking Shotgun sequencing Capillary sequencing machines 1.0E+00 • Computational fragment assembly • Next generation technologies 1.0E-05 – – – – – Polony based sequencing – Novel assembly techniques 1.0E-01 1.0E-02 1.0E-03 1.0E-04 1.0E-06 1.0E-07 Traditional approach 1. Shear the very large genome into smaller chunks 2. Clone in vectors that can support large inserts 3. Digest and separate on high resolution gel to determine the clone overlap 4. Pick minimum number of clones 5. Shotgun sequence each clone 6. Read the traces and assemble 7. Make the gene calls 8. Load it into a genome viewer BAC library in DNA sequencing Shotgun sequencing D Sequence each clone Individual sequence reads Contig assembly E Contig A Gap Contig B Paired reads vs single reads Single reads • M13 clones • robotic template prep Contig A Gap Contig B Paired reads • Plasmids, cosmids, BACs Contig A Gap Contig B Gap closure!! Prefer 3-10 mate pairs per gap Inserts of different, but known sizes Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes 1. Find reads outoverlapping of sequencer mate pair a pair of reads from two ends of the same insert fragment 2. Merge some “good” pairs of reads into contigssequence formed contig longer a contiguous by several overlapping reads with no gaps 3. Link contigs to formand supercontigs supercontig an ordered oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the 4. Derive multiple consensus sequence sequence alignment of reads in contig ..ACGATTACAATAGGTT.. Target: 30X coverage or >30 high quality reads per base Assembled into chromosomes • Refseq nomenclature: – – – – NT: genomic sequence of complete gene NC: chromosome NM: mRNA sequence NP: protein sequence Assembly: completed genome, multiple assemblies Calling the genes • De novo computer algorithms – Identify coding sequences by GC content – Start and stop sites – Intron/exon boundaries • Comparison with other known genes • EST libraries Sanger method Sanger sequencing reached its technical limits • Only modestly parallel (394 lanes/machine) • Long read lengths (500-900 bp) & >99.9% correct • Need to clone the DNA to obtain enough for sequencing reaction • At SLU: cost for typical Sanger sequencing is $56/sample with reliable 500 bp of sequence DNA sequencing timeline How many sequenced genomes? NCBI: >16,000 genomes deposited JGI (Joint Genome Institute): >8000 complete >28,000 draft genomes NGS sequencing • Polony: discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads • Polony-based or cluster-based sequencing is the basis of most second generation sequencers Typical NGS workflow: 1. Library construction to add adapters to sequence 2. Template CLONAL amplification (on a bead or chip) 3. Massively PARALLEL sequencing Library Prep: ~ 6 hours Illumina NGS A) Fragment DNA B) Repair ends/Add A overhang DNA C) Ligate adapters D) Select ligated DNA Cluster generation ~ 6 hours E) Attach DNA to flow cell F) Bridge amplification G) Generate clusters H) Anneal sequencing primer Sequencing 2-6 days I) Extend 1st base, read & deblock K) Generate base calls J) Repeat to extend strand Illumina HiSeq and miSeq • 100 – 200 bp read lengths • Available locally with MoGene and Cofactor Genomics • GTAC (Wash U) has HiSeq 2500, HiSeq 3000 and MiSeq. They offer read lengths from 50bp to 250 bp (single- and paired-end) • Why not use this for all sequencing? – – – – Cost is ~300-400/library and ~$1100/lane of sequencing Generate Tb of data per run Gb per lane Sample prep limitations Ion Torrent – measures pH changes Done on a semiconductor chip Ion Torrent workflow Illumina vs Ion Torrent • • • • Illumina has greater capacity but longer run times Latest versions of both have read lengths ~200 bp SLU has an Ion Torrent machine Cost is ~$270/sample, including the sequencing • Can do single- or pair-end reads • Paired end are 2X cost for library construction, but necessary for de novo genome assembly Bioinformatics challenges • Each flow cell in the Illumina Hiseq 2500 can generate a billion bases of sequence – Raw read files are Tb in size – Processed read files are several 700-800 Mb – Alignment files 150-300 Mb • Assembly of millions of short (75-100 bp) reads into vertebrate genome – Need high-performance compute (HPC) cluster for vertebrate sized genomes* • What biomolecular species to interrogate – 25,000 genes – 160,000 transcripts – miRNA, non-coding RNA Sequencing has become a standard technique • • • • RNA sequencing for expression ChIP sequencing for TF site identification DNA sequencing for variants Identification of populations/genetic changes in highly variable viruses and bacteria • Metagenomics – Identification of unknown/non-culturable communities of bacteria/viruses/fungi Where is all this data deposited? • NCBI: National Center for Biotechnology Information • Databases are well integrated • Well integrated with literature (PubMed) • EBI: European Bioinformatics Institute • • • • • Same base data as NCBI, but offers different front-end Much better list-based searching More protein-based information (domains, complexes & interactions) Not as well integrated with literature Transcript variants differ from NCBI because of different annotation pipelines NCBI Ensembl main page Genome viewers • Provides chromosomal context to the gene(s) of interest • See transcript variants in graphical view • Have “tracks” of additional information: – – – – – Variants (SNPs) Expression data Repetitive sequences Comparative data (with other species) Download genomic sequence • Ensembl genome viewer (useast.ensembl.org) • UCSC genome viewer (genome.ucsc.edu) Genetic maps • Chromosomal banding patterns – Stain with Giemsa (G-banding pattern) Chromosomes are numbered based on size Giemsa binds to phosphate groups & attaches to regions that are AT rich Dark regions heterchromatic, late replicating and AT rich Lighter regions euchromatic, early replicating and GC rich Chromosome nomenclature p (petite) = short arm q (queue) = long arm Bands are numbered going away from centromere 4q21.1 represents chromosome 4, long arm 2nd band, 1st sub-band and 1st sub-sub-band Today in computer lab • Finding genes and transcripts using NCBI and EBI • Visualization of genes and transcripts with genome browsers

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download bchm6280_lect1_16