* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Week 13
Survey
Document related concepts
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Epigenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Human genome wikipedia , lookup
Minimal genome wikipedia , lookup
Genome editing wikipedia , lookup
Genome evolution wikipedia , lookup
DNA sequencing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pathogenomics wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Human Genome Project wikipedia , lookup
Genomic library wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Exome sequencing wikipedia , lookup
Transcript
10/09/2015 Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week 13: Next Generation Sequencing (NGS) Analysis Adam Cornish Graduate Student Guda lab Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ Fall 2015 GCBA 815 Introduction n Vector NTI is an integrated suite of sequence analysis and design tools that help you manage, view, analyze, transform, share, and publicize diverse types of molecular biology data, in a graphically rich analysis environment. Eisenstein. Nature. 2015 __________________________________________________________________________________________________ Fall 2015 GCBA 815 1 10/09/2015 Sources of NGS data PacBio Illumina Ion Torrent __________________________________________________________________________________________________ Fall 2015 GCBA 815 Single Cell Sequencing __________________________________________________________________________________________________ Fall 2015 GCBA 815 2 10/09/2015 Applications of NGS n Genome ¨ ¨ ¨ ¨ ¨ n Transcriptome ¨ ¨ ¨ ¨ n n Targeted sequencing panels (cancer, newborns, autism, etc.) Whole exome sequencing Whole genome sequencing Copy number analysis Reconstruction of extinct species’ genomes Whole transcriptome (poly-A selection) Small RNA analysis (siRNA, snoRNA, lincRNA, etc.) Gene expression profiling for selected target genes Rare cell identification Metagenome ¨ Bulk sequencing of many types of bacteria ¨ Examples: human gut microbiome, pollen composition, bacteria composition, viral studies Epigenome ¨ ¨ Chromatin Immunoprecipitation Sequencing (ChIP-Seq) Methylation Sequencing (Methyl-Seq) __________________________________________________________________________________________________ Fall 2015 GCBA 815 Variant calling using NGS data __________________________________________________________________________________________________ Fall 2015 GCBA 815 3 10/09/2015 Important file types The big three: n Fastq n SAM/BAM n VCF ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ Raw sequencing data usually directly from the sequencer Sequence data that has usually been aligned to a specific genome Tab-delimited text file that contains a list of possible variants: SNV Insertion and deletion (indel) Duplication Copy number variation Inversion Tandem duplication __________________________________________________________________________________________________ Fall 2015 GCBA 815 Fastq @SRR098401.11403008/1 GAGGCTATAGCATGGTCAAGGCACAAGAAGATCACTGGACTGCCCTCGCTCAGCCCTCAGCTACTG + >>?>?@>?>@@>?@@=@@@@@??>??@??@?@A?>@@@?>@@???A@:@A@@A@@@A@@AAB@@BB Row 1: Information from the sequencer about the location of this read on the plate Row 2: The Sequence Row 3: Metadata provided by the sequencing team Row 4: Quality scores pertaining to each nucleotide in the sequence __________________________________________________________________________________________________ Fall 2015 GCBA 815 4 10/09/2015 Fastq continued Phred quality score Quality scores are phred-scaled: Seq: TCAGCCCTCAGCTACTGCTCT Score: A@@A@@@A@@AAB@@BBABAB Probability that the base is called wrong Accuracy of the base call 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999% Phred-33 is the most common, and is based on ASCII values. The quality score of a base call is the ASCII value of the character subtracted by 33. Example: the ASCII value for ‘A’ is 65, and 65 - 33 = 32. That means the base call corresponding to this score has a 1 in ~2,000 chance of being wrong. __________________________________________________________________________________________________ Fall 2015 GCBA 815 Sequence Alignment / Map (SAM / BAM) SRR098401.104031357 83 chr22 17445857 60 76M = 17445512 -421 ACTGTTACCAGATCAAGAACTGATAGGGACAGGGATCATTATTCCCCCTTTACAGATGAGAAGGCCGTCACGCCTC @@>>B@@@BBAAAB9A@@>:@@?=A@?@?@A???>?@??=???@@@@@>@>>@@@><??@>@>@@8?>?=:@>?>> BD:Z:NOJKPQQQQMONOMKKKLNOMNLLLJLMINLJLMLMLKKKKJLJJJMKCKLINJMMLJKKKMOOMNNOLPQSNMK K PG:Z:MarkDuplicates RG:Z:NA12878 BI:Z:OOMLRRPPRPPQQONOLOPOONOOOKLNMONJKMNONMMMMLMKKKMLGMNLNMMNNJMJLNOMLNMPNONONNM M NM:i:0 MQ:i:60 AS:i:76 XS:i:0 Similar to the Fastq file in that it contains the raw sequence and its quality scores. It also tells you where the sequence aligned to the genome, and how well (this scre is also phred-scaled). In this case, this read aligned to chromosome 22, position 17445857, and has a quality score of 60 (or a 1 in 1,000,000 chance of being placed incorrectly). __________________________________________________________________________________________________ Fall 2015 GCBA 815 5 10/09/2015 Variant Call Format (VCF) __________________________________________________________________________________________________ Fall 2015 GCBA 815 ExAC Browser 6