* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download File formats for NGS data - Bioinformatics Training Materials
Comparative genomic hybridization wikipedia , lookup
Y chromosome wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Copy-number variation wikipedia , lookup
Transposable element wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Microevolution wikipedia , lookup
Genome (book) wikipedia , lookup
Point mutation wikipedia , lookup
X-inactivation wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Human Genome Project wikipedia , lookup
Public health genomics wikipedia , lookup
Neocentromere wikipedia , lookup
Genomic imprinting wikipedia , lookup
Non-coding DNA wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Human genome wikipedia , lookup
Sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Pathogenomics wikipedia , lookup
Reference genomes and common file formats Overview ● ● ● ● Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features ○ BED (genomic intervals) ○ GFF/GTF (gene annotation) ○ Wiggle files, BEDgraphs, BigWigs (genomic scores) Why do we need to know about reference genomes? ● Allows for genes and genomic features to be evaluated in their genomic context. ○ Gene A is close to gene B ○ Gene A and gene B are within feature C ● Can be used to align shallow targeted high-throughput sequencing to a pre-built map of an organism Genome Reference Consortium (GRC) ● Most model organism reference genomes are being regularly updated ● Reference genomes consist of a mixture of known chromosomes and unplaced contigs called as Genome Reference Assembly ● Genome Reference Consortium: ○ A collaboration of institutes which curate and maintain the reference genomes of 4 model organisms: ■ Human - GRCh38.p9 (26 Sept 2016) ■ Mouse - GRCm38.p5 (29 June 2016) ■ Zebrafish - GRCz10 (12 Sept 2014) ■ Chicken - Gallus_gallus-5.0 (16 Dec 2015) ○ Latest human assembly is GRCh38, patches add information to the assembly without disrupting the chromosome coordinates ● Other model organisms are maintained separately, like: ○ Drosophila - Berkeley Drosophila Genome Project Overview ● ● ● ● Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features ○ ○ ○ BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores) The reference genome ● ● ● A reference genome is a collection of contigs A contig is a stretch of DNA sequence encoded as A, G, C, T or N Typically comes in FASTA format: ○ ○ ">" line contains information on contig Following lines contain contig sequences Unaligned sequences - FastQ ● Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence FastQ (unaligned sequences) → SAM (aligned sequences) ● FastQ: FASTA with quality ● ● ● ● "@" followed by identifier Sequence information "+" Quality scores encodes as ASCI Unaligned sequences - FastQ header ● Header for each read can contain additional information ○ ○ ○ HS2000-887_89 - Machine name 5 - Flowcell lane /1 - Read 1 or 2 of pair Unaligned sequences - FastQ qualities ● ● ● ● Qualities come after the "+" line -log10 probability of sequence base being wrong Encoded in ASCII to save space Used in quality assessment and downstream analysis Overview ● ● ● ● Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features ○ ○ ○ BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores) Aligned sequences - SAM format ● ● ● SAM - Sequence Alignment Map Standard format for sequence data Recognised by majority of software and browsers SAM header ● ● ● SAM header contains information on alignment and contigs used @HD - Version number and sorting information @SQ - Contig/Chromosome name and length of sequence Aligned sequences - SAM format SAM aligned reads ● Contains read and alignment information and location ○ ○ ○ Read name Sequence of read Encoded sequence quality Aligned sequences - SAM format SAM aligned reads ● ● ● Chromosome to which the read aligns Position in chromosome to which 5' end of the read aligns Alignment information - "Cigar string" ○ ○ 100M - Continuous match of 100 bases 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match Aligned sequences - SAM format SAM aligned reads ● Bit flag - TRUE/FALSE for pre-defined read criteria, like: is it paired? duplicate? ○ ● ● https://broadinstitute.github.io/picard/explain-flags.html Paired read position and insert size User defined flags Li H et al.,The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. Overview ● ● ● ● Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features ○ ○ ○ BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores) Summarised genomic features formats ● After alignment, sequence reads are typically summarised into scores over/within genomic intervals ○ ○ ○ BED - genomic intervals with additional information Wiggle files, BEDgraphs, BigWigs - genomic intervals with scores GFF/GTF - genomic annotation with information and scores BED format - genomic intervals ● BED3 - 3 tab separated columns ○ ○ ○ ● Chromosome Start End Simplest format ● BED6 - 6 tab separated columns ○ ○ ○ ○ Chromosome, start, end Identifier Score Strand ("." stands for strandless) Wiggle format - genomic scores Variable step Wiggle format ● Information line ○ ○ ○ ● Chromosome Step size (Span - default=1, to describe contiguous positions with same value) Fixed step Wiggle format ● ○ ○ ○ ○ Each line contains: ○ ○ Start position of the step Score Information line ● Chromosome Start position of first step Step size (Span - default=1, to describe contiguous positions with same value) Each line contains: ○ Score bedGraph format - genomic scores ● ● ● BED-like format Starts as a 3 column BED file (chromosome, start, end) 4th column: score value GFF - genomic annotation ● Stores position, feature (exon) and meta-feature (transcript/gene) information ● Columns: ○ ○ ○ ○ ○ ○ ○ ○ ○ Chromosome Source Feature type Start position End position Score Strand Frame - 0, 1 or 2 indicating which base of the feature is the first base of the codon Semicolon separated attribute: ID (feature name);PARENT (meta-feature name) Saving time and space - compressed file formats ● Many programs and browsers deal better with compressed, indexed versions of genomic files ○ ○ ○ ○ SAM -> BAM (.bam and index file of .bai) BED -> bigBed (.bb) Wiggle and bedGraph -> bigWig (.bw/.bigWig) BED and GFF -> (.gz and index file of .tbi) Getting help and more information ● UCSC file formats ○ ● IGV file formats ○ ● https://genome.ucsc.edu/FAQ/FAQformat.html http://software.broadinstitute.org/software/igv/FileFormats Sanger file formats ○ http://gmod.org/wiki/GFF3 Acknowledgement ● Tom Carroll http://mrccsc.github.io/genomic_formats/genomicFileFormats.html#/