Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrative Genomics Viewer HMS Countway Library June 15, 2010 Jim Robinson , Helga Thorvaldsdóttir Broad Institute of MIT and Harvard Agenda • • • • • • • • Introduction User Interface Basics Data Integration Hands-on Exercise File formats Viewing Next-Generation Sequence (NGS) Data Hands-on Exercise IGVTools Slides and handouts: ftp.broadinstitute.org/pub/genepattern/igv/tutorials/June2010 Introduction What is IGV A desktop application for integrated visualization of multiple data types and annotations in the context of the genome Microarrays Epigenomics RNA-Seq NGS alignments Comparative genomics Motivation • Easily view investigator-generated datasets alongside publically available data • Support integration of diverse data types and sample attribute information • Handle large datasets IGV goals • Meet the needs of diverse projects, including • • • • The Cancer Genome Atlas (TCGA) Epigenetic & lincRNA studies 1000 Genomes Project Single-investigator projects • Meet the needs of diverse users – biologists and bioinformatics specialists • Scale to very large datasets on standard desktop systems • Intuitive and easy to use IGV distribution •First public release in August 2008 •Current release: 1.4.2 •Early access versions updated frequently •More than 5500 registered users •Is open source and freely available •http://www.broadinstitute.org/igv •Contact us: [email protected] Installing IGV • Register at http://www.broadinstitute.org/igv • Click “Downloads” • Click a Launch button (Mac or PC), or • Download an unzip binary distribution (Linux) IGV Web site http://www.broadinstitute.org/igv IGV Web site Downloads IGV Web site Downloads PC and Mac Linux User Interface Basics IGV layout Expression and copy number data IGV layout Cytoband Track Names Genomic Coordinates Data Panel Annotation Heatmap Genome Features IGV layout NGS data IGV layout NGS data UI basics • • • • Selecting a reference genome Loading data Navigating through the data Setting track attributes Selecting a reference genome • Select one of the hosted genomes from the pull-down menu • For more information see www.broadinstitute.org/igv/Genomes • You can import other genomes if you have the sequence data Loading data Types of data • Any data tied to genomic coordinates • Genome annotations • Sample attributes/annotations File formats • Many different file formats supported • See www.broadinstitute.org/igv/FileFormats Tracks • Two generic types: • data (continuous valued data) • annotation (features) • Specialized types include • alignments • mutations • multiple alignments • Type is defined by file format, and can be overridden by the user • IGV uses type to determine • initial placement in a panel • display options and options for other track attributes Loading data #1 : Load local file #2 : Load from URL #3 : Load from server (Broad IGV data server, other data server) “Load from server” menu What you see depends on : (1) which server you selected – default is Broad server (2) which reference genome you’ve selected Click on the for more information about the data source “Load from server” menu What you see depends on : (1) which server you selected – default is Broad server (2) which reference genome you’ve selected Click on the for more information about the data source Click on the to expand the sub-menus “Load from server” menu “Load from server” menu Click on the to select datasets Note that all nested datasets are also selected – make sure you know what you’ve selected “Load from server” menu “Load from server” menu One last thing … … you cannot unload using the checkboxes Navigating through the data Whole genome view Navigating through the data Zooming in to the chromosome level Select chromosome from menu Click on chromosome number Navigating through the data Chromosome view Navigating through the data Zooming further in Use the railroad track Double-click in data panel Shift-click to go faster Alt-click to zoom back out Navigating through the data Zooming further in Specify range in the search box Navigating through the data Zooming further in Red box on cytoband shows where we are Ruler shows the extent of the region Navigating through the data Scroll or jump to location at same zoom level Click on cytoband Click on ruler Click and drag – up/down left/right Use scroll bar Use keyboard (1) arrow keys (2) Page Up, Page Down, Home, End Navigating through the data Zoomed in to base pair view Reference genome bases Protein residues Navigating through the data Jump to feature • Enter name of feature in search box • With or without zoom (View > Preferences > General) • Click on a feature track (e.g. gene track, BED, GFF) • Ctrl+F = jump forward to next feature • Ctrl+B = jump backward to previous feature Setting track attributes Right-click popup menu Setting track attributes Multiple tracks Select multiple tracks by clicking on track names : Shift-click / Ctrl-click Select multiple tracks by clicking on color in annotation heatmap Setting track attributes Global attributes Tracks > Fit Data to Window Tracks > Set Track Height Annotation track Gene representation 5’ UTR Intron Zoomed in views Exons 3’ UTR Annotation display mode 1. Features are drawn in a single row, by default 2. Expand the track using the popup menu Sessions • Save current state of IGV to a named session file. • Use to • restore the same state • share session with colleagues Data Integration Data integration • Load different types of data • Use sample annotations to manipulate tracks Sample annotations • Default annotations for all sample tracks: • data file, data type, track name • Custom annotations: • use sample information file • Show / hide annotation panel (View > Show Attribute Display) • Show / hide selected annotations (View > Select Attributes to Show) Sort tracks by attribute value Click on the annotation name Use the menu Tracks > Sort Tracks Sort tracks by data value in a region Region Tool Popup menu Group tracks Group tracks Group tracks Group tracks Filter tracks Filter tracks Filter tracks Hands-on Exercise UI basics and data integration File Formats File formats •Sample Info File •Annotation File Formats •Data File Formats •Track Line •Genomes and FASTA Files Sample info file A sample information file (also called an attribute file) is a tabdelimited text file that includes descriptive information (attributes) for track identifiers. Uses: Annotation heatmap Sorting Filtering Grouping The first column of a sample information file contains track identifiers. Subsequent columns may contain any attribute values and may be given any arbitrary label. Sample info file Example TRACK_ID Data_Type LINKING_ID SAMPLE_ID EX-01-001 Expression P-01-P001 CN-01-002 CopyNumber MU-01-003 Primary/ Hypermutated Secondary GENDER T/N Tumor_type Treated P-01-S001 M Tumor GBM Y Primary Y P-01-P001 P-01-S001 M Tumor GBM Y Primary Y Mutation P-01-P001 P-01-S002 M Tumor GBM Y Primary Y EX-01-004 Expression P-01-P002 P-01-S003 M Normal GBM Y Secondary Y CN-01-005 CopyNumber P-01-P002 P-01-S004 M Tumor GBM Y Secondary N EX-01-006 Expression P-01-P002 P-01-S004 M Tumor GBM Y Secondary N ME-01-007 Methylation P-01-P002 P-01-S004 M Tumor GBM Y Secondary N EX-01-008 Expression P-01-P003 P-01-S006 F Tumor GBM N Primary Y EX-01-009 Expression P-01-P004 P-01-S009 F Tumor GBM N Primary Y EX-01-0010 Expression P-01-P005 P-01-S0011 M Control Annotation File formats •BED - UCSC standard format. Useful for displaying any feature type from simple blocks to genes. http://genome.ucsc.edu/FAQ/FAQformat.html#format1 •GFF – Two variants, GFF2 and GFF3. Can also be used all feature types, tends to be more verbose and slower to parse than BED. File sizes can be significantly larger. http://www.sequenceontology.org/gff3.shtml •Note: BED file coordinates are “zero-based half-open”. This means an interval spanning the first base is represented as 0-1. GFF files are “onebased open”. An interval spanning the first base is represented as 1-1. This difference is responsible for many off-by-one bugs. Data File formats •Single Track Formats • WIG – for fixed or variable step data with fixed spans http://genome.ucsc.edu/goldenPath/help/wiggle.html • BEDGraph – similar to BED format http://genome.ucsc.edu/goldenPath/help/bedgraph.html Data File formats • Multi-track (array) formats • IGV – general array-based data format. • CN – GenePattern format designed for SNP copy number data. Can be used for other data that spans a single base. • SEG – Specialized format for segmented copy number data. • GCT – GenePattern format for expression data. Only coordinate that uses probe names instead of genomic coordinates. GCT format GCT format GCT rows are keyed by probe identifier. For display in IGV these rows must be mapped to genomic coordinates with one of the following options: Probe to locus. IGV can automatically map probes for many common chips directly to a genomic location. Probe to gene. Optionally the user can specify that probes be mapped to genes. When this is chosen the expression value is applied to the entire region spanned by the gene. User-supplied. The automatic mappings can be overridden by inserting a locus string in the description column delimited by the symbols |@ and |, for example |@chr6:1950428-1950681|. UCSC track line A track line can be used to control many aspects of the track display such as graph type, color, and scale. Can be used with wig, bed, gff, igv, cn, and gct files. Line begins with “track” for wig and bed, “#track” for other formats. Track line consists of key=value pairs, separated by a single space Example: track name=“my custom track” graphType=bar color=255,0,0 Importing a genome Custom genome assemblies can be defined using “import genome” The imported genome will be available from the drop down menu Prerequisites: A FASTA file , directory of FASTA files, or zip of FASTA files that contains the sequence data for each chromosome in the genome. (Required) A cytoband file, which IGV uses to display the chromosome ideogram. (Optional) An annotation file in BED file format, the GFF file format, or any variation of the genePred table format. (Optional) Importing a genome 1. Click File > Import Genome. IGV displays the Import Genome window: 2. Enter a name for the genome. 3. For Sequence File, click the ellipse button and select the FASTA file (or zip of FASTA files) that contains the sequence data. 4. Optionally, specify the cytoband file and the gene track annotation (Gene File) file. 5. Click Save. IGV displays the Genome Archive window. 6. Select the directory in which to save the genome archive (*.genome) file and click Save. IGV saves the genome and loads it into IGV. Viewing NGS Data Next-generation sequencing The size of NGS datasets presents many challenges, including: • Implementation • Managing terabyte size files with modest compute resources (desktop computers). • Visual design • Highlight events of interest • Deemphasize irrelevant details • Avoid information overload Aligned reads – all bases Aligned reads - mismatches Aligned reads – base quality Vary view by resolution scale Whole chromosome -- calculated summary data, e.g. coverage. ~ 50-100 kb -- putative rearrangements, SNPs ~ 500 bp -- bases Viewing NGS data Viewing NGS data Viewing NGS data Viewing NGS data Viewing NGS data Viewing NGS data Double click here Viewing NGS data Viewing NGS data Click here Viewing NGS data Viewing NGS data Viewing NGS data Viewing NGS data Viewing NGS data Paired end data • Pairs with unexpected insert sizes are color coded by chromosome. • Useful for visualizing possible rearrangements. Paired end data Paired end data Paired end data Alignment preferences View > Preferences > Alignments Hands-on Exercise Viewing NGS data IGVTools IGVTools IGVTools is a set of utilities for preparing large files for efficient display. tile: converts a sorted data input file to a binary tiled data (.tdf) file. Supported input file formats: .wig, .cn, .snp, .igv, .gct count: computes average alignment or feature density for over a specified window size across the genome. Supported input file formats: .sam, .bam, .aligned, .sorted.txt, .bed sort: sorts the input file by start position. Supported input file formats: .cn, .igv, .sam, .aligned, and .bed. index: creates an index file for an input ascii alignment file. Supported input file formats: .sam, .aligned, .sorted.txt IGVTools tile The tile utility converts large ascii data files into tiled data format (.tdf) files. TDF files have the following advantages 1.Data is is indexed for efficient retrieval. 2.Data for zoomed out views are preprocessed. 3.TDF files are web friendly, large data files can be shared over the web. Only small slices of the file are actually transferred as needed. IGVTools count The count command is used to transform alignment files to read density TDF files, e.g. for ChIP-Seq, RNA-Seq, & similar alignment counting experiments. igvtools Alignments Alignments in bam/sam, .aligned, or bed format. Read density “Tiled Data File” indexed and optimized for fast retrieval at multiple resolution scales IGVTools sort This utility sorts IGV supported genomic formats by start position. Example: igvtools sort -m 1000000 –t ~/myTmpDir inputFile.sam outputFile.sorted.sam The sort command uses a combination of memory and disk to handle large files. -m = maximum # of lines to hold in memory. When this number is exceeded a temporary file is created. -t = directory used to create temporary files during sorting. IGVTools index Used to create an index file for viewing SAM (not BAM) files Note: to be confused with the samtools index, which is used to create an index for BAM files SAM => igvtools BAM => samtools Example: igvtools index inputFile.sam Result inputFile.sam.sai The index file must remain with the sam file to be found, IGV just appends .sai to the end. Creating Web links Use HTML hyperlinks to launch IGV and share datasets over the web. Two types of links are supported (1) Launch IGV on a specified session file. Example: http://www.broadinstitute.org/igv/dynsession/igv.jnlp?sessionURL=http://www.broadinstitute. org/tumorscape/textReader/IGV/all_tumors_session.xml&locus=chr7:55054218-55206232 (2) Load sessions or data files into a running IGV Example: http://localhost:60151/load?file=http://www.broadinstitute.org/igvdata/annotations/hg18/cons ervation/pi.12mer.wig.tdf&locus=egfr&genome=hg18