Download ParSNP Hash

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular Inversion Probe wikipedia , lookup

NEDD9 wikipedia , lookup

Genome evolution wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic drift wikipedia , lookup

RNA-Seq wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

SNP genotyping wikipedia , lookup

Population genetics wikipedia , lookup

Tag SNP wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
ParSNP Hash
Pipeline to parse SNP data and output
summary statistics across sliding
windows
Objective
• Parse VCF files
• Calculate summary statistics across sliding windows
throughout the genome
• Implement NTFreq module to calculate nucleotide
frequencies for each population and combined
population
• Implement TajimasD module to calculate Tajima’s D
• Implement GO module to annotate identified SNPs
Data set
• Simulated data set for chromosome 2R in
Drosophila melanogaster
• 1.4 Mbp
– 2 populations
• Pooled individuals per population
– 75bp reads, error rate 1%
– 10,000 simulated SNPs
• 100x coverage per variant
• At least 100bp apart
• Allelic Frequencies ranging from .1 to .9 per population
FastQ
-> sai
SAM -> BAM
- > .bcf -> VCF
Data
to->Variant
Call Format
Index Reference Genome
Only chromosome 2R of D. melanogaster
-Genome build Dmel 3 from Flybase
Use BWA to Align FastQ to Reference Genome
Gap open penalty = 1
Gap extension max = 12
Disallowing deletion within 12 bp of 3’UTR
Maximum level of gap extensions = 12
Use SAMTools to Remove Ambiguously mapped Regions
(MAPQ >= 20)
Use BCFTools mpileup to Generate a Binary Code Format (BCF)
BCF -> VCF
Formatting data: Parse VCF
For each window:
• Fetch the VCF rows from each BCF file
• Convert the VCF rows into hashes of arrays
• Compute the Theta, Pi, Tajima’s D for each
population
• Compute Fst for each window between each
population
Sliding windows
• Sliding window size is specified, and called
modules are calculated across specified
window size
Module 1: Calculate allele frequencies
• Input is taken from parsed VCF file
• Hashes are created for each population with
the following structure
– {SNP_location} {nucleotide} -> frequency;
• Hashes created for full dataset
– {SNP_location}{Population} -> {nucleotide} ->frequency
Output site frequency spectra
• Site frequency spectrum (SFS) output as the
following hash:
– {nonref_allele}{frequency}->count;
• Allows us to calculate a histogram for the nonreference allele frequencies
• Send output to R to generate SFS graphs
Module 2: Calculate Summary
Statistics and Tajima’s D
• theta_pi (index of diversity)
• theta_watterson (index of diversity)
Module 2: Calculate Summary
Statistics and Tajima’s D
• Tajima’s D (index of selection/population
expansion)
Module 3: FST for DNA sequence
• Calculate FST (index of differentiation) according
to Hudson et al. 1992
1 – Hw/Hb
Hw: average number of differences within
each population
Hb: average number of differences between
the 2 populations
Module 4: GO annotations
• Module takes SNP list as input
• Outputs the following:
– List of genes that have overlap with SNP positions
– Gene Ontology (GO) IDs and terms associated
with each SNP matched gene
– List of genes for a selected window
• Visualization using GOSlim
Data visualization
• Integrated Genomics Viewer (IGV)
• Broad Institute
• http://www.broadinstitute.org/igv/
SFS for population 1 and 2
Sliding window for summary statistics
Phist greater than 0.1 in window 1080001 - 1100000
Go Accession ID Ontology
Specific
GO:0000124
Cellular Component
Spt-Ada-Gcn5-acetyltransferase complex
GO:0005703
Cellular Component
(Thought to be a site of active transcription)
GO:0005634
Cellular Component
(Nucleus)
GO:0006911
Biological Process
Phagosome biosynthesis/formation
GO:0045747
Biological Process
Up regulation of Notch signaling pathway
GO:0006355
Biological Process
Regulation of cellular transcription, DNA-dependent
GO:0000910
Biological Process
(Cytoplasm division)
GO:0016773
Molecular Function
(Intermolecular transfer of phosphorus group to an alcohol group)
GO:0005700
Cellular Component
(Polytene associated)
GO:0005488
Molecular Function
(Ligand, non-covalent partner)
GO:0005737
Cellular Component
(Ambiguous)
GO:0035222
Biological Process
(Patterning in wing imaginal disc)
GO:0005875
Cellular Component
(Microtubule associated)
GO:0004672
Molecular Function
Protamine kinase activity
GO:0000123
Cellular Component
Histone acetylase complex
Identify differentiated genomic regions
• For each window with a Fst > 0.1, print the
name of the SNP and associated GO term
Phist (Fst) greater than 0.1 in window 1080001 - 1100000
Go Accession ID Ontology Specific
GO:0000124
Cellular Component Spt-Ada-Gcn5-acetyltransferase complex
GO:0005703
Cellular Component (Thought to be a site of active transcription)
GO:0005634
Cellular Component (Nucleus)
GO:0006911
Biological Process
Phagosome biosynthesis/formation
GO:0045747
Biological Process
Regulation of cellular transcription, DNA-dependent
GO:0000910
Biological Process
(Cytoplasm division)
GO:0016773
Molecular Function (Intermolecular transfer of phosphorus group to an alcohol
group)GO:0005700 Cellular Component (Polytene associated)
GO:0005488
Molecular Function (Ligand, non-covalent partner)
GO:0005737
Cellular Component (Ambiguous)
GO:0035222
Biological Process
(Patterning in wing imaginal disc)
GO:0005875
Cellular Component (Microtubule associated)
GO:0004672
Molecular Function Protamine kinase activity
GO:0000123
Cellular Component Histone acetylase complex
Thank You
Use PERL or die , print “
(X_x)”;
##Hashes to Hashes##
Print “ % 2 %”;