Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Molecular Inversion Probe wikipedia , lookup
Genome evolution wikipedia , lookup
Human genetic variation wikipedia , lookup
Genetic drift wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
SNP genotyping wikipedia , lookup
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows Objective • Parse VCF files • Calculate summary statistics across sliding windows throughout the genome • Implement NTFreq module to calculate nucleotide frequencies for each population and combined population • Implement TajimasD module to calculate Tajima’s D • Implement GO module to annotate identified SNPs Data set • Simulated data set for chromosome 2R in Drosophila melanogaster • 1.4 Mbp – 2 populations • Pooled individuals per population – 75bp reads, error rate 1% – 10,000 simulated SNPs • 100x coverage per variant • At least 100bp apart • Allelic Frequencies ranging from .1 to .9 per population FastQ -> sai SAM -> BAM - > .bcf -> VCF Data to->Variant Call Format Index Reference Genome Only chromosome 2R of D. melanogaster -Genome build Dmel 3 from Flybase Use BWA to Align FastQ to Reference Genome Gap open penalty = 1 Gap extension max = 12 Disallowing deletion within 12 bp of 3’UTR Maximum level of gap extensions = 12 Use SAMTools to Remove Ambiguously mapped Regions (MAPQ >= 20) Use BCFTools mpileup to Generate a Binary Code Format (BCF) BCF -> VCF Formatting data: Parse VCF For each window: • Fetch the VCF rows from each BCF file • Convert the VCF rows into hashes of arrays • Compute the Theta, Pi, Tajima’s D for each population • Compute Fst for each window between each population Sliding windows • Sliding window size is specified, and called modules are calculated across specified window size Module 1: Calculate allele frequencies • Input is taken from parsed VCF file • Hashes are created for each population with the following structure – {SNP_location} {nucleotide} -> frequency; • Hashes created for full dataset – {SNP_location}{Population} -> {nucleotide} ->frequency Output site frequency spectra • Site frequency spectrum (SFS) output as the following hash: – {nonref_allele}{frequency}->count; • Allows us to calculate a histogram for the nonreference allele frequencies • Send output to R to generate SFS graphs Module 2: Calculate Summary Statistics and Tajima’s D • theta_pi (index of diversity) • theta_watterson (index of diversity) Module 2: Calculate Summary Statistics and Tajima’s D • Tajima’s D (index of selection/population expansion) Module 3: FST for DNA sequence • Calculate FST (index of differentiation) according to Hudson et al. 1992 1 – Hw/Hb Hw: average number of differences within each population Hb: average number of differences between the 2 populations Module 4: GO annotations • Module takes SNP list as input • Outputs the following: – List of genes that have overlap with SNP positions – Gene Ontology (GO) IDs and terms associated with each SNP matched gene – List of genes for a selected window • Visualization using GOSlim Data visualization • Integrated Genomics Viewer (IGV) • Broad Institute • http://www.broadinstitute.org/igv/ SFS for population 1 and 2 Sliding window for summary statistics Phist greater than 0.1 in window 1080001 - 1100000 Go Accession ID Ontology Specific GO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complex GO:0005703 Cellular Component (Thought to be a site of active transcription) GO:0005634 Cellular Component (Nucleus) GO:0006911 Biological Process Phagosome biosynthesis/formation GO:0045747 Biological Process Up regulation of Notch signaling pathway GO:0006355 Biological Process Regulation of cellular transcription, DNA-dependent GO:0000910 Biological Process (Cytoplasm division) GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group) GO:0005700 Cellular Component (Polytene associated) GO:0005488 Molecular Function (Ligand, non-covalent partner) GO:0005737 Cellular Component (Ambiguous) GO:0035222 Biological Process (Patterning in wing imaginal disc) GO:0005875 Cellular Component (Microtubule associated) GO:0004672 Molecular Function Protamine kinase activity GO:0000123 Cellular Component Histone acetylase complex Identify differentiated genomic regions • For each window with a Fst > 0.1, print the name of the SNP and associated GO term Phist (Fst) greater than 0.1 in window 1080001 - 1100000 Go Accession ID Ontology Specific GO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complex GO:0005703 Cellular Component (Thought to be a site of active transcription) GO:0005634 Cellular Component (Nucleus) GO:0006911 Biological Process Phagosome biosynthesis/formation GO:0045747 Biological Process Regulation of cellular transcription, DNA-dependent GO:0000910 Biological Process (Cytoplasm division) GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group)GO:0005700 Cellular Component (Polytene associated) GO:0005488 Molecular Function (Ligand, non-covalent partner) GO:0005737 Cellular Component (Ambiguous) GO:0035222 Biological Process (Patterning in wing imaginal disc) GO:0005875 Cellular Component (Microtubule associated) GO:0004672 Molecular Function Protamine kinase activity GO:0000123 Cellular Component Histone acetylase complex Thank You Use PERL or die , print “ (X_x)”; ##Hashes to Hashes## Print “ % 2 %”;