Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Next Generation Sequence Analysis Jeff Chou, PhD Section of Statistical Genetics, Department of Biostatistical Sciences, Biostatistics Core and Cancer Genomics Core, Comprehensive Cancer Center Center for Public Health Genomics Acknowledgment: I am closely working with Drs. Carl Langefeld, Lance Miller, Greg Hawkins, David McWilliams. Via the core support, I work with many researchers in the institute, mostly for their microarray data. DNA-seq Analysis Pipeline DNA sequence data / alignment statistics • Fastqc • Samtools • Picard reads length, reads number, base quality, sequence quality, Sequence sequence duplication levels, GC content, etc. alignmentrquality, number of high quality aligned reads, base data sstics mismatch rate, strand balance, duplicate reads, depth, etc. DNA sequence data aligners • BWA, Bowtie, Lifescope - Indexing Genome with Suffix Array Sequence • BFAST, Novoalign - Indexing Genome with Hash Tables r data sstics • SHRiMP2 - Indexing Reads with Hash Tables DNA sequence data variant callers • GATK - indel realignment, base recalibration, variant calling (unified genotyper caller and haplotype caller) and filtering • FreeBayes - haplotype-based,Sequence Bayesian algorithm designed to find small r dataindels, sstics MNPs, and complex events polymorphisms, specifically SNPs, • Samtools mpileup - scans every position, computes all the possible genotypes, and the probability of these genotypes Consistence Analysis of Called Variants from APOL1 Gene Region 17 calls 16 calls 15 calls 14 calls 13 calls sum Percent Number variants called 861 993 358 264 209 2685 100 Freebayers 861 929 269 185 94 2338 gatk 861 971 330 216 146 Samtools 861 985 300 194 Freebayers 861 993 255 Samtools 861 933 Freebayers 861 gatk BFAST Bowtie BWA Lifescope novoalign Shrimp2 Total variants Percent 87.1 5638 41.5 2524 94.0 8996 28.1 94 2434 90.7 3022 80.5 131 143 2383 88.8 3382 70.5 145 94 101 2134 79.5 2421 88.1 993 357 258 186 2655 98.9 5061 52.5 861 983 334 229 186 2593 96.6 9355 27.7 Samtools 861 993 357 259 195 2665 99.3 3882 68.7 Freebayers 861 181 205 79 57 1383 51.5 1842 75.1 gatk 861 986 334 249 181 2611 97.2 7038 38.1 Samtools 861 990 352 249 156 2608 97.1 4255 61.3 Freebayers 861 992 355 258 186 2652 98.8 3441 77.1 gatk 861 990 350 251 196 2648 98.6 6153 43.0 Samtools 861 993 358 262 198 2672 99.5 4181 63.9 Freebayers 861 992 358 260 198 2669 99.4 3753 71.1 gatk 861 991 355 259 195 2661 99.1 6091 43.7 Samtools 861 993 358 263 205 2680 99.8 5356 50.0 Manuscript “APOL1-APOL3 gene interactions in non-diabetic nephropathy: functional assessment and results of deep sequencing" submitted to Kidney International. RNA-seq Analysis Pipeline RNA sequence data / alignment statistics • Fastqc • Samtools • Picard reads length, reads number, base quality, sequence quality, sequence duplication levels, etc. Sequence alignment quality, number of high quality aligned reads, base r data sstics mismatch rate, strand balance, duplicate reads, etc. RNA sequence data aligners / Transcriptome assemblers • Tophat/Bowtie/Cufflinks - to identify exon-exon splice junctions, • • assembles transcripts, estimates their abundances, and tests for differential expression and regulation (2009) Sequence Subread/Subjunc/featureCounts – alignment, exon-exon junction r data sstics detection and read summarization (2013). Bioconductor (DESeq, DEXSeq, limma, etc) Expression data statistical analysis • Differential expression analysis • High dimensional analysis Sequence • Enrichment/pathways analysis r data sstics • Analysis software developed in house Study of RNA-seq Breast Cancer Cell Lines and Primary Tumors • 28 breast cancer cell lines • 42 triple negative breast cancer primary tumors • 21 uninvolved breast tissue that were adjacent to TNBC promary tumors • 42 Estrogen Receptor Positive and HER2 Negative Breast Cancer primary tumors • 30 uninvolved breast tissue samples that adjacent to ER+ primary tumors • 5 breast tissue from patients with no known cancer • 168 samples in total • Illumina HiSeq 2000, paired reads Statistical Analysis • • • • • Differential expression analysis High dimensional analysis Classification and prediction Survival analysis Enrichment analysis and pathways analysis RNA-seq Breast Cancer data ER+ verse uninvolved breast tissue