Download Jeff Chou, PhD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
Next Generation Sequence Analysis
Jeff Chou, PhD
Section of Statistical Genetics, Department of Biostatistical Sciences,
Biostatistics Core and Cancer Genomics Core, Comprehensive Cancer
Center
Center for Public Health Genomics
Acknowledgment:
I am closely working with Drs. Carl Langefeld, Lance Miller, Greg
Hawkins, David McWilliams.
Via the core support, I work with many researchers in the institute,
mostly for their microarray data.
DNA-seq Analysis Pipeline
DNA sequence data / alignment statistics
• Fastqc
• Samtools
• Picard
reads length, reads number, base quality, sequence quality,
Sequence
sequence duplication
levels, GC content, etc.
alignmentrquality,
number of high quality aligned reads, base
data sstics
mismatch rate, strand balance, duplicate reads, depth, etc.
DNA sequence data aligners
• BWA, Bowtie, Lifescope - Indexing
Genome with Suffix Array
Sequence
• BFAST, Novoalign
- Indexing
Genome with Hash Tables
r data sstics
• SHRiMP2
- Indexing Reads with Hash Tables
DNA sequence data variant callers
• GATK - indel realignment, base recalibration, variant calling (unified
genotyper caller and haplotype caller) and filtering
• FreeBayes - haplotype-based,Sequence
Bayesian algorithm designed to find small
r dataindels,
sstics MNPs, and complex events
polymorphisms, specifically SNPs,
• Samtools mpileup - scans every position, computes all the possible
genotypes, and the probability of these genotypes
Consistence Analysis of Called Variants
from APOL1 Gene Region
17
calls
16
calls
15
calls
14
calls
13
calls
sum
Percent
Number variants called
861
993
358
264
209
2685
100
Freebayers
861
929
269
185
94
2338
gatk
861
971
330
216
146
Samtools
861
985
300
194
Freebayers
861
993
255
Samtools
861
933
Freebayers
861
gatk
BFAST
Bowtie
BWA
Lifescope
novoalign
Shrimp2
Total
variants
Percent
87.1
5638
41.5
2524
94.0
8996
28.1
94
2434
90.7
3022
80.5
131
143
2383
88.8
3382
70.5
145
94
101
2134
79.5
2421
88.1
993
357
258
186
2655
98.9
5061
52.5
861
983
334
229
186
2593
96.6
9355
27.7
Samtools
861
993
357
259
195
2665
99.3
3882
68.7
Freebayers
861
181
205
79
57
1383
51.5
1842
75.1
gatk
861
986
334
249
181
2611
97.2
7038
38.1
Samtools
861
990
352
249
156
2608
97.1
4255
61.3
Freebayers
861
992
355
258
186
2652
98.8
3441
77.1
gatk
861
990
350
251
196
2648
98.6
6153
43.0
Samtools
861
993
358
262
198
2672
99.5
4181
63.9
Freebayers
861
992
358
260
198
2669
99.4
3753
71.1
gatk
861
991
355
259
195
2661
99.1
6091
43.7
Samtools
861
993
358
263
205
2680
99.8
5356
50.0
Manuscript “APOL1-APOL3 gene interactions in non-diabetic nephropathy: functional
assessment and results of deep sequencing" submitted to Kidney International.
RNA-seq Analysis Pipeline
RNA sequence data / alignment statistics
• Fastqc
• Samtools
• Picard
reads length, reads number, base quality, sequence quality,
sequence duplication
levels, etc.
Sequence
alignment quality, number of high quality aligned reads, base
r data sstics
mismatch rate, strand balance, duplicate reads, etc.
RNA sequence data aligners / Transcriptome assemblers
• Tophat/Bowtie/Cufflinks - to identify exon-exon splice junctions,
•
•
assembles transcripts, estimates their abundances, and tests for differential
expression and regulation (2009) Sequence
Subread/Subjunc/featureCounts
– alignment,
exon-exon junction
r data
sstics
detection and read summarization (2013).
Bioconductor (DESeq, DEXSeq, limma, etc)
Expression data statistical analysis
• Differential expression analysis
• High dimensional analysis
Sequence
• Enrichment/pathways analysis r data sstics
•
Analysis software developed in house
Study of RNA-seq Breast Cancer Cell Lines and Primary Tumors
• 28 breast cancer cell lines
• 42 triple negative breast cancer primary tumors
• 21 uninvolved breast tissue that were adjacent to TNBC
promary tumors
• 42 Estrogen Receptor Positive and HER2 Negative Breast
Cancer primary tumors
• 30 uninvolved breast tissue samples that adjacent to ER+
primary tumors
• 5 breast tissue from patients with no known cancer
• 168 samples in total
• Illumina HiSeq 2000, paired reads
Statistical Analysis
•
•
•
•
•
Differential expression analysis
High dimensional analysis
Classification and prediction
Survival analysis
Enrichment analysis and pathways analysis
RNA-seq Breast Cancer data ER+ verse uninvolved breast tissue