Download Bioinformatics for Stem Cell

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD Outline • • • • • Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis MULTIVARIATE ANALYSIS Identify Markers of Human Colon Cancer and Normal Colon Piero Dalerba Tomer Kalisky 4 Single Cell Analysis of Normal Human Colon Epithelium Hierarchical Clustering Hierarchical Clustering • Cluster 3.0 – http://bonsai.hgc.jp/~mdehoon/software/cluster/ • Distance metric – Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation • Linkage – Single, complete, average, median, centroid Multivariate Analysis - PCA Principal Component Analysis X = data matrix V = loading matrix U = scores matrix Fundamentals of PCA • Reduces dimensions of the data • PCA uses orthogonal linear transformation • First principal component has the largest possible variance. • Exploratory tool to uncover unknown trends in the data PCA Analysis HIGH-THROUGHPUT DATA ANALYSIS MICROARRAY ANALYSIS Microarray • Spotted vs. in situ • Two channel vs. one channel • Probe vs. probeset vs. gene Quantile Normalization Sort #1 #2 #3 SortedAvg Average Val(Probe_i) = SortedAvg[Rank(Probe_i)] Invariant Set Normalization Before Normalization Invariant set After Normalization Good to Check the Image SAM Two-Class Unpaired 1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Group A Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B? Group B Exp 3 Exp 4 Exp 6 SAM Two-Class Unpaired Permutation tests i) For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Original grouping Gene 1 Group A Exp 3 Exp 2 Gene 1 Group B Exp 6 Exp 4 Exp 5 Exp 1 Randomized grouping SAM Two-Class Unpaired iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values “Observed d = expected d” line SAM Two-Class Unpaired Significant negative genes (i.e., mean expression of group A > mean expression of group B) Significant positive genes (i.e., mean expression of group B > mean expression of group A) The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. GenePattern http://genepattern.broadinstitute.org/ AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117 Gene Set Analysis Your Gene Set Cell Cycle Transcription factor Compute enrichment in pathways and networks TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING BOOLEAN ANALYSIS Boolean Implication GABRB1 45,000 Affymetrix microarrays ACPP [Sahoo et al. Genome Biology 08] • Analyze pairs of genes. • Analyze the four different quadrants. • Identify sparse quadrants. • Record the Boolean relationships. – If ACPP high, then GABRB1 low – If GABRB1 high, then ACPP low Threshold Calculation High Intermediate Low Threshold Sorted arrays [Sahoo et al. 07] • A threshold is determined for each gene. • The arrays are sorted by gene expression • StepMiner is used to determine the threshold BooleanNet Statistics a01 a11 a00 a10 nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total B (expected – observed) statistic = A error rate = 1 2 ((a a00 00+ a01) √ expected + a00 (a00+ a10) ) Boolean Implication = (statistic > 3, error rate < 0.1) [Sahoo et al. Genome Biology 08] Six Boolean Implications [Sahoo et al. Genome Biology 08] MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010] MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010] MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010] B Cell Genes KIT Boolean Implications CD19 [Sahoo et al. PNAS 2010] Jun Seita http://gexc.stanford.edu [Seita, Sahoo et al. PLoS ONE, 2012] SEQUENCING DATA ANALYSIS Sequencing Data Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTQ S X I J L - FASTA @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba Sanger Phred+33, (0, 40) Solexa Solexa+64,(-5, 40) Illumina 1.3+ Phred+64, (0, 40) Illumina 1.5+ Phred+64, (3, 40) Illumina 1.8+ Phred+33, (0, 41) Mapping Mapping Software • Long reads – BLAST, HMMER, SSEARCH • Short reads – BLAT – Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA Visualizations Visualizations • UCSC Genome Browser • GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 • Integrative Genomics Viewer (IGV) Quantification • Peak calling – QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT • Expression quantification – Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ • SNP calling – samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH Peak Discovery [Pepke et al. Nature Methods 2009] Transcript Quantification RPKM, FPKM [Pepke et al. Nature Methods 2009] SNP Calling Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010] [Trapnell et al. Nature Biotech 2010]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bioinformatics for Stem Cell