* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Microarray-based Disease Prognosis using Gene Annotation
Genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Pathogenomics wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Genomic imprinting wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Public health genomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene nomenclature wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005 • Internship site: BioDiscovery, Inc. • Mentor: Dr. Bruce Hoff • Source of Funding: BioDiscovery, Inc. Motivation • Microarray gene-expression profiling studies to predict disease outcomes. – ex: cancer outcome • To improve treatment of patients based on knowledge of gene-expression profile (molecular signature). Lancet Paper “Prediction of cancer outcome with microarrays: a multiple random validation strategy” Findings of Stefan Michiels et al :“Gene expression microarray-based predictors of clinical outcome have been poorly optimistic and careful review shows that performance is poor and variable.” - Analyzed data from the 7 largest published studies that have attempted to predict prognosis of cancer patients based on DNA microarray analysis. - Random sampling approach Goal • Reproduce the Lancet paper. • Compare the classification based on expression levels of microarray probes, with classification based on GSEA scores of biological pathways. • Validate our hypothesis:– By abstracting away from the gene expression domain to that of biological properties, performance should stabilize and improve. Phase I : Reproduce the Lancet Paper (Gene-Expression based classification) Methodology • Data loading • Data preprocessing • Data selection • Correlating with clinical outcome • Determine the molecular signature • Classification of data Data Loading • Read Affymetrix chip expression data. Sample data: Probe set AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at AFFX-BioDn-3_at Descriptions T-ALL-C1T-ALL-C1T-ALL-C2T-ALL-C2 Avg Diff Abs CallAvg Diff Abs Call M16762 Mouse interleukin 2 (IL-2) 5803.2 gene, P exon 4 -968.6 A M37897 Mouse interleukin 10-1626.9 mRNA, Acomplete -929.1 cds A M25892 Mus musculus interleukin -2599.64 (Il-4) A mRNA,254.3 complete A cds M83649 Mus musculus Fas antigen 2353.9mRNA, A complete 1430.1 cds A J04423 E coli bioB gene biotin 124288 synthetase P (-5, -M, 77263 -3 represent P tra J04423 E coli bioB gene biotin 177215 synthetase P (-5,113251 -M, -3 represent P tra J04423 E coli bioB gene biotin 105651 synthetase P (-5, -M, 78284 -3 represent P tra J04423 E coli bioC protein (-5107134 and -3 Prepresent 70346 transcript P regions J04423 E coli bioC protein (-5 96543 and -3 Prepresent 84231 transcript P regions J04423 E coli bioD gene dethiobiotin 145965 Psynthetase 84271 (-5 and P -3 repre J04423 E coli bioD gene dethiobiotin 431822 Psynthetase 328727 (-5 and P -3 repre Data Preprocessing • Scaling – Identify the present, absent and marginal expressional levels. – scaling the average of the fluorescent intensities of all genes to a constant target intensity of 2500. – Expression values above 45000 capped to 45000 and the ones below 100 to 1. • Filtration – Eliminate the genes with low or no variance • Log transformation – Log2(values) Preprocessed Data: Before After Data Selection • Training-Validation Approach:– Training set for identifying the molecular signature. – Validation set for estimating the proportion of misclassifications. Therefore, Dataset(N) (Random selection) Training(n) Validation(N-n) such that, – Each set includes half the patients with and half without a favorable outcome. Correlation • Clinical outcome – Favorable = 1 (continuous complete remission) – Unfavorable = -1 (relapse) • Correlate expression values of each gene with the clinical outcome – Pearson’s correlation coefficient • Determined the molecular signature – defined by the top 50 highest correlated genes. Data Classification (Nearest Centroid Prediction Rule) • A new point is classified based on which centroid is nearest. Unfavorable Centroid • Data is 50- dimensional. • PCA plot is used to plot the data. • Principle component analysis(PCA) is a powerful tool for analysing data by identifying patterns in it. Favorable Centroid Results(cont’d.) • Each of the 500 training sets provided a different molecular signature Top 250 genes included in 500 molecular signatures 248 229 210 • Plot of genes that occurred most frequently in the molecular signature. Probe IDs 191 172 153 134 115 96 77 58 39 20 1 0 2 4 6 8 Number of signatures 10 12 Analysis • The frequency of the genes participating in defining the signature is quite low. • This suggests that the molecular signature is selected almost randomly and is unstable. • Phase II Analysis of Microarray data using GSEA (Gene Set Enrichment Analysis) http://www.nature.com/ng/journal/v37/n1/full/ng1490.html Methodology • Data loading • Data preprocessing • Data selection • GSEA – Determine enrichment scores • Correlating with clinical outcome • Classification of data Preliminary steps • Data loading • Data preprocessing • Data selection same as in phase I GSEA • Gene Set Enrichment Analysis – A microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. – GSEA provides an enrichment score that measures the degree of enrichment of the gene set of a rank-ordered gene list derived from the data set. GSEA(cont’d) • GSEA Inputs: – List of genes ranked according to the expression difference between two classes. – a priori defined gene sets (ex. pathways), each consisting of members drawn from the list of genes. • Ranking of genes is done using a distance metric, Signal-to-Noise ratio (SNR). http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf Signal to Noise ratio • The signal-to-noise ratio method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: ((α)* sqrt(n)) ÷ σ where α (signal) is the difference in mean expressions of two classes and σ (noise) is the standard deviation. Implementation • Determine SNR for each microarray. • Sort gene list based on SNR values. • The degree of enrichment of the gene set is measured by comparing the SNR-ordered gene list with the gene set(pathways). http://www.nature.com/ng/journal/v37/n1/full/ng1490.html Enrichment Score (ES) • If gene is in gene set, increment rank by Y • If gene is not in gene set, decrement rank by X X=√G/(N-G) Y=√(N-G)/G G=number of genes in set N=size of data http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc ES=greatest positive deviation of this running sum across all genes Correlation & Classification • Similar to phase I – First, the top 50 pathways are selected to create favorable and unfavorable centroids – Next, the training and validation set is classified based on the nearest-centroid prediction rule. Results(cont’d.) • Each of the 500 training sets provided a different molecular signature 16 13 Pathways • Plot of pathways that occurred in over 150 of the molecular signatures. Pathways included in over 150 signatures 10 7 4 1 0 50 100 150 200 250 Number of signatures 300 350 400 Results Gene Expression Average % =93.77% Gene Set Based Average % =97.88% Results (cont’d) Gene Expression Average % =96.45% Gene Set Based Average % =93.80% Results (cont’d) Gene Expression Average % =75.17% Gene Set Based Average % =52.91% Results (cont’d) Gene Expression Average % =26.48% Gene Set Based Average % =47.76% Three significant pathways • Iron ion homeostasis – • Unfolded protein response, positive regulation of target gene transcription – • Reduces tumor angiogenesis by protecting cells from oxidative stress A stress-signaling pathway in tumor cells Tryptophan catabolism – Has an antiproliferative effect on many tumor cells Conclusion • Our results have shown that • The centroid classification based on gene expression performs poorly with the validation set. • The GSEA method does not perform any better than the gene expression method Future Work • Analysis with a different classification approach. • Using much larger data sets from different samples. Acknowledgements • Dr. Bruce Hoff • Dr. Soheil Shams • SoCalBSI References 1. 2. 3. 4. 5. Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, Vol. 365, 488–92 (2005). Mootha, V. K., et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, Vol. 34, 267-273 (2003). http://www.broad.mit.edu/gsea/doc/detailed_description_of_g sea_algorithm.doc. http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_re port.pdf http://www.nature.com/ng/journal/v37/n1/full/ng1490.html