Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Patrick Kemmeren Using EP:NG http://www.ebi.ac.uk/expressionprofiler Aims • Demonstrate the use of the following components in EP:NG: – – – – – – Data Selection Data Transformation Missing Value Imputation Principal Components Analysis Hierarchical Clustering Clustering Comparison … • … for the purposes of microarray data analysis (here, tumor line classification) … • … by examining the following paper: M. Crescenzi and A. Giuliani, The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data, FEBS Letters 507(2001) 114-118. Overview • PCA: Principal Component Analysis – Unsupervised approach – Reduces complexity of data by reducing its dimensionality – Computes a new, smaller set of uncorrelated variables that best represent the original data • Orientation by Genes – Genes are statistical variables, samples are statistical samples – Covariance matrix: records covariance of one gene vs. another, over samples. PCA (cont.) • Principal Components – A principal component is a mathematical entity, computed from the data, equivalent to a characteristic vector of the covariance matrix – In other words, finding a way to rotate the original coordinate axes and finding the directions of maximumI variance of the scatter of points • Summary – Eigenvalues are characteristic values of the principal components – the higher the eigenvalue, the more variability in the dataset it describes – The first few components can thus describe a large proportion of the data -1 Materials and Methods • Data – http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt • cDNA from 60 cancer cell lines, hyb’d to ~8000 individual gene cDNAs • T-matrix: 1416 variables, corresponding to selected genes of highest variance (1375) – log ratios between the gene expression level and a reference mixture • Strategy – – – – Perform PCA analysis Select top explaining components Project + cluster cell lines into component space Choose K (number of clusters) by visual observation + clustering comparison Component: Data Upload • “Provide URL” – The data matrix URL: http://discover.nci.nih.gov/nature2000/data/selecte d_data/t_matrix1375.txt – Data Format: Nr of columns after 1 for annotation => 3 – Species: Homo Sapiens • >> Data Selection Component: Data Selection • Select columns: – Only the cell-line columns (format XX:Cell Line) Filter: .*:.* • >> Data Transformation Component: Data Transformation • KNN imputation – 10 neighbours • >> Data Transformation - transpose • >> Data Selection • >> Side menu: Ordination Component: Ordination • Analysis Options: – – – – Principal Components Save 5 eigenvalues Output: Graphs of Eigenvalues Output: Summary and Eigenvalues, Arrays and Genes Co-ordinates • >> Output Display – Examine outputs… • Save the rows (cell lines) co-ordinates (keep top 5 eigenvalues) on the local hard drive (using original column annotations as row annotations here). • Import it into excel (paste the orignal column annotations (cell lines)). • Save this file as tab-delimited and upload it again. Component: Hierarchical Clustering • Cluster the cell lines (in the 5 component space now) – Euclidean Distance – Average Linkage • >> Output Display – How many clusters can you see? – Try to zoom in Components: K-Groups Clustering, Clustering Comparison • Cluster the tumour cell lines in the components with the K-means algorithm several times… try K=10, K=6, K=5, K=4 • Run Clustering Comparison several times – Which K seems most fitting? – An automated method for this process is being developed Obtaining genes strongly correlated with components • From the PCA results screen, import the columns (genes) co-ordinates into Excel – Sort (Ascending) on the first component column (Comp1) – What are the top genes there? Acknowledgements Original EP Development: Clustering Comparison: • Jaak Vilo (Tartu) • Aurora Torrente • Patrick Kemmeren (Utrecht) • Christine Körner (Leipzig) • Misha Kapushesky PCA/COA/BGA: EP:NG Framework Development: • Patrick Kemmeren (Utrecht) • Misha Kapushesky Visualization Components (under development): • Steffen Durinck (Leuven) • Aedín Culhane (Cork) Gene Ordering: • Karlis Freivalds (Riga) Normalization (under development): • Tom Bogaert (Leuven) Discussions: • EBI Microarray Informatics Team • Contributors from the open source community EP:NG is an open source project – if you are interested in contributing, testing or just discussing ideas, let us know!