Download EP:NG Tutorial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

NEDD9 wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Patrick Kemmeren
Using EP:NG
http://www.ebi.ac.uk/expressionprofiler
Aims
• Demonstrate the use of the following
components in EP:NG:
–
–
–
–
–
–
Data Selection
Data Transformation
Missing Value Imputation
Principal Components Analysis
Hierarchical Clustering
Clustering Comparison …
• … for the purposes of microarray data
analysis (here, tumor line classification) …
• … by examining the following paper:
M. Crescenzi and A. Giuliani, The main biological determinants
of tumor line taxonomy elucidated by a principal component
analysis of microarray data, FEBS Letters 507(2001) 114-118.
Overview
• PCA: Principal Component Analysis
– Unsupervised approach
– Reduces complexity of data by reducing its
dimensionality
– Computes a new, smaller set of uncorrelated
variables that best represent the original data
• Orientation by Genes
– Genes are statistical variables, samples are
statistical samples
– Covariance matrix: records covariance of one
gene vs. another, over samples.
PCA (cont.)
• Principal Components
– A principal component is a mathematical entity,
computed from the data, equivalent to a
characteristic vector of the covariance matrix
– In other words, finding a way to rotate the original
coordinate axes and finding the directions of
maximumI variance of the scatter of points
• Summary
– Eigenvalues are characteristic values of the
principal components – the higher the eigenvalue,
the more variability in the dataset it describes
– The first few components can thus describe a
large proportion of the data
-1
Materials and Methods
• Data
–
http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt
• cDNA from 60 cancer cell lines, hyb’d to ~8000 individual
gene cDNAs
• T-matrix: 1416 variables, corresponding to selected
genes of highest variance (1375) – log ratios between
the gene expression level and a reference mixture
• Strategy
–
–
–
–
Perform PCA analysis
Select top explaining components
Project + cluster cell lines into component space
Choose K (number of clusters) by visual
observation + clustering comparison
Component: Data Upload
• “Provide URL”
– The data matrix URL:
http://discover.nci.nih.gov/nature2000/data/selecte
d_data/t_matrix1375.txt
– Data Format: Nr of columns after 1 for
annotation => 3
– Species: Homo Sapiens
• >> Data Selection
Component: Data Selection
• Select columns:
– Only the cell-line columns (format XX:Cell
Line)
Filter: .*:.*
• >> Data Transformation
Component: Data Transformation
• KNN imputation
– 10 neighbours
• >> Data Transformation
- transpose
• >> Data Selection
• >> Side menu: Ordination
Component: Ordination
• Analysis Options:
–
–
–
–
Principal Components
Save 5 eigenvalues
Output: Graphs of Eigenvalues
Output: Summary and Eigenvalues, Arrays and
Genes Co-ordinates
• >> Output Display
– Examine outputs…
• Save the rows (cell lines) co-ordinates (keep top 5
eigenvalues) on the local hard drive (using original
column annotations as row annotations here).
• Import it into excel (paste the orignal column annotations
(cell lines)).
• Save this file as tab-delimited and upload it again.
Component: Hierarchical
Clustering
• Cluster the cell lines (in the 5
component space now)
– Euclidean Distance
– Average Linkage
• >> Output Display
– How many clusters can you see?
– Try to zoom in
Components:
K-Groups Clustering,
Clustering Comparison
• Cluster the tumour cell lines in the
components with the K-means algorithm
several times… try K=10, K=6, K=5, K=4
• Run Clustering Comparison several times
– Which K seems most fitting?
– An automated method for this process is being
developed
Obtaining genes strongly
correlated with components
• From the PCA results screen, import the
columns (genes) co-ordinates into Excel
– Sort (Ascending) on the first component
column (Comp1)
– What are the top genes there?
Acknowledgements
Original EP Development:
Clustering Comparison:
• Jaak Vilo (Tartu)
• Aurora Torrente
• Patrick Kemmeren (Utrecht)
• Christine Körner (Leipzig)
• Misha Kapushesky
PCA/COA/BGA:
EP:NG Framework Development:
• Patrick Kemmeren (Utrecht)
• Misha Kapushesky
Visualization Components
(under development):
• Steffen Durinck (Leuven)
• Aedín Culhane (Cork)
Gene Ordering:
• Karlis Freivalds (Riga)
Normalization
(under development):
• Tom Bogaert (Leuven)
Discussions:
• EBI Microarray Informatics Team
• Contributors from the open source community
EP:NG is an open source project – if you are interested in
contributing, testing or just discussing ideas, let us know!