Download Gene Expression Analysis Using Microarrays

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Gene Expression Analysis Using
Microarrays
Dr Mushtaq Ahmed
Technology Incubation Division
Persistent Systems Private Ltd
Pune
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Topics
1. Introduction
2. Data Storage and Exchange Standards
3. Analysis (Clustering)
4. Conclusion and References
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
1. Introduction
• Structure Activity Relationship
• Structural vs. Functional Genomics
• Principals of Microarray Experiment
• Applications
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Structure Activity Relationship
GENES
(finite)
EXPERIMENTAL
SETUP
Structural
Genomics
OR
Prediction
Work
Functional
Genomics
OR
Confirmation
Work
PROTEINS
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
FUNCTIONS
(infinite)
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Source:Yale
Bioinformatics
Principles of a Microarray Experiment:
Hybridization
1.
Environment  Functions  Proteins  mRNA  cDNA
2.
Different incubations of cells results in up or down regulation
of different sets of genes.
3.
Microarray provides a medium for matching known and
unknown DNA samples based on base-pairing rules and
automating the process of identifying the unknowns
4.
Set of expressed genes (at mRNA stage) isolated and
identified using hybridization on a microarray chip
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
HTS Using Hybridization
Microarray
Chip
Probe: oligos/cDNA
(gene templates)
+
Target: cDNA
(variables to be detected)
Samples
Hybridization
Analysis of outcome
Pathways
Targets/Leads
Disease Class.
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Functional
Annotation
Physiological
states
Timeline for drug discovery
Discovery (5 yrs)
5000
Gene expression study
Pre-Clinical (1 yr)
50
Clinical (6 yrs)
5
Review (2 yrs)
1
Marketed
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
2. Data Storage and Exchange
Standards
• Raw and Processed Data
• Conceptual View of Database
• Example of ArrayExpress
• Issues
• Standardization for Exchange
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Raw data – images
• Red (Cy5) dot
–
overexpressed or up-regulated
–
underexpressed or down-regulated
• Green (Cy3) dot
• Yellow dot
– equally expressed
• Intensity - “absolute” level
• red/green - ratio of expression
cDNA plotted microarray
–
–
2 - 2x overexpressed
0.5 - 2x underexpressed
• log2( red/green ) - “log ratio”
–
–
1
-1
2x overexpressed
2x underexpressed
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Microarray Expression Value
Representation
expression value types
composite
spots
primary
spots
primary
measurements
primary images
Source: MGED
derived
values
composite images
e.g., green/red ratios
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Gene expression database – a conceptual
view
Samples
Gene expression
matrix
Genes
Gene
annotations
Sample
annotations
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Gene expression
levels
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
DAG Representation of Biomaterials
Sample source
Primary sample 1
Derived sample 1
Derived sample 2
treatment
A new state of
sample source
treatment
Primary sample 2
extraction
Extract 1
Extract 2
labeling
Hybridization
Labeled extract 2
Labeled extract 1
Source: MGED
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
ArrayExpress (MGED) Design
Reference
e.g., publication, web
resource
Database
e.g., gene in
SWISS-prot
ArrayExpress
Ontology
e.g., organism
taxonomy
Experiment
Hybridization
Array
External links
Sample
Source: MGED
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
ExpressionValue
ArrayExpress (MGED) Architecture
application server
Web server
MAML
data
ArrayExpress
data submission &
Curation database
Curation pipeline
Persistent Systems
Pvt. Ltd.
image
server?
http://www.persistent.co.in
data warehouse
Source: MGED
Issues in Storage
• Size of Data
– Experiments
• 100 000 genes, 320 cell types
• 2000 compounds, 3 time points, 2 concentrations, 2 replicates
– Data
• 8 x 1011 data-points
• 1 x 1015 = 1 petaB of data
• Others
– Raw data are images
– lack of standard measurement units for gene expression
– lack of standards for sample annotation
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Standardization
• MIAME (Minimum Info About a Microarray Expt)
– Experimental design, Array design
– Samples, Hybridisations
– Measurements, Controls
• OMG-LSR-DFT
– Life Sciences Research, Domain Task Force Gene Expression RFP
– EBI (MAML), Rosetta (GEML), NetGenics : submitters
• Proposed MAGEML (MAML +GEML)
–
–
–
–
Annotations + data; data stored as a set of external 2D matrices
Data format independent of particular scanner or image analysis software
Sample and treatment can be represented as a Directed Acyclic Graphs
Concept of composite images and composite spots
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
3. Data Analysis (Clustering)
• Normalization
• Hierarchical Clustering
• Divisive Clustering
• Other Methods
• Visual Tools
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Normalization
• Assumption
– Average expression ratio =1
– Amount of mRNA from both the sample is same
• Total Intensity
– Calculate a factor to rescale intensities of all te genes so that
• total Cy3= total Cy5
• Regression Techniques
– Adjust the intensities so that
• Slope of scatter plot of Cy3 vs Cy5 =1
• Using ratio statistics
– Based on ‘housekeeping genes’ expression a probability density
ratio is developed which is used for normalization
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Clustering
• Hierarchical
– Single, Complete and Average Linkage
• Divisive
– K-means
– Self Organizing Maps (SOM)
• Others
– Principal Component Analysis (PCA)
– Supervised Methods
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Hierarchical clustering
• Distance metrics or Similarity Measures
– Euclidian, Pearson, distance of slopes etc..
• Cost functions
– Single Linkage
• Min distance of any two members (one from each of the two clusters)
– Complete Linkage
• Max distance of any two members (one from each of the two clusters)
– Average Linkage
• UPGMA
• WPGMA
• Within Groups
– Ward’s Method
• Join which produces smallest possible error in some of squared errors
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Divisive clustering
•
K-means
– ‘k’ random (or specified) points used to create clusters, average vectors for
the clusters then used iteratively
– Knowledge of probable no of clusters (k) needed
– Used in combination with PCA and hierarchical clustering
•
Self Organizing maps
– User defined geometric configurations as partitions
– Random vectors generated for each partition and TRAINED till convergence
(ANN based)
•
Visualization Methods
– Helps in cluster visualization
• Scatter Plot, Web plot, histogram
– May help in clustering itself
• E.g., SuperGrouper utility of MaxdView
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Other Clustering Methods
•
PCA (Principal Component Analysis)
– Also called SVD (Singular Value Decomposition)
– Reduces dimensionality of gene expression space
– Finds best view that helps separate data into groups
•
Supervised Methods
– SVM (Support Vector Machine)
– Previous knowledge of which genes expected to cluster is used for training
– Binary classifier uses ‘feature space’ and ‘kernel function’ to define a optimal
‘hyperplane’
– Also used for classification of samples- ‘expression fingerprinting’ for disease
classification
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
4. Conclusion and References
•
•
•
•
•
Microarrays makes HTS with hybridization possible
No single standard unit for measuring expression levels
Handling and interpretation not yet exact
Assumptions: Elements in cluster must share some commonality
Classification depends on method used for clustering, normalization,
distance function
• No “correct” way of classification, “biological understanding” is the
ultimate guide
• Provides extension to existing knowledge (e.g., classifying a novel
gene into a known pathway)
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Software
•
Databases
– Public repositories:
• GEO (NCBI), GeneX (NCGR), ArrayExpress (EBI)
– In-house databases
• Stanford, MIT, University of Pennsylvania,
– Organism specific databases
•
Mouse Genome Informatics Database
– Proprietary databases –
• Gene Logic, NCI, Synergy (NetGenics), Genomics Knowledge Platform (Incyte)
•
Analysis Tools
– Public Domain
• maxdView (University of Manchester)
• CyberT , RCuster interfaces of GeneX
– Proprietary
• Spotfire, Xpression NTI (Informaxinc)
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
References
• Microarray Gene Expression Database Group
– http://www.mged.org
• National Center for Genomic Research
– http://genex.ncgr.org
• University of Manchester , Bioinformatics Group
– http://bioinf.man.ac.uk/microarray/resources.html
• Nature Reviews Genetics
– http://www.nature.com/nrg/
Persistent Systems Pvt. Ltd.
http://www.persistent.co.in
Related documents