Download R packages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Figure omitted because of copyright reason
A printed version can be found at
Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis.
Genome Biol. 2001 Aug 29; 2: 4021.1-4021.2.
“ I think you should be more explicit here in step two”
~ Normal science consists largely of moppingup operations. Experimentalists carry out
modified versions of experiments that have
been carried out may times before ~
Thomas S. Kuhn
The FAQ of biologist:
What is the best microarray
analysis software?
Different kinds of microarray
software
 Image analysis software
 Data mining software
– Statistics software
• R packages for microarray analysis





SNPs analysis software
Database/ LIMS software
Public Expression Database
Primer design
Software for further data mining: annotation,
promoter analysis & pathway reconstruction
Softwares won’t discuss today
 Hardware control softwares
– Arrayer controlling – ArrayMaker
– Scanner controlling/ Image acquisition
A statistics on current microarray
softwares
28 Feb 2002
Jan 2001
Image analysis
17
17
Data mining
39
R packages
SNP analysis
14
1
Database/ LIMS
14
4
Public Database
16
8
Accessory
8
-
Further data mining
9
-
116
29
Total
* Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Image analysis software
 Spot recognition
 Segmentation
– Foreground calculation
– Background calculation
 Spot quality measures
Major Image analysis softwares









AIDA array
ArrayPro
ArrayVision
Dapple
F-scan
GenePix Pro 3.0.5
ImaGene 4.0
Iconoclust
Iplab
 Lucidea Automated
Spotfinder
 Phoretix Array3
 P-scan
 QuantArray 3.0
 ScanAlyze 2
 Spot
 TIGR Spotfinder
 UCSF Spot
Examples of common used image
analysis software
 ScanAlyze 2 (Mike Eisen, LBNL)
 GenePix Pro 3.0.5 (Axon Instruments)
 QuantArray 3.0 (Packard Instrument)
 ImaGene 4.0 (Biodiscovery)
Spot recognition
 ArrayPro from Media Cybernetics
 Automate and fast grid, subgrid and spot finding
algorithms
Segmentation
 Purpose – classification between foreground
and background
–
–
–
–
Fixed circle
Adaptive circle
Adaptive shape
Histogram method
Segmentation
 Using extra dye – DAPI, avoid morphology
assumption
UCSF Spot
Spot quality measure
 E.g. QuantArray 3.0
– Diameter
– Spot Area
– Footprint
– Circularity
– Spot Signal/Noise
– Spot Uniformity
– Background Uniformity
– Replicate Uniformity
 Problem: lacking rigorous spot quality definition
and experimental verification
Future Image analysis software
 Rigorous quality mearsures definition
 Extra dye for better segmentation
 Automated analysis
Data mining software
 Main purposes
1. Filtering and normalization
2. Statistical inference of differentially
expressed genes
3. Identification of biologically meaningful
patterns, i.e. expression profile; expression
fingerprint/ signature
4. Visualization
5. Other analysis like pathway reconstruction
etcs.
Different categories
 Turnkey system
 Comprehensive software
 Specific analysis software
 Extension/ accessory of other software
Major data mining software
 AIDA Array
 AMADA
 ANOVA program for microarray
















data
ArrayMiner
arraySCOUT
ArrayStat
BRB ArrayTools
CHIPSpace
Cleaver
CIT
CLUSFAVOR
Cluster
Cyber T
DNA-arrays analysis tools
dchip
Expression Profiler
Expressionist
Freeview & FreeOView
Gene Cluster



















GeneLinker Gold
GeneMaths
GeneSight
GeneSpring
Genesis
Genetraffic
J-Express
MAExplorer
Partek
R cluster
Rosetta Resolver
SAM
SpotFire Decision Site
SNOMAD
TIGR ArrayViewer
TIGR Multiple Experiment Viewer
TreeView
Xcluster
Xpression NTI
Turnkey system
 Definition: A computer system that has been
customized for a particular application. The term
derives from the idea that the end user can just
turn a key and the system is ready to go.
 For microarray, this includes everything from OS,
server software, database, client software,
statistics software and even hardware
 Examples
– Genetraffic (Iobion)
• Using Open Source softwares - LINUX, the R statistical
language, PostgreSQL, and Apache Web server
– Rosetta resolver (Rosetta Biosoftware)
• Sun Fire server and drive array, Oracle 8i, Rosetta server and
client side software
Turnkey system
 Advantages
– performance
– Security
– Support multiple users
– Incorporate the experiment and data standards in design
 Disadvantages
– Expensive
– Not suitable for small labs
– Require dedicated supporting staff
– Close system
Comprehensive software
 Definition: Software incorporate many
different analyses for different stage in a
single package.
 Examples
–
–
–
–
Cluster (Mike Eisen, LBNL)
GeneMaths (Applied Maths)
GeneSight (Biodiscovery)
GeneSpring (Silicon Genetics)
Comprehensive software
 Cluster
– Filter data
– Adjust datanormalization, log
transform etc
– Clustering
– Self-Organizing Maps
(SOMs)
– Principle Component
Analysis (PCA)
 GeneSpring
– & Promoter analysis
– Gene annotation with
public database
information
– Scripting tools
– Access Open DataBase
Connectivity (ODBC)
databases
Comprehensive software
 GeneMaths
– & Bootstrap analysis
for clustering
– Fast clustering
algoritms
– Access Open DataBase
Connectivity (ODBC)
database
 GeneSight
– & confidence analysis
for replicated data
– statistical analysis for
significant genes
– Graphical data set
builder
Comprehensive software
 Advantages
– Standardized operation
– Generate various analysis easily
– Shorter learning curve for biologist
– Script language for automated process control
– Some brilliant ideas or analysis within
particular software
– “False” Sense of security?
Comprehensive software
 Disadvantages
– Inflexible to latest analysis development
– Generate various analysis too easily
– Implicit data analysis/ statistics background and
definitions
– Proprietary script language
– Data compatibility with other softwares
– Necessity to design and maintain your own database
– Commercial softwares can be expensive!
– Adding particular analysis because of marketing
purpose, extra spending on unnecessary functions
– Sometimes only available in a few computing platforms
Specific analysis software
 Definition: Software performing a few/ one
specific analysis
 Examples
– GeneCluster (Whitehead Institute Centre
for genome research)
– INCLUSive - INtegrated CLustering, Upstream
Sequence retrieval and motif Sampler
(Katholieke Universiteit Leuven)
– SAM – Significance Analysis of Microarrays
(Stanford University)
Specific analysis software
 GeneCluster – performing normalization,
filter and SOM
Specific analysis software
 INCLUSive - INtegrated CLustering,
Upstream Sequence retrieval and motif
Sampler
 SAM – finding statistical significant
differentially expressed gene
Specific analysis software
 Advantages
– Better statistical background reference, usually
with literature support
 Disadvantages
– Non-standardized environment – java, web,
excel… etc
– Data compatibility problem
– Data preprocessing problem
Extension/ accessory of other
software
 Definition: extension of other software’s
capability
 Examples:
– Freeview: Visualization and Optimization of
Gene Clustering Dendrograms for Cluster
– ArrayMiner: extension of GeneSpring
Statistics softwares
 Excel
 MATLAB
 Octave
 SAS
 SPSS
 S-PLUS
 Statistica
R
Statistics softwares
 Advantages
– Highly flexible
– High level, multivariate analyses are either
standard or easily programmable
 Disadvantages
– Usually command line driven, impossible to
learn intuitively (a disadvantage??)
– Require a much better understanding of the
statistical data analysis to follow the steps (a
disadvantage??)
R-packages
 A language and environment for statistical
computing and graphics.
 Highly compatible to S/ S-plus
 Open source under GNU General Public
Licence
 Runs on many UNIX/ Linux/ windows
family and MacOS platform
 There are growing number of microarray
analysis softwares (packages) written in R
R-packages
 Dedicated for
microarray analysis
–
–
–
–
–
–
–
–
–
affy
Bioconductor
SMA extension
Cyber T
GeneSOM
Permax
OOMAL (S-Plus)
SMA
YASMA
 General packages
– cclust
– cluster
– mclust
– multiv
– mva
– …etc!
R-packages
 SMA - Statistical Microarray Analysis
(Terry Speed, UC Berkeley)
 Bioconductor
R-packages
 SMA
– perform intensity and spatial dependent
normalization
– Replicated array data analysis by an empirical
bayes approach
R-packages
 Result of replicated data output B vs M plot
R-packages
 Bioconductor
– open source software project to provide infrastructure
in terms of design and software to assist biologists and
statisticians for analysing genomic data, with primary
emphasis on inference using DNA microarrays
– Most software produced by the Bioconductor project
will be in the form of R libraries
• Variation 1: provide basic infrastructure support that will help
other developers produce high quality software
• Variation 2: provide innovative methodology for analyzing
genomic data
– Provide some form of graphical user interface
for selected libraries
– A mechanism for linking together different
groups with common goals
Future Data mining software
 Standardized, open-source (free) platform?
– EMBOSS - European Molecular Biology Open
Software Suite.
 More supervised analysis package and
pathway prediction package?
 Plugin modules
– J-express
– GeneSpring
Mutation analysis software
 Chip based SNP or chromosomal aberration
analysis (arrayCGH)
 Various forms of protocols, e.g. primer
extension, ligase chain reaction, MALDITOF-MS, hybridization..etc
 Result is in the form of base calling or
allelic imbalance
 Example – genorama
Database
 Definition: large collection of data organized
especially for rapid search and retrieval
 Two categories
– Within laboratory/ institute database; LIMS
– Public expression database
 Standardized definition of data
– Minimum Information About a Microarray Experiment
(MIAME)
•
•
•
•
•
•
Experimental design
Array design
Samples
Hybridizations
Measurements
Normalization controls
Database/ LIMS software
 The database within your lab/ institute
 The quality of in house data management
will affect the quality of final public data
repository
 Database structure may be relatively
simple
Major Database/ LIMS software









AMAD
ARGUS
ArrayDB
ArrayInformatics
Clonetracker
GeNet
Genetraffic
GeneX
MAD
 Maxd
 NOMAD
 Partisan Array LIMS
 Phoretix Array2
Database
 Rosetta Resolver
 SMD
Public Expression Database
 Necessities
– Provide raw data to validate published array
result and develop new analysis tools
– Further understanding of your data
– Compare among different groups, meta-data
mining
– Source for specialty array design
 Different categories
– Generic
– Species specific
– Disease specific
 The importance of data standardization
Major public gene expression
databases
 3D-GeneExpression







Database
ArrayExpress
BodyMap
ChipDB
ExpressDB
Gene Expression Omnibus
(GEO)
Gene Expression Database
(GXD)
Gene Resource Locator
 GeneX
 Human Gene Expression






Index (HuGE Index)
RIKEN cDNA Expression
Array Database (READ)
RNA Abundance Database
(RAD)
Saccharomyces Genome
Database (SGD)
Standford Microarray
Database (SMD)
TissueInfo
yeast Microarray Global
Viewer (yMGV)
Primer/ probe design
 Array designer
 GAP (Genome- wide Automated Primer
finder servers)
 OligoArray
 Primer3
 ProbeWiz Server
Other useful software for further
data mining
 Data annotation
–
–
–
–
DRAGON
Gene Ontology
PubGene
Resourcerer
 Promoter analysis
–
–
–
–
AlignACE
INCLUSive
MEME
Sequence Logo
 Pathway reconstruction
– GenMAPP
– PathFinder
 Data annotation
– Link GI to a particular name
– Literature mining to infer network
 Network reconstruction
– Cluster + promoter analysis
– statistical inference from experimental data
Some suggestions for biologists who
are serious in microarray study
 Communicate or even collaborate with
Statisticians, Mathematicians and
bioinformaticians
 Learn a high level statistical language, e.g. R
 Learn programming, e.g. C
 Learn database, e.g. SQL
 Learn Linux
 Revise your statistics, probability and may be even
calculus
 Lucky…?!
Picture omitted because of copyright reason
Conclusion – the future
 A unified open environment for standard
analysis and development
 The best microarray analysis software?
~ Exploratory data analysis can never be the
whole story, but nothing else can serve as
the foundation stone -- as the first step. ~
John. W. Tukey