Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Figure omitted because of copyright reason A printed version can be found at Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis. Genome Biol. 2001 Aug 29; 2: 4021.1-4021.2. “ I think you should be more explicit here in step two” ~ Normal science consists largely of moppingup operations. Experimentalists carry out modified versions of experiments that have been carried out may times before ~ Thomas S. Kuhn The FAQ of biologist: What is the best microarray analysis software? Different kinds of microarray software Image analysis software Data mining software – Statistics software • R packages for microarray analysis SNPs analysis software Database/ LIMS software Public Expression Database Primer design Software for further data mining: annotation, promoter analysis & pathway reconstruction Softwares won’t discuss today Hardware control softwares – Arrayer controlling – ArrayMaker – Scanner controlling/ Image acquisition A statistics on current microarray softwares 28 Feb 2002 Jan 2001 Image analysis 17 17 Data mining 39 R packages SNP analysis 14 1 Database/ LIMS 14 4 Public Database 16 8 Accessory 8 - Further data mining 9 - 116 29 Total * Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html Image analysis software Spot recognition Segmentation – Foreground calculation – Background calculation Spot quality measures Major Image analysis softwares AIDA array ArrayPro ArrayVision Dapple F-scan GenePix Pro 3.0.5 ImaGene 4.0 Iconoclust Iplab Lucidea Automated Spotfinder Phoretix Array3 P-scan QuantArray 3.0 ScanAlyze 2 Spot TIGR Spotfinder UCSF Spot Examples of common used image analysis software ScanAlyze 2 (Mike Eisen, LBNL) GenePix Pro 3.0.5 (Axon Instruments) QuantArray 3.0 (Packard Instrument) ImaGene 4.0 (Biodiscovery) Spot recognition ArrayPro from Media Cybernetics Automate and fast grid, subgrid and spot finding algorithms Segmentation Purpose – classification between foreground and background – – – – Fixed circle Adaptive circle Adaptive shape Histogram method Segmentation Using extra dye – DAPI, avoid morphology assumption UCSF Spot Spot quality measure E.g. QuantArray 3.0 – Diameter – Spot Area – Footprint – Circularity – Spot Signal/Noise – Spot Uniformity – Background Uniformity – Replicate Uniformity Problem: lacking rigorous spot quality definition and experimental verification Future Image analysis software Rigorous quality mearsures definition Extra dye for better segmentation Automated analysis Data mining software Main purposes 1. Filtering and normalization 2. Statistical inference of differentially expressed genes 3. Identification of biologically meaningful patterns, i.e. expression profile; expression fingerprint/ signature 4. Visualization 5. Other analysis like pathway reconstruction etcs. Different categories Turnkey system Comprehensive software Specific analysis software Extension/ accessory of other software Major data mining software AIDA Array AMADA ANOVA program for microarray data ArrayMiner arraySCOUT ArrayStat BRB ArrayTools CHIPSpace Cleaver CIT CLUSFAVOR Cluster Cyber T DNA-arrays analysis tools dchip Expression Profiler Expressionist Freeview & FreeOView Gene Cluster GeneLinker Gold GeneMaths GeneSight GeneSpring Genesis Genetraffic J-Express MAExplorer Partek R cluster Rosetta Resolver SAM SpotFire Decision Site SNOMAD TIGR ArrayViewer TIGR Multiple Experiment Viewer TreeView Xcluster Xpression NTI Turnkey system Definition: A computer system that has been customized for a particular application. The term derives from the idea that the end user can just turn a key and the system is ready to go. For microarray, this includes everything from OS, server software, database, client software, statistics software and even hardware Examples – Genetraffic (Iobion) • Using Open Source softwares - LINUX, the R statistical language, PostgreSQL, and Apache Web server – Rosetta resolver (Rosetta Biosoftware) • Sun Fire server and drive array, Oracle 8i, Rosetta server and client side software Turnkey system Advantages – performance – Security – Support multiple users – Incorporate the experiment and data standards in design Disadvantages – Expensive – Not suitable for small labs – Require dedicated supporting staff – Close system Comprehensive software Definition: Software incorporate many different analyses for different stage in a single package. Examples – – – – Cluster (Mike Eisen, LBNL) GeneMaths (Applied Maths) GeneSight (Biodiscovery) GeneSpring (Silicon Genetics) Comprehensive software Cluster – Filter data – Adjust datanormalization, log transform etc – Clustering – Self-Organizing Maps (SOMs) – Principle Component Analysis (PCA) GeneSpring – & Promoter analysis – Gene annotation with public database information – Scripting tools – Access Open DataBase Connectivity (ODBC) databases Comprehensive software GeneMaths – & Bootstrap analysis for clustering – Fast clustering algoritms – Access Open DataBase Connectivity (ODBC) database GeneSight – & confidence analysis for replicated data – statistical analysis for significant genes – Graphical data set builder Comprehensive software Advantages – Standardized operation – Generate various analysis easily – Shorter learning curve for biologist – Script language for automated process control – Some brilliant ideas or analysis within particular software – “False” Sense of security? Comprehensive software Disadvantages – Inflexible to latest analysis development – Generate various analysis too easily – Implicit data analysis/ statistics background and definitions – Proprietary script language – Data compatibility with other softwares – Necessity to design and maintain your own database – Commercial softwares can be expensive! – Adding particular analysis because of marketing purpose, extra spending on unnecessary functions – Sometimes only available in a few computing platforms Specific analysis software Definition: Software performing a few/ one specific analysis Examples – GeneCluster (Whitehead Institute Centre for genome research) – INCLUSive - INtegrated CLustering, Upstream Sequence retrieval and motif Sampler (Katholieke Universiteit Leuven) – SAM – Significance Analysis of Microarrays (Stanford University) Specific analysis software GeneCluster – performing normalization, filter and SOM Specific analysis software INCLUSive - INtegrated CLustering, Upstream Sequence retrieval and motif Sampler SAM – finding statistical significant differentially expressed gene Specific analysis software Advantages – Better statistical background reference, usually with literature support Disadvantages – Non-standardized environment – java, web, excel… etc – Data compatibility problem – Data preprocessing problem Extension/ accessory of other software Definition: extension of other software’s capability Examples: – Freeview: Visualization and Optimization of Gene Clustering Dendrograms for Cluster – ArrayMiner: extension of GeneSpring Statistics softwares Excel MATLAB Octave SAS SPSS S-PLUS Statistica R Statistics softwares Advantages – Highly flexible – High level, multivariate analyses are either standard or easily programmable Disadvantages – Usually command line driven, impossible to learn intuitively (a disadvantage??) – Require a much better understanding of the statistical data analysis to follow the steps (a disadvantage??) R-packages A language and environment for statistical computing and graphics. Highly compatible to S/ S-plus Open source under GNU General Public Licence Runs on many UNIX/ Linux/ windows family and MacOS platform There are growing number of microarray analysis softwares (packages) written in R R-packages Dedicated for microarray analysis – – – – – – – – – affy Bioconductor SMA extension Cyber T GeneSOM Permax OOMAL (S-Plus) SMA YASMA General packages – cclust – cluster – mclust – multiv – mva – …etc! R-packages SMA - Statistical Microarray Analysis (Terry Speed, UC Berkeley) Bioconductor R-packages SMA – perform intensity and spatial dependent normalization – Replicated array data analysis by an empirical bayes approach R-packages Result of replicated data output B vs M plot R-packages Bioconductor – open source software project to provide infrastructure in terms of design and software to assist biologists and statisticians for analysing genomic data, with primary emphasis on inference using DNA microarrays – Most software produced by the Bioconductor project will be in the form of R libraries • Variation 1: provide basic infrastructure support that will help other developers produce high quality software • Variation 2: provide innovative methodology for analyzing genomic data – Provide some form of graphical user interface for selected libraries – A mechanism for linking together different groups with common goals Future Data mining software Standardized, open-source (free) platform? – EMBOSS - European Molecular Biology Open Software Suite. More supervised analysis package and pathway prediction package? Plugin modules – J-express – GeneSpring Mutation analysis software Chip based SNP or chromosomal aberration analysis (arrayCGH) Various forms of protocols, e.g. primer extension, ligase chain reaction, MALDITOF-MS, hybridization..etc Result is in the form of base calling or allelic imbalance Example – genorama Database Definition: large collection of data organized especially for rapid search and retrieval Two categories – Within laboratory/ institute database; LIMS – Public expression database Standardized definition of data – Minimum Information About a Microarray Experiment (MIAME) • • • • • • Experimental design Array design Samples Hybridizations Measurements Normalization controls Database/ LIMS software The database within your lab/ institute The quality of in house data management will affect the quality of final public data repository Database structure may be relatively simple Major Database/ LIMS software AMAD ARGUS ArrayDB ArrayInformatics Clonetracker GeNet Genetraffic GeneX MAD Maxd NOMAD Partisan Array LIMS Phoretix Array2 Database Rosetta Resolver SMD Public Expression Database Necessities – Provide raw data to validate published array result and develop new analysis tools – Further understanding of your data – Compare among different groups, meta-data mining – Source for specialty array design Different categories – Generic – Species specific – Disease specific The importance of data standardization Major public gene expression databases 3D-GeneExpression Database ArrayExpress BodyMap ChipDB ExpressDB Gene Expression Omnibus (GEO) Gene Expression Database (GXD) Gene Resource Locator GeneX Human Gene Expression Index (HuGE Index) RIKEN cDNA Expression Array Database (READ) RNA Abundance Database (RAD) Saccharomyces Genome Database (SGD) Standford Microarray Database (SMD) TissueInfo yeast Microarray Global Viewer (yMGV) Primer/ probe design Array designer GAP (Genome- wide Automated Primer finder servers) OligoArray Primer3 ProbeWiz Server Other useful software for further data mining Data annotation – – – – DRAGON Gene Ontology PubGene Resourcerer Promoter analysis – – – – AlignACE INCLUSive MEME Sequence Logo Pathway reconstruction – GenMAPP – PathFinder Data annotation – Link GI to a particular name – Literature mining to infer network Network reconstruction – Cluster + promoter analysis – statistical inference from experimental data Some suggestions for biologists who are serious in microarray study Communicate or even collaborate with Statisticians, Mathematicians and bioinformaticians Learn a high level statistical language, e.g. R Learn programming, e.g. C Learn database, e.g. SQL Learn Linux Revise your statistics, probability and may be even calculus Lucky…?! Picture omitted because of copyright reason Conclusion – the future A unified open environment for standard analysis and development The best microarray analysis software? ~ Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone -- as the first step. ~ John. W. Tukey