Download Affymetrix Data Mining Tool manual

® Affymetrix Data Mining Tool User’s Guide Version 3.0 For Research Use Only. Not for use in diagnostic procedures. Affymetrix Confidential 700233 Rev. 3 Trademarks ™, ™, HuSNP™, Affymetrix®, GeneChip®, EASI™, , ™ ™ ™ ™ ™ ™ ™ GenFlex , Jaguar , MicroDB , 417 , 418 , 427 , 428 , Pin-and-Ring™, Flying Objective™, NetAffx™ and CustomExpress™ are trademarks owned or used by Affymetrix, Inc. Microsoft® is a registered trademark of Microsoft Corporation. Oracle® is a registered trademark of Oracle Corporation. Limited License PROBE ARRAYS, INSTRUMENTS, SOFTWARE AND REAGENTS ARE LICENSED FOR RESEARCH USE ONLY AND NOT FOR USE IN DIAGNOSTIC PROCEDURES. NO RIGHT TO MAKE, HAVE MADE, OFFER TO SELL, SELL, OR IMPORT OLIGONUCLEOTIDE PROBE ARRAYS OR ANY OTHER PRODUCT IN WHICH AFFYMETRIX HAS PATENT RIGHTS IS CONVEYED BY THE SALE OF PROBE ARRAYS, INSTRUMENTS, SOFTWARE, OR REAGENTS HEREUNDER. THIS LIMITED LICENSE PERMITS ONLY THE USE OF THE PARTICULAR PRODUCT(S) THAT THE USER HAS PURCHASED FROM AFFYMETRIX. Patents Software products may be covered by one or more of the following patents: U.S. Patent Nos. 5,733,729; 5,795,716; 5,974,164; 6,066,454; 6,090,555; 6,185,561 and 6,188,783; and other U.S. or foreign patents. Copyright ©1999, 2001 Affymetrix, Inc. All rights reserved. Contents CHAPTER 1 Welcome Data Mining Tool User’s Guide 3 3 What’s New in DMT 3.0 3 Conventions Used 4 On-line Documentation 5 Technical Support 6 Your Feedback is Welcome 6 CHAPTER 2 Installing Data Mining Tool 3.0 Before You Begin 9 9 ® 9 Oracle LIMS Users 9 MicroDB™ Users 9 Microsoft SQL Server LIMS Users ® Installing Data Mining Tool 10 Creating an Oracle® Alias 17 Oracle 8.1.7 Alias Configuration CHAPTER 3 Affymetrix® Data Mining Tool Overview Access Data 17 25 25 Affymetrix Publish Database 25 Affymetrix® Analysis Data Model 26 i ii Co n t e nt s DMT Windowpanes Query Data 27 30 Building and Running a Query Viewing Query Results 30 33 Tables 33 Graphs 38 Analyze Query Results 43 Statistical Analyses 43 Cluster Analysis 43 Matrix Analysis 46 CHAPTER 4 Getting Started 49 Starting DMT 49 Managing Database Connections 50 Registering a Database 51 Unregistering a Database 52 Selecting a Database 53 Specifying the Default Directory CHAPTER 5 Building and Running a Query Building a Query 54 59 59 Starting a New Query 59 Specifying the Filters 61 Query Builder 68 Selecting Analyses for the Query 70 Specifying Analysis Filters 70 Running a Query 79 Affymetrix® Data Mining Tool User’s Guide Normalizing GeneChip® Signal Data 79 Choosing Normalization Before a Query or Pivot 80 Choosing Normalization After a Query or Pivot 81 Normalization Options 81 CHAPTER 6 Managing Queries Saving a Query 87 87 Using the Save As Command 88 Opening a Previously Saved Query 89 Deleting a Query 90 CHAPTER 7 Query Results Tables Experiment Information Table GeneChip® Data Mode Spot Data Mode 93 93 94 95 Query Table 96 Pivot Data Table 97 Selecting Results for the Pivot Table 99 Running the Pivot Operation 101 Including Probe Descriptions in the Pivot Table 102 Including Annotations in the Pivot Table 102 Sorting Pivot Table Columns 103 Pivot Options 104 Working with Tables 106 Finding Probes 106 Viewing Descriptions & Obtaining Further Gene Information 107 Annotating Probes 108 Adding Probes to the Filter Grid 109 iii iv Co n t e nt s Copying Tables 110 Exporting Data 111 Expanding the Results Pane 111 Clearing the Results Pane 112 CHAPTER 8 Annotations 115 Annotating Probes 115 Loading Annotations 116 Querying Annotations 118 Adding Probes to the Filter Grid 121 Deleting Annotations 122 CHAPTER 9 Probe Lists Creating Probe Lists 127 127 Creating a Probe List from the Query or Pivot Table 128 Creating a Probe List from Cluster Analysis 130 Creating a Probe List from Search Array Descriptions 131 Creating a Probe List from Filter 132 Creating a Probe List by Combining Existing Lists 132 Loading a Probe List 134 Specifying Probe List Members 134 Specifying an Input File 135 Using Probe Lists 137 Adding a Probe List to the Filter Grid 137 Displaying Selected Probe List Members 138 Managing Probe Lists 140 Viewing and Editing Probe List Members 140 Combining Probe Lists 142 Affymetrix® Data Mining Tool User’s Guide Exporting a Probe List 143 Deleting a Probe List 144 CHAPTER 10 Array Sets 149 Creating an Array Set 149 Working with Array Sets 151 Viewing Array Sets 151 Managing Array Sets 152 Editing an Array Set 152 Deleting an Array Set 153 CHAPTER 11 Graphing Results Scatter Graph 157 158 Plotting the Scatter Graph 158 Working with the Scatter Graph 161 Scatter Graph Options 168 Fold Change Graph 171 Plotting the Fold Change Graph 173 Working with the Fold Change Graph 176 Fold Change Graph Options 183 Series Graph 185 Plotting the Series Graph 186 Working with the Series Graph 188 Series Graph Options 191 Histogram 193 Plotting the Histogram 193 Working with the Histogram 195 Histogram Options 199 v vi Co n t e nt s Other Graphing Features 202 Enlarging the Graph Pane 202 Changing Graph Colors 202 Copying and Clearing Graphs 204 Printing Graphs 204 CHAPTER 12 Statistical Analyses 209 Selecting an Operator 209 Average, Median, Standard Deviation or Inter-Quartile Range 210 Fold Change 212 T-Test 214 Mann-Whitney Test 216 Count & Percentage 218 CHAPTER 13 Matrix Analysis Overview 223 Population Size 224 Running a Matrix Analysis CHAPTER 14 223 Cluster Analysis Self Organizing Map (SOM) Algorithm 225 231 231 Running a SOM Cluster Analysis 232 Saving a Probe List 237 SOM Filters 238 SOM Parameters 239 Affymetrix® Data Mining Tool User’s Guide Correlation Coefficient Clustering Algorithm 240 Running the Correlation Coefficient Cluster 241 Correlation Coefficient Clustering Options 244 Effect of Changing Algorithm Parameters 246 Saving and Importing Seed Patterns 248 Saving a Probe List CHAPTER 15 251 DMT Tutorial Introduction 255 255 Step 1: Restoring the MicroDB™ Database 256 Step 2: Starting DMT 256 Step 3: Registering the Database 256 Step 4: Selecting the Tutorial Database 258 Step 5: Opening the DMT Session 258 Lesson 1: Identifying Highly Expressed Genes 259 Step 1: Specifying a Filter 259 Step 2: Selecting Analyses for the Query 260 Step 3: Pivoting on Signal & Detection Call 260 Step 4: Querying and Pivoting the Data 262 Step 5: Sorting the Pivot Table by Signal 263 Step 6: Saving a Probe List 263 Step 7: Plotting the Series Line Graph 264 Lesson 1 Summary 268 Suggested Exercise 269 Lesson 2: Calculating Averages of Replicates 270 Step 1: Specifying a Probe List for the Filter 270 Step 2: Selecting Analyses for the Query 272 Step 3: Pivoting on Signal 273 Step 4: Query and Pivot the Data 274 Step 5: Selecting Average & Standard Deviation Operators 276 Step 6: Sorting the Pivot Table 279 vii viii Co n t e nt s Step 7: Displaying Probe Set Descriptions 280 Lesson 2 Summary 281 Suggested Exercise 281 Lesson 3: Summarizing Qualitative Data 282 Step 1: Pivoting on Detection Call 282 Step 2: Performing Count & Percentage Analysis 284 Step 3: Sorting Pivot Table Results 286 Step 4: Saving a Probe List 287 Step 5: Annotating Probe List Members 287 Lesson 3 Summary 288 Suggested Exercise 288 Lesson 4: Evaluating Difference Between Two Tissues 289 Step 1: Pivoting on Signal 290 Step 2: Mann-Whitney Test 292 Step 3: Annotating Probe Sets 295 Step 4: Saving a Probe List 295 Lesson 4 Summary 296 Suggested Exercise 296 Lesson 5: Evaluating Change Call Consistency 297 Step 1: Clearing the Filter Grid & Selecting Comparison Analyses 299 Step 2: Pivoting on Difference Call 300 Step 3: Comparison Ranking 301 Step 4: Annotating Probe Sets 303 Step 5: Saving a Probe List 304 Lesson 5 Summary 304 Suggested Exercise 304 Lesson 6: Self Organizing Map (SOM) Cluster Analysis 305 Step 1: Clearing the Filter Grid & Selecting Analyses 306 Step 2: Pivoting on Signal 307 Step 3: Computing Average Signal 308 Step 4: SOM Cluster Analysis 310 Affymetrix® Data Mining Tool User’s Guide Step 5: Saving & Annotating a Probe List 318 Lesson 6 Summary 318 AP PE N D I X A Filter Grid GeneChip Data Mode 323 323 Statistical Expression Algorithm 323 Empirical Expression Algorithm 324 Spot Data Mode AP PE N D I X B 330 Working with Windows & Tables Query Windowpanes 333 333 Expanding a Windowpane 333 Resizing a Windowpane 333 Clearing the Results or Graph Pane 334 Tables 334 Selecting the Entire Table 334 Selecting Rows 334 Resizing Columns 335 Hiding Columns 335 Reordering Columns 336 AP PE N D I X C Query Table Data GeneChip® Data Mode 339 339 Statistical Expression Algorithm Metrics 339 Empirical Expression Algorithm Metrics 340 Spot Data Mode 346 ix x Co n t e nt s DMT Algorithms AP PE N D I X D The SOM Algorithm 349 349 Neighborhood 351 Learning Rate 352 The Correlation Coefficient Clustering Algorithm 353 The Matrix Algorithm 354 AP PE N D I X E Toolbars & Shortcuts 359 DMT Main Toolbar 359 Session Toolbar 360 Shortcut Descriptions 361 1 Chapter 1 Welcome 1 Welcome to the Affymetrix® Data Mining Tool (DMT) User’s Guide. The DMT filters, queries and analyzes publish databases of GeneChip® or spotted array expression data. Data Mining Tool User’s Guide This manual explains how to use DMT to: ■ Build a query. ■ Display the query results in table or graph format. ■ Evaluate and compare replicate data using statistical analyses. ■ ■ Calculate the overlap significance between two lists of GeneChip® probe sets or spot probes. Apply cluster analysis to experimental results to help identify gene expression patterns. This manual also includes a tutorial that demonstrates; 1) a data mining strategy to identify genes that significantly change expression level, 2) statistical analyses of replicate data, and 3) cluster analysis. What’s New in DMT 3.0 Compatible with Microarray Suite Statistical or Empirical Expression Algorithm DMT can query and analyze experimental results generated by the Statistical Expression algorithm (in Microarray Suite 5.0) as well as the Empirical Expression algorithm (in versions of Microarray Suite prior to 5.0). The filter includes both Statistical and Empirical metrics so that a query may specify (in “OR” fashion) both types of metrics. 3 4 CH A P T E R 1 Welcome Publish Database Security Each publish database requires a login password to prevent unauthorized database access. Conventions Used This manual provides a detailed outline for all tasks associated with Affymetrix® Data Mining Tool. Various conventions are used throughout the manual to help illustrate the procedures described. Explanations of these conventions are provided below. Steps Instructions for procedures are written in a step format. Immediately following the step number is the action to be performed. On the line below the step there may be the following symbol: ⇒. This symbol defines the system response or consequence as a result of user action; what you see and what has happened that you may not see. Following the response additional information pertaining to the step may be found and is presented in paragraph format. For example: 9. Click Yes to continue. ⇒ The Delete task proceeds. In the lower right pane the status is displayed. To view more information pertaining to the delete task, right-click Delete and select View Task Log from the shortcut menu. Font Styles Bold fonts indicate names of commands, buttons, options or titles within a dialog box. When asked to enter specific information, such input appears in italics within the procedure being outlined. For example: 1. To select another server, enter the server name in the Oracle Alias box. 2. Enter DMT_3_Tutorial in the Publish Database box, then click Register. ⇒ The tutorial database is available to DMT. Affymetrix® Data Mining Tool User’s Guide Screen Captures The steps outlining procedures are frequently supplemented with screen captures to further illustrate the instructions given. The screen captures depicted in this manual may not exactly match the windows displayed on your screen. Additional Comments Throughout the manual, text and procedures are occasionally accompanied by special notes. These additional comments are and their meanings are described below. Information presented in tips provide helpful advice or shortcuts for completing a task. The Note format presents important information pertaining to the text or procedure being outlined. Caution notes advise you that the consequence(s) of an action may be irreversible and/or result in lost data. Warnings alert you to situations where physical harm to person or damage to hardware is possible. On-line Documentation The CD with DMT includes an electronic version of this user’s guide. The on-line documentation is in Adobe Acrobat format (a *.pdf file) and is readable with the Adobe Acrobat® Reader software, available at no charge from Adobe at http://www.adobe.com. The electronic user’s guide is printable, searchable and fully indexed. You can have it open and minimized on screen while using the DMT software. 5 6 CH A P T E R 1 Welcome Technical Support Affymetrix provides technical support to all licensed users via phone or e-mail. To contact Affymetrix Technical Support: Affymetrix Inc. 3380 Central Expressway Santa Clara, CA 95051 USA Tel: 1-888-362-2447 (1-888-DNA-CHIP) Fax: 1-408-731-5441 E-mail: [email protected] Affymetrix UK Ltd., Voyager, Mercury Park, Wycombe Lane, Wooburn Green, High Wycombe HP10 0HH United Kingdom Tel: +44 (0) 1628 552550 Fax: +44 (0) 1628 552585 E-mail: [email protected] www.affymetrix.com Your Feedback is Welcome Affymetrix Technical Publications is dedicated to continually improving the quality of our documentation and helping you get the information that you need. We welcome any comments or suggestions you may have regarding this manual. Please contact us at: [email protected] 2 Chapter 2 Installing Data Mining Tool 3.0 2 Installing Data Mining Tool 3.0 will uninstall any previous version of DMYT. You will no longer be able to use your previous version of DMT after installing Data Mining Tool 3.0. Before You Begin This section guides you through the installation of Data Mining Tool 3.0. Listed below is an overview of the steps needed to complete the installation. Microsoft® SQL Server LIMS Users 1. Obtain the name of the LIMS Server from your IT personnel if not known (this is needed during installation). 2. Install Data Mining Tool 3.0. Oracle® LIMS Users 1. Install Oracle Client Utilities on the workstation (Oracle Client Utilities must be the same version installed on the LIMS Server). 2. Install SQL* Loader (for better performance). 3. 4. Create an Oracle Alias. (Refer to the section Creating an Oracle® Alias, on page 17.) Install Data Mining Tool 3.0. MicroDB™ Users Install Data Mining Tool 3.0. 9 10 CH A P T E R 2 Installing Data Mining Tool 3.0 Installing Data Mining Tool The following are detailed instructions for installing DMT. Please note that the screen captures depicted in this section may not exactly match the windows displayed on your screen. You must be logged in as administrator to install the DMT 3.0 software. The screen captures depicted in this manual may not exactly match the windows displayed on your screen. 1. Log in as an administrator. 2. Insert the Affymetrix® DMT 3.0 CD-ROM. 3. If the autorun feature does not start the program: a. Click Start → Run. b. Type <cd drive letter>:\setup.exe. Click OK. ⇒ The Affymetrix Software Setup window appears. c. Affymetrix® Data Mining Tool User’s Guide 4. Click DMT 3.0 Setup. ⇒ The Welcome window appears (Figure 2.1). Figure 2.1 Welcome window 5. Click Next. 11 12 CH A P T E R 2 Installing Data Mining Tool 3.0 6. Several consecutive Software License Agreement windows appear. Review the contents in each and click Yes to accept the terms of the agreement. ⇒ The Customer Information window appears (Figure 2.2). Figure 2.2 Customer Information window 7. Enter your Name, Company and Serial Number. The serial number is located on the Affymetrix® Software Product Registration card. If you do not have a serial number, contact Affymetrix Technical Support. If you are upgrading from a previous version, the Serial Number field populates automatically. 8. Click Next. ⇒ The Choose Destination Location window appears (Figure 2.3). Affymetrix® Data Mining Tool User’s Guide Figure 2.3 Choose Destination Location window 9. Select the destination where Data Mining Tool will be installed. 13 14 CH A P T E R 2 Installing Data Mining Tool 3.0 10. Click Next. ⇒ The Select Database Compatibility window appears (Figure 2.4). Figure 2.4 Select Database Compatibility window 11. 12. Select the type of database that DMT will connect with. ■ Affymetrix® LIMS - if connecting to a LIMS Server. ■ Affymetrix® MicroDB - if connecting to a local publish database using MicroDB™. Click Next. ⇒ If connecting to a LIMS server, the Select Database Type window appears (Figure 2.5). If using MicroDB™ go to step 16. Affymetrix® Data Mining Tool User’s Guide Figure 2.5 Select Database Type window 13. Select the type of database used on the LIMS server, either SQL Server or Oracle. If you do not know the type of database you are using with the LIMS Server, please contact your IT personnel or DBA. 15 16 CH A P T E R 2 Installing Data Mining Tool 3.0 14. Click Next. ⇒ The Enter Information window appears (Figure 2.6). Figure 2.6 Enter Information windows for the SQL Server database (left) or the Oracle® database (right) 15. In the Enter Information window complete one of the following; ■ If SQL Server is selected, enter the SQL Server Name (usually the name of the LIMS Server). ■ If Oracle® is selected, enter the Oracle Alias Name. 16. Click Next. ⇒ Database connectivity is verified and the Start Copying Files window appears. 17. In the Start Copying Files window, verify the information and click Next. ⇒ Program files are copied and the system configures the registry. The Setup Complete window appears. ⇒ For Oracle systems: If a warning message regarding SQL Loader appears, continue the DMT install until complete. Then, install SQL Loader (part of Oracle) for better DMT performance. After SQL Loader is installed, re-install DMT. 18. Select Yes, I want to restart my computer now and click Finish. Affymetrix® Data Mining Tool User’s Guide Creating an Oracle® Alias To create an Oracle alias, use the Net8 Assistant. The following steps guide you through creating an alias. Oracle 8.1.7 Alias Configuration 1. Start → Programs → <Oracle directory> → Network Administration → Net8 Assistant. ⇒ Oracle Net8 Assistant window appears (Figure 2.7). Figure 2.7 Oracle® Net8 Assistant window 2. Expand Local. 17 18 CH A P T E R 2 Installing Data Mining Tool 3.0 3. Highlight Service Naming, then from the menu bar click Edit → Create. ⇒ The Net Service Name Wizard Welcome window appears (Figure 2.8). Figure 2.8 Net Service Name Welcome window 4. Enter the Net Service Name (which is the alias name). The name must be the same name as the local LIMS server. If creating a remote publish server alias, Host Name must be the same as the computer name of the remote publish server. 5. Click Next. ⇒ The Networking Protocol window appears (Figure 2.9). Affymetrix® Data Mining Tool User’s Guide Figure 2.9 Networking Protocol window 6. Select TCP/IP (Internet Protocol). 7. Click Next. ⇒ The Host Name window appears (Figure 2.10). Figure 2.10 Host Name window The Host Name is the name of the local LIMS Server. 19 20 CH A P T E R 2 Installing Data Mining Tool 3.0 The Port Number is left as the default value 1521, unless it has been changed. If creating a remote publish server alias, the Host Name must be the name of the remote publish server. Click Next. ⇒ The Database SID window appears (Figure 2.11). 8. Figure 2.11 Database SID window Select (Oracle8i) Service Name option. Enter the name of the Oracle database instance on the local LIMS server. 9. If creating a remote publish database server alias, the Database SID name should be the instance created on the remote publish server. 10. Click Next. ⇒ The Test Service window appears (Figure 2.12). Affymetrix® Data Mining Tool User’s Guide Figure 2.12 Test Service window 11. Click Test... to test the alias created. ⇒ A Connection Test Information window appears (Figure 2.13). Figure 2.13 Connection Test Information window 21 22 CH A P T E R 2 Installing Data Mining Tool 3.0 12. If the connection was successful go to step 13. If the connection was unsuccessful, follow the instructions below. a. If the test fails, click Change Login.... Figure 2.14 Change Login window b. Enter Username and Password, then click OK. c. Repeat step 11. 13. Click Close. 14. Click Finish. 15. Repeat the above steps to create and test the second alias if using remote publish database server. 16. Save the configuration settings. If your test was unsuccessful, verify that your listener is listening for your alias. 3 Chapter 3 Affymetrix® Data Mining Tool Overview 3 Affymetrix® Data Mining Tool (DMT) provides a flexible and intuitive query interface to a large data warehouse of published expression databases and helps you sift through hundreds or thousands of experimental results. This chapter provides an overview of DMT and how it interacts with publish databases. It explains the steps involved in running a query and the options available to you for viewing and analyzing results. Access Data DMT operates in GeneChip® data or spot data mode. It enables you to access, query and analyze data found in a publish database populated with Affymetrix GeneChip® probe array expression analysis results (*.chp) or spotted probe array intensity results (*.spt). The data mode and location of the publish database determine the DMT features available. Affymetrix Publish Database An Affymetrix publish database is created by an Affymetrix publishing application (see Table 3.1, on page 26). These applications import or publish analysis data (*.chp or *.spt) to a publish database located on the LIMS server or a local workstation (MicroDB™). Published data are available to DMT or other third party analysis tools, as well as database management tools such as Microsoft Access® 2000. 25 26 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Table 3.1 Affymetrix® publishing applications Publishing Application Data Published Publish Database Location Affymetrix® LIMS GeneChip® probe array expression analysis data (*.exp, *.cel, *.chp) LIMS server Affymetrix® MicroDB™ GeneChip® probe array expression analysis data (*.chp) Local workstation Affymetrix® MicroDB™ Affymetrix® Jaguar™ spotted array intensity data (*.spt) Local workstation You can also use DMT to query other appropriately formatted databases populated with Affymetrix® GeneChip® expression analysis results (*.chp) or spotted probe array intensity results (*.spt). Affymetrix® Analysis Data Model DMT is compatible with any Affymetrix® Analysis Data Model (AADM) compliant database populated with Affymetrix GeneChip® probe array expression analysis results (*.chp) or AADM-derived database populated with spotted probe array intensity results (*.spt). AADM is available at www.affymetrix.com. Affymetrix® Data Mining Tool User’s Guide DMT Windowpanes The DMT session appears when you start a new query or previously saved query. DMT has four different panes for filtering and displaying expression data (Figure 3.1, Figure 3.2). The panes are: Filter grid Enables you to specify the filters and the limits the data must meet to be returned by the query. Data tree Displays analyses, array sets and probe lists. You can select analyses or array sets from the data tree for the query. Graph pane Displays graphs (scatter, fold change, series, or histogram graph) and cluster analysis results. Results pane Displays the experiment information, query and pivot tables. Use the filter grid and data tree to specify query conditions. The graph and results panes display query and analysis results. 27 28 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Figure 3.1 DMT display in GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide Figure 3.2 DMT display in spot data model 29 30 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Query Data A query searches a publish database to find vital experimental and expression data. User-defined filters specify the search criteria. The query returns only those records that meet the criteria or limits specified by the query filters. Building and Running a Query To build a query: ■ Specify the filter conditions that the expression data must satisfy (Table 3.2 lists the type of filters for the various DMT data modes). ■ Select the analyses for the query from the data tree (Figure 3.1, Figure 3.2). Analysis filters are available in GeneChip LIMS data mode. You can filter the current database so that the data tree displays only the analyses that meet user-specified criteria. Affymetrix® Data Mining Tool User’s Guide Table 3.2 DMT Filters DMT Data Mode Publishing Application GeneChip® probe array Spotted probe array Analysis Filtersa Expression Filtersb Affymetrix® LIMS Sample template Experiment template Attribute Attribute value Sample project Probe Array Sample type Operator Sample name Absolute or comparison expression metrics (See Appendix A) MicroDB™ Not available Absolute or comparison expression metrics (See Appendix A) MicroDB™ Not available Intensity result Standard deviation intensity Pixel intensity Background Standard deviation Background Ratio (See Appendix A) a. In GeneChip LIMS mode, analysis filters interrogate the publish database and determine the analyses displayed in the data tree. b. Filters interrogate the analyses selected in the data tree. 31 32 CH A P T E R 3 Affymetrix® Data Mining Tool Overview You can specify analysis filters in the Filter Analysis dialog box (LIMS data mode only) (Figure 3.3) that interrogate the current database and determine the analyses displayed in the data tree. To filter the analyses, select View → Analysis Filters from the menu bar. Figure 3.3 Filter Analysis dialog box (available when connected to a publish database on the LIMS server in GeneChip® data mode) Affymetrix® Data Mining Tool User’s Guide Viewing Query Results You can view the data retrieved from the database in both tables and graphs. This section describes the various types of tables and graphs available in DMT. Tables The results pane (Figure 3.1) contains three tables: ■ Experiment Information table ■ Query table ■ Pivot table The query and pivot tables provide two different views of expression data. Experiment Information Table The experiment information table contains information about analyses or array sets selected in the data tree. In GeneChip® data mode, the experiment information table (Figure 3.4) displays: ■ User-specified information such as project and experiment name. ■ Information automatically captured by Affymetrix® LIMS during hybridization, scanning and analysis of GeneChip probe arrays including experiment template parameters. ■ Values for user-modifiable expression algorithm parameters (used to calculate the expression metrics). In spot data mode, the experiment information table (Figure 3.5) displays: ■ Probe array and operator name. ■ Parameters associated with the analysis. 33 34 CH A P T E R 3 Affymetrix® Data Mining Tool Overview To populate the experiment information table: ■ Select analyses or array sets in the data tree. ■ Click the Info toolbar button Information. , or select Query → Experiment Figure 3.4 Experiment information table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide Figure 3.5 Experiment information table, spot data mode Query Table The query table displays the expression data for probes (probe sets or spot probes) that met the query criteria (Figure 3.6 and Figure 3.7). Each row displays the probe name, analysis and expression data (for example, signal and detection) for every analysis. If the query results include the same probe from different analyses, the query table displays a separate row for each probe/analysis pair. For example, if a query returned the same probe set from four different analyses, the query table would display four rows of results for the same probe set (one row per analysis). 35 36 CH A P T E R 3 Affymetrix® Data Mining Tool Overview To populate the query table: ■ Select Analyses Or Array Sets In The Data Tree. ■ Specify filters (optional). ■ Click the Query button bar. Figure 3.6 Query table, GeneChip® data mode Figure 3.7 Query table, spot data mode or select Query → Run Query from the menu Affymetrix® Data Mining Tool User’s Guide Pivot Table When a query returns the same probe from several different analyses, it is often more convenient to view the probe data (probe set or spot probe) from each analysis side by side in the same row. DMT can retrieve analyses from the database and organize the data in the pivot table so that all analysis results for the same probe are displayed in one row (Figure 3.8 and Figure 3.9). In the pivot table, the column headers display the analysis names; the columns display the expression data. The pivot table columns are available for graphing, statistical analysis, or cluster analysis. To populate the pivot table: ■ Select analyses or array sets in the data tree. ■ Specify filters (optional). ■ Click the Run Pivot button menu bar. Figure 3.8 Pivot table, GeneChip® data mode or select Query → Retrieve Data from the 37 38 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Figure 3.9 Pivot table, spot data mode Graphs DMT can display the pivot table data in graphical formats. The types of graphs available include: ■ Scatter graph ■ Fold Change graph ■ Series graph ■ Histogram graph Each type of graph is displayed in a separate tab of the graph pane (Figure 3.1 and Figure 3.2). The graphing functions are only available for the analyses displayed in the pivot table. Affymetrix® Data Mining Tool User’s Guide Scatter Graph The scatter graph plots multiple pairs of user-specified numeric columns from the pivot table using a traditional scatter plot (Figure 3.10). Each point represents a probe (probe set or spot probe) common to both columns in the comparison. A point is defined by the intersection of the value on the x and y axes for the common probe. Figure 3.10 Scatter graph, GeneChip® data mode 39 40 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Fold Change Graph The fold change graph (Figure 3.11) compares multiple pairs of user-specified numeric pivot table columns (base and comparison columns). It displays a scatter plot of the fold change of the comparison column compared to the base column. (See Appendix A for the fold change calculation.) Each point represents a probe (probe set or spot probe) that is common to the base and comparison columns. The y-axis coordinate is the average fold change for all of the base-comparison pairs that contain the probe. The xaxis coordinate is the average of the comparison column value for all of the comparison analyses that contain the probe. Figure 3.11 Fold change graph, GeneChip® data mode Series Graph The series graph plots any numeric pivot table column in a line or bar graph format (Figure 3.12, Figure 3.13). The series graph is a useful way to monitor gene expression across different experiments or over a time course. Affymetrix® Data Mining Tool User’s Guide Figure 3.12 Series line graph, GeneChip® data mode Figure 3.13 Series bar graph, GeneChip® data mode 41 42 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Histogram The histogram plots a frequency distribution of any numeric pivot table column (Figure 3.14). The histogram sorts the metric values into groups or bins (x-axis coordinate) and plots the number of probes (probe sets or spot probes) in each bin (y-axis coordinate). For example, a histogram of probe set expression signal values can help evaluate the proportion of genes expressed at different levels. Figure 3.14 Histogram, expression signal data Affymetrix® Data Mining Tool User’s Guide Analyze Query Results You can apply statistical and cluster analyses to the results displayed in the pivot table. This section describes the various types of statistical and cluster analysis available in DMT. Statistical Analyses DMT can apply the following statistical analyses to numeric pivot table columns: ■ Average ■ Median ■ Standard deviation ■ Inter-Quartile range ■ Fold change ■ T-Test ■ Mann-Whitney test ■ Count & Percentage The pivot table displays the resulting data in new columns that are available for graphing (scatter, fold change, or series graph), clustering, or further statistical analysis. Cluster Analysis Cluster analysis finds expression profiles that have similar shapes. DMT provides two different algorithms, the Self Organizing Maps (SOM) and Correlation Coefficient algorithms, for finding those clusters. Self Organizing Map Algorithm The self organizing map (SOM) algorithm is designed to identify patterns in expression signals. However any numeric pivot table column may be selected for cluster analysis. The algorithm represents the selected data of probe sets in n experiments as points in k-dimensional space. Initially, the algorithm randomly maps a grid of nodes in space, then iteratively adjusts the node positions toward collections of points until the nodes reflect clusters of probe sets with similar 43 44 CH A P T E R 3 Affymetrix® Data Mining Tool Overview expression patterns. (See Appendix D for more information about the SOM algorithm.) shows the patterns and probe set members of clusters found by the SOM algorithm. Figure 3.15 Figure 3.15 SOM cluster results Affymetrix® Data Mining Tool User’s Guide Correlation Coefficient Algorithm The correlation coefficient algorithm uses a nearest neighbor approach to find groups of probe sets with similar pattern. The average pattern of a group defines a cluster seed. Probe sets whose patterns are closely matched to the seed pattern are assigned to the seed’s cluster. Figure 3.16 Correlation coefficient cluster results 45 46 CH A P T E R 3 Affymetrix® Data Mining Tool Overview Matrix Analysis Matrix analysis enables you to compare probe lists and determine the overlap between two lists (Figure 3.17). The matrix algorithm computes the probability (P-value) that the observed overlap is expected due to random chance. The algorithm converts the P-value to an overlap significance value that is displayed in the matrix. The overlap significance value = -logP, and may range from near zero to a large number. Appendix D provides further information on the Matrix algorithm. The matrix highlights values that exceed the overlap significance threshold (pink) and values that exceed the non-overlap significance threshold (yellow). Figure 3.17 Matrix displays the overlap significance values for two probe lists 4 Chapter 4 Getting Started 4 This chapter provides step by step instructions for completing the basic tasks that are necessary to start and run Affymetrix® Data Mining Tool (DMT). Starting DMT 1. Click the Windows Start button , then select Start → Programs → Affymetrix → Data Mining Tool. ⇒ The Publish Database Login dialog box appears (Figure 4.1). This dialog does not appear in MicroDB mode. Figure 4.1 Publish Database login for LIMS mode 49 50 CH A P T E R 4 Getting Started 2. Enter the password for the publish database and click Login. ⇒ The main window appears (Figure 4.2). Figure 4.2 DMT main window, Database02 selected In the DMT main window, you can: ■ Register or unregister a publish database. ■ Select a database for the query. ■ Start a new DMT session. ■ Open or delete a previously saved query. Managing Database Connections DMT connects with databases created using the Affymetrix® LIMS or Affymetrix® MicroDB applications (or other appropriately formatted databases). The tasks involved with managing these database connections include registering a database, selecting a database for use with DMT, or unregistering a database. Affymetrix® Data Mining Tool User’s Guide Registering a Database You must register a publish database to make it available to DMT. To register a database, use the appropriate procedure outlined below that is suited to your particular system. Publish Database on Windows Workstation (MicroDB™ System) 1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 4.3). Figure 4.3 Register Database dialog box, publish database on Windows NT workstation 2. Select a database from the Publish Database drop-down list, then click Register. ⇒ The publish database is now available to DMT. Publish Database on LIMS Server (Affymetrix® LIMS) 1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 4.4). The Publish Database box contains a list of publish databases on the server. Figure 4.4 Register Database dialog box, publish database on LIMS server 2. Enter the SQL server or Oracle Alias name in the Server Name box. 51 52 CH A P T E R 4 Getting Started 3. Click List Databases to display the publish databases for the server in the Publish Database drop-down list. 4. Select a database from the Publish Database drop-down list. 5. Click Register. ⇒ The Publish Database login dialog box appears (Figure 4.5). Figure 4.5 Publish Database Login 6. Enter the database password and click Login. ⇒ The database is available to DMT. Unregistering a Database Unregistering a database removes it from the lists of available, or registered, databases which may be queried. 1. Select Edit → Unregister Database from the menu bar. ⇒ The Unregister Database dialog box appears (Figure 4.6). Figure 4.6 Unregister Database dialog box Affymetrix® Data Mining Tool User’s Guide Select a database, then click Unregister. ⇒ The database is no longer available to DMT. 2. Selecting a Database DMT connects to a single publish database at a time. By default, DMT connects to the most recently registered database. Select the database of interest before opening a DMT session. 1. 2. Close all the DMT sessions and return to the main DMT window (Figure 4.2). Select Edit → Select Database from the menu bar, then select a database. ⇒ The Publish Database Login dialog box appears (Figure 4.7). Figure 4.7 Publish Database Login 3. Enter the password, then click Login. ⇒ The status bar at the bottom of the main window displays the current database name (Figure 4.2). If the status bar is not displayed, select View → Status Bar from the menu bar. 53 54 CH A P T E R 4 Getting Started Specifying the Default Directory You must specify a default directory that identifies the location of files for import (for example, when loading probe lists or annotations) or export when the data export option is selected. 1. Open a DMT session. 2. Click the Options button . Alternatively, select View → Options from the menu bar. ⇒ The Data Mining Options dialog box appears (Figure 4.8). Figure 4.8 Data Mining Options dialog box, Default Directory tab 3. Click the Default Directory tab. 4. Click the Browse button . ⇒ The Browse for Folder dialog box appears (Figure 4.9). Affymetrix® Data Mining Tool User’s Guide Figure 4.9 Browse for Folder dialog box 5. Locate the default directory, then click OK. 55 56 CH A P T E R 4 Getting Started 5 Chapter 5 Building and Running a Query 5 A query is the key to obtaining interesting data for subsequent analysis using Affymetrix® Data Mining Tool. This chapter explains how to define a query (specify the conditions that the data must meet to be retrieved from the database) and select analyses for the query from the current database. Building a Query The three main steps for building a query are: ■ Open a DMT session. ■ Specify the filters. ■ Select analyses or array sets for the query. Affymetrix® Data Mining Tool operates in GeneChip® data mode or spot data mode. The data mode and location of the publish database (LIMS server or Windows NT workstation) determine the DMT features available. Starting a New Query To start a new query in GeneChip® data mode, select Data → New → GeneChip Mining from the menu bar. ⇒ A new DMT session starts (Figure 5.1). To start a new query in spot data mode, select Data → New → Spotted Array Mining from the menu bar. ⇒ A new DMT session starts (Figure 5.2). You can open more than one DMT session at a time. Select Window → Cascade, or Window → Tile from the menu bar to organize the open windows. 59 60 CH A P T E R 5 Building and Running a Query Figure 5.1 DMT session, GeneChip® data mode (graph pane not displayed until a graph or cluster result is generated) Figure 5.2 DMT session, spot data mode (graph pane not displayed until a graph or cluster result is generated) Affymetrix® Data Mining Tool User’s Guide DMT Session Components Session toolbar Provides access to additional functions specific to the DMT session. See Appendix E for detailed toolbar information. Filter grid Provides a flexible interface for selecting expression metrics for filtering and entering the limits the data must meet to be returned by the query. Data tree Displays the analyses in the current publish database. When the database is on the LIMS server, the Filter Analysis dialog box can be used to filter the analyses displayed in the data tree. The data tree also displays array sets and probe lists. Select the analyses for the query from the data tree. Results pane Displays the experiment information, query, and pivot tables that contain information about the analyses or array sets selected in the data tree, and query results. Graph pane Displays graphs or cluster analysis results. This pane is not displayed until a graph or cluster analysis is generated. Specifying the Filters The filter grid (Figure 5.3, Figure 5.4) enables you to select expression metrics for filtering and specify the limits that the data must meet to be returned by the query. Figure 5.3 Filter grid, GeneChip® data mode 61 62 CH A P T E R 5 Building and Running a Query Figure 5.4 Filter grid, spot data mode Filter Grid Components Column headers Displays the probe set or spot probe name and expression metrics available for the filter. GeneChip® data mode: Any absolute or comparison expression analysis metric generated by the Statistical Expression algorithm or the Empirical Expression algorithm (in versions of Microarray Suite lower than 5.0). (See the Affymetrix Microarray Suite User’s Guide for more information about the expression algorithms and metrics.) Spot data mode: Intensity, intensity standard deviation, pixel count, background, background standard deviation, ratio. Sort Specifies a sort order (ascending, descending, or none) in the query table for a results column. Note: This sort specification does not affect the pivot table. To sort a pivot table column, right-click the column header and select a sort option from the shortcut menu. Line 1 through n Accommodates the entries that specify metric limits. Limits entered in two or more cells of the same row are combined in AND fashion (intersection). Limits entered in subsequent rows are combined in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide Entering Limits 1. Double-click the cell of interest in Line 1 of the grid (not the Sort row). ⇒ The blinking cursor in the cell indicates DMT is ready to accept typed input (Figure 5.5). Table 5.1 and Table 5.2 describe query operators and statements and provide example limits. Figure 5.5 Filter grid, GeneChip® data mode If you double-click a cell in the last row of the filter grid, DMT automatically adds another row to the grid. 2. Enter the limit, then do one of the following: ■ Double-click the next cell where you want to enter a limit. ■ Press the ENTER key to complete the entry and move the cursor to the grid cell below in line 2. ■ Press the TAB key to complete the entry and move the cursor to the right to the next cell in the row. Limits may be entered in all columns and many rows of the filter grid. Limits in two or more cells in the same row are logically connected with an AND (intersection) statement. Limits entered in subsequent rows are logically connected with an OR (union) statement. Enter limits for Statistical algorithm metrics and Empirical algorithm metrics on separate lines in the filter grid. Figure 5.6 shows an example filter that specifies probe sets with a signal greater than 400 AND detection p-value < 0.1. 63 64 CH A P T E R 5 Building and Running a Query Figure 5.6 Filter grid, GeneChip® data mode Use Probe Lists to quickly add a group of associated probes to the filter. Right-click the cell in the Probe Set Name column, and select Probe List from the shortcut menu. See page 137 for more information. Entering Multiple Limits in a Single Cell Limits containing AND (intersection) or OR (union) operators may be entered in a single cell. For example, the limit in Figure 5.7 defines the range between 500 and 1000 (the intersection of the range > 500 and the range < 1000). The query returns probe sets where: 500 < signal < 1000. Probe sets with signal < 500 or > 1000 are not returned. Figure 5.7 Filter grid, GeneChip® data mode DMT automatically adds a blank row to the bottom of the grid to accommodate another OR entry. The last row of the grid may remain blank with no effect on the query. Affymetrix® Data Mining Tool User’s Guide Editing Limits Double-click the cell to highlight the entire limit, then do one of the following: ■ Enter a new limit (overwrites the old limit). ■ Right-click the mouse and make a selection from the shortcut menu of edit commands. ■ Use the mouse to select part of the limit, then enter new text. An Oracle® database is case sensitive. Specifying a Sort Order for the Query Table You may specify a sort order (ascending, descending, or not sorted) for the query table. 1. In the filter grid, click the cell in the Sort row (first row) for the metric you want to sort (for example, Signal in Figure 5.8). ⇒ An arrow button appears. Figure 5.8 Filter grid, GeneChip® data mode 2. Click the arrow, and select a sort order from the drop-down list that appears (Figure 5.9). Figure 5.9 Filter grid, sort options (GeneChip® data mode) 3. Repeat steps 1 - 2 for additional metrics you want to sort. 65 66 CH A P T E R 5 Building and Running a Query If a sort order is specified for two or more metrics, the sort is prioritized from left to right. For example, the limits in Figure 5.10 sort the query results first by descending signal, then by ascending detection p-value. Figure 5.10 Filter grid, multiple column sort (GeneChip® data mode) Table 5.1 Query operators and example query statements Comparison Operators = Definition Equal (number or character field) Example Limit =3 =’P’ Returns the Record for Metric Data... Equal to 3 Called present > Greater than >5 Greater than 5 < Less than <20 Less than 20 >= Greater than or equal to >=6 Greater than or equal to 6 <= Less than or equal to <=19 Less than or equal to 19 != Not equal to (number or character field) !=25 Not equal to 25 Returns the Record for Metric Data... Ranges Definition Example Limit BETWEEN Returns records with the metric value between the user-specified limits BETWEEN 2 AND 5 NOT BETWEEN Returns records where the metric value is not between the userspecified limits NOT BETWEEN 1 and Not between 1 and 1.5 1.5 Between 2 and 5 Affymetrix® Data Mining Tool User’s Guide Table 5.1 Query operators and example query statements Lists Definition Example Limit Returns the Record for Metric Data... IN Returns records that match any one of the values in the list IN (‘cre’, ‘bioB’) cre or bioB NOT IN Returns records that do not match any one of the values in the list NOT IN (‘cre’, ‘biobB’) Not cre or biobB LIKE Searches character fields such as probe name and returns records that match the pattern in the LIKE statement LIKE ‘cre’ cre LIKE ‘cr_’ cr followed by any single character (the underscore symbol (_) is the wild card for a single character) LIKE ‘cr%’ cr followed by any string of zero or more characters (the % symbol is the wild card for any string of zero or more characters) NOT LIKE ‘cr%’ Not cr followed by any string of zero or more characters (the % symbol is the wild card for any string of zero or more characters) Example Statement Returns the Record for Metric Data... NOT LIKE Local Operators & Complex Statements Searches character fields such as probe name and returns records that do not match the pattern in the NOT LIKE statement Definition AND Connects two conditions and only returns results when both conditions are true >5 AND <6 Greater than 5 and less than 6 OR Connects two conditions and returns results when either condition is true <5 OR >9 Less than 5 or greater than 9 NOT Negates a condition when combined with various operators. For example, NOT LIKE, NOT IN NOT < 5000 Not less than 5000 () Used to force the order of evaluation of two or more combined conditions (>5 AND <10) OR (>200 AND < 500) Greater than 5 and less than 10 or greater than 200 and less than 500 67 68 CH A P T E R 5 Building and Running a Query Table 5.2 Expression call search strings (GeneChip® data mode) Absolute Call Limit Present =’P’ Marginal =’M’ Absent =’A’ No call =’No Call’ Difference Call Limit Increased =’I’ Marginally increased =’MI’ No change =’NC’ Marginally decreased =’MD’ Decreased =’D’ An Oracle database is case-sensitive. Use upper case letters to specify the call, except for ‘No Call’. Query Builder The Query Builder helps you input complex limits in the filter grid without prior knowledge of correct syntax for operators such as BETWEEN and LIKE. You need only specify text or numbers where appropriate. The Query Builder inserts the logical operators and syntactically correct limit into the user-specified cell of the filter grid. Entering Limits 1. Right-click the cell of interest in the filter grid (do not click the Sort row). 2. Select Show Query Builder from the shortcut menu that appears. ⇒ The Build Filter dialog box appears for the chosen type of result (Figure 5.11). Affymetrix® Data Mining Tool User’s Guide Figure 5.11 Build Avg Diff Filter dialog box 3. Click an operator or statement button. See Table 5.1, on page 66 for information on operators and statements. 4. Enter appropriate text to complete the limit. Lower case text in the query builder is a place holder that must be replaced with your input. A text search string must contain single quotation marks (for example, LIKE ‘YDR154C/’). 5. Click OK or press the ENTER key to place the limit in the filter grid. Editing Limits 1. Click Undo in the Build Filter dialog box. ⇒ The last entry is deleted. 2. Alternatively, select the text you want to edit, then make a new entry or right-click to open a shortcut menu of edit commands. The BACKSPACE, DELETE, and arrow keys are supported during editing in the Build Filter dialog box. 69 70 CH A P T E R 5 Building and Running a Query Selecting Analyses for the Query An analysis includes the GeneChip® expression analysis results (*.chp) or spotted array intensity results (*.spt) derived from an experiment. An analysis is computed using particular values for user-modifiable algorithm parameters. Selecting Analyses from the Data Tree The data tree displays analyses in the current database as well as array sets. See Chapter 10 for more information about array sets. ■ To select analyses for the query, click the analyses or array sets in the data tree. ■ To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses. Specifying Analysis Filters If the publish database is on the LIMS server, you may specify analysis filters that determine the analyses displayed in the data tree. ■ Click the Display Analysis Filters button Analysis → Filters from the menu bar. . Alternatively, select View ⇒ The Filter Analysis dialog box appears (Figure 5.12). The Filter Analysis dialog box contains an Attribute section (top) and a Sample section (bottom). Analysis filters can be specified in the Attribute section, Sample section, or both sections. Analysis filters specified in the Attribute and Sample sections are combined in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide Figure 5.12 Filter Analysis dialog box, GeneChip® data mode (publish database on the LIMS server) 71 72 CH A P T E R 5 Building and Running a Query Attribute Section The Attribute section includes the template tree, attribute list, and value list (Figure 5.13). Together these components comprise a hierarchy that enables you to specify particular attribute values as analysis filters (see Table 5.3). The data tree displays the analyses that contain the selected attribute values when the Apply button is clicked. Figure 5.13 Filter Analysis dialog box, Attribute section Affymetrix® Data Mining Tool User’s Guide Table 5.3 Filter Analysis dialog box, Attribute section components Component Displays... Select... Template tree sample and experiment templates in the current database one or more templates from the template tree to display the associated attributes in the attribute list Attribute list attributes associated with the templates selected in the template tree attributes from this list to display all values for the selected attributes in the value list Value list all the values in the current database for the attributes selected in the attribute list particular attribute values from this list for use as analysis filters Selecting Analysis Filters in the Attribute Section To select adjacent items in the template tree, attribute list, or value list, press and hold the SHIFT key, then click the first and last row in the selection. To select non-adjacent items, press and hold the CTRL key, then click the desired rows. 1. Click the template names of interest in the template tree. ⇒ The Attribute list displays all attributes associated with the selected templates (Figure 5.14). 73 74 CH A P T E R 5 Building and Running a Query Figure 5.14 Template tree (top) and attribute list (bottom) 2. Click the attribute(s) of interest in the Attribute list (Figure 5.14). ⇒ The value list displays all values for the selected attribute(s) (Figure 5.15). Affymetrix® Data Mining Tool User’s Guide Figure 5.15 Value list displays all values for the selected attribute(s) 3. In the value list, click the attribute Value(s) you want to use as an analysis filter(s). 4. Click Clear to clear all selections from the Attribute section. 5. When finished specifying analysis filters, click Apply. ⇒ The data tree in the Query window displays the analyses selected by the filters. If analysis filters are specified in both the Attribute and Sample sections, DMT combines the filters in OR fashion (union). If no attribute values are highlighted in the value list, then all values are selected. 75 76 CH A P T E R 5 Building and Running a Query Finding Templates or Attributes The Find function in the Filter Analysis dialog box searches for templates, attribute names, or attribute values. The Find button is located in the lower right corner of the Attribute section in the Filter Analysis dialog box. 1. To begin a search, click Find in the Filter Analysis dialog box (Figure 5.12). ⇒ The Find dialog box appears (Figure 5.16). Figure 5.16 Find dialog box 2. Enter the text string for the search (up to 256 alphanumeric characters and spaces) in the Find what box. 3. Select Templates, Attribute Names, or Attribute Values from the Look in drop-down list. 4. Click Find Now. ⇒ Template search highlights templates in the template tree that contain the search text string. ⇒ Attribute name search highlights the: 1) attributes in the attribute list that contain the search text string, and 2) corresponding attribute values in the value list. ⇒ Attribute value search highlights the attribute values in the value list that contain the search text string. 5. Click Close to close the Find dialog box. Affymetrix® Data Mining Tool User’s Guide Sample Section The Sample section of the Filter Analysis dialog box (Figure 5.17) displays the attributes that LIMS requires during sample registration and experiment setup (see Table 5.1). Figure 5.17 Sample section of the Filter Analysis dialog box Table 5.4 Filter Analysis dialog box, Sample section Component Contents Sample Project Projects in the current database. You can assign a sample to a project before publishing data. Several samples can be assigned to the same project for faster selection in DMT. If samples have been assigned to multiple projects, select all pertinent projects from the sample project list. Probe Array GeneChip® probe array types in the current database. Sample Type Samples types in the current database. You can organize experiments according to sample type before publishing analysis results data. The sample type may be used to create groups of results for a project. Many experiments may be associated with one sample type for faster selection in DMT. For example, experiment results may be assigned to Treated Liver or Untreated Liver sample types in the Liver project. Operator Logon user names of operators who created experiments. Sample Name Identifies the RNA source of the target hybridized to the GeneChip® probe array. You can assign the same sample name to different GeneChip probe arrays or experiments, then select the name to conveniently obtain all results for the sample from different experiments. 77 78 CH A P T E R 5 Building and Running a Query Selecting Filters in the Sample Section The Sample section of the Filter Analysis dialog box (Figure 5.17) organizes attributes with increasing specificity from left to right. 1. Starting at the left in the Sample section, click the items of interest in each component list. Selected items in the same component list are combined in OR fashion (union). Selections from different component lists are combined in AND fashion (intersection). Table 5.5 Filter Analysis dialog box, Sample section Select This From Component List To Display... Sample Project in the data tree, the analyses associated with the projects Probe Array in the data tree, the analyses associated with the probe arrays Sample Type operators and sample names associated with the sample types Operator sample names associated with the selected sample types AND operators Sample Name in the data tree, the analyses associated with the selected sample types AND operators AND sample names If no items in a list have been highlighted, then all of the items in the list are selected by default. 2. When finished specifying analysis filters, click Apply. ⇒ The data tree in the Query window displays the analyses selected by the filters. If analysis filters are specified in both the Attribute and Sample section, DMT combines the filters in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide Running a Query After specifying the filter and selecting the analyses from the data tree, the query is ready to run. To run the query, do one of the following: ■ Click the Query button bar. or select Query → Run Query from the menu ⇒ The query table displays the query results. ■ Click the Pivot button bar. or select Query → Pivot Data from the menu ⇒ The pivot table displays the query results. For more information about results tables, see Chapter 7, Query Results Tables. Normalizing GeneChip® Signal Data Normalization is a mathematical technique that minimizes discrepancies in results data from different experiments due to non-biological variables such as sample preparation, hybridization conditions, staining, amount of spotted probe, or GeneChip® probe array lot. Results data may be normalized prior to publishing in Affymetrix® Microarray Suite (GeneChip data) or Affymetrix® Jaguar™ (spotted array data). If GeneChip signal data were not normalized or were not normalized consistently, normalization can be performed in DMT. In DMT, you may normalize the data before or after a query or pivot operation. The normalization option is only available in GeneChip® data mode. 79 80 CH A P T E R 5 Building and Running a Query Choosing Normalization Before a Query or Pivot 1. Click the Options button . ⇒ The Data Mining Options dialog box appears. 2. Click the Normalization tab. ⇒ The normalization options are displayed (Figure 5.18). Figure 5.18 Data Mining Options dialog box, Normalization tab 3. Select the Compute Normalization option and confirm the All Probe Set Normalization algorithm is selected. 4. Click OK. ⇒ After a query, the query and pivot table display normalized signal values for each probe set. If the pivot table does not display the normalized data column, verify that the pivot data includes Norm Signal or Norm Avg Diff (select Query → Pivot Data from the menu bar). Affymetrix® Data Mining Tool User’s Guide Choosing Normalization After a Query or Pivot 1. 2. After a query or pivot operation is run, select Query → Normalize from the menu bar. To display the normalized signal data in the query table, click the Query tab in the results pane (displays the query table), then select Query → Normalize from the menu bar. If the Query Normalize menu item is not available, verify that the All Probe Set Normalization algorithm is selected in the Data Mining Options dialog box (click the Options button , then click the Normalization tab). 3. To display the normalized signal data in the pivot table, click the Pivot tab in the results pane (displays the pivot table), then select Query → Normalize from the menu bar. If the pivot table does not display the normalized data values, check to make sure the pivot data includes Norm Signal or Norm Avg Diff (select Query → Pivot Data from the menu bar). Normalization Options 1. Click the Options button . ⇒ The Data Mining Options dialog box appears. 2. Click the Normalization tab. ⇒ The normalization options are displayed (Figure 5.19). 81 82 CH A P T E R 5 Building and Running a Query Figure 5.19 Data Mining Options dialog box, Normalization tab 3. Click Settings. ⇒ The All Probe Set Normalization Settings dialog box appears (Figure 5.20). Figure 5.20 All Probe Set Normalization Settings Affymetrix® Data Mining Tool User’s Guide Target Intensity Select this option to normalize the signal data to a user-specified target intensity (default = 5000). When selected, DMT computes the Normalization Factor (NF) for an analysis n so that: Target Intensity = NFn x average signaln If the user-specified Target Intensity option is not selected, DMT sets the Target Intensity equal to the average signal of all analyses queried, not just the analyses returned by the query. Intensity Threshold Select the Intensity Threshold option to specify a threshold for the signal values used to compute the average signal. When the signal of a probe set is less than the intensity threshold, DMT omits the probe set from the average signal calculation. Low and High Percentage DMT does not include a signal value in the average signal calculation when it falls in the Low Percentage or High Percentage range. The default values are the bottom 2% and the top 2%.) If an Intensity Threshold is specified, the Low and High Percentage range is applied to the signal values above threshold. 83 84 CH A P T E R 5 Building and Running a Query 6 Chapter 6 Managing Queries 6 Saving and opening filters saves time when one or more complex filters are used on a regular basis. This chapter outlines the tasks of saving a query, opening previously saved queries and deleting queries. Saving a Query You may save the filter parameters you specify. You can apply the saved filter parameters to subsequent experimental results or use them to regenerate the current query results in a future session. 1. When a DMT session is open, select Data → Save from the menu bar. ⇒ The Save dialog box appears (Figure 6.1). Figure 6.1 Save dialog box 2. Enter a name for the query in the Name box, then click Save. ⇒ This saves the filter parameters. 87 88 CH A P T E R 6 Managing Queries Using the Save As Command Queries created by other users may be opened as read-only. Changes to read-only queries cannot be saved unless the query is renamed. This prevents users from modifying queries created by other users. You can also use the Save As command if you want modify, but not overwrite, one of your own queries. 1. Select Data → Save As from the menu bar. ⇒ The Save dialog box appears (Figure 6.2). Figure 6.2 Save dialog box 2. Enter a new name for the modified query, then click Save. ⇒ This saves the filter parameters. Affymetrix® Data Mining Tool User’s Guide Opening a Previously Saved Query 1. When a DMT session is open, select Data → Open from the menu bar. ⇒ The Open dialog box appears (Figure 6.3). The Open dialog box displays all saved queries in the default directory, unless the Only show my queries option box is selected. You may open any saved query. Figure 6.3 Open dialog box 2. Select a query, then click Open. ⇒ The DMT session starts. 89 90 CH A P T E R 6 Managing Queries Deleting a Query 1. Select Data → Delete Query from the menu bar. ⇒ The Delete dialog box appears (Figure 6.4). Figure 6.4 Delete dialog box 2. Select a query, then click Delete. ⇒ The selected query is permanently removed from the system. Users (identified by the logon name) cannot delete queries created by other users. 7 Chapter 7 Query Results Tables 7 The results tables display experimental information and expression data that satisfy the query filter conditions. This chapter explains how to use these results tables. The results tables are generated independently. Therefore, you can change the analyses displayed in one table without affecting the contents of the other tables. Experiment Information Table The experiment information table displays information about the analyses or array sets selected in the data tree. 1. To view experiment information for several analyses or array sets: a. In the data tree, select the analyses or array sets you want to view To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses. b. Click the Info button or select Query → Experiment Information from the menu bar ⇒ The selected analyses are displayed in the experiment information table (Figure 7.1, Figure 7.2). 2. To view experiment information for one analysis, right-click the analysis in the data tree and select Experiment Info from the shortcut menu. 3. If necessary, click the Experiment Info tab to view the table. 93 94 CH A P T E R 7 Query Results Tables The experiment information table displays each analysis in a separate column. You can resize, reorder, or hide columns as desired (see Appendix B). GeneChip® Data Mode The experiment information table for GeneChip® data (Figure 7.1) displays information about analyses or array sets selected in the data tree, including: ■ Information entered during GeneChip® probe array experiment setup. ■ Data and experiment attributes automatically captured during hybridization, scanning and analysis. Figure 7.1 Experiment information table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide Spot Data Mode The experiment information table for spot data (Figure 7.2) displays information about selected analyses, including: ■ Probe array and operator name. ■ Parameters associated with the analysis. Figure 7.2 Experiment information table, spot data mode 95 96 CH A P T E R 7 Query Results Tables Query Table The query table presents query data in rows that identify an analysis and probe name (probe set or spot probe), followed by the expression metrics that met the limits specified by the filter. Appendix C describes the metrics and other types of data included in the query table. To populate the query table: 1. In the data tree, click analyses or array sets of interest. 2. Specify filters (optional). 3. Click the Query button . ⇒ The query table displays the query results (Figure 7.3, Figure 7.4). If you specified a sort order for a particular metric(s) in the filter grid, the corresponding query table rows are arranged accordingly (ascending, descending, or no sort). The columns may be resized, reorder, or hidden as desired (see Appendix B). Figure 7.3 Query table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide Figure 7.4 Query table, spot data mode In the Spot column, the spot coordinates (in parentheses) follow the probe name. Pivot Data Table The results of a query frequently include the same probe (probe set or spot probe) from different analyses. The pivot operation organizes the query results so that all analysis results for a particular probe are displayed side by side in the same row of the pivot data table (Figure 7.5). The pivot table is blank until the pivot operation is run. The pivot table makes it easier to review and compare the query results from different analyses that are associated with a particular probe. 97 98 CH A P T E R 7 Query Results Tables Figure 7.5 Pivot table, GeneChip® data mode Figure 7.6 Pivot table, spot data mode Affymetrix® Data Mining Tool User’s Guide Selecting Results for the Pivot Table Before running the pivot operation, specify the type of expression metrics you want to view in the pivot table. 1. Click the Options button . ⇒ The Data Mining Options dialog box appears. 2. Click the Pivot tab. ⇒ This tab displays the results available for the pivot operation (Figure 7.7, Figure 7.8). Figure 7.7 Data Mining Options dialog box, Pivot tab (GeneChip® data mode) 99 100 CH A P T E R 7 Query Results Tables Figure 7.8 Data Mining Options dialog box, Pivot tab (spot data mode) 3. Place (or remove) a check mark next to the result you want to include (or exclude) from the pivot operation. DMT applies the data selections to the next pivot operation. The pivot table displays only the types of results selected for the pivot operation. 4. Click OK to close the Data Mining Options dialog box. Viewing Results Selected for the Pivot Table The menu bar also shows the metrics selected for the pivot table. 1. 2. To view Statistical algorithm metrics, select Query → Select Pivot Data → Statistical Algorithm Results from the menu bar. To View Empirical algorithm results, select Query → Select Pivot Data → Empirical Algorithm Results from the menu bar. ⇒ This displays a drop-down list of metrics (Figure 7.9). Check marks indicate items selected for the pivot table. To include (or exclude) a result in the pivot table, click the result to add (or remove) a check mark. Affymetrix® Data Mining Tool User’s Guide Figure 7.9 Pivot data drop-down list (GeneChip® data mode), Empirical algorithm (left) and Statistical algorithm (right) Running the Pivot Operation You can query analyses or array sets selected in the data tree and display the results in the pivot table. 1. In the data tree, click the analyses you want to query and pivot. To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses. 2. To view the query results in the pivot table, do one of the following: ■ Click the Pivot button ■ Right-click a highlighted analysis or array set in the data tree and select Pivot Data from the shortcut menu. ■ . Select Query → Pivot from the menu bar. ⇒ This displays the pivot table (Figure 7.5, Figure 7.6). The pivot table must be populated before the scatter, fold change, or bar graph can be plotted. 101 102 CH A P T E R 7 Query Results Tables Including Probe Descriptions in the Pivot Table The pivot table can include probe (probe set or spot probe) descriptions. The descriptions are derived from public databases (except for custom probe arrays). ■ To display probe descriptions, select Query → Pivot Descriptions from the menu bar. ⇒ This adds the Description column to the pivot table. Including Annotations in the Pivot Table The pivot table can include several columns of annotations. For more information about annotating probes, see Chapter 8, Annotations. 1. Select Query → Pivot Annotations from the menu bar. ⇒ If there is more than one annotation type, the Select Pivot Annotation Type dialog box appears (Figure 7.10). Figure 7.10 Select Pivot Annotation Type dialog box 2. Select an annotation type, then click OK. ⇒ This adds a column of annotations (one annotation type per column) to the pivot table (far right). Affymetrix® Data Mining Tool User’s Guide Sorting Pivot Table Columns You can specify a sort order for up to four columns in the pivot table. 1. 2. Click the Pivot tab in the results pane. Select Edit → Sort from the menu bar. Alternatively, right-click the pivot table and select Sort from the shortcut menu. ⇒ The Sort dialog box appears (Figure 7.11). Figure 7.11 Sort dialog box 3. Click the Sort By drop-down arrow. 4. Select a pivot column from the drop-down list (Figure 7.12). Figure 7.12 Sort dialog box 103 104 CH A P T E R 7 Query Results Tables 5. Select the Ascending or Descending sort order option. 6. To specify another sort order, click the next drop-down arrow in the Then By box, and repeat steps 4 and 5. 7. Click OK when finished. Pivot Options The Data Mining Options dialog box displays the pivot options (Figure 7.13). 1. To open the Data Mining Options dialog box, do one of the following: ■ Click the Options button ■ Right-click the pivot table and select Options from the shortcut menu. ■ . Select View → Options from the menu bar, then click the Pivot tab. Figure 7.13 Data Mining Options dialog box, GeneChip® data mode (left) and spot data mode (right) Affymetrix® Data Mining Tool User’s Guide Show Order Analyses Dialog Select this option to display the Order Pivot Analysis dialog box (Figure 7.14) before the pivot operation begins. This dialog box enables you to specify an order for the analyses (columns) in the pivot table. The analysis order in the pivot table determines the order of the analyses in the series bar and histogram graphs. Figure 7.14 Order Pivot Analyses dialog box 1. Use the drag-and-drop method to change the order of the analyses in the Order Analyses dialog box. 2. Click OK to pivot the data. You can also reorder columns in the pivot table using the drag-anddrop method (see Appendix B). 105 106 CH A P T E R 7 Query Results Tables Working with Tables Working with results tables is the same in GeneChip® data mode (shown in this section) or spot data mode. Finding Probes DMT can perform a text search in the query or pivot table. 1. To specify the text string for the search, do one of the following: ■ Click the Find button ■ Right-click the query or pivot table and select Find In Results from the shortcut menu. ■ . Select Edit → Find In Results from the menu bar. ⇒ The Find Probe dialog box appears (Figure 7.15). Figure 7.15 Find Probe dialog box, GeneChip® data mode 2. Enter the text string for the search in the Find What box, or select a previously entered text string from the Find what drop-down list. 3. Select the Match and Direction search options, then click Find Next. If the pivot table includes descriptions, the find function searches the probe set or spot probe name and description columns. 4. Click Find Next again to continue the search. The Find Next command finds all strings that match the search text string. For example, using the Find Next command to search for the text string biob would find AFFX-BioB-5 as well as other occurrences of BioB (unless either the Match whole word only or Match case option is selected). Affymetrix® Data Mining Tool User’s Guide Viewing Descriptions & Obtaining Further Gene Information The Description dialog box (Figure 7.16) is available in the query or pivot table. It enables you to: ■ View descriptions. ■ View or enter annotations for the selected probe (probe set or spot probe). ■ Access an Internet website for further gene information. 1. Double-click the query or pivot table row that contains the probe of interest. ⇒ The Description dialog box appears (Figure 7.16). The Description dialog box displays: ■ The probe name and a brief description. ■ The target sequence the probe set is designed to interrogate. ■ Annotations. The Description dialog box is automatically updated when you click another probe in the query or pivot table. Figure 7.16 Description dialog box, GeneChip® data mode 2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website. 3. To annotate the selected probe, click Annotate. 107 108 CH A P T E R 7 Query Results Tables See the following section for information about annotating probes. Annotating Probes You can annotate probes (probe sets or spot probes) displayed in the query or pivot table. 1. Select one or more probe names in the query or pivot table. To select adjacent names, press and hold the SHIFT key while you click the first and last name in the selection. To select non-adjacent names, press and hold the CTRL key while you click the names. 2. Right-click the query or pivot table and select Annotate Probes from the shortcut menu. Alternatively, select Annotations → Annotate Probes from the menu bar. ⇒ The Annotate dialog box appears and displays the selected probe names in the Probe Set(s) box (Figure 7.17). 3. Enter an Annotation Type or make a selection from the drop-down list. 4. Enter comments in the Annotation box. Figure 7.17 Annotate dialog box 5. Click OK to add the annotation and close the Annotate dialog box. Affymetrix® Data Mining Tool User’s Guide Adding Probes to the Filter Grid You can add all or selected probes (probe sets or spot probes) in the query or pivot table to the current filter. DMT saves the selected probes as a probe list, then adds the probe list to the filter. (See Chapter 9 for more information about probe lists.) Adding Selected Probes 1. Select one or more probe names in the query or pivot table. To select adjacent names, press and hold the SHIFT key while you click the first and last name in the selection. To select non-adjacent names, press and hold the CTRL key while you click the names. 2. Right-click the table and select Add Selected Rows to Filter from the shortcut menu. Alternatively, select Edit → Add Selected Rows to Filter from the menu bar. ⇒ The selected probe set names are added to the Probe Set Name column (or the selected spot probe names to the Spot column) in the filter grid. Adding All Probes 1. Right-click the query or pivot table and select Add All Rows to Filter from the shortcut menu. Alternatively, select Edit → Add All Rows to Filter from the menu bar. ⇒ All probe set names are added to the Probe Set Name column (or all spot probe names to the Spot column) in the filter grid. If the option Always Prompt to Create List is chosen (select Edit → Lists → Always Prompt to Create List from the menu bar), DMT prompts you to create a list of the probe sets (or spot probes) you want to add to the filter. DMT adds the list name to the filter instead of the probe names. See Chapter 9 for more information about lists. 109 110 CH A P T E R 7 Query Results Tables Copying Tables All or a selected portion of a results table can be copied to the system clipboard, then pasted into other applications. 1. To select the entire results table, click the upper left corner of the table (Figure 7.18). Figure 7.18 Query table (GeneChip® data mode), all rows selected 2. To select part of a results table, do one of the following: ■ Click and drag the mouse to select the desired rows. ■ Click a row header to select the entire row. To select adjacent rows, press and hold the SHIFT key while you click the first and last row in the selection. To select non-adjacent rows, press and hold the CTRL key while you click the rows. 3. To copy the selection to the system clipboard, do one of the following: ■ Click the Copy Cells button . ■ Right-click the table and select Copy Cells from the shortcut menu. ■ Select Edit → Copy Cells from the menu bar. ⇒ The selected table cells are copied to the system clipboard. 4. To copy the selection to Excel, select Edit → Copy to Excel from the menu bar. ⇒ Microsoft® Excel opens and the selection is pasted into a new spreadsheet. Affymetrix® Data Mining Tool User’s Guide Exporting Data The experiment information, query, or pivot table data may be exported (saved) to a tab-delimited text file (*.txt), then imported into other applications. Hidden table columns are not exported. 1. Select Data → Export As from the menu bar. ⇒ The Export As dialog box appears (Figure 7.19). Figure 7.19 Export As dialog box 2. 3. Select a directory from the Save in drop-down box. Enter a File name, then click Save. Expanding the Results Pane When the Query window displays both the graph and results pane, you can enlarge the results pane. 1. Right-click a table in the results pane and select Expand Results from the shortcut menu. Alternatively, select View → Expand Results from the menu bar. ⇒ The graph pane is hidden and the results pane is enlarged. 2. To return the results pane to its original size, repeat step 1. 111 112 CH A P T E R 7 Query Results Tables Clearing the Results Pane To clear the results pane, select Edit → Clear Results from the menu bar. ⇒ All tables from the results pane are cleared. 8 Chapter 8 Annotations 8 You can annotate probes (probe sets or spot probes) and view the annotations in the pivot table. The annotations may be queried and the query results may be added to the filter. Creating and working with annotations is the same in GeneChip® data mode (shown in this chapter) or spot data mode. Annotating Probes 1. Select one or more probes in the query or pivot table. 2. Right-click the query or pivot table and select Annotate Probes from the shortcut menu. Alternatively, select Annotations → Annotate Probes from the menu bar. ⇒ The Annotate dialog box appears and displays the selected probes in the Probe Set(s) box (Figure 8.1). Figure 8.1 Annotate dialog box, GeneChip® data mode 3. Enter the Annotation Type, or select from the drop-down list. 115 116 CH A P T E R 8 Annotations Enter comments in the Annotation box, then click OK. 4. Loading Annotations You can add or load annotations previously saved in a text file (*.txt) to the system. Creating a Text File Use the following procedure to create an annotation text file. 1. Create a text file (*.txt) following the tab delimited format shown in Figure 8.2. 2. In the first line, enter the columns names (as defined in Table 8.1) delimited by tabs (Figure 8.2). Table 8.1 Annotation text file, column names Column Number 3. Column Name 1 Probe name 2 Type 3 Annotation In the next line, enter the probe name, annotation type and the annotation delimited by tabs (Figure 8.2). Enter only one annotation per line. Each annotation can include up to 2000 characters. Affymetrix® Data Mining Tool User’s Guide Figure 8.2 Annotations text file (*.txt), GeneChip® data mode Loading the Annotations Use the following procedure to add an annotations text file to the system. 1. Select Annotations → Load Annotations from the menu bar. ⇒ The Open dialog box appears (Figure 8.3) and displays the contents of the default directory specified in the Data Mining Options dialog box (Default Directory tab). ó Figure 8.3 Open dialog box 2. Select the text file that contains the annotations, then click Open. ⇒ The annotations are added to the GeneInfo database and are available to DMT. 117 118 CH A P T E R 8 Annotations Querying Annotations The Query Annotations window (Figure 8.4) enables you to build an annotation query (top pane) and view the returned results (bottom pane). 1. Select Annotations → Query Annotations from the menu bar. ⇒ The Query Annotations window appears (Figure 8.4). Figure 8.4 Query Annotations window 2. Click the drop-down arrow in the Field column to display a drop-down list of field types. Select <None> if you want to clear a previously selected feld type. 3. Use the scroll bar to view the list and select a field type (or none) from the drop-down list. 4. Enter the search text string in the Search For box. ⇒ DMT combines the field type with the text string in AND fashion (intersection) (see Table 8.2). 5. To edit the text string, highlight the entry and right-click the cell. ⇒ A shortcut menu of edit commands is displayed. Affymetrix® Data Mining Tool User’s Guide Table 8.2 Field types Field Type Returns Annotations.... Probe for probe set or spot probe names that contain the text string User created by a user whose name contains the text string Description for probe sets with descriptions that contain the text string Annotation Type of the specified type that contain the text string 6. To enter another row of criteria, click the Operation column. 7. Click the drop-down arrow, then select the AND (intersection) or OR (union) operator. ⇒ DMT automatically adds another row to the Query Annotations filter grid. 8. To specify additional query criteria, repeat step 2 through step 6. 9. To run the query, do one of the following: ■ Click the Query Annotations button ■ Right-click the top pane and select Run Query from the shortcut menu. ■ . Select Annotations → Run Query from the menu bar. ⇒ The bottom pane of the Query Annotations window displays the returned results (Figure 8.5). 119 120 CH A P T E R 8 Annotations Figure 8.5 Query Annotations window, query criteria (top) and query results (bottom) Annotation Query Results Type Annotation type selected when the annotation was created. Annotation Text entered by the user who created the annotation. User Windows NT name of the user who logged onto the workstation when the annotation was created. Date Date when the annotation was created or last updated. Description Probe description (derived from a public database). Copying Annotation Query Results Annotation query results may be copied to the system clipboard and pasted into other applications. The row numbers are also copied with the selected cells for reference. 1. 2. Click the row number in the query results to select the entire row. Select Annotations → Copy Cells from the menu bar. ⇒ The selection is copied to the system clipboard. Affymetrix® Data Mining Tool User’s Guide Clearing the Annotation Query or Query Results 1. To clear the annotation filter grid (top pane of the Query Annotations window), select Annotations → Clear Query from the menu bar. 2. To clear the annotation query results (bottom pane of the Query Annotations window), select Annotations → Clear Results from the menu bar. Adding Probes to the Filter Grid Probes (probe sets or spot probes) returned by an annotation query may be added to the current filter. 1. Select one or more probes in the bottom pane of the Query Annotations window. 2. Right-click the selection, then select Add Selected Results To Filter from the shortcut menu. Alternatively, select Annotations → Add Selected Results To Filter from the menu bar. ⇒ The selected probe set names are added to the Probe Set Name column (or the selected spot probe names to the Spot column) in the filter grid. If the option Always Prompt to Create List is selected (Edit → Lists → Always Prompt to Create List from the menu bar), DMT prompts you to create a list of the probe sets (or spot probes) you want to add to the filter. DMT adds the list name to the filter instead of the probe names. See Chapter 9 for more information about lists. 121 122 CH A P T E R 8 Annotations Deleting Annotations An annotation may only be removed from the database by the user who created it. The delete command permanently removes an annotation from the system. 1. Select Annotations → Query Annotations from the menu bar. ⇒ The Query Annotations window appears (Figure 8.6). Figure 8.6 Query Annotations window, specifying search for annotations created by the user 2. Select User from the Field drop-down list. 3. Enter your user name in the Search For box. 4. Click the Query Annotations button . ⇒ All of the annotations that meet the criteria are displayed. 5. To select a row, click the row number. To select the all rows, click the upper left corner of the query results pane (Figure 8.7). Figure 8.7 Query Annotations window Affymetrix® Data Mining Tool User’s Guide 6. Select Annotations → Delete Selected Annotations from the menu bar. Alternatively, right-click a selected annotation, then select Delete Selected Annotations from the shortcut menu. ⇒ The selected annotations are permanently removed. 123 124 CH A P T E R 8 Annotations 9 Chapter 9 Probe Lists 9 A user-specified group of probes (probe sets or spot probes) can be saved as a probe list. Probe lists are displayed in the data tree and may be added to the filter grid (probe set name or spot column), or used to view specific query results. A text file (comma delimited *.txt) that specifies a probe list may also be added to the system. This section covers the methods for creating or loading probe lists and how to use and manage probe lists in Affymetrix® Data Mining Tool. Creating and working with probe lists is the same in GeneChip® data mode (shown in this chapter) or spot data mode. Creating Probe Lists A probe list may be generated from probes selected from: ■ The query or pivot table. ■ Cluster analysis results. ■ Search array descriptions. ■ The filter grid. Additionally, existing probe lists may be combined to create new lists. This section outlines the various procedures for creating probe lists. 127 128 CH A P T E R 9 Probe Lists Creating a Probe List from the Query or Pivot Table 1. Select one or more probes in the query or pivot table. 2. Right-click the table and select Create Probe List from the shortcut menu ⇒ The Save Probe List dialog box appears (Figure 9.1). Figure 9.1 Save Probe List dialog box 3. Enter a name for the list in the Name box, then click Save. ⇒ The Probe List Members dialog box appears and displays the members in the saved list (Figure 9.2). Affymetrix® Data Mining Tool User’s Guide Figure 9.2 Probe List Members dialog box 4. Click Close when finished viewing the probe list members. ⇒ The data tree displays the probe list in the Probe Lists directory (Figure 9.3). 5. In the data tree, click the plus sign (+) next to the probe list name to display the probe list members. For example, in Figure 9.3, the probe list L1 contains five members. Figure 9.3 Data tree, Probe Lists Directory 129 130 CH A P T E R 9 Probe Lists Creating a Probe List from Cluster Analysis The cluster members identified by cluster analysis may be saved as a probe list. (See Chapter 14 for more information about cluster analysis.) 1. After the cluster analysis results are returned, click the cluster of interest in the Clusters tab of the graph pane. ⇒ The cluster members (probe sets or spot probes) are displayed in the Probes box (Figure 9.4). Figure 9.4 Graph pane, Clusters tab (GeneChip® data mode) 2. Enter a Probe List Name, then click Save Selected. ⇒ A probe list is created that includes the members of the selected cluster and is displayed in the data tree Probe List directory. Affymetrix® Data Mining Tool User’s Guide Creating a Probe List from Search Array Descriptions 1. Select Edit → Search Array Descriptions from the menu bar. ⇒ The Search Array Descriptions dialog box appears (Figure 9.5) Figure 9.5 Search Array Descriptions dialog box 2. In the Search for box, enter the description, or partial description, then click Find. ⇒ Results for the search are displayed in the list box (Figure 9.6). Figure 9.6 Search Array Descriptions dialog box with search results 131 132 CH A P T E R 9 Probe Lists 3. Press and hold the CTRL key while you click to select the desired probe set names. 4. Click Add to Filter. ⇒ The Save Probe List dialog box appears. 5. Enter a Name for the probe list, then click Save. ⇒ The Probe List Members dialog box appears. 6. Click Close when finished. ⇒ The data tree displays the probe list. Creating a Probe List from Filter 1. 2. Enter the probe set names in the Probe Set Name column of the filter grid. Select → Edit → Probe Lists → Create Probe List from Filter. ⇒ The Save Probe List dialog box appears. 3. Enter a Name for the probe list, then click Save. ⇒ The Probe List Members dialog box appears. 4. Click Close when finished. ⇒ The data tree displays the probe list. Creating a Probe List by Combining Existing Lists 1. Select Edit → Probe Lists → Combine Probe Lists. ⇒ The Combine Probe Lists dialog box appears (Figure 9.7). Figure 9.7 Combine Probe Lists dialog box Affymetrix® Data Mining Tool User’s Guide 2. Select or clear the Only show my probe lists option, as desired. 3. In the Combine Probe List drop-down box, select a probe list. 4. Select a second probe list from the lower drop-down box. 5. Select either the And or Or option. 6. ■ And specifies that probe names must belong to both lists to be included in the new list. ■ Or specifies probe names belonging to either one or both lists will be included in the new list. Enter a new probe list name. Figure 9.8 Combining all probes belonging to lists Hu6800 or Like_Affx IN_cre 7. Click OK. ⇒ The Probe List Members dialog box appears displaying all probes in new probe list. 8. Click Close when finished. ⇒ The data tree displays the new probe list. 133 134 CH A P T E R 9 Probe Lists Loading a Probe List In addition to creating a probe list (described in the preceding section), a probe list may be loaded or added to the system. There are two methods available for loading a probe list: ■ Specify members. Select this option to manually enter the probe list members. ■ Specify input file. Select this option to load a previously saved text file (*.txt) that specifies the probe list members. Specifying Probe List Members 1. Select Edit Probe Lists → Load Probe List from the menu bar. ⇒ The Load List dialog box appears (Figure 9.9). 2. Enter a Probe List name. 3. Select the Specify members (comma delimited) option and enter the probe set or spot probe names using a comma delimited format (terminate the entry with a comma) (Figure 9.9). . Figure 9.9 Load List dialog box, Specify members option 4. Click OK. ⇒ The list is created and displayed in the data tree Probe Lists directory. Affymetrix® Data Mining Tool User’s Guide Specifying an Input File To load a probe list using the Specify input file option, you must first create the list (*.txt) so that you can select it from the Load List dialog box. Creating the Input File 1. Create a text file (*.txt) following the comma delimited format shown in Figure 9.10. 2. Enter the probe names (probe set or spot probe) in comma delimited format (terminate the entry with a comma) (Figure 9.10). Figure 9.10 Comma delimited probe list entries 3. Save the text file. 135 136 CH A P T E R 9 Probe Lists Selecting the Input File 1. Select Edit → Probe Lists → Load Probe List from the menu bar. ⇒ The Load List dialog box appears (Figure 9.11). Figure 9.11 Load List dialog box 2. Enter a Probe List name. 3. Select the Specify input file option and enter the name of the text file (*.txt) that contains the list members. Alternatively, a. Click the Browse button . ⇒ The Select List dialog box appears (Figure 9.12). Figure 9.12 Select List dialog box Affymetrix® Data Mining Tool User’s Guide b. Select a text file, then click Open. ⇒ The Load List dialog box displays the selected input file (Figure 9.13). Figure 9.13 Load List dialog box 4. Click OK. ⇒ The probe list is created and displayed in the data tree Probe Lists directory. Using Probe Lists Probe lists provide a convenient way to quickly add a group of associated probes (probe sets or spot probes) to the filter, or to highlight and view results for only selected probes. Adding a Probe List to the Filter Grid You can add an existing probe list to the filter. 1. In the filter grid, right-click a cell in the Probe Set Name or Spot column and select Probe List... from the shortcut menu (Figure 9.14). ⇒ The Open Probe List dialog box appears (Figure 9.14). 137 138 CH A P T E R 9 Probe Lists Figure 9.14 Shortcut menu and Open Probe LIst dialog box The Open Probe List dialog box displays all probe lists contained on the server, unless the Only show my probe lists option is selected. 2. From the Open dialog, select the probe list that you want to add to the filter. 3. Click Open. ⇒ The probe list is added to the filter. Displaying Selected Probe List Members Use probe lists to highlight probe list members in the scatter or fold change graph, or exclusively display members in the pivot table or a series line graph. Pivot the analyses of interest and plot the scatter, fold change and series line graph before highlighting a probe list(s). 1. Select one or more probe lists in the data tree. To select adjacent probe lists, press and hold the SHIFT key while you click the first and last list in the selection. To select non-adjacent probe lists, press and hold the CTRL key while you click the lists. 2. Right-click a selected probe list and select Display Selected Probes from the shortcut menu. ⇒ If the scatter or fold change graph is the active (selected) graph, the corresponding points are highlighted. If the series graph is active, Affymetrix® Data Mining Tool User’s Guide only the data for the selected probe list members is displayed (Figure 9.15). ⇒ The pivot table displays only the rows for the probe list members (Figure 9.16). 3. To restore all rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu. Figure 9.15 Series line graph of probe list L5 Figure 9.16 Pivot table displaying probe list L5 139 140 CH A P T E R 9 Probe Lists Managing Probe Lists List management is the same in GeneChip® data mode (shown in this section) or spot data mode. Viewing and Editing Probe List Members 1. Select Edit → Lists → View Members from the menu bar. ⇒ The Probe List Members dialog box appears (Figure 9.17). Figure 9.17 Probe List Members dialog box 2. Select a Probe List from the drop-down list. ⇒ The Probe List Members box displays the list members. If the Only show my probe lists option is selected (Figure 9.17), the Probe Lists drop-down list only displays lists created by you (identified by the logon name). 3. To add a probe set to the list, enter the probe set name in the bottom box (Figure 9.19), then click Add Member. Affymetrix® Data Mining Tool User’s Guide Figure 9.18 Probe list Members dialog box 4. To remove a probe from the list, highlight it in the Probe List Members box (Figure 9.19), then click Remove Member. Figure 9.19 Probe List Members dialog box 5. Click Close when finished viewing or editing the list. 141 142 CH A P T E R 9 Probe Lists Combining Probe Lists 1. Select Edit → Lists → Combine Lists from the menu bar. ⇒ The Combine Probe Lists dialog box appears (Figure 9.20). Figure 9.20 Combine Probe Set Lists dialog box 2. Make a selection from the upper and lower Combine Probe List dropdown list box (Figure 9.21). Figure 9.21 Combine Probe Lists dialog box 3. Select the And (intersection) or Or (union) combination option for the lists. 4. Enter a New probe list name for the new list, then click OK. ⇒ If the Show members after saving option is selected (Figure 9.21), the Probe List Members dialog box appears and displays the new list members (Figure 9.22). Affymetrix® Data Mining Tool User’s Guide Figure 9.22 Probe List Members dialog box 5. Click Close when finished viewing or editing list members. Exporting a Probe List A probe list may be exported as a text file (*.txt). 1. Right-click the probe list for export and select Export Probe List from the shortcut menu. Alternatively, select Probe Lists → Export Probe Lists from the menu bar. ⇒ The Save As dialog box appears (Figure 9.23). Figure 9.23 Save As dialog box 143 144 CH A P T E R 9 Probe Lists 2. Choose a directory from the Save in drop-down list. 3. Enter a name for the text file and click Save. Deleting a Probe List Using the Shortcut Menu 1. Right-click the probe list you want to delete and select Delete Probe List from the shortcut menu. ⇒ DMT prompts you to confirm the probe list to be deleted (Figure 9.24). Figure 9.24 Delete probe list prompt 2. Click OK to delete the probe list. Using the Menu Bar 1. Select Edit → Probe Lists → Delete Saved Probe List from the menu bar. ⇒ The Delete Probe List dialog box appears (Figure 9.25). Figure 9.25 Delete Probe List dialog box Affymetrix® Data Mining Tool User’s Guide 2. Select the list you want to delete and click Delete. 145 146 CH A P T E R 9 Probe Lists 10 Chapter 10 Array Sets 10 An array set is a user-specified group of GeneChip® probe array analyses. An array set provides a convenient way to select a group of analyses for a query, the pivot operation, graphing, statistical analyses, or clustering. Array sets are only available for GeneChip® probe array analyses. Creating an Array Set 1. In the data tree, click the analyses you want to include in an array set (Figure 10.1). To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses. Figure 10.1 Right-click selected analyses in data tree for the shortcut menu 149 150 CH A P T E R 10 Array Sets 2. Right-click a selected analysis, then select Create Set from the shortcut menu (Figure 10.1). Alternatively, select Edit → Sets → Create Set from the menu bar. ⇒ The Save Array Set dialog box appears (Figure 10.2). Figure 10.2 Save Array Set dialog box The Virtual Set option is available if the analyses selected for the array set are derived from different GeneChip® probe array types. When a virtual set is pivoted, DMT merges the analyses and displays them in a single column of the pivot table. A virtual set is a convenient way to manage the analyses from a multiple GeneChip probe array set. If the same probe set occurs in more than one analysis, the pivot table displays each probe set-analysis combination in a separate row to ensure no data are lost. For example, a control probe that is found across a set of four probe arrays will generate four pivot table rows. Each row is distinguished by the probe set-analysis name in the row header. 3. Enter a Name for the array set. 4. Select the Virtual Set option if you want to merge the analyses into a single column in the pivot table. 5. Click Save. ⇒ The array set is saved and displayed in the data tree under the My Array Sets directory (Figure 10.3). Affymetrix® Data Mining Tool User’s Guide Figure 10.3 Data tree, My Array Sets directory Saved Array Sets are stored in the registry on the computer and are only available when using that specific computer. Working with Array Sets An array set is available for graphing (see Chapter 11), statistical analysis (see Chapter 12) and cluster analysis (see Chapter 14). Viewing Array Sets The results tables are displayed independently. Therefore, changing the analyses displayed in the experiment information table does not affect the query or pivot table contents. Experiment Information Table 1. Click an array set(s) in the data tree. To select adjacent array sets, press and hold the SHIFT key while you click the first and last array set in the selection. To select non-adjacent array sets, press and hold the CTRL key while you click the array sets. 2. Right-click a highlighted array set and select Experiment Info from the shortcut menu. ⇒ The experiment information table displays information for the analyses in the array set(s). 151 152 CH A P T E R 10 Array Sets Pivot Table 1. Select an array set(s) in the data tree. 2. Right-click a highlighted array set in the data tree and select Pivot Data from the shortcut menu. ⇒ The pivot table displays the analysis results from the selected array set(s). The pivot table displays a single column of results for a virtual array set. Managing Array Sets Array sets that you have created can be edited or deleted from DMT. Only array sets created by you, as identified by the logon name, are displayed in the data tree. Editing an Array Set 1. 2. Select an array set in the data tree. Select Edit → Sets → Edit Set from the menu bar. ⇒ The Array Set Members dialog box appears and displays the selected array set and its members (Figure 10.4). Figure 10.4 Array Set Members dialog box Affymetrix® Data Mining Tool User’s Guide 3. 4. Do one or both of the following: ■ Add a member to the array set: Enter the analysis name (from the current database) in the bottom box, then click Add member. ■ Remove a member from the array set: Select the analysis in the Array Set Members box, then click Remove member. Click Close when finished editing the array set. Deleting an Array Set 1. In the data tree, select the array set(s) you want to delete. 2. Right-click a selected array set and select Delete Sets from the shortcut menu. Alternatively, select Edit → Sets → Delete Sets from the menu bar. ⇒ DMT prompts you to confirm the array set(s) to be deleted. 3. Click OK to delete the array set(s). 153 154 CH A P T E R 10 Array Sets 11 Chapter 11 Graphing Results 11 DMT can plot user-specified columns of numeric pivot table data in a scatter, fold change, series, or histogram graph. This includes: ■ Analysis results ■ Statistical data generated using the analysis function (see Chapter 12). The graph pane of the DMT session displays each type of graph in a separate tab (Figure 11.1). The pivot operation must be run before the graphs can be plotted. Figure 11.1 Graph pane, Scatter graph tab 157 158 CH A P T E R 11 Graphing Results Scatter Graph The scatter graph (Figure 11.1) is an x-y graph that compares numeric pivot table data (from user-specified columns) using a traditional scatter plot. Multiple pivot table columns may be assigned to each axis. This enables quick comparison of the results from different experiments. Each point in the scatter graph represents a probe common to the two pivot table columns in the comparison. A point is defined by the intersection of the result value on the x and y axes for the common probe. The scatter graph displays up to eight fold change lines (four pairs) to help identify results that have changed significantly. The fold change lines are defined in pairs: y = mx and y = 1/mx where m = 2,3,5 and 10 by default. In GeneChip® data mode, average difference and fold change metrics are generally the most informative because probe sets with significant changes in expression levels can be easily identified. Plotting the Scatter Graph Plotting the scatter graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. 1. Click the Scatter Graph button . Alternatively, select Graph → Scatter from the menu bar. ⇒ The Scatter Graph dialog box appears and displays the pivot table columns available for the scatter graph (Figure 11.2). Affymetrix® Data Mining Tool User’s Guide Figure 11.2 Scatter Graph dialog box (GeneChip® data mode), pivot table columns available for the scatter graph 2. Use the drag-and-drop method to select each x-axis column in the Available Columns box and place it in the Select X-Axis Column(s) box (Figure 11.3). Alternatively, select one or more columns in the Available Column box, then click the down arrow above the Select X-Axis Column(s) box. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns. 3. Use the drag-and-drop method to select each y-axis column in the Available Columns box and place it in the Select Y-Axis Column(s) box (Figure 11.3). Alternatively, select one or more columns in the Available Column box, then click the down arrow above the Select Y-Axis Column(s) box. The analysis results for a GeneChip® probe array set must be ordered identically because the scatter graph compares the first analysis on the x-axis to the first analysis on the y-axis (and so forth). If the analyses are not identically ordered, many probe sets will not be compared and plotted (only the common probe sets such as the controls). 159 160 CH A P T E R 11 Graphing Results Figure 11.3 Scatter Graph dialog box, GeneChip® probe data mode 4. To change the order of an column in the Select X-Axis (or Select YAxis) Column(s) box, use the drag-and-drop method to move the column to a new position in the list. Alternatively, select the column, then click the up or down arrow located at the inside of the Select X or Select Y Axis Column(s) box. 5. To change the scatter graph axes from log scale (default) to linear scale, click the Log Scale option to remove the check mark. 6. Click OK. ⇒ The graph pane displays the scatter graph (Figure 11.4). The points are color-coded using the display option colors in the Scatter Graph tab of the Data Mining Options dialog box (click the Options button). Affymetrix® Data Mining Tool User’s Guide Figure 11.4 Scatter graph, GeneChip® data mode, signal metric Working with the Scatter Graph Working with the scatter graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. Magnifying the Graph 1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the graph area of interest (Figure 11.5). 2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.6). 161 162 CH A P T E R 11 Graphing Results Figure 11.5 Scatter graph, rectangle selects an area to magnify (GeneChip® data mode) Figure 11.6 Magnified area in the scatter graph Affymetrix® Data Mining Tool User’s Guide 3. To zoom out and restore the graph, right-click the graph and select Full Out Zoom from the shortcut menu. Locating a Probe Select a probe in the pivot table to quickly locate it in the scatter graph. 1. Click and hold the probe name in the pivot table. ⇒ The corresponding point in the scatter graph is highlighted (Figure 11.7). The highlighting is removed when the mouse button is released. Figure 11.7 Scatter graph highlights the probe selected in the pivot table (GeneChip® data mode) 163 164 CH A P T E R 11 Graphing Results Viewing Probe Information & Annotating Probes 1. To display probe and corresponding gene information, click a point in the scatter graph. ⇒ The probe name, analyses names, metrics from the pivot table and a brief description of the gene are displayed to the right of the graph (Figure 11.8). Figure 11.8 Scatter graph displaying probe information (GeneChip® data mode) 2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website. 3. Double-click a point in the graph or a pivot table row to display the Description dialog box (Figure 11.9). Affymetrix® Data Mining Tool User’s Guide Figure 11.9 Description dialog box ⇒ The Description dialog box displays a brief description of the probe (probe set or spot probe), the sequence that is designed to interrogate and any annotations associated with the probe. 4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.10). Figure 11.10 Annotate dialog box 5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. 165 166 CH A P T E R 11 Graphing Results Selecting Points in the Graph The lasso feature enables you to quickly select and focus on points of interest in the scatter graph by drawing a line around them (roping). The pivot table displays only the rows that correspond to the roped points (all other rows are hidden). Probes selected by roping may be conveniently annotated as a group or saved in a probe list that can be applied to the filter grid of a subsequent query. 1. Click the Lasso Points button . Alternatively, select Graph → Lasso Points. ⇒ The mouse pointer changes to a pair of cross hairs (+) when it is positioned over the scatter graph. 2. To rope points of interest, position the cross hairs near the group of points, then do one of the following: ■ Click and hold the mouse button while you draw a complete circle around the points (Figure 11.11); or ■ Click the mouse, move it to draw a line segment, then click the mouse again to start drawing a new line segment. Repeat until you return the cross hairs to the starting point and the lines segments enclose the points of interest (Figure 11.12). Affymetrix® Data Mining Tool User’s Guide Figure 11.11 Roped points in the scatter graph Figure 11.12 Roped points in the scatter graph 167 168 CH A P T E R 11 Graphing Results 3. To terminate the roping operation, double-click the mouse or press the ESC key. ⇒ The scatter graph displays the selected points in orange color (default selected point color that is user-specified in the Options dialog box, see Changing Graph Colors on page 202). ⇒ The pivot table displays only the rows that correspond to the roped points (all other rows are hidden). 4. To restore the hidden rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu. 5. To clear the selection from the graph, right-click the graph and select Clear Selection from the shortcut menu. ⇒ The roped points are deselected and all rows (probes) are restored to the pivot table. Scatter Graph Options Preferences for the scatter graph display may be set in the Data Mining Options dialog box (Figure 11.13). Newly selected options are immediately applied to an existing graph and subsequent sessions for you. 1. Click the Options button Alternatively: ■ ■ , then click the Scatter Graph tab. Right-click the scatter graph and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Scatter Graph tab. ⇒ The Data Mining Options dialog box appears and displays the scatter graph options (Figure 11.13). Affymetrix® Data Mining Tool User’s Guide Figure 11.13 Data Mining Options dialog box, Scatter graph tab, GeneChip® data mode (left) and spot data mode (right) Point Options Point size The point size number determines the dot size for a graph point. Enter a larger point size for easier viewing. Use a smaller point size for higher resolution graphs. Color by Absolute Call In GeneChip® data mode, select this option to colorcode the points according to the colors assigned to the absolute or detection call combination of the x and yaxis analyses (as displayed in the Scatter Graph tab of the Data Mining Options dialog box, see Table 11.1). Note: You must pivot the absolute or detection call data. 169 170 CH A P T E R 11 Graphing Results Color by Difference Call In GeneChip data mode, select this option to color-code the points according to the colors assigned to the difference or change call for the x-axis analysis (as displayed in the Colors section of the Data Mining Options dialog box). There are five possible difference calls: decrease (D), marginal decrease (MD), no change (NC), marginal increase (MI) and increase (I). If the X-axis analysis does not have a difference or detection call, then the difference or detection call for the y-axis analysis is used. If neither the x or y-axis analysis has a difference or detection call, Point Color is used. Use Point Color Select this option to display all graph points using the Point Color (default is black) in the Colors section of the Data Mining Options dialog box. Table 11.1 Absolute or detection call combinations in the scatter graph (GeneChip® data mode) Absent in Y Marginal in Y Present in Y Absent in X A-A A-M A-P Marginal in X M-A M-M M-P Present in X P-A P-M P-P Colors The colors of the absolute and difference call categories as well as other scatter graph items (graph points, graph background, selected or roped points, fold change lines) may be changed. (For further information, see Changing Graph Colors on page 202.) Affymetrix® Data Mining Tool User’s Guide Fold Change Lines The default fold change lines are defined in four pairs: y = 2x and y = 1/2x, y = 3x and y = 1/3x, y =10x and y = 1/10x, y = 30x and y = 1/30x. 1. To redraw the fold change lines, enter new values in the edit boxes. Only integer values may be entered. 2. Remove the check mark to turn off the display of that pair of fold change lines. Fold Change Graph The fold change graph is a scatter plot that displays the fold change for a user-specified set of base and comparison columns. (Appendix A describes the fold change calculation.) Numeric pivot table columns are available for the fold change graph. Each point in the graph represents a probe (probe set or spot probe) that is common to the base and comparison column. The y-axis coordinate of a point is the average fold change for all of the base-comparison column pairs that contain the probe. The x-axis coordinate is the average result value for all of the comparison columns that contain the probe. The fold change graph supports calculations with replicates. All pairs of replicate comparison and base columns contribute to the fold change graph. The fold change is averaged when the probe is repeated (for example, when the query returns analysis results from several different GeneChip® probe or spot arrays of the same type, or analysis results from the same probe found on different types of GeneChip probe or spot arrays). For the example replicate data in Table 11.2, DMT calculates the average fold change values from rows 1 and 2, 3 and 4, 5 and 6, and 7 and 8 (excluding the control probes). 171 172 CH A P T E R 11 Graphing Results Table 11.2 Sample replicate data for the fold change calculation Base Column Comparison Column 1 rep1base000A rep1samp030A 2 rep2base000A rep2samp030A 3 rep1base000B rep1samp030B 4 rep2base000B rep2samp030B 5 rep1base000C rep1samp030C 6 rep2base000C rep2samp030C 7 rep1base000D rep1samp030D 8 rep2base000D rep2samp030D Multiple pivot table columns may be assigned to each axis. For example, Figure 11.14 displays the fold change for two sets of base and comparison columns. N002AS-Avg Diff and N004AS-Avg Diff are the base columns. N006AS-Avg Diff and N008AS-Avg Diff are the comparison columns. Affymetrix® Data Mining Tool User’s Guide Figure 11.14 Fold change graph Plotting the Fold Change Graph Plotting the fold change graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. 1. Click the Fold Change Graph button . Alternatively, select Graph → Fold Change from the menu bar. ⇒ The Fold Change Graph dialog box (Figure 11.15) appears and displays the pivot table columns available for the fold change graph. 173 174 CH A P T E R 11 Graphing Results Figure 11.15 Fold Change Graph dialog box (GeneChip® data mode), pivot table columns available for the fold change graph 2. Use the drag-and-drop method to select each base column in the Available Columns box and place it in the Select Base Column(s) box (Figure 11.16). Alternatively, select one or more base columns in the Available Columns box, then click the down arrow above the Select Base Column(s) box. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns. 3. Use the drag-and-drop method to select each comparison column in the Available Columns box and place it in the Select Comparison Column(s) box (Figure 11.16). Alternatively, select one or more comparison columns in the Available Columns box, then click the down arrow above the Select Comparison Column(s) box. Affymetrix® Data Mining Tool User’s Guide Figure 11.16 Fold Change Graph dialog box, GeneChip® probe data mode 4. To change the order of a column in the Select Base (or Select Comparison) Column(s) box, use the drag-and-drop method to move the column to a new position in the list. Alternatively, select the column, then click the up or down arrow located at the inside of the Select Base (or Comparison) Column(s) box. 5. To change the fold change graph axes from log scale (default) to linear scale, click the Log Scale option to remove the check mark. 6. Click OK. ⇒ The graph pane displays the fold change graph (Figure 11.17). 175 176 CH A P T E R 11 Graphing Results Figure 11.17 Fold change graph (GeneChip® data mode) Working with the Fold Change Graph Working with the fold change graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. Magnifying the Graph 1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the area of interest in the graph (Figure 11.18). 2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.19). Affymetrix® Data Mining Tool User’s Guide Figure 11.18 Fold change graph, rectangle selects area to magnify Figure 11.19 Magnified area in the fold change graph 177 178 CH A P T E R 11 Graphing Results 3. To zoom out and restore the graph, right-click the graph and select Full Out Zoom from the shortcut menu. Locating Probes in the Graph Select a probe in the pivot table to quickly locate it in the scatter graph. 1. Click and hold the probe name in the pivot table. ⇒ The corresponding point is highlighted in the fold change graph (Figure 11.20). The highlighting is removed when the mouse button is released. Figure 11.20 Click a probe name to in the pivot table to highlight the corresponding point in the fold change graph Affymetrix® Data Mining Tool User’s Guide Viewing Probe Information & Annotating Probes 1. To display probe and corresponding gene information, click a point in the fold change graph. ⇒ The probe name, analyses names, results from the pivot table and a brief description of the gene are displayed to the right of the graph (Figure 11.21). Figure 11.21 Fold change graph displaying probe information (GeneChip® data mode) 2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website. 3. Double-click a point in the graph or a pivot table row to display the Description dialog box (Figure 11.22). 179 180 CH A P T E R 11 Graphing Results Figure 11.22 Description dialog ⇒ The Description dialog box appears and displays a brief description of the probe (probe set or spot probe), the sequence that it is designed to interrogate and any annotations associated with the probe. 4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.23). Figure 11.23 Annotate dialog box 5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. Affymetrix® Data Mining Tool User’s Guide Selecting Points in the Graph The lasso feature enables you to quickly select and focus on points of interest in the fold change graph by drawing a line around them (roping). The pivot table displays only rows that correspond to the roped probes (all other rows are hidden). Probes selected by roping may be conveniently annotated as a group or included in a probe list that can be applied to the filter grid of a subsequent query. 1. Click the Lasso Points button Points from the menu bar. . Alternatively, select Graph → Lasso ⇒ The mouse pointer changes to a pair of cross hairs (+) when positioned over the fold change graph. 2. To rope points of interest, position the cross hairs near the group of points, then do one of the following: ■ Click and hold the mouse button while you draw a complete circle around the points (Figure 11.24); or ■ Click the mouse, move it to draw a line segment, then click the mouse again to start drawing a new line. Repeat until the cross hairs return to the starting point and the lines segments enclose the points of interest (Figure 11.25). Figure 11.24 Roped points in the fold change graph 181 182 CH A P T E R 11 Graphing Results Figure 11.25 Roped points in the fold change graph 3. To terminate the roping operation, double-click the mouse or press the ESC key. ⇒ The fold change graph displays the selected points in orange color (default selected point color is user-specified in the Options dialog box, see Changing Graph Colors on page 202). The pivot table displays only the rows that correspond to the roped points (all other rows are hidden). 4. To restore the hidden rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu. 5. To clear the selection from the graph, right-click the graph and select Clear Selection from the shortcut menu. ⇒ The roped graph points are deselected and all probes (rows) are restored to the pivot table. Affymetrix® Data Mining Tool User’s Guide Fold Change Graph Options Preferences for the fold change graph display may be set in the Data Mining Options dialog box (Figure 11.26). Newly selected options are immediately applied to an existing graph and subsequent sessions for you. 1. Click the Options button , then click the Fold Change tab. Alternatively, do either of the following: ■ ■ Right-click the fold change graph and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Fold Change tab. ⇒ The Data Mining Options dialog box appears and displays the fold change options (Figure 11.26). Figure 11.26 Data Mining Options dialog box, Fold Change tab 183 184 CH A P T E R 11 Graphing Results Point Options Point size The point size number determines the dot size for a graph point. Enter a larger point size for easier viewing, but use a smaller point size for higher resolution graphs. Fold Change Calculation Default Threshold The intensity threshold value used to calculate the fold change is a function of the noise, scaling or normalization factor and the noise multiplier of the two analyses. The intensity threshold value is calculated by the expression algorithm and is stored in the Mining database. If the intensity threshold value is not found in the database, then DMT uses the default threshold value you entered for the intensity threshold. (Appendix A describes the fold change calculation.) Note: In spot data mode, set the default threshold to zero (Data Mining Options dialog box, Fold Change tab). Y-Axis Gridlines The fold change graph displays major and minor y-axis gridlines as horizontal lines with y-intercepts specified in the edit boxes (only integer values may be entered). A solid line labeled with the y-intercept value represents the major Y-axis gridline and a dotted line represents the minor gridline. Colors The colors assigned to the fold change graph points, background, or selected (roped) points may be changed. (See Changing Graph Colors on page 202.) Affymetrix® Data Mining Tool User’s Guide Series Graph The series graph displays numeric pivot table columns in a line (default) or bar graph format (Figure 11.27 and Figure 11.28). Both graph formats plot numeric pivot columns on the x-axis and the data associated with each probe (probe set or spot probe) in the column on the y-axis. The series graph is an extremely useful way to: ■ Monitor gene expression across different experiments or over a time course. ■ View probes roped in the scatter or fold change graph. ■ View individual data for cluster members (saved in a probe list). Figure 11.27 Series line graph (GeneChip® data mode) 185 186 CH A P T E R 11 Graphing Results Figure 11.28 Series bar graph (GeneChip® data mode) Plotting the Series Graph Plotting the series graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. 1. Click the Series Graph button . Alternatively, select Graph → Series from the menu bar. ⇒ The Series Graph dialog box appears and displays the pivot table columns available for the series graph (Figure 11.29). Affymetrix® Data Mining Tool User’s Guide Figure 11.29 Series Graph dialog box, pivot table columns available for the series graph 2. Select the pivot table columns for the series graph. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns. 3. Click OK. ⇒ The graph pane displays the series graph (Figure 11.30). The line graph format is the default. (See Series Graph Options on page 191 to specify the bar graph format. ) The series graph does not include probe sets that are hidden in the pivot table. 187 188 CH A P T E R 11 Graphing Results Working with the Series Graph Working with the series graph is the same in GeneChip® data mode (shown in the following section) or spot data mode. Locating Probes in the Graph Select a probe in the pivot table to quickly locate it in the series graph. 1. Click and hold the probe name in the pivot table. ⇒ In the series line graph, the corresponding line in the graph is highlighted. ⇒ In the series bar graph, the portion of the graph that contains the probe is displayed. The highlighting is removed when the mouse button is released. Figure 11.30 Series line graph, highlighted line (top) corresponds to the selected pivot table row Affymetrix® Data Mining Tool User’s Guide Viewing Probe Information & Annotating Probes 1. To display information about a probe, move the pointer over a point (or bar) in the series graph. ⇒ A pop-up tool tip displays the probe name and associated data (Figure 11.31). 2. To view sequence information, double-click a point or bar in the series graph, or a pivot table row. ⇒ The Description dialog box appears (Figure 11.32) and displays a brief description of the gene, its sequence or the portion of the gene sequence the probe is designed to interrogate. Figure 11.31 Series graph 189 190 CH A P T E R 11 Graphing Results Figure 11.32 Description dialog box 3. To view further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website. 4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.33). Figure 11.33 Annotate dialog box 5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. Affymetrix® Data Mining Tool User’s Guide Series Graph Options Preferences for the series graph display may be set in the Data Mining Options dialog box (Figure 11.26). Newly selected options are immediately applied to an existing graph and subsequent sessions for you. 1. Click the Options button Alternatively: ■ ■ , then click the Series Graph tab. Right-click the series graph, select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Series Graph tab. ⇒ The Data Mining Options dialog box appears and displays the Series Graph options (Figure 11.34). Figure 11.34 Data Mining Options dialog box, Series Graph tab 191 192 CH A P T E R 11 Graphing Results Graph Type Bar Graph or Line Graph Option Select a format option to display information in the series graph as described in Table 11.3. Table 11.3 Series graph formats Bar Line X-axis Probe set or spot probe names Pivot table column or probe Y-axis User-specified data for each probe in the analysis User-specified data for each probe in the column Series Bar Graph Options Probe Set Width (%) Determines the width of the graph bar. X-Axis Parameters Visible The number of probes displayed on the x-axis in the viewable portion of the graph pane. Series Line Graph Options X-Axis Select probe or pivot table columns for display on the x-axis. Point Size Determines the dot size for a graph point. Enter a larger point size for easier viewing, but use a smaller point size for higher resolution graphs. X-Axis Parameters Visible Specifies the number of probes or columns displayed on the x-axis in the viewable portion of the graph pane. Affymetrix® Data Mining Tool User’s Guide Colors Up to 25 different colors are applied to the bars or lines. If there are more than 25 bars or lines, the colors are re-used. The color of the series graph points, background, lines, or bars may be changed. (See Changing Graph Colors on page 202.) Histogram The histogram plots a frequency distribution of data from numeric pivot table columns. DMT sorts the data into groups or bins (x-axis coordinate) and plots the number of probe (probe sets or spot probes) per bin (y-axis coordinate) for each analysis. The resulting data distribution helps evaluate the proportion of genes expressed at a particular level. Plotting the Histogram Plotting the histogram is the same in GeneChip® data mode (shown in the following section) or spot data mode. 1. Click the Histogram button . Alternatively, select Graph → Histogram from the menu bar. ⇒ The Histogram dialog box (Figure 11.35) appears and displays the numeric pivot table columns available for the histogram. Figure 11.35 Histogram dialog box 193 194 CH A P T E R 11 Graphing Results 2. Select the desired columns for the histogram. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns. 3. Click OK. ⇒ The graph pane displays the histogram (Figure 11.36). Figure 11.36 Histogram of average difference data (GeneChip® data mode) Affymetrix® Data Mining Tool User’s Guide Working with the Histogram Working with the histogram is the same in GeneChip® data mode (shown in the following section) or spot data mode. Viewing Histogram Information & Annotating Probes 1. To view information for a particular histogram bar, place the mouse pointer over that area of the histogram. ⇒ A pop-up tool tip displays the minimum and maximum value for the bin and the number of probe sets from the corresponding column in the bin (Figure 11.36). 2. To view sequence information, double-click a row in the pivot table. ⇒ The Description dialog box appears (Figure 11.37) and displays a brief description of the gene, its sequence or the portion of the gene sequence the probe is designed to interrogate. Figure 11.37 Description dialog box 3. To view further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website. 4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.33). 195 196 CH A P T E R 11 Graphing Results Figure 11.38 Annotate dialog box 5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. Adding Landmarks One or more landmarks (Figure 11.40) may be added to the histogram to identify where a user-specified probe falls in the distribution. 1. Right-click the histogram and select Add Landmark from the shortcut menu. ⇒ The Landmarks dialog box appears (Figure 11.39). Figure 11.39 Landmarks dialog box Affymetrix® Data Mining Tool User’s Guide 2. Select one or more columns, then enter a probe name. 3. Click OK. ⇒ The histogram displays the landmark labeled with the column and probe name (Figure 11.40). Figure 11.40 Histogram with landmark for average difference value of probe set M95787_at in analysis N004AS 4. To hide the landmark(s), right-click the histogram and select Hide Landmarks from the shortcut menu. 5. To display the hidden landmark(s), right-click the histogram and select Show Landmarks from the shortcut menu. 6. To clear all landmarks, right-click the histogram and select Remove Landmarks from the shortcut menu. 197 198 CH A P T E R 11 Graphing Results Magnifying the Histogram 1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the graph area of interest (Figure 11.41). 2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.42). Figure 11.41 Histogram, rectangle selects area to magnify Figure 11.42 Magnified area of the histogram 3. To zoom out and restore the graph, right-click the histogram and select Full Out Zoom from the shortcut menu. Affymetrix® Data Mining Tool User’s Guide Histogram Options Preferences for the histogram display may be set in the Data Mining Options dialog box (Figure 11.43). Newly selected options are immediately applied to an existing graph and subsequent sessions for you. 1. Click the Options button , then click the Histogram tab. Alternatively, do either of the following: ■ ■ Right-click the histogram and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Histogram tab. ⇒ The Data Mining Options dialog box appears (Figure 11.43). Figure 11.43 Data Mining Options dialog box, Histogram tab Graph Options Combined Histogram All of the pivot table columns in a single bin are combined into one bar (Figure 11.44). If a single column was selected for the histogram, the Combined Histogram and Separate Histograms options are identical. 199 200 CH A P T E R 11 Graphing Results Separate Histograms Each bar in the histogram represents one pivot table column and is color-coded according to the legend at the right of the histogram (Figure 11.45). Select the Separate Histograms option to plot a separate frequency distribution for each column. Figure 11.44 Histogram, combined histogram option Figure 11.45 Histogram, separate histograms option Affymetrix® Data Mining Tool User’s Guide Bin Options Range Select this option to specify the range of values for the histogram. Fixed Bin Size Select this option to define the range of data values for each bin. Each bin is set to the user-specified Bin Size. If a range is specified, it determines where the first bin begins, otherwise the lowest data value is used. The first bin begins at the lowest data value or the low value set in the Range option. The histogram creates sufficient bins to plot all of the data using the user-specified Bin Size. If a Range is specified, the number of bins = Range/ Bin Size. Variable Bin Size Select this option to define the number of histogram bins. Number of Bins is the number of bins plotted. First Bin Upper Limit defines the boundary value between the first and second bin. The first bin includes all values less than or equal to the first bin upper limit. The user-specified Range and Number of Bins determine the size of the remaining bins (increases exponentially). Use the Variable Bin Size and Range options to compare the distribution of values from one or more analyses. For example, set the Number of Bins = 10, First Bin Upper Limit = 40 and Range = 0 to 10,000. The histogram plots 10 bins that contain an increasingly larger range of values. 201 202 CH A P T E R 11 Graphing Results X-Axis Options Ticks per label Defines the number of graph markers or tick marks on the x-axis between the numeric labels. The numeric label shows the range for a bin. The histogram displays a tick mark for each bin. Note: Set the Ticks per label option to 1/2 or 1/4 the Number of Bins in the Variable Bin Size option. This displays enough labels to view the histogram ranges without overloading the graph. Color Options The color of the histogram background, landmarks, or bars may be changed. (See Changing Graph Colors on page 202.) Other Graphing Features Enlarging the Graph Pane 1. Right-click the graph pane and select Expand Graph from the shortcut menu. Alternatively, select View → Expand Graph from the menu bar. 2. Repeat step 1 to restore the graph pane to its original size. Changing Graph Colors 1. Click the Options button , then click the graph tab of interest. Alternatively, do one of the following, then click the graph tab of interest: ■ Right-click the graph and select Options from the shortcut menu; or ■ Select View → Options from the menu bar. ⇒ The Data Mining Options dialog box appears and displays the selected graph options (Figure 11.46). Affymetrix® Data Mining Tool User’s Guide Figure 11.46 Data Mining Options dialog box, Scatter Graph tab 2. To change the color of an item (for example, Selected Point Color in Figure 11.46), click the associated color square in the Data Mining Options dialog box. ⇒ The Color palette appears (Figure 11.47). Figure 11.47 Color palette (expanded palette, right) 203 204 CH A P T E R 11 Graphing Results 3. Click a new basic color in the palette or click Define Custom Colors to define a custom color. ⇒ The color palette expands to display the custom color field (Figure 11.47). 4. To define a custom color, use the click-and-drag method to position the cross hairs in the custom color field. In the luminosity scale to the right, adjust the color brightness by moving the arrow up or down the scale. ⇒ The Color|Solid swatch displays the custom color. 5. When finished, click Add to Custom Colors to apply the color. Color selections are saved on a per user basis. Copying and Clearing Graphs 1. To copy a graph to the system clipboard, right-click the graph and select Copy Graph from the shortcut menu. Alternatively, select Edit → Copy Graph from the menu bar. 2. To clear a graph from the graph pane, right-click the graph and select Clear Graph from the shortcut menu. 3. To clear all graphs from the graph pane, select Edit → Clear Graphs from the menu bar. Printing Graphs 1. In the graph pane, click the graph tab you want to print. 2. Click the Print button in the toolbar. ⇒ The Print dialog box appears (Figure 11.48). Affymetrix® Data Mining Tool User’s Guide Figure 11.48 Print dialog box 3. Confirm that the Graph option is selected. 4. Click OK. 205 206 CH A P T E R 11 Graphing Results 12 Chapter 12 Statistical Analyses 12 DMT offers several types of statistical analyses to help evaluate and compare replicate data. Statistical operators can be applied to numeric pivot table columns. The resulting data are displayed in the pivot table and are available for graphing and further statistical analysis. Selecting an Operator Open the Analysis Function dialog box to select a statistical operator(s). ■ Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 12.1). Figure 12.1 Analysis Function dialog box 209 210 CH A P T E R 12 Statistical Analyses Average, Median, Standard Deviation or Inter-Quartile Range One or more of the following operators can be applied to user-specified numeric columns in the pivot table: Average Computes the average for the selected pivot table column(s). Median Computes the median (50th percentile) for the selected pivot table column(s). Standard Deviation Calculates the standard deviation for the selected pivot table column(s). Inter-Quartile Range Computes the 75th and the 25th percentile value for the selected pivot table column(s). The inter-quartile range is the 75th percentile minus the 25th percentile. 1. In the Analysis Function dialog box (Figure 12.2), select one or more of the operators: Average, Median, Standard Deviation, or InterQuartile Range. Figure 12.2 Analysis Function dialog box Affymetrix® Data Mining Tool User’s Guide 2. Enter a name for the new column(s) of data that will be generated, then click Next. ⇒ The column selection dialog box appears (Figure 12.3). Figure 12.3 Column selection dialog box, average and standard deviation analysis 3. Select one or more pivot table columns for the operators selected in Step 2, then click Finish. ⇒ The pivot table (right side) displays the new column(s) of statistical results (Figure 12.4). The column header displays the user-specified name, followed by the type of operator. For example, in Figure 12.4, the new column names are Tumor-Average and Tumor-Stdev. Figure 12.4 Pivot table displaying average and standard deviation results 211 212 CH A P T E R 12 Statistical Analyses Fold Change The fold change (FC) operator compares user-specified pivot table columns (base and comparison analysis) and computes the fold change for each probe in the comparison. (See Appendix A for more information about the fold change calculation). 1. In the Analysis Function dialog box, select the Fold Change operator (Figure 12.5). Figure 12.5 Analysis Function dialog box 2. Enter a column name for the fold change results, then click Next. ⇒ The Column selection dialog box appears (Figure 12.6). Affymetrix® Data Mining Tool User’s Guide Figure 12.6 Column selection dialog box, fold change analysis 3. Select a base and comparison column, then click Finish. ⇒ The pivot table (right side) displays the fold change results (Figure 12.7). Figure 12.7 Pivot table, fold change results 213 214 CH A P T E R 12 Statistical Analyses T-Test The T-Test analyzes two groups of pivot table columns (control and experiment) and determines the significance of change of the means of the two groups as well as the direction of the change. It computes a p-value for each comparison. The p-value is the probability value that the observed difference occurred by chance. A small p-value (for example, 0.01) means it is unlikely (only a one in 100 chance) that such a mean difference would occur by chance under the assumption that the mean difference was zero. The T-Test assumes two samples of unequal variances and a normal distribution of the data. DMT uses an unpaired, one-sided T-Test and converts the p-value to a two-sided p-value. It shows the direction of change in a separate pivot table column. 1. In the Analysis Function dialog box, select the T-Test operator (Figure 12.8). Figure 12.8 Analysis Function dialog box 2. Enter a column name for the T-Test results. 3. Confirm the default P Cutoff, or enter a new value. Affymetrix® Data Mining Tool User’s Guide If the computed p-value for a call is greater than the p Cutoff, the Change Direction call is None (no change). 4. Click Next. ⇒ The column selection dialog box appears (Figure 12.9). Figure 12.9 Column selection dialog box, T-Test 5. Select two or more pivot columns for the Control and two or more pivot columns for the Experiment, then click Finish. ⇒ The pivot table displays two columns of T-Test results: the computed P Value and the Change Direction call (Figure 12.10). Figure 12.10 Pivot table displaying T-Test results 215 216 CH A P T E R 12 Statistical Analyses Mann-Whitney Test The Mann-Whitney test compares two groups of pivot table columns (control and experiment) to determine the significance of change as well as the direction of change. It computes a p-value for each comparison. The Mann-Whitney test is the nonparametric method for comparing two unpaired groups. It does not assume a particular distribution of the data. 1. In the Analysis Function dialog box, select the Mann-Whitney operator (Figure 12.11). Figure 12.11 Analysis Function dialog box 2. Enter a column name for the Mann-Whitney test results. 3. Confirm the default P Cutoff, or enter a new value. If the computed p-value for a call is greater than the p Cutoff, the Change Direction call is None. 4. Click Next. ⇒ The column selection dialog box appears (Figure 12.12). Affymetrix® Data Mining Tool User’s Guide Figure 12.12 Column selection dialog box, Mann-Whitney test 5. Select two or more pivot columns for the Control and two or more pivot columns for the Experiment, then click Finish. ⇒ The pivot table displays the Mann-Whitney test results (Figure 12.13). Figure 12.13 Pivot table, Mann-Whitney results 217 218 CH A P T E R 12 Statistical Analyses Count & Percentage The Count & Percentage operator is only available in GeneChip® data mode. For each probe set in user-specified pivot table columns, it counts the number and computes the percentage of: ■ Absolute or detection calls (P, M, or A) ■ Difference or change calls (I, MI, NC, MD, D), or ■ Calls within a user-specified numeric range For the Count & Percentage operator, you can specify any combination of: ■ Absolute call, difference call and numeric range ■ Detection call, change call and numeric range A probe set must meet all conditions to be counted. 1. In the Analysis Function dialog box, select the Count & Percentage operator (Figure 12.14). Figure 12.14 Analysis Function dialog box 2. Specify the conditions (absolute call, difference call, or numeric thresholds) a probe set must meet to be counted. Affymetrix® Data Mining Tool User’s Guide If numeric thresholds are specified, DMT counts only the values within the threshold limits. If both > and < threshold options are selected, they are combined in AND (intersection) fashion. The example in Figure 12.14 specifies a probe set must have an absolute call = P, difference call = I and expression metric >200 to be counted. 3. Enter a column name for the Count & Percentage results. 4. Click Next. ⇒ The column selection dialog box appears (Figure 12.15). Figure 12.15 Column selection dialog box, Count & Percentage operator 5. Select the pivot columns and parameter(s) for the count and percentage analysis, then click Finish. (The Parameters box is only displayed if a numeric threshold was specified in the Analysis Function dialog box.) ⇒ The pivot table displays the Count & Percentage results (Figure 12.16). For example, in Figure 12.16, the count for probe set AB002533_at is 16 because the probe set met all conditions (absolute call = P, difference call = I, average difference > 200) in 16 of 16 columns, resulting in percentage = 100%. 219 220 CH A P T E R 12 Statistical Analyses Figure 12.16 Pivot table, Count & Percentage results 13 Chapter 13 Matrix Analysis 13 Matrix analysis compares two probe lists, determines the probes in common (probe sets or spot probes) and computes an overlap or nonoverlap significance score for the two probe lists. The matrix provides a spreadsheet framework for comparing probe lists and displays the overlap or non-overlap significance score for probe lists in the matrix. See Appendix D for further information about the matrix algorithm. Overview Matrix analysis uses the binomial distribution to calculate the probability that an overlap between two lists occurs by chance. (See Appendix D for more information about the binomial distribution.) The analysis compares two separate lists and calculates the significance of the overlap between them. To illustrate how the significance is determined, consider two independent sets, probe list A and Q. Probe list A has na members and probe list Q has nq members. These sets were generated from a total population size of t. (Note: the total population usually includes additional members besides those in sets A and Q.) The expected overlap between the lists based on random chance is important in determining the overlap significance. The chance (or frequency, w) of picking a member of set Q at random from the total population is: w = nq/t. For example, if there are 10 member of Q in a total population of 100 members, then there is a ten percent chance of picking Q. If we make na random picks (the number of members in set A) from this distribution, we would expect to pick a member of set Q ten percent of the time. The expected overlap between Q and A is na*w. What we actually observe is there are x members that belong to both classification A and Q. How close the observed overlap, x, is to the expected overlap, na*w, determines the overlap significance. If these two values are 223 224 CH A P T E R 13 Matrix Analysis close, then there is a high probability that the overlap is due to random chance. The algorithm uses the binomial distribution to determine this significance. The observed overlap could be larger or smaller than the expected overlap. If the observed overlap is larger than the expected, then set A is over represented. If the observed overlap is smaller than the expected, then set A is under represented. Population Size The total population is an important parameter in calculating overlap significance. This is the total population from which the lists were generated. It is defined as the number of members in common between the two independent classification schemes. The total population for clustering sets is the number of probe sets used when generating the clusters. For the SOM clustering, this value is the total number of probe sets contained in all of the clusters. For correlation coefficient clustering, the population size is either the maximum number of seeds used or the total number of probe sets in the pivot table, depending on whether all or only the seed set were used to generate the final clusters. (See Chapter 14 for a description of the clustering algorithms and parameters.) Matrix analysis initially sets the population size as the number of unique probe sets in the row and column probe lists of the matrix. If you are using only a subset of the classification lists, or the lists do not include all of the members that were used to generate the classification, then the calculated population size is too small. In this case, change the total population to the total number of members used to generate the classification. Affymetrix® Data Mining Tool User’s Guide Running a Matrix Analysis 1. Select Analyze → Matrix from the menu bar. ⇒ The Matrix opens (Figure 13.1). Figure 13.1 Matrix 2. Click Select Rows. ⇒ The Select dialog box appears (Figure 13.2). Figure 13.2 Select Probe Sets dialog box 225 226 CH A P T E R 13 Matrix Analysis 3. Select the probe lists you want to include in the matrix rows, then click OK. ⇒ The matrix displays the probe list names in the row headers (Figure 13.3). Figure 13.3 Matrix, rows specified 4. Click Select Columns. ⇒ The Select dialog box appears (Figure 13.2). 5. Select the probe lists you want to include in the matrix columns, then click OK. ⇒ The matrix displays the probe list names in the column headers (Figure 13.4). Affymetrix® Data Mining Tool User’s Guide Figure 13.4 Matrix, probe lists selected for the rows and columns 6. Confirm the Population Size, or enter a new value. The default population value is equal to the number of unique probes in the rows and column probe lists. (See Population Size on page 224 for information on how to set this value.) 7. Click Calculate. ⇒ The algorithm computes the overlap (over represented probe sets) or non-overlap (under represented probe sets) significance score for each pair of probe lists in the matrix (Figure 13.5). The overlap significance score increases as the overlap or lack of overlap increases between two lists (see Appendix D). To distinguish between overlap or non-overlap, the matrix highlights scores that exceed the overlap significance threshold (pink) or are nonoverlap scores and exceed the significance threshold (yellow). The threshold values in the Overlap and Non-overlap boxes can be changed. 227 228 CH A P T E R 13 Matrix Analysis Figure 13.5 Matrix displaying overlap significance scores 8. Click Print to print the matrix. 9. Click Close when finished to close the matrix. 14 Chapter 14 Cluster Analysis 14 Cluster analysis helps identify gene expression patterns (profiles) in the data and groups together probe sets or spot probes with similar gene expression patterns. DMT offers two clustering algorithms: self organizing map (SOM) and correlation coefficient clustering. Self Organizing Map (SOM) Algorithm The self organizing map (SOM) algorithm is designed to cluster GeneChip® average difference data (shown in this chapter). However, any numeric column in the pivot table may be selected for cluster analysis. (Appendix D describes the SOM algorithm and its user-modifiable parameters.) The algorithm considers the expression levels of n probe sets in k experiments as n points in k-dimensional space. Initially, the algorithm randomly places a grid of nodes or centroids onto the k-dimensional space. The algorithm iteratively adjusts the positions of the nodes to identify clusters in the data. 231 232 CH A P T E R 14 Cluster Analysis Running a SOM Cluster Analysis Prior to cluster analysis, normalize GeneChip® signal data in Affymetrix® Microarray Suite or DMT. Normalize spot probe intensity data in Affymetrix® Jaguar™. (For more information, see Chapter 5.) 1. Select Analyze → SOM Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 14.1). Figure 14.1 Select Columns for Clustering dialog box 2. Select more than one pivot table column for SOM clustering, then click OK. ⇒ The SOM Clustering dialog box appears (Figure 14.2). Affymetrix® Data Mining Tool User’s Guide Figure 14.2 SOM Clustering dialog box The section SOM Filters on page 238, provides a description of the thresholds, row variation and row normalization filters. Filtering the data is optional, but recommended. See SOM Parameters on page 239 for a description of the user-modifiable algorithm parameters. 3. To apply threshold filtering, confirm the Thresholds values, MinVal and MaxVal, or enter new values, then click Add>. ⇒ The threshold filter is displayed in the box to the right (Figure 14.3). 4. To apply Row Variation filtering, confirm the row variation Max/Min and Max-Min defaults or enter new values, then click Add>. ⇒ The row variation filter is displayed in the box to the right (Figure 14.3). 5. Click Compute to display the number of probe sets (or spot probes) remaining after the row variation filter is applied to the data. The Compute button is a tool for quickly confirming the row variation Max/Min and Min-Min parameters. The number of rows (probe sets) 233 234 CH A P T E R 14 Cluster Analysis that remain in the dataset after filtering is displayed as New Rows next to the Compute button (Figure 14.3). When you click Compute, any values entered in the Row Variation edit boxes are also applied to the filters in the box on the upper right, even when the Row Variation values do not appear in the filter box. 6. To apply Row Normalization, confirm the Mean and Variance defaults, or enter new values, then click Add>. ⇒ The row normalization filter is displayed in the box to the right (Figure 14.3). Figure 14.3 SOM Clustering dialog box, data filtering 7. To change the order of a filter, highlight the filter, then click Down or Up to move the filter to the desired position. 8. To delete a filter, highlight the filter, then click Del. To delete all filters, click Del All. Affymetrix® Data Mining Tool User’s Guide 9. Confirm the defaults for Parameters, or enter new values. See SOM Parameters on page 239 for a description of the usermodifiable algorithm parameters. 10. Click Run to filter the data and perform SOM cluster analysis. ⇒ The graph pane displays the results of the cluster analysis (Figure 14.4). Figure 14.4 SOM clusters The rows and columns parameters generate the nodes that identify clusters. For example, in Figure 14.4 the default rows and columns (6 x 3) generate 18 clusters (click the down arrow to scroll the cluster view). The SOM algorithm maps clusters that have similar gene expression patterns near one another. As a result, in Figure 14.4, the average gene expression 235 236 CH A P T E R 14 Cluster Analysis patterns in Cluster 1 and Cluster 2 show the greatest similarity and those in Cluster 1 and Cluster 18 are the most dissimilar. Each cluster plot displays the cluster number followed by the number of cluster members (in parentheses). The middle (red) graph line represents the average gene expression pattern for the cluster. The two outer (blue) graph lines represent the standard deviation of expression (Figure 14.5). Cluster plot axes are not scaled identically. SOM cluster results may show run-to-run variability due to the inherent nature of the algorithm (for example, the random initialization process). Figure 14.5 SOM cluster plot (4 pivot columns selected for clustering) 11. Click a cluster plot to view the members in the Probes box (Figure 14.4). Affymetrix® Data Mining Tool User’s Guide Saving a Probe List Saving a Selected Cluster as a Probe List 1. Click the cluster you want to save. ⇒ The cluster members are displayed in the Probes box of the Cluster tab (Figure 14.4). 2. Enter a Probe List Name. 3. Click Save Selected. ⇒ The data tree displays the probe list name. Saving All Clusters as a Probe List 1. Click Save All. ⇒ The Save All Clusters dialog box appears (Figure 14.6). Figure 14.6 Save All Clusters dialog box 2. Enter a cluster root name and click Save All. ⇒ The data file tree displays the probe lists (Figure 14.7). Each probe list is named using the cluster root name followed by the cluster number. 237 238 CH A P T E R 14 Cluster Analysis Figure 14.7 Data file tree, Probe Lists directory To quickly view data for the cluster members in a probe list, right-click the probe list in the data tree, then select Highlight Pivot and Graph from the shortcut menu. The pivot table displays only the rows for the probe list (cluster members). If the scatter, fold change and series line graphs were previously plotted for the clustered columns, the scatter and fold change graphs highlight the points from the probe list. The series line graph displays only the probe list. SOM Filters The SOM filter values are user-modifiable. The default values are intended for probe set average difference data. Thresholds The minimum and maximum thresholds are designed to exclude outlier data. Data that exceed the maximum threshold value are changed to the maximum threshold value. Data less than the minimum threshold value are changed to the minimum threshold value. Affymetrix® Data Mining Tool User’s Guide Row Variation The row variation filters are designed to exclude probe sets or spot probes that do not significantly change expression level across the experiments. DMT evaluates each probe set or spot probe across all selected columns and includes it in the analysis if both of the following conditions are met: 1) maximum value/minimum value > 3 (default), and 2) maximum value - minimum value > 100 (default) The maximum and minimum row variation values are user-modifiable. Row Normalization This normalizes the data to a mean of zero and a variance of one. Row normalization helps the algorithm identify clusters based on the shape of expression patterns rather than absolute expression levels. SOM Parameters See Appendix D for further description of the SOM algorithm. Rows & Columns Specifies the rows and columns of nodes that identify clusters in the data. The number of nodes (rows x columns) determines the number of clusters generated. Epochs Determines the number of iterations the algorithm runs. Iterations = Epochs x Number of probe sets Seeds The number of times the algorithm runs through a set of iterations. The algorithm selects the result that minimizes the sum of the distances from the data points to the nodes. Initialization Initial placement of the nodes in k-dimensional space. Random Vectors method randomly places the nodes in kdimensional space. Random Datapoints method places the nodes on randomly-selected points. 239 240 CH A P T E R 14 Cluster Analysis Neighborhood Defines a distance from the target node (the node closest to the point being considered). At each iteration, nodes in the neighborhood are moved toward the point being considered (updated). Bubble neighborhood = a radial distance from the target node. All nodes in the bubble neighborhood are updated the same amount. Nodes outside the bubble neighborhood are not updated. In the Gaussian neighborhood, all nodes are updated. The distance a node moves is a function of the distance of the node from the target node. The greater the distance between the node and the target node, the smaller the distance the node is updated. Initial Initial width of the bubble neighborhood (default = 5). neighborhood size Final Final width of the bubble neighborhood at the last neighborhood size iteration. Initial learning rate Initial distance (learning rate) a node is updated. Final learning rate Final learning rate at the last iteration. Correlation Coefficient Clustering Algorithm The correlation coefficient clustering algorithm finds probe set patterns that have similar shape. The process for finding clusters of similar probe set patterns is accomplished in three steps: ■ Filtering - Removes patterns due mostly to noise. ■ Seeding - Defines the expression patterns of the clusters. ■ Clustering - Groups patterns which are close to the cluster shape. First, the data set is filtered to remove probe sets with low or relatively constant expression levels across the samples (low standard deviation). The entire data set need not be included to obtain a diverse set of clusters. To the contrary, including noisy data tends to make the discovery of unique Affymetrix® Data Mining Tool User’s Guide expression patterns more difficult. Filtering reduces the number of expression patterns using the following seeding step. It has been empirically determined that 3,000 or fewer genes should be included in the seeding step. Next, a nearest neighbor approach is used to calculate seeds with unique patterns in the data set. All probe sets whose expression patterns exceed the user-defined correlation coefficient (CC) threshold are grouped to define a seed. The expression level for each of the genes in the seed is normalized relative to its standard deviation and the mean of the normalized expression levels is calculated and defined as the seed pattern. In the final step, the pattern of each gene is compared to the seed patterns. Those patterns that closely match the seed pattern are assigned to the seed cluster. Depending on the way the clustering parameters are defined, either all genes or just those that survived the filtering step are assigned to seed clusters. Genes may match more than one seed. Assignment to more than one cluster is allowed, or assignment to only the cluster with the highest CC may be forced. Unlike the SOM clustering, the correlation coefficient algorithm does not pre-define the number of clusters. The seeding operation determines the final number of clusters. The correlation coefficient clustering algorithm is designed to cluster GeneChip® expression data such as signal or average difference. In general it is best to use normalized expression values. This removes some types of sample preparation artifacts which can create spurious patterns that tend to mask the true patterns in the data. However, any column in the pivot table may be selected for cluster analysis. See Appendix D for more information about the algorithm. Running the Correlation Coefficient Cluster To run the correlation coefficient clustering, you must specify the data to cluster and various parameters for filtering, seeding and final clustering. 1. Select Analyze → Correlation Coefficient Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 14.8). 241 242 CH A P T E R 14 Cluster Analysis Figure 14.8 Select Columns for Clustering dialog box 2. Select the samples for clustering. 3. Click OK when finished. ⇒ The Correlation Coefficient Clustering dialog box appears (Figure 14.9). Figure 14.9 Correlation Coefficient Clustering dialog box Affymetrix® Data Mining Tool User’s Guide See Correlation Coefficient Clustering Options on page 244 for a description of the Filter, Seed Patterns and Cluster options and settings. 4. In GeneChip® data mode, confirm the default or enter a new value for the Maximum number of probe sets to include in seeding. The Filter options are only available in GeneChip data mode if the absolute call or detection values have been retrieved from the database. 5. To generate seeds, choose the Generate Seeds option. The Import Seed Patterns option is described later. 6. Confirm the defaults or enter new values for the Correlation coefficient threshold and Minimum number of probe sets per seed. 7. Confirm the defaults for the Cluster options (Unique assignments to one cluster and Cluster filtered probe sets only) or choose new Cluster options. 8. Click Run to start the cluster analysis. ⇒ The Cluster tab in the graph pane displays the clusters (Figure 14.10). The pane displays the cluster number followed by the number of cluster members (in parentheses). The cluster plot axes are not scaled identically. 243 244 CH A P T E R 14 Cluster Analysis Figure 14.10 Correlation coefficient cluster plot Correlation Coefficient Clustering Options The parameters for the filtering, seed generation and clustering steps of the clustering algorithm are specified in the Correlation Coefficient Clustering dialog box (Figure 14.9). The following discusses the parameters for each step. Filter Many genes in a data set may not be expressed and have low expression values. However, the noise in the expression values will lead to spurious patterns which are removed by the filtering step. The detection call (Statistical Expression algorithm) and absolute call (Empirical Expression algorithm) are used to determine whether a probe set is expressed or not. An absent call (A) indicates the gene is not expressed in the sample. These calls may be excluded from the seeding process. To filter based on the expression call, choose the Exclude probe sets with less than _% Present calls across all analyses when generating seeds option. The filter slider sets the percentage of P (present) or M (marginal) calls that are required for a given probe set to be included in the seeding step. Affymetrix® Data Mining Tool User’s Guide A higher filter percentage excludes more probe sets with low expression values; a lower percentage includes more genes with low signal. The default is 75%. Depending on the experiment, the filtering parameter may be set to either a high or low number. For example, suppose an experiment looks at several different tissues and only those probe sets expressed in a single tissue are of interest. In this case, lowering the filtering percentage and tolerating the noise in the rest of the sample is required to detect the one rare gene that may be expressed. Also in the filter step, specify the Maximum number of probes to include in seeding. This parameter ranks the genes according to the relative standard deviation of their expression intensities across the samples, that is, those with the greatest fluctuations in expression patterns. Top-ranked probe sets fluctuate the most, low-ranked genes the least. The value determines the number of top-ranked genes that will be included in seeding. The default is 1,000, but sometimes this value may be as small as several hundred in order to obtain meaningful clusters. Seed Seeding is a pre-clustering process by which cluster patterns are first determined. A seed is usually a small group of genes whose expression patterns are very similar to each other. The seed’s expression pattern is calculated from the average expression pattern of this small group. Two separate parameters are used in the seeding step: ■ Correlation coefficient threshold ■ Minimum number of probes per seed The correlation coefficient threshold is a numerical way of representing the relatedness of expression patterns. It is the covariance between expression patterns for two probe sets across a series of biological samples. The value of the correlation coefficient ranges from -1 to +1, where +1 represents complete correspondence. The higher the threshold, the more similar the probe sets must be to belong to the same seed. The default value is 0.98, but can be as large as 0.999 or as small as 0.8 in order to obtain meaningful clusters. 245 246 CH A P T E R 14 Cluster Analysis Set the Minimum number of probe sets to specify the number of genes that must be present in a seed for it to be used in clustering. A higher number is more restrictive and reduces the number of allowed patterns. A lower number allows rarer expression patterns to define seeds and then later clusters. The default is 3. If the file name is entered into the Save Seed Patterns box, the patterns will be saved as a text file. Cluster There are three parts to the final clustering step. The correlation coefficient threshold is the same parameter as in the seeding step. It specifies how closely a probe set pattern must match the seed’s pattern in order to join the cluster. A lower number allows a less stringent expression relationship between the probe sets which are permitted to join the cluster. A higher number forces a more stringent relationship. Generally, it is best to use a less stringent threshold than in the seeding step in order to incorporate more unseeded probe sets into the cluster. The default is 0.90. If the Cluster Filtered Probe Sets Only option is chosen, only those probe sets that passed the filtering steps are allowed to join the cluster. The choice will depend on factors such as the quality of the data and whether a rare expression pattern is being sought. It is possible that a probe set correlation coefficient exceeds the threshold for two or more seeds. If the Unique assignments to one cluster option is chosen, a probe set is assigned to the cluster with the highest correlation coefficient. If this option is not chosen, the probe set is assigned to every cluster whose correlation threshold exceeds the threshold Effect of Changing Algorithm Parameters describes how changing a parameter value affects seeding and clustering. Table 14.1 Affymetrix® Data Mining Tool User’s Guide Table 14.1 User-modifiable Correlation Coefficient algorithm parameters Correlation Coefficient Algorithm Parameter Description Parameter Change Effect of Parameter Change Specifies the percentage of present and marginal detection or absolute calls that a probe set must have across all analyses in order to be considered for the seeding step, and optionally, the clustering step (default = 75%) Increase Decreases the number of probe sets (with the highest relative standard deviation) used in the seeding process. An excessive number of probe sets in the seeding process generates large, less distinct clusters. Decrease Increases the number of probe sets (with the highest relative standard deviation) used in the seeding process. Increases the number of seeds (representative expression profiles). The algorithm ranks the probe sets not excluded by the filter in order of highest standard deviation. Probe sets with the highest standard deviation are included in the seeding procedure until the Maximum number of probe sets to include in seeding is reached. Increase Increases the number of probe sets (with the highest relative standard deviation) used in the seeding process. An excessive number of probe sets in the seeding process generates large, less distinct clusters. Decrease Decreases the number of probe sets (with the highest relative standard deviation) used in the seeding process. Decreases the number of seeds (representative expression profile for a cluster). Increase Increases the similarity required between the expression profiles of two probe sets in order to be included in the same seed. If the seed correlation coefficient threshold is excessively high, this prevents identification of any seeds. Decrease Lowers the similarity required between the expression profiles of two probe sets in order to be included in the same seed. If the seed correlation coefficient threshold is too low, expression profiles merge and can result in a new profile that is unlike either merged profile. Minimum number of probe sets (that Increase Minimum number of probe exceed the seed correlation Decrease coefficient threshold) required to sets per seed define a cluster and generate a seed. Decreases the number of clusters generated. Filter Maximum number of probe sets to include in seeding Seed correlation Expression patterns of probe sets that pass the filter are compared to coefficient one another. If the correlation threshold coefficient between two probe set profiles exceeds the threshold, they are included in the same seed. Cluster correlation coefficient threshold The expression profile of each probe set is compared to each seed. If the correlation coefficient exceeds the threshold, the probe set is assigned to the cluster. Increases the number of clusters generated. Increase Decreases the number of probe sets in a cluster. Decrease Increases the number of probe sets in a cluster. 247 248 CH A P T E R 14 Cluster Analysis Saving and Importing Seed Patterns The seeding process described above is useful for finding interesting, but unknown patterns in the data set. In cases where the pattern is known, the seeding process can be omitted and the known patterns imported instead. An example pattern would be a gene that is expressed in one tissue type, but in none of the others. To import the seed patterns: 1. Choose the Import Seed Patterns option. 2. Enter the name of the seeds data file (*.txt) that contains the patterns (Figure 14.11). Alternatively, click the Browse button and select a *.txt from the Read Seeds Data dialog box that appears. The *.txt can be a file saved from a previous clustering run (see Saving Seeds Data on page 249) or manually created (see Seed Pattern (*.txt) Format on page 250). Figure 14.11 Correlation Coefficient Clustering dialog box Affymetrix® Data Mining Tool User’s Guide Saving Seeds Data The seeds generated by a cluster analysis may be saved in a seeds data file (*.txt). 1. In the Correlation Coefficient Clustering dialog box (Figure 14.12), select the Generate Seeds option. Figure 14.12 Correlation Coefficient Clustering dialog box 2. Click the upper Browse button . ⇒ The Save Seeds Data dialog box appears (Figure 14.13). 249 250 CH A P T E R 14 Cluster Analysis Figure 14.13 Save Seeds Data dialog box 3. Select a directory for the saved file. 4. Enter a File Name for the seeds data file (*.txt), then click Save. The seed patterns are saved when the clustering algorithm is executed. Seed Pattern (*.txt) Format In the seed pattern text file (Figure 14.14), the first row contains the column headings. The following rows contain the patterns. Each row contains the label of the pattern and the expression values. Figure 14.4 shows a representative pattern file. In this example, the first three rows are patterns of individual probe sets. The last three rows are userspecified patterns. Figure 14.14 Import seed text file Affymetrix® Data Mining Tool User’s Guide Saving a Probe List Cluster members may be saved as a probe list. 1. Click the cluster you want to save. ⇒ The cluster members are displayed in the Probes box of the Clusters tab (Figure 14.15). 2. Enter a name for the list in the Probe List Name box. Figure 14.15 Correlation coefficient cluster plot 3. Click Save Selected. ⇒ The data tree displays the probe list name. 251 252 CH A P T E R 14 Cluster Analysis To quickly view data for the cluster members in a probe list, right-click the probe list in the data tree, then select Highlight Pivot and Graphs from the shortcut menu. The pivot table displays only the rows for the probe list (cluster members). If the scatter, fold change and series line graphs were previously plotted for the clustered columns, the scatter and fold change graphs highlight the points from the probe list. The series line graph displays only the probe list. 15 Chapter 15 DMT Tutorial 15 Introduction This tutorial includes six lessons that demonstrate (in GeneChip® data mode) how to use DMT. ■ Lesson 1: Identify highly expressed genes ■ Lesson 2: Calculate summary statistics of replicates ■ Lesson 3: Summarize qualitative data ■ Lesson 4: Evaluate difference between two tissues ■ Lesson 5: Use comparison ranking to evaluate difference call consistency ■ Lesson 6: Perform cluster analysis using the self organizing map (SOM) algorithm The tutorial lessons use the demonstration database DMT_3_Tutorial that is provided on the Affymetrix Data Mining Tutorial and Demo Data CD (P/N 610050 Rev. 2). The database includes absolute and comparison analyses of tissue T1, T2 and T3. LIMS users: The tutorial database name may be different from that used in this manual. Please contact your Database Administrator for the correct name. There are six replicate absolute analyses of each tissue type (a total of 18 absolute analyses). For example, the replicates for tissue T1 are T1_r1, T1_r2, ... T1_r6 (Figure 15.1). There are 36 comparison analyses that compare tissue T1 and T2 replicates for use in Lesson 5. The number of replicates needed in your own experiments will depend on how much variability you expect to see in your system. The signal intensity data were scaled to a target intensity (TGT) of 500 using the All Probe Sets option in the Affymetrix® Microarray Suite software. 255 256 CH A P T E R 15 DMT Tutorial Figure 15.1 DMT_3_Tutorial database, 18 absolute analyses Before we can start to analyze the data, we must first register and connect the database to DMT. Step 1: Restoring the MicroDB™ Database Refer to the Affymetrix® MicroDB™ User’s Guide, the SQL Server manual, or the Oracle® manual for instructions on how to restore the tutorial database to a workstation or server. Step 2: Starting DMT Refer to Chapter 2 on page 10 for more information on installing and registering DMT. Press the Windows Start menu button Affymetrix → Data Mining Tool. ⇒ The DMT main window appears. , then select Programs → Step 3: Registering the Database Tutorial Database on Windows NT® Workstation (MicroDB™ System) 1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 15.2). Affymetrix® Data Mining Tool User’s Guide Figure 15.2 Register Database dialog box, publish database on Windows NT workstation 2. Select the DMT_3_Tutorial database from the Publish Database drop-down list, then click Register. ⇒ The tutorial database is now available to DMT. Tutorial Database on LIMS Server (Affymetrix® LIMS) Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 15.3). Figure 15.3 Register Database dialog box Oracle® Database 1. To select another server, enter the server name or Oracle alias, then click List Databases to display the publish databases for the server in the Publish Database drop-down list. 2. Select the DMT_3_Tutorial database from the Publish Database dropdown list, then click Register. ⇒ The tutorial database is available to DMT. 257 258 CH A P T E R 15 DMT Tutorial Step 4: Selecting the Tutorial Database 1. 2. Select Edit → Select Database from the menu bar. Select the DMT_3_Tutorial database. ⇒ The status bar at the bottom of the main window displays the name of the current database. If the status bar is not displayed, select View → Status Bar from the menu bar. Step 5: Opening the DMT Session A DMT session must be opened to begin data analysis. ■ Select Data → New → GeneChip Mining from the menu bar. ⇒ The DMT session opens (Figure 15.4). Figure 15.4 DMT session, DMT_3_Tutorial database selected Affymetrix® Data Mining Tool User’s Guide Lesson 1: Identifying Highly Expressed Genes Identifying genes that significantly change expression level can give insight into the major functional and structural cell changes that occur between two experimental conditions (for example, normal cells and cells treated with a drug). This lesson shows how to identify genes that are highly expressed in tissue T1. It then examines the expression of these same genes in tissue T2 and T3. We will use only one replicate of each tissue in this lesson. Lesson 1 includes: ■ Step 1: Specifying a Filter ■ Step 2: Querying the Database ■ Step 3: Sorting the Pivot Table by Signal ■ Step 4: Creating a Probe List ■ Step 5: Plotting the Series Bar Graph Step 1: Specifying a Filter The filter is a useful tool for selecting transcripts that exceed a certain limit or transcripts within a given expression range. For example, to find highly expressed genes, we can specify (in the filter grid) genes that are called Present with a Signal > 1000. 1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears. 2. In Line 1 of the filter grid, double-click the Signal cell, then enter >1000. 3. Enter =’P’ in the Detection cell (Figure 15.9). Figure 15.5 Filter grid 259 260 CH A P T E R 15 DMT Tutorial The query interrogates the absolute analyses selected from the data tree and returns probe sets that have a Signal greater than 1000 and a Present (P) Detection call. To specify more complex queries, right-click a cell in the filter grid, then select Show Query Builder from the shortcut menu that appears. This opens the Build Filter dialog box for the selected cell. The Build Filter dialog box enables you to enter complex limits in the filter grid without prior knowledge of correct syntax for operators such as BETWEEN and LIKE. You need only specify text or number where appropriate. Step 2: Selecting Analyses for the Query In the data tree, select the absolute analyses: T1_r1, T2_r1 and T3_r1. Step 3: Pivoting on Signal & Detection Call 1. To select results for the pivot operation, click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.6). 2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.6). Affymetrix® Data Mining Tool User’s Guide Figure 15.6 Data Mining Options dialog box, Pivot tab 3. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal and Detection (Figure 15.6). 4. Clear the check mark from the Show order analyses dialog option. When this option is chosen, the software prompts you to confirm the order of the columns (analyses) in the pivot table prior to the pivot operation. 5. Click OK to close the Data Mining Options dialog box. 261 262 CH A P T E R 15 DMT Tutorial Step 4: Querying and Pivoting the Data 1. Click the Pivot toolbar button . ⇒ The data are queried using the filter specified in step 1. The pivot table displays the signal and detection call for each probe set returned by the query (Figure 15.7). You can reorder the pivot table columns using the click-and-drag method. Figure 15.7 Pivot table Some fields in the pivot table are blank because the probe sets in these analyses did not satisfy the filter criteria. Affymetrix® Data Mining Tool User’s Guide Step 5: Sorting the Pivot Table by Signal In the pivot table, right-click the Signal column heading for T1_r1 and select Sort Descending from the shortcut menu that appears. ⇒ The pivot table columns are sorted in descending order of the signal values for T1_r1 (Figure 15.8). Step 6: Saving a Probe List 1. Select the ten pivot table rows with the highest signal values for T1_r1 (Figure 15.8). Figure 15.8 Pivot table 2. Right-click a highlighted cell and select Create Probe List from the shortcut menu that appears. ⇒ The Save Probe List dialog box appears (Figure 15.9). 263 264 CH A P T E R 15 DMT Tutorial Figure 15.9 Save Probe List dialog box 3. In the Name box, enter the probe list name Highly Expressed. 4. Clear the check mark from the Show members after saving option. 5. Click Save. ⇒ The probe list is saved and displayed in the data tree. To view the probe list members, click the plus sign (+) next to the probe list in the data tree. Step 7: Plotting the Series Line Graph Now that we have identified genes that are highly expressed in T1_r1 and saved them in a probe list, we can plot the series line graph to examine the expression levels of these genes in T2_r1 and T3_r1 as well. 1. Right-click the filter grid and select Clear Query from the shortcut menu that appears. ⇒ The criteria in the filter grid are cleared. 2. Click the Pivot toolbar button . Verify that analysis T1_r1, T2_r1 and T3_r1 remain selected in the data tree before running the pivot operation. Affymetrix® Data Mining Tool User’s Guide 3. Right-click the Highly Expressed probe list in the data tree and select Display Selected Probes from the shortcut menu that appears. ⇒ The pivot table displays only the members of the Highly Expressed probe list (Figure 15.10). Figure 15.10 Pivot table 4. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.11). 265 266 CH A P T E R 15 DMT Tutorial Figure 15.11 Data Mining Options dialog box, Series Graph tab 5. Click the Series Graph tab and verify the Line Graph option is selected. 6. Click OK. 7. Click the Series Graph toolbar button . ⇒ The Series Graph dialog box appears (Figure 15.12). Affymetrix® Data Mining Tool User’s Guide Figure 15.12 Series Graph dialog box 8. Select all three columns (T1_r1-Signal, T2_r1-Signal and and T3_r1-Signal), then click OK. ⇒ The signal series line graph is plotted for the probe sets in the Highly Expressed probe list (Figure 15.13). If necessary, use the scroll bar at the bottom of the graph pane to view the entire graph. 267 268 CH A P T E R 15 DMT Tutorial Figure 15.13 Series bar graph displaying the Highly Expressed probe list Lesson 1 Summary We used filters (Detection = P and Signal > 1000) to query the database and select transcripts in a given expression range. We then sorted the pivot table by signal in descending order to quickly identify those genes returned by the query that were expressed the highest. We saved probe sets (genes) of interest as a probe list. The probe list is a useful way to organize probe sets of interest. In the data tree, the Display Selected Probes function provided a convenient way to view pivot table results and plot graphs for the probe list members only. You can use the probe list to look at gene expression for list members across other experiments. For example, in this lesson we saved ten probe sets (with the highest signal value in T1_r1) as a probe list, then used the Display Selected Probes function to update the pivot table and plot the series bar graph for analyses T1_r1, T2_r1 and T3_r1. Affymetrix® Data Mining Tool User’s Guide Suggested Exercise Repeat lesson 1, filtering for genes that are called present and have a signal between 1000 and 2000. Generate a short probe list (five to ten members) and plot the series line graph (select Columns for the X-Axis option) for the probe list across three replicate analyses. 269 270 CH A P T E R 15 DMT Tutorial Lesson 2: Calculating Averages of Replicates The analysis of replicates allows us to measure the variability in a data set and determine confidence values for these measurements. This enables us to measure small, consistent changes even when the variability in a data set is relatively high. Small changes in gene expression can be biologically very important. Using a larger number of replicates increases the probability that small changes are statistically significant. Lesson 2 shows how to compute the mean and standard deviation for the members of the Highly Expressed probe list (generated in lesson 1) across replicate analyses. This lesson includes: ■ Step 1: Specifying a Probe List for the Filter ■ Step 2: Selecting Analyses for the Query ■ Step 3: Pivoting on Signal ■ Step 4: Querying and Pivoting the Data ■ Step 5: Selecting the Average and Standard Deviation Operators ■ Step 6: Sorting the Pivot Table ■ Step 7: Displaying Probe Set Descriptions Step 1: Specifying a Probe List for the Filter 1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears. 2. In Line 1 of the filter grid, right-click the Probe Set Name column, then select Probe List from the shortcut menu that appears. ⇒ The Open Probe List dialog box appears (Figure 15.14). Affymetrix® Data Mining Tool User’s Guide Figure 15.14 Open Probe List dialog box 3. Select the Highly Expressed probe list (generated in lesson 1), then click Open. ⇒ The selected probe list is placed in the Probe Set Name column of the filter grid (Figure 15.15). The probe list contains the probe sets we want to analyze. By loading the list we limit our analysis to these probe sets only. Figure 15.15 Filter grid 271 272 CH A P T E R 15 DMT Tutorial Step 2: Selecting Analyses for the Query 1. In the data tree, select all replicate absolute analyses for tissue T1, T2, and T3 (T1_r1 through T1_r6, T2_r1 through T2_r6, and T3_r1 through T3_r6) (Figure 15.16). Figure 15.16 Data tree, all absolute analyses (18) selected Affymetrix® Data Mining Tool User’s Guide Step 3: Pivoting on Signal 1. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.17). 2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.17). Figure 15.17 Data Mining Options dialog box, Pivot tab 3. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared. 4. Click OK to close the Data Mining Options dialog box. 273 274 CH A P T E R 15 DMT Tutorial Step 4: Query and Pivot the Data 1. Click the Pivot toolbar button . ⇒ The pivot table displays probe sets that are members of the Highly Expressed probe list (Figure 15.18). Figure 15.18 Pivot table Average and Standard Deviation The average and standard deviation statistics or the median and inter-quartile range statistics can be used to summarize the expression level for each probe set across a number of replicate analyses. Select the average and standard deviation statistics if you assume a normal distribution for the data (Figure 15.19). The standard deviation provides an estimate of how much the expression level changes from one replicate to the next. Select the median and inter-quartile range statistics if you assume the data do not have a normal distribution (Figure 15.20). The inter-quartile range is the 75th percentile minus the 25th percentile. If you are not sure whether your data have a normal distribution, calculate both the mean and median values. If the values vary significantly, the data probably do not have a normal distribution and it may be better to use the median value. Affymetrix® Data Mining Tool User’s Guide Figure 15.19 Normal data distribution Figure 15.20 Skewed data distribution 275 276 CH A P T E R 15 DMT Tutorial Step 5: Selecting Average & Standard Deviation Operators 1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.21). Figure 15.21 Analysis Function dialog box 2. Enter T1 in the Column Name box. 3. Select the Average and Standard Deviation operators (Figure 15.21), then click Next. ⇒ The column selection dialog box displays the available pivot table columns (Figure 15.22). Affymetrix® Data Mining Tool User’s Guide Figure 15.22 Analysis Function dialog box 4. Select all replicate T1 signal columns (T1_r1-Signal, T1_r2-Signal,... T1_r6-Signal), then click Finish. ⇒ The pivot table (far right) displays the new columns T1-Average and T1-Stdev (Figure 15.23). Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table. Figure 15.23 Pivot table, average and standard deviation for replicate T1 average difference data 277 278 CH A P T E R 15 DMT Tutorial 5. Repeat items 1 through 4 of Step 5 for the replicate T2 average difference columns (enter T2 in the Column Name box of the Analysis Function dialog box). 6. Repeat items 1 through 4 of Step 5 for the replicate T3 average difference columns (enter T3 in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays six new columns: ■ T1-Average and T1-Stdev, ■ T2-Average and T2-Stdev, and ■ T3-Average and T3-Stdev (Figure 15.24). Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table. Figure 15.24 Pivot table, average and standard deviation for replicate T1, T2 and T3 average difference data Affymetrix® Data Mining Tool User’s Guide Step 6: Sorting the Pivot Table We are interested in probe sets with large signal values. We can sort the pivot table to help identify these probe sets. 1. Select Edit → Sort from the menu bar. ⇒ The Sort dialog box appears (Figure 15.25). Figure 15.25 Sort dialog box 2. Select T1-Average from the top Sort By drop-down list and select the Descending sort option. 3. Click OK. ⇒ The pivot table is sorted by descending average T1-Signal value (Figure 15.26). 279 280 CH A P T E R 15 DMT Tutorial Figure 15.26 Pivot table sorted by descending T1-Average Step 7: Displaying Probe Set Descriptions 1. Select Query → Pivot Descriptions from the menu bar. ⇒ The pivot table displays a column of probe set descriptions (Figure 15.27). Figure 15.27 Pivot table, probe set descriptions displayed Affymetrix® Data Mining Tool User’s Guide Lesson 2 Summary We used a probe list as a filter to focus on genes of interest across different analyses. Here we included the Highly Expressed probe list (generated in lesson 1) in the filter and queried all replicate analyses of tissue T1, T2 and T3. We computed the mean and standard deviation to help summarize the replicate average difference data for tissue T1, T2 and T3, and provide a confidence measure for the data. We sorted the T1 signal values to help us identify probe sets with large signal values. Descriptions were displayed in the pivot table for more information about the probe sets. Suggested Exercise Repeat lesson 2, computing the median and inter-quartile range for the replicate analyses of tissue T1, T2 and T3. 281 282 CH A P T E R 15 DMT Tutorial Lesson 3: Summarizing Qualitative Data Some transcripts may be expressed at the limit of assay detection. The more often a weakly expressed transcript is called present across multiple analyses, the more confident we are that it is actually present. (Think of this as a jury where each experiment is a juror that votes whether or not a transcript is present.) This lesson shows how to: ■ Use the Count & Percentage analysis to evaluate the consistency of detection calls across all replicate data. ■ Identify the transcripts that are present in all replicates of tissue T1, T2 and T3. ■ Annotate the genes that are present and generate a corresponding probe list representing potential genes of interest. Lesson 3 includes: ■ Step 1: Pivoting on Detection Call ■ Step 2: Performing Count & Percentage Analysis ■ Step 3: Sorting the Pivot Table Results ■ Step 4: Saving a Probe List ■ Step 5: Annotating Probe List Members Step 1: Pivoting on Detection Call 1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears. 2. In the data tree, select all 18 absolute analyses for tissue T1, T2 and T3. 3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.28). 4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.28). Affymetrix® Data Mining Tool User’s Guide Figure 15.28 Data Mining Options dialog box, Pivot tab 5. From the list of Absolute Expression Data for the Statistical Algorithm, select Detection. Verify that all other options are cleared. 6. Click OK to close the Data Mining Options dialog box. 7. Click the Pivot toolbar button . ⇒ The pivot table displays the detection call for each probe set in the selected analyses (Figure 15.29). 283 284 CH A P T E R 15 DMT Tutorial Figure 15.29 Pivot table displaying detection calls Step 2: Performing Count & Percentage Analysis 1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.30). Figure 15.30 Analysis Function dialog box 2. Enter T1 Present in the Column Name box. 3. Select the Count & Percentage operator, then select the P (present) option. Affymetrix® Data Mining Tool User’s Guide 4. Click Next. ⇒ The column selection dialog box displays the available pivot table columns (Figure 15.31). Figure 15.31 Analysis Function dialog box 5. Select the six replicates for tissue T1 (T1_r1, T1_r2,... T1_r6), then click Finish. ⇒ This generates the columns T1 Present-Count and T1 PresentPercent in the pivot table (Figure 15.32). 6. Repeat items 1 through 5 of Step 2 for the replicate T2 Detection columns (enter T2 Present in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays two new columns: T2 PresentCount and T2 Present-Percent (Figure 15.32). 7. Repeat items 1 through 5 of Step 2 for the replicate T3 Detection columns (enter T3 Present in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays two new columns: T3 PresentCount and T3 Present-Percent (Figure 15.32). 285 286 CH A P T E R 15 DMT Tutorial Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table. Step 3: Sorting Pivot Table Results 1. In the pivot table, right-click the T1 Present-Count column heading and select Sort Descending from the shortcut menu that appears. Figure 15.32 Pivot table, count and percentage columns For each probe set, the: ■ Count column displays the number of columns (analyses) in which the detection call = Present. ■ Percent column shows the corresponding percentage of columns (analyses) in which the probe set was called present. For example, in Figure 15.32, probe set Z70759_at was called present in all replicates of tissue T1, T2 and T3 or 100% of the analyses. Sorting the T1 Present-Count column in descending order ranks the probe sets so that those with the most consistent detection calls in tissue T1 are displayed at the top of the pivot table. Affymetrix® Data Mining Tool User’s Guide Step 4: Saving a Probe List Save all probe sets with T1 Present-Percent =100% as a probe list called T1 Present 100%. (See lesson 1, step 6.) Step 5: Annotating Probe List Members 1. In the data tree, right-click the probe list T1 Present 100% and select Display Selected Probes from the shortcut menu that appears. ⇒ The pivot table displays all the members of the T1 Present 100% probe list. 2. Select all pivot table rows, right-click a pivot table row, then select Annotate Probes from the shortcut menu. ⇒ The Annotate dialog box appears (Figure 15.33). Figure 15.33 Annotate dialog box 3. Enter Tutorial in the Annotation Type box. 4. In the Annotation box enter: T1: Present count = 6, Percent = 100%. 5. Click OK. ⇒ The probe sets that were called present across all six T1 replicates are annotated. 287 288 CH A P T E R 15 DMT Tutorial Lesson 3 Summary We used the count and percentage analysis to summarize detection calls for the tissues T1, T2 and T3. By sorting the pivot table T1 count column, we were able to identify the most consistent results. This makes it easy to annotate all probe sets that are present in all analyses (or a user-specified percentage of analyses). We saved the probe sets that were present in all six T1 replicates as a probe list and annotated the members of the probe list. In future sessions we can query the annotations (see Chapter 8, Annotations). Suggested Exercise Repeat lesson 3 using the count and percentage analysis to identify all genes called absent in all replicates of tissue T1, T2 and T3. Affymetrix® Data Mining Tool User’s Guide Lesson 4: Evaluating Difference Between Two Tissues The T-Test and Mann-Whitney test are ranking tests that enable you to determine the direction and significance of change in a transcript’s expression level between two experimental conditions with one or more replicates. These analyses are very good strategies to use if you are looking for small, consistent changes in expression levels. The use of replicates helps distinguish real change from biological and experimental noise. The T-Test assumes the expression levels for a given transcript are normally distributed across experiments. The Mann-Whitney test makes no assumptions about the data distribution. DMT computes a p-value for each comparison. The p-value is the probability value that the observed difference in expression level occurred by chance. A small p-value (for example, 0.01) means it is unlikely (only a one in 100 chance) that such a mean difference would occur by chance. If the computed p-value > p-value cutoff, the change call is no change. In this lesson, we use the Mann-Whitney test to compare the signal replicates for tissues T1 and T2 and determine whether the signal data for these two tissues show a statistically significant difference. The lesson shows how to generate change calls for signal data so we can determine the direction of change and associated p-values to estimate confidence. Lesson 4 includes: ■ Step 1: Pivoting on Signal ■ Step 2: Performing a Mann-Whitney Test ■ Step 3: Annotating Probe Sets ■ Step 4: Saving a Probe List 289 290 CH A P T E R 15 DMT Tutorial Step 1: Pivoting on Signal 1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears. 2. In the data tree, select all absolute analysis replicates for T1 and T2. 3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.34). 4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.34). Figure 15.34 Data Mining Options dialog box, Pivot tab 5. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared. 6. Click OK to close the Data Mining Options dialog box. Affymetrix® Data Mining Tool User’s Guide 7. Figure 15.35 Pivot table displaying signal Click the Pivot toolbar button . ⇒ The pivot table displays the signal for each probe set returned by the query (Figure 15.35). 291 292 CH A P T E R 15 DMT Tutorial Step 2: Mann-Whitney Test 1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.36). Figure 15.36 Analysis Function dialog box 2. Enter T1vsT2 in the Column Name box. 3. Select the Mann-Whitney test option. 4. Click Next. ⇒ The column selection dialog box appears (Figure 15.37). Affymetrix® Data Mining Tool User’s Guide Figure 15.37 Analysis Function dialog box, select analyses for the Mann-Whitney test 5. Select the six replicate T1 signal columns in the Control Columns box (Figure 15.37). 6. Select the six replicate T2 signal columns in the Experiment Columns box (Figure 15.37). The pivot table columns selected in the Control and Experiment Columns lists define the two populations being compared. 7. Click Finish. ⇒ The pivot table displays two columns of T1vsT2-Mann-Whitney test results (Figure 15.38). The pivot table also displays the computed p-value and the direction of change (up, down, or none) for each probe set in the comparison. An Up or Down change direction call is associated with a probe set if the pvalue < 0.05. If the p-value is > 0.05, the change direction call is None. An Up call for a transcript indicates the signal is higher in the Experiment group than the Control group. A Down call indicates the signal is lower in the Experiment group compared to the Control group. 293 294 CH A P T E R 15 DMT Tutorial 8. Right-click the P Value column header and select Sort Ascending from the shortcut menu that appears. ⇒ The pivot table displays the p-values in ascending order (Figure 15.38). Figure 15.38 Pivot table Affymetrix® Data Mining Tool User’s Guide Step 3: Annotating Probe Sets 1. In the pivot table, select probe sets with an Up call and p-value < 0.001. 2. Right-click a selected row and select Annotate Probes from the shortcut menu that appears. ⇒ The Annotate dialog box appears (Figure 15.39). Figure 15.39 Annotate dialog box 3. Enter or select Tutorial in the Annotation Type box. 4. In the Annotation box, enter Signal higher in T2 than T1 with p<= 0.001, then click OK. ⇒ The probe sets that showed a higher expression level in T2 compared to T1 with significance of p-value < 0.001 are annotated. Step 4: Saving a Probe List Save all probe sets with an Up direction call as a probe list named T2_T1_MW_T2UP. (See lesson 1, step 6.) You can now inspect or further filter the probe list as in lesson 1 and 2. 295 296 CH A P T E R 15 DMT Tutorial Lesson 4 Summary When replicate analyses are available, the Mann-Whitney test helps determine whether differences in expression levels between two different groups of samples are statistically significant. The Mann-Whitney test generates change calls (Up, Down, None) based on comparisons of one numeric metric (typically, signal). Lesson 5 shows a more stringent comparison between 2 sets of replicates using comparison replicates. Suggested Exercise Repeat lesson 4 and apply the T-Test to tissue T1 and T2 replicate signal data. Affymetrix® Data Mining Tool User’s Guide Lesson 5: Evaluating Change Call Consistency Comparison ranking is a useful method for assessing the consistency of change calls when comparing two data sets that include replicate analyses. It is a ranking strategy that uses the change call from Microarray Suite analysis to perform the ranking. The results are typically more conservative than a standard Mann-Whitney or T-Test. To comparison rank two data sets: ■ Generate all possible combinations of comparison analyses for the two sets of replicate data in Affymetrix® Microarray Suite. ■ Pivot the change call result for all of the comparison analyses. ■ Run a count and percentage analysis of the change call data. ■ In the pivot table, sort the change call, count and percentage columns in descending order. This arranges or ranks the probe sets with the highest count and percentage of a call at the top of the pivot table. Those with the lowest count and percentage of the call are displayed at the bottom of the table. In this format, you can conveniently evaluate the consistency of the data and the significance of a change call. Lesson 5 shows how to comparison rank T1 and T2 change call data. The tutorial database includes comparison analyses for all possible combinations of the T1 and T2 replicates (36 total, generated in Affymetrix® Microarray Suite, see Figure 15.40 and Table 15.1). This lesson includes: ■ Step 1: Clearing the Filter Grid & Selecting Comparison Analyses ■ Step 2: Pivoting on Change Call ■ Step 3: Comparison Ranking ■ Step 4: Annotating Probe Sets ■ Step 5: Saving a Probe List 297 298 CH A P T E R 15 DMT Tutorial Figure 15.40 DMT_2_Tutorial database, 36 comparison analyses of tissue T1 and T2 replicates Table 15.1 Comparison analyses of T1 and T2 replicate data (generated in Affymetrix® Microarray Suite) T1 Replicate Analyses T2 Replicate Analyses T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6 Comparison Analyses T1_r1 T1_r1 v T2_r1 T1_r1 v T2_r2 T1_r1 v T2_r3 T1_r1 v T2_r4 T1_r1 v T2_r5 T1_r1 v T2_r6 T1_r2 T1_r2 v T2_r1 T1_r2 v T2_r2 T1_r2 v T2_r3 T1_r2 v T2_r4 T1_r2 v T2_r5 T1_r2 v T2_r6 T1_r3 T1_r3 v T2_r1 T1_r3 v T2_r2 T1_r3 v T2_r3 T1_r3 v T2_r4 T1_r3 v T2_r5 T1_r3 v T2_r6 T1_r4 T1_r4 v T2_r1 T1_r4 v T2_r2 T1_r4 v T2_r3 T1_r4 v T2_r4 T1_r4 v T2_r5 T1_r4 v T2_r6 T1_r5 T1_r5 v T2_r1 T1_r5 v T2_r2 T1_r5 v T2_r3 T1_r5 v T2_r4 T1_r5 v T2_r5 T1_r5 v T2_r6 T1_r6 T1_r6 v T2_r1 T1_r6 v T2_r2 T1_r6 v T2_r3 T1_r6 v T2_r4 T1_r6 v T2_r5 T1_r6 v T2_r6 Affymetrix® Data Mining Tool User’s Guide Step 1: Clearing the Filter Grid & Selecting Comparison Analyses 1. To clear the filter grid, right-click the grid and select Clear Query from the shortcut menu that appears. 2. In the data tree, select all of the comparison analyses for the T1 and T2 replicate data (36 total) (Figure 15.41). (Press and hold the CTRL key while you click the analyses.) Figure 15.41 Data tree, comparison analyses selected 299 300 CH A P T E R 15 DMT Tutorial Step 2: Pivoting on Difference Call 1. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.42). 2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.42). Figure 15.42 Data Mining Options dialog box, Pivot tab 3. From the list of Relative Expression Data for the Statistical Algorithm, select Change. 4. Click OK to close the Data Mining Options dialog box. 5. Click the Pivot toolbar button . ⇒ The pivot table displays the change call for each probe set in the selected analyses (Figure 15.43). Affymetrix® Data Mining Tool User’s Guide Figure 15.43 Pivot table displaying change calls Step 3: Comparison Ranking 1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.44). Figure 15.44 Analysis Function dialog box 2. Enter Rank T1vsT2 in the Column Name box. 301 302 CH A P T E R 15 DMT Tutorial 3. Select the Count & Percentage analysis option, choose the I (increase) difference call option and click Next. ⇒ The column selection dialog box displays the pivot table columns available for the Count & Percentage analysis (Figure 15.45). Figure 15.45 Analysis Function dialog box 4. Select all of the columns (comparison analyses) and click Finish. ⇒ The new pivot table columns: Rank T1vsT2-Count and Rank T1vsT2-Percent are generated (Figure 15.46). Affymetrix® Data Mining Tool User’s Guide Figure 15.46 Pivot table 5. Right-click the Rank T1vT2-Count column header and select Sort Descending from the shortcut menu that appears. ⇒ The probe sets with the highest count and percentage are arranged, or ranked, at the top of the pivot table (Figure 15.46). Those with the lowest count and percentage (least consistent data) are located at the bottom of the table. Step 4: Annotating Probe Sets 1. Select the pivot table rows with RankT1vsT2-Percent = 100%. 100% concordance is very high stringency or confidence. You can select a lower percentage, depending on your requirements. 2. Right-click a highlighted row and select Annotate Probes from the shortcut menu that appears. ⇒ The Annotate dialog box appears (Figure 15.47). 303 304 CH A P T E R 15 DMT Tutorial Figure 15.47 Annotate dialog box 3. Enter T1vT2: Increase with 100% concordance in the Annotation box. 4. Enter or select Tutorial in the Annotation Type box. 5. Click OK. Step 5: Saving a Probe List In the pivot table, select the probe sets with RankT1vsT2-Percent = 100% and save them as a probe list. (See lesson 1, step 6.) Lesson 5 Summary The comparison ranking method uses the count and percentage operator to rank the increase or decrease change calls of comparison analyses between two groups of replicate samples. The method enables you to assess the consistency or concordance of change calls between the two groups. In this lesson we identified the genes that show concordance of the increase change call in T1 and T2. We annotated these genes and saved them as a probe list. Suggested Exercise Perform a comparison ranking using count and percentage analysis on tissue T1 and T2 Decrease and Marginal Decrease change calls. Affymetrix® Data Mining Tool User’s Guide Lesson 6: Self Organizing Map (SOM) Cluster Analysis Cluster analysis groups probe sets with similar gene expression patterns. For example, cluster analysis can help identify transcripts that are increased after a treatment or over a period of time. Clustering can be applied to any numeric output; however, the SOM algorithm is optimized for expression signals and the algorithm defaults are set accordingly.1 This lesson demonstrates how to: ■ Compute the average signal values of tissue T1, T2 and T3 ■ Apply SOM cluster analysis to the average signal values of T1, T2 and T3 (See Appendix D for more information about the SOM algorithm.) ■ Save a cluster result as a probe list Lesson 6 includes: ■ Step 1: Clearing the Filter Grid & Selecting Analyses ■ Step 2: Pivoting on Signal ■ Step 3: Computing Average Signal ■ Step 4: SOM Cluster Analysis ■ Step 5: Saving & Annotating a Probe List 1. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, Eric S., and Golub, T.R. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA. 96:2907-2912. 305 306 CH A P T E R 15 DMT Tutorial Step 1: Clearing the Filter Grid & Selecting Analyses 1. To clear the filter grid, right-click the grid and select Clear Query from the shortcut menu that appears. 2. In the data tree, select the absolute analysis replicates for tissue T1, T2 and T3 (18 analyses). 3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.48). 4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.48). Figure 15.48 Data Mining Options dialog box, Pivot tab 5. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared. Affymetrix® Data Mining Tool User’s Guide 6. Click OK to close the Data Mining Options dialog box. Step 2: Pivoting on Signal 1. Figure 15.49 Pivot table To pivot the data, click the Pivot toolbar button . ⇒ The pivot table displays the signal for each probe in the selected analyses (Figure 15.49). 307 308 CH A P T E R 15 DMT Tutorial Step 3: Computing Average Signal 1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.50). Figure 15.50 Analysis Function dialog box 2. Enter T1 in the Column Name box. 3. Select the Average operator, then click Next. ⇒ The column selection dialog box appears (Figure 15.51). Affymetrix® Data Mining Tool User’s Guide Figure 15.51 Analysis Function dialog box 4. Select the six replicate T1 Signal columns (absolute analyses) in the Analysis Function dialog box, then click Finish. ⇒ The new column T1-Average in pivot table is generated (Figure 15.52). Figure 15.52 Pivot table 309 310 CH A P T E R 15 DMT Tutorial 5. Repeat items 1 through 4 in Step 3 for the replicate T2 signal columns (enter T2 in the Column Name box of the Analysis Function dialog box) to generate the T2-Average column in the pivot table. 6. Repeat items 1 through 4 in Step 3 for the replicate T3 signal columns (enter T3 in the Column Name box of the Analysis Function dialog box) to generate the T3-Average column in the pivot table. Step 4: SOM Cluster Analysis 1. Select Analyze → SOM Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 15.53). Figure 15.53 Select Columns for Clustering dialog box 2. Select the T1-Average, T2-Average and T3-Average columns, then click OK. ⇒ The SOM Clustering dialog box appears (Figure 15.54). Affymetrix® Data Mining Tool User’s Guide Figure 15.54 SOM Clustering dialog box The SOM Clustering dialog box contains two sections: SOM Filtering (top) and Parameters (bottom). The filter and parameters settings significantly affect the analysis. Click Defaults to reset the algorithm parameters to the default settings. SOM Filtering There are three types of SOM filters: thresholds, row variation and row normalization. The default values are appropriate for most data sets, when clustering on signal, but the optimum values may differ depending on the data set and the type of information you want to extract from your data. Thresholds This sets the maximum and minimum values for the data set (signal in this example). The minimum and maximum threshold settings control the outliers in the data. 311 312 CH A P T E R 15 DMT Tutorial shows three expression profiles. In the raw data, the two outliers prevent effective normalization. Normalization is much more effective after filtering. However, filtering removes information from the data set, so filter as little as possible to obtain the optimum results. Figure 15.55 Figure 15.55 Example expression profiles showing effects of normalization and filtering Row Variation When comparing expression patterns between different biological conditions, most genes do not significantly change expression level and are uninformative. Keeping uninformative genes in the data set in effect forms a single, large cluster that may affect our ability to cluster the expression patterns that do change. The row variation filters define the genes that are considered changed. The Max/Min setting defines the minimum expression ratio value (maximum expression level/minimum expression level) a probe set must have across all experiments to be included in the cluster analysis. The Max/Min setting is very useful at moderate to high expression levels, but is subject to noise at low expression levels. For example, in Figure 15.56, Affymetrix® Data Mining Tool User’s Guide the max/min ratio (b) for the bottom profile is much higher than max/min ratio (a) for the top profile. We need a second parameter to filter for changes at low expression levels. Figure 15.56 Max/Min The Max-Min setting can distinguish between changes in low expression levels and high expression levels, and can be used to filter out noise. As Figure 15.57 shows, the max-min can be set to eliminate the bottom profile if we should want to. It is very important to take care when filtering out noisy, low-level expression, because the noise will be amplified after normalization (Figure 15.57). To see how many probe sets remain after filtering, click Compute in the SOM Clustering dialog box. The Max-Min value sets a threshold for the absolute numerical difference of the clustering values. For example, if the Max/Min is set at three, changes of 30/10 and 300/100 will be included in the cluster analysis. By setting the Max-Min to 100, we eliminate the inherently noisy, low numerical change values. 313 314 CH A P T E R 15 DMT Tutorial Figure 15.57 Max-min filter setting Row Normalization Normalization is a technique that helps to answer the question: What probe sets have similar expression patterns? For example, we may be interested in finding all genes that increase expression under certain experimental conditions regardless of the actual level of increase. Asking the question in this way allows us to find small as well as large changes in expression levels. It also makes the technique less sensitive to experimental variation in the absolute expression levels, such as difficulty normalizing between controls and experiments. Using filtering and normalization usually reduces the number of clusters in a data set because we ignore the actual expression levels. For example, Figure 15.58 shows three different transcripts that are expressed at very different levels. Without normalization, the three probe sets may group into three different clusters according to their absolute expression levels. However, after normalization, it is clear that their relative expression levels are the same and they should cluster together. Affymetrix® Data Mining Tool User’s Guide Figure 15.58 Raw and normalized expression profiles Order of Filtering The Up and Down buttons in the SOM Clustering dialog box can be used to change the order in which the filters are applied to the data. Be very careful if you intend to change the default order. In particular, the row normalization filter changes the values of the data to which the filters are applied and will change the filter functions significantly. Parameters The rows and columns parameters should be considered carefully. These settings specify the grid of centroids or nodes that is applied to the data. In general, try to keep the grid square or almost square to ensure good coverage of the whole data set. For the same reason, it also helps, but is not imperative, to make one of the settings (row or column) an uneven number. The exact number of clusters (rows x columns) you select depends on the size and complexity of the data set, and the type of analysis you want to perform. The default of 18 clusters (6 rows x 3 columns) is a good place to start. If you find that the analysis generates empty clusters (clusters with zero members), reduce the number of clusters. Reducing the cluster number increases the variability of the shapes of curves grouped together in a cluster. This is indicated by an increase in the distance between the two (blue) error bars. Generating a small number of clusters summarizes the data, but may obscure rare, interesting patterns. Increasing the cluster number reduces the variability of the patterns that are grouped together. This is indicated by a decrease in the distance between the 315 316 CH A P T E R 15 DMT Tutorial two (blue) error bars. If the cluster number is increased too much, the algorithm generates clusters with no members and many clusters will look the same. The optimum number of clusters for a particular dataset displays the narrowest possible error bars with lowest number of empty clusters. The remaining parameters affect the functioning of the cluster program and are intended for expert users. Do not change these parameters unless you understand their function and the effect of changing them on your data. 3. In the SOM Clustering dialog box, click the Add> button for the Thresholds, Row Variation and Row Normalization variables. ⇒ The default values for these algorithm variables are displayed in the box in the upper right corner (Figure 15.59). Figure 15.59 SOM Clustering dialog box 4. Enter 3 rows and 2 columns in the Parameters section. Other values may be more appropriate for your data. These are suggested values for this data set. Affymetrix® Data Mining Tool User’s Guide 5. Click Run. ⇒ The SOM algorithm generates 6 clusters (3 rows x 2 columns specified in the SOM parameters) (Figure 15.60). Your results may not be identical to the clusters in Figure 15.60. Run-torun cluster results may vary slightly because the nodes are randomly initialized (see Appendix D). Figure 15.60 SOM cluster results 317 318 CH A P T E R 15 DMT Tutorial To expand the cluster graph view, right-click the graph and select Expand Graph from the shortcut menu that appears. Figure 15.60 shows the results of clustering the mean signal values for three tissues (six replicates each). T1 is the first point, T2 is the second point and T3 is the third point. Step 5: Saving & Annotating a Probe List After genes of interest are identified, we can save them as a probe list. 1. Click Cluster 3. ⇒ The probe sets in cluster 3 are displayed at the right in the Probes box (Figure 15.60). 2. Enter the name Cluster 3 in the Probe List Name box. 3. Click Save Selected. ⇒ A probe list is generated that includes the probe sets in cluster 3 and the probe list is displayed in the data tree. 4. Annotate the probe list members (see lesson 3, step 5). In future sessions the annotations may be queried and sorted (see Chapter 8, Annotations). Lesson 6 Summary SOM cluster analysis identifies gene expression patterns in the data. The threshold and row variation filters help focus the analysis on probe sets that have the same expression pattern. The cluster results display patterns of gene expression rather than absolute expression levels because the Row Normalization filters normalize the signal data to a mean of zero and variance of one (see Appendix D). Adjusting the number of nodes or centroids (rows x columns) affects the cluster number and the variability of expression patterns grouped together in a cluster. The optimum number of clusters for a particular data set displays the narrowest possible error bars with the lowest number of empty or similar clusters. We computed the average signal for the T1, T2 and T3 replicates. We applied SOM cluster analysis to the average signal values. The SOM cluster Affymetrix® Data Mining Tool User’s Guide results organize the expression data into groups of genes with similar expression patterns. 319 320 CH A P T E R 15 DMT Tutorial A Appendix A Filter Grid A This Appendix explains the column headings in the filter grid for both GeneChip® data mode and spot data mode. GeneChip Data Mode The filter grid includes expression metrics generated by the Statistical Expression algorithm (in Microarray Suite 5.0) or the Empirical Expression algorithm (in Microarray Suite 4.0 or lower). Statistical Expression Algorithm Probe Set Name Identifier for the probe set on a GeneChip® probe array Signal A measure of the abundance of a transcript. Detection The call that indicates whether the transcript was present (P), absent (A), marginal (M), or no call (NC) Detection p-value p-value that indicates the significance of the detection call. Stat Pairs The number of probe pairs for a particular probe set on the array. Stat Pairs Used = Pairs - Masked probe pairs - Saturated MM probe pairs This is the number of pairs used by the Statistical Expression algorithm to make the detection call in an absolute analysis. Signal Log Ratio The change in expression level for a transcript between a baseline and an experiment array. This change is expressed as the log2 ratio. Signal Log Ratio The lower limit of the log ratio within a 95% confidence Low interval. 323 324 APPENDIX A Filter Grid Signal Log Ratio The upper limit of the log ratio within a 95% confidence High interval. Change The call that indicates the change in the transcript level between a baseline and an experiment array. Change p-value p-value that indicates the significance of the change call. Stat Common Pairs The intersection of the probe pairs from the baseline and experiment that are used by the Statistical Expression algorithm to make the change call in a comparison analysis. Empirical Expression Algorithm Probe Set Name Identifier for the probe set on a GeneChip® probe array Positive The number of probe pairs scored positive. A probe pair is positive if: PM - MM > Statistical Difference Threshold (SDT) and PM/MM > Statistical Ratio Threshold (SRT) where PM = perfect match intensity and MM = mismatch intensity. The SDT is a function of the noise (Q) and is calculated as: SDT = Q * SDTmultiplier The SDTmultiplier and the SRT are user-modifiable parameters (see Affymetrix® Microarray Suite User’s Guide). The SDTmultiplier is set at 2.0 for the standard staining protocol or 4.0 for the antibody amplification protocol. (Refer to the Affymetrix Expression Analysis Technical Manual.) The default SRT value is 1.5. Note: Increasing the SDTmult and SRT increases analysis stringency, reducing these thresholds decreases analysis stringency. Affymetrix® Data Mining Tool User’s Guide 325 Negative The number of probe pairs scored negative. A probe pair is negative if: MM - PM > SDT and MM/PM > SRT Pairs Number of probe pairs for a particular probe set on a GeneChip® probe array. Pairs Used Number of probe pairs per probe set used in the analysis (Empirical Expression algorithm). This may be the total number of probes per probe set on the probe array or the number of probe pairs in a pre-designated subset (for example, probe pairs specified by a probe mask file and/or a masked image). Pairs Used = total probe pairs per probe set – (probe pairs masked in a mask file) – (probe pairs masked in the image). Pairs in Average A trimmed probe set that excludes probes with extremely intense or weak signal from the analysis. If 8 or fewer probe pairs are used, Pairs in Avg = Pairs Used (or the number of probe pairs per probe set minus any that are masked). Super scoring is performed if more than 8 probe pairs are used. Superscoring is a process that excludes probe pairs from calculation of the Avg Diff and Log Avg Ratio if they are outside a given intensity range. Microarray Suite software calculates the mean and standard deviation of the intensity differences (PM – MM) for an entire probe set (excluding the highest and lowest values). Those values outside of a set number of standard deviations (STP) are not included in the calculation of the Avg Diff or Log Avg Ratio. The STP is a user-modifiable parameter with a default value = 3. 326 APPENDIX A Filter Grid Log Avg (Log Avg Ratio) Describes the hybridization performance of a probe set and is determined by calculating the ratio of the PM/MM intensities for each probe pair in a probe set, taking the logs of the resulting values and averaging them for the probe set: Log Avg = 10 x [Σ log (PM/MM)] / Pairs in Avg Note: Log Avg = 0 indicates random cross hybridization. The higher the Log Avg, the more confidence the gene transcript is present. Average Difference Serves as a relative indicator of the level of expression of a transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. The Avg Diff is calculated by taking the difference between the PM and MM of every probe pair (excluding the probe pairs where PM – MM is outside the STP standard deviation of the mean of PM-MM) in a probe set and averaging the differences for the entire probe set. Avg Diff = Σ (PM – MM) / Pairs in Avg Note: Avg Diff cannot be used to compare the hybridization intensity levels of two different probe sets on the same array. Absolute Call Each transcript in an absolute analysis has three possible Absolute Call outcomes: Present (P), Absent (A), or Marginal (M). The absolute call is derived from the Pos/Neg, Positive Fraction and Log Avg absolute call metrics. Each absolute call metric is weighted and entered into a decision matrix to determine the status of the transcript. Affymetrix® Data Mining Tool User’s Guide Increase (Inc) 327 Number of probe pairs that increased. A probe pair is considered to increase if the intensity difference between the PM and MM probe cells in the experimental sample is significantly higher than in the baseline sample. Two criteria must be met for a probe pair to show a significant increase: (PM – MM)exp – (PM – MM)baseline > Change Threshold (CT) and [(PM – MM)exp – (PM – MM) baseline] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Affymetrix Microarray Suite computes the Change Threshold (CT) using the Statistical Difference Threshold of the experiment and baseline data. Alternatively, you can specify a value for the CT multiplier, which is multiplied by the noise of the baseline or experiment data (whichever is greater) to define CT. Percent Change Threshold is a user-specified value (default = 80). Decrease (Dec) Number of probe pairs that decreased. A probe pair is considered to decrease if the intensity difference between the PM and MM probe cells in the experimental sample is significantly lower than in the baseline sample. Two criteria must be met for a probe pair to show a significant decrease: (PM – MM) baseline – (PM – MM) exp > Change Threshold (CT) and [(PM – MM)baseline – (PM – MM) exp] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Increase Ratio For each transcript: # Increased probed pairs / # probe Pairs Used Decrease Ratio For each transcript: # Decreased probed pairs / # probe Pairs Used 328 APPENDIX A Filter Grid Positive Change # Positive probe pairsexp - # Positive probe pairsbaseline Negative Change # Negative probe pairsexp - # Negative probe pairsbaseline Difference Positive Difference Negative Ratio (DPos - DNeg Ratio) Log Avg Ratio Change (Positive Change – Negative Change) / # probe Pairs Used The DPos – DNeg Ratio and Log Avg Ratio Change are usually positive when a transcript changes from a very low to a relatively high expression level and are typically negative when the expression level changes from a high to a very low or undetectable level. Both metrics may have values close to zero if the transcript is present in both the baseline and experimental samples despite an increase or decrease in the level of the transcript. The difference between the Log Avg Ratio of the baseline and experimental probe array data (in a comparison analysis) for each transcript. The Log Avg Ratios are recomputed for each for each probe set based on probe pairs used in both the baseline and experimental probe arrays (the recomputed values are not displayed by DMT). Log Avg Ratio Change = Log Avgexp – Log Avgbase Difference Call Each transcript in a comparison analysis has five possible Difference Call outcomes: (1) Increase (I), (2) Marginally Increase (MI), or (3) Decrease (D), (4) Marginally Decrease (MD), and (5) No Change (NC). The difference call is derived from the comparison metrics: Max [Increase/Total, Decrease/Total], Increase/Decrease Ratio, Log Average Ratio Change, and Dpos – Dneg Ratio. Each comparison metric is weighted and entered into a decision matrix to determine the status of the transcript (see Affymetrix Microarray Suite User’s Guide). Average Difference Change Avg Diffexperiment - Avg Diffbaseline Affymetrix® Data Mining Tool User’s Guide Fold Change FC = 329 Indicates the relative change in the expression levels between the experiment and baseline targets. The Fold Change (FC) for a transcript is a positive number when the expression level in the experiment increases compared to the baseline and is a negative number when the expression level in the experiment declines. ) { AvgDiffChange max [min (AvgDiffexp, AvgDiffbase), QM x QC] ( + } + 1 if AvgDiffexp > AvgDiffbase - 1 if AvgDiffexp < AvgDiffbase Qc = max(Qexp, Qbase) QM = 2.1 for a 50 µm feature, QM = 2.8 for a 24 µm or smaller feature Microarray Suite recomputes the normalized or scaled Avg Diff values in both the experimental and baseline data sets to include only probe pairs used in both the baseline and experiment arrays. This recomputation is not done in the DMT calculation. If the noise (Q) of the experiment or baseline is greater than the Avg Diff of the transcript (baseline or experiment data), the Fold Change is calculated over the noise and is an approximation (a tilde character (~) precedes the approximated Fold Change value in the *.chp file). Sort Score A ranking based on the fold Change and the Avg Diff Change. The higher the fold Change and the Avg Diff Change, the higher the Sort Score. 330 APPENDIX A Filter Grid Spot Data Mode Spot Identifier for the spotted probe. Intensity Background-subtracted spot intensity. Standard deviation (SD) Standard deviation of the intensity signal. Pixels Number of pixels in the image data file (*.tif) used to calculate the intensity for a spot. Background Background calculated for a spot. Background SD Standard deviation of the spot background. Ratio If channel 1 > channel 2, ratio = - channel1/channel 2, otherwise ratio = channel 2/channel 1. B Appendix B Working with Windows & Tables B The windows and tables found in Affymetrix® Data Mining Tool can be modified to suit the individual needs of the user or data. This appendix explains the options available. Query Windowpanes Expanding a Windowpane 1. To expand the results pane (or the graph pane), right-click the pane and select Expand Results (or Expand Graph) from the shortcut menu. Alternatively, select View → Expand Results (or View → Expand Graph) from the main menu. ⇒ The results (or graph) pane is enlarged and the graph (or results) pane is hidden. 2. Repeat step 1 to return the pane to its original size. Resizing a Windowpane You may resize a windowpane using the click-and-drag method to move a border. 1. Place the mouse pointer over a border so that it changes from a single arrow to a double arrow . 2. Use the click-and-drag method to move a border in the horizontal or vertical direction and resize the windowpane. 333 334 APPENDIX B Working with Windows & Tables Clearing the Results or Graph Pane Right-click the results (or graph) pane and select Clear Results (or Clear Graph) from the shortcut menu. Alternatively, select Edit → Clear Results (or Edit → Clear Graphs) from the main menu. ⇒ All graphs are cleared and the graph pane is hidden. Tables Selecting the Entire Table Click the upper left corner of a table. ⇒ All rows in the table are selected (Figure B.1). Figure B.1 Query table Selecting Rows To select adjacent rows, press and hold the SHIFT key while you click the first and last row in the selection. To select non-adjacent rows, press and hold the CTRL key while you click the rows. Affymetrix® Data Mining Tool User’s Guide Resizing Columns 1. Place the mouse pointer over the border of a column header. ⇒ The mouse pointer changes from a single arrow arrow . to a double Figure B.2 Query table, adjusting width of Analysis Name column 2. Use the click-and-drag method to adjust the column width. 1. Right-click an analysis column header in the experiment information table or a metric column header in the query or pivot table (Figure B.3). Hiding Columns Figure B.3 Pivot table, shortcut menu of column commands 335 336 APPENDIX B Working with Windows & Tables 2. Select Hide Column from the shortcut menu. ⇒ The selected column is hidden. 3. To show hidden columns, right-click an analysis column header in the experiment information table or a metric column header in the query or pivot table, then select Show All Columns from the shortcut menu (Figure B.3). Reordering Columns 1. Click a column header and use the click-and-drag method to move the column. 2. In the pivot table, click a column header (analysis) and use the clickand-drag method to reorder the column and its subordinate results columns. 3. In the pivot table, click a subordinate column header (results), then use the click-and-drag method to reorder the results column within the analysis. DMT retains the column order of the results table in saved queries and as a user preference. If you open a previously saved query, DMT: 1) displays the results tables using the saved column order, and 2) unhides any hidden columns of results data. If you create a new query, DMT applies the column settings used in the previous session. C Appendix C Query Table Data C After running a query, results are presented in the Query Table. This appendix defines the column headings and explains the information found there, for both GeneChip Data Mode and Spot Data Mode. GeneChip® Data Mode Statistical Expression Algorithm Metrics Probe Set Name Identifier for the probe set on a GeneChip® probe array Signal A measure of the abundance of a transcript. Detection The call that indicates if the transcript was detected (P) or undetected (A) Detection p-value p-value that indicates the significance of the detection call. Stat Pairs The number of probe pairs for a particular probe set on the array. Stat Pairs Used = Pairs - Masked probe pairs - Saturated MM probe pairs This is the number of pairs used by the Statistical Expression algorithm to make the detection call in an absolute analysis. Signal Log Ratio The change in expression level for a transcript between a baseline and an experiment array. This change is expressed as the log2 ratio. Signal Log Ratio The lower limit of the log ratio within a 95% confidence Low interval. Signal Log Ratio The upper limit of the log ratio within a 95% confidence High interval. 339 340 APPENDIX C Query Table Data Change The call that indicates the change in the transcript level between a baseline and an experiment array. Change p-value p-value that indicates the significance of the change call. Stat Common Pairs The intersection of the probe pairs from the baseline and experiment that are used by the statistical Expression algorithm to make the change call in a comparison analysis. Empirical Expression Algorithm Metrics Analysis Name Name of the experiment entered during experiment set up. Probe Set Name Identifier for the probe set on the array. Positive Number of probe pairs scored positive. A probe pair is called positive if the intensity of the PM probe cell is significantly greater than that of the corresponding MM probe cell. To evaluate intensity, the Empirical Expression algorithm calculates the ratio and difference associated with each probe pair and compares these values to the Statistical Difference Threshold (SDT) and the Statistical Ratio Threshold (SRT). A probe pair is positive if: PM - MM > SDT and PM/MM > SRT. Negative Number of probe pairs scored negative. A probe pair is called negative if the intensity of the MM probe cell is significantly greater than that of the corresponding PM probe cell. To evaluate intensity, the expression algorithm calculates the ratio and difference associated with each probe pair and compares these values to the Statistical Difference Threshold (SDT) and the Statistical Ratio Threshold (SRT). A probe pair is negative if: MM - PM > SDT and MM/PM > SRT. (See Affymetrix® Microarray Suite User’s Guide for further information.) Pairs Number of probe pairs for a particular probe set on the probe array. Affymetrix® Data Mining Tool User’s Guide Pairs Used 341 Number of probe pairs per probe set used in the analysis. This may be the total number of probes per probe set on the probe array or the number of probe pairs in a pre-designated subset (for example, probe pairs specified by a probe mask file or a masked image). Pairs Used = total probe pairs per probe set - (probe pairs masked in a mask file) - (probe pairs masked in the image) Pairs in Avg A trimmed probe set that excludes probes with extremely intense or weak signal from the analysis. If 8 or fewer probe pairs are used, Pairs in Avg = Pairs Used (or the number of probe pairs per probe set minus any that are masked). Super scoring is performed if more than 8 probe pairs are used. Superscoring is a process that excludes probe pairs from calculation of the Avg Diff and Log Avg Ratio if they are outside a given intensity range. Microarray Suite calculates the mean and standard deviation of the intensity differences (PM – MM) for an entire probe set (excluding the highest and lowest values). Those values outside of a set number of standard deviations (STP) are not included in the calculation of the Avg Diff or Log Avg Ratio. The STP is a usermodifiable parameter with a default value = 3. Pos Fraction Number of positive probe pairs divided by the number of probe pairs used. Log Avg Describes the hybridization performance of a probe set and is determined by calculating the ratio of the PM/MM intensities for each probe pair in a probe set, taking the logs of the resulting values, and averaging them for the probe set: Log Avg = 10 x [Σ log (PM/MM)] / Pairs in Avg Pos/Neg Ratio of positive probe pairs to negative probe pairs in a probe set (# Positive probe pairs/# Negative probe pairs). 342 APPENDIX C Query Table Data Avg Diff This parameter serves as a relative indicator of the level of expression of a transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. Note: Avg Diff cannot be used to compare the hybridization intensity levels of two different probe sets on the same array. Avg Diff is calculated by taking the difference between the PM and MM of every probe pair (excluding the probe pairs where PM – MM is outside the STP standard deviation of the mean of PM-MM) in a probe set and averaging the differences for the entire probe set: Avg Diff = Σ (PM – MM) / Pairs in Avg Norm Avg Diff Avg Diff x Normalization Factor DMT computes the normalization factor (NF) using all probe sets on the array in an analysis, then applies any specified filters. All intensities in an analysis are multiplied by the NF. Absolute Call Each transcript in an absolute analysis has three possible Absolute Call outcomes: Present (P), Absent (A), or Marginal (M). The absolute call is derived from the Pos/Neg, Positive Fraction, and Log Avg absolute call metrics. Each absolute call metric is weighted and entered into a decision matrix to determine the status of the transcript. (See Affymetrix® Microarray Suite User’s Guide for further information.) Affymetrix® Data Mining Tool User’s Guide Increase (Inc) 343 Number of probe pairs that increased. A probe pair is considered to increase if the intensity difference between the PM and MM probe cells in the experimental sample is significantly higher than in the baseline sample. Two criteria must be met for a probe pair to show a significant increase: (PM – MM)exp – (PM – MM)baseline > Change Threshold (CT) and [(PM – MM)exp – (PM – MM) baseline] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Affymetrix Microarray Suite computes the Change Threshold (CT) using the Statistical Difference Threshold of the experiment and baseline data. Alternatively, you can specify a value for the CT multiplier, which is multiplied by the noise of the baseline or experiment data (whichever is greater) to define CT. Percent Change Threshold is a user-specified value (default = 80). Decrease (Dec) A probe pair is considered to decrease if the intensity difference between the PM and MM probe cells in the experimental sample is significantly lower than in the baseline sample. Two criteria must be met for a probe pair to show a significant decrease: (PM – MM) baseline – (PM – MM) exp > Change Threshold (CT), and [(PM – MM)baseline – (PM – MM) exp] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 (See Affymetrix® Microarray Suite User’s Guide for further information.) Inc Ratio For each transcript: # Increased probe pairs / # probe Pairs Used 344 APPENDIX C Query Table Data Dec Ratio For each transcript: # Decreased probe pairs / # probe Pairs Used Pos Change # Positive probe pairsexperiment - # Positive probe pairsbaseline Neg Change # Negative probe pairsexperiment - # Negative probe pairsbaseline Inc/Dec For each transcript: # Increased probe pairs / # Decreased probe pairs Dpos-Dneg Ratio (Positive Change – Negative Change) / # probe Pairs Used Log Avg Ratio Change The difference between the Log Avg Ratio of the baseline and experimental probe array data (in a comparison analysis) for each transcript. The Log Avg Ratios are recomputed for each for each probe set based on probe pairs used in both the baseline and experimental probe arrays (the recomputed values are not displayed by DMT). The Dpos – Dneg Ratio and Log Avg Ratio Change are usually positive when a transcript changes from a very low to a relatively high expression level and are typically negative when the expression level changes from a high to a very low or undetectable level. Both metrics may have values close to zero if the transcript is present in both the baseline and experimental samples despite an increase or decrease in the level of the transcript. Log Avg Ratio Change = Log Avgexp – Log Avgbase Diff Call Each transcript in a comparison analysis has five possible Difference Call outcomes: (1) Increase (I), (2) Marginally Increase (MI), or (3) Decrease (D), (4) Marginally Decrease (MD), and (5) No Change (NC). The difference call is derived from the comparison metrics: Max [Increase/Total, Decrease/Total], Increase/Decrease Ratio, Log Average Ratio Change, and Dpos – Dneg Ratio. Each comparison metric is weighted and entered into a decision matrix to determine the status of the transcript. (See Affymetrix® Microarray Suite User’s Guide for further information.) Affymetrix® Data Mining Tool User’s Guide Avg Diff Change 345 Serves as a relative indicator of the level of expression of a transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. The Avg Diff is calculated as: Avg Diff Change = Avg Diffexp – Avg Diffbaseline B=A (Baseline = Absent) An asterisk (*) in this column indicates the transcript is called absent (A) in the baseline. Fold Change The Fold Change indicates the relative change in the expression levels between the experiment and baseline targets. The Fold Change for a transcript is a positive number when the expression level in the experiment increases compared to the baseline and is a negative number when the expression level in the experiment declines. The Fold Change (FC) is calculated as: FC = ) { AvgDiffChange max [min (AvgDiffexp, AvgDiffbase), QM x QC] ( + } + 1 if AvgDiffexp > AvgDiffbase - 1 if AvgDiffexp < AvgDiffbase (See Affymetrix® Microarray Suite User’s Guide for further information.) Approx If the noise (Q) of the experiment or baseline array is greater than the Avg Diff of the transcript (the baseline or experimental data), the Fold Change is calculated over the noise and is an approximation (a tilde character (~) precedes the approximated Fold Change value in the *.chp file. Sort Score The Sort Score is a ranking based on the Fold Change and the Avg Diff Change. The higher the Fold Change and the Avg Diff Change, the higher the Sort Score. 346 APPENDIX C Query Table Data Spot Data Mode Analysis name Name of the experiment associated with an intensity results file (*.spt) Spot Identifier for the spotted probe Intensity Background-subtracted intensity for the selected spot Standard Deviation Standard deviation of the spot intensity Pixels Number of pixels in the image data file (*.tif) used to calculate the channel signal (intensity) Background Background calculated for the spot Background SD Standard deviation of the background for a spot Ratio Ratio of channel 1/channel2 intensity data D Appendix D DMT Algorithms D This appendix provides further information on the three algorithms used in Affymetrix® Data Mining Tool: the SOM clustering algorithm, Correlation Coefficient clustering algorithm and the Matrix algorithm. The SOM Algorithm The self organizing map (SOM) algorithm applies cluster analysis to GeneChip® metric data to help identify gene expression patterns. The algorithm considers the expression levels of n probe sets (or the intensities of n probes) in k experiments as n points in k-dimensional space. Initially, the algorithm randomly places a grid of nodes or centroids onto the k-dimensional space. The rows and columns of nodes determine the number of clusters identified by the algorithm (rows x columns = number of clusters). Figure D.1 shows a 3 x 2 arrangement of nodes that can identify six clusters of gene expression patterns. Figure D.1 3 x 2 arrangement of nodes 349 350 APPENDIX D DMT Algorithms The user specifies the rows and columns of nodes as well as the initial placement of the nodes (initialization). The random vectors method randomly places the nodes in k-dimensional space. The random datapoints method places the nodes on randomly-selected points. Next, the algorithm iteratively adjusts the node positions toward clusters of points. At each iteration, it selects a data point (P) and moves (updates) the node closest to P (the target node, Np) toward P. (The data points are randomly ordered for selection and recycled as needed through the iterations.) Other nodes may also move toward P, depending on their distance from Np, the type of neighborhood selected (discussed in the following section) and time (iteration). The algorithm updates a node using the formula: f i + 1 ( N ) = f i ( N ) + α ( d ( N, N P ) ,i ) ( P – f i ( N ) ) where: N = the node being updated P = the data point being considered fi(N) = the position of N at iteration i Np = the target node (the node closest to P) α = distance N moves toward P in iteration i (learning rate), which is a function of: ❥ d(N, Np), the distance between N (the node being considered) and Np (the target node) in two-dimensional space ❥ i, iteration P - fi(N) = distance between P and N in k-dimensional space T = maximum number of iterations Affymetrix® Data Mining Tool User’s Guide Neighborhood The neighborhood describes an area around the target node, Np. At each iteration, Np and all nodes in the neighborhood move toward the P, the point being considered. There are two types of neighborhoods: bubble or Gaussian. Bubble Neighborhood The bubble neighborhood specifies a radial distance from Np (default = 5). At an iteration, all nodes in the bubble neighborhood are updated by the same amount. Nodes outside the bubble neighborhood are not updated. Neighborhood size is a user-modifiable parameter that specifies the width of the bubble neighborhood. Neighborhood size decays with time (iterations) as described by the following equation: Neighborhood sizei = neighborhood size_i * (neighborhood size_f / neighborhood size_i)i/T where: neighborhood sizei = width of bubble neighborhood at iteration i neighborhood size_i = initial width of bubble neighborhood at the first iteration neighborhood size_f = final width of bubble neighborhood at the last iteration T = the maximum number of iterations (iterations = epochs x number of probe sets (or probes)) Gaussian Neighborhood In the Gaussian neighborhood, all nodes are updated at each iteration. The distance a node moves is a function of its distance from the target node (Np). The greater the distance between N and Np, the less N moves toward P. 351 352 APPENDIX D DMT Algorithms Learning Rate The learning rate is a user-modifiable parameter that specifies the distance a node moves toward P at each iteration. The learning rate decays with time (iteration) as described by the following equation: learning ratei = alpha_i * (learning rate_f /learning rate_i)i/T where: learning ratei = learning rate at iteration i learning rate_i = initial learning rate at the first iteration learning rate_f = final value of the learning rate at the last iteration T = the maximum number of iterations (iterations = epochs x number of probe sets (or probes)) Affymetrix® Data Mining Tool User’s Guide The Correlation Coefficient Clustering Algorithm The correlation coefficient (ρ) between two probe set expression patterns (X and Y) across all analyses is determined by the equation: N 1--⋅ ( X – Xm ) ⋅ ( Yi – Ym ) N ∑ i Cov ( X, Y ) i=1 ρ ( X, Y ) = ------------------------- = ------------------------------------------------------------------σXσY σXσY where: Cov (X,Y) is the covariance between X and Y σX = standard deviation of X, σY = standard deviation of Y Xm = mean Avg Diff (or normalized Avg Diff) for probe set X across all analyses Ym = mean Avg Diff (or normalized Avg Diff) for probe set Y across all analyses Xi = Avg Diff (or normalized Avg Diff) for probe set X from analysis i Yi = Avg Diff (or normalized Avg Diff) for probe set Y from analysis i N = number of analyses Covariance increases when (Xi - Xm) and (Yi - Ym) are both positive or negative. The covariance decreases when (Xi - Xm) is positive and (Yi - Ym) is negative, or vice versa. Each analysis is weighed equally. The order in which the analyses are used to compute ρ(X,Y) is not important because all data are compared to the mean. The value of ρ(X,Y) can range from -1 to +1: ρ = 1 indicates perfect positive correlation, ρ = 0 indicates no correlation and ρ = -1 indicates perfect inverse correlation. The correlation coefficient clustering algorithm is designed to identify positive correlations, not negative inverse correlations. 353 354 APPENDIX D DMT Algorithms The Matrix Algorithm The matrix algorithm determines the overlap significance between two lists (the probe sets or spotted probes common to both lists). The matrix displays the overlap significance value and highlights values that exceed the overlap significance threshold (pink) or the non-overlap significance threshold (yellow). The algorithm uses the binomial distribution equation to calculate the probability (p-value) that the observed overlap between two lists is expected due to random chance. The classification algorithm computes a p-value for each overlap significance value: n! x n–x P = ----------------------- ⋅ w ⋅ ( 1 – w ) ( n – x )!x! where, P= probability that the observed overlap is due to random chance n= number of probe sets (or spotted probes) in the first list (rows) x= observed number of probe sets (or spotted probes) that overlap in the two lists w= frequency of probe sets (or spotted probes) in the second list and w = b/t where b = number of probe sets (or spotted probes) in the second list and t = total population The p-value may range from zero to one. A score of one indicates there is no relationship (overlap) between the lists and that the observed distribution of probe sets or spotted probes in the two lists is expected to occur due to random chance. A score close to zero indicates the observed overlap between the two lists is not expected to occur due to random chance. The algorithm computes the overlap significance score from the p-value: Overlap significance = -log P Affymetrix® Data Mining Tool User’s Guide As a result, higher values in the matrix indicate greater overlap or nonoverlap significance between two lists. The algorithm uses the following rules to distinguish between these two possibilities: ■ x > wn, there is greater overlap than expected by random chance ■ x< wn, there is less overlap than expected by random chance The matrix displays the overlap significance value. It highlights values that exceed the overlap significance threshold (pink) or values that exceed the non-overlap significance threshold value (yellow). 355 356 APPENDIX D DMT Algorithms E Appendix E Toolbars & Shortcuts E You can display toolbars with text labels. To display the toolbar button labels, select View → Toolbar → Text labels from the menu bar. DMT Main Toolbar Figure E.1 DMT main toolbar Table E.1 DMT main toolbar button descriptions Menu Command Button Function Data → Open Displays the Open dialog box. Select and open a previously saved query from the Open dialog box. Data → Save Displays the Save dialog box so that a query may be named and saved. Data → Print Displays the Print dialog box. Help → Contents Displays DMT help contents. 359 360 APPENDIX E Toolbars & Shortcuts Session Toolbar Figure E.1 Session toolbar Table E.2 Session Toolbar Button Descriptions Menu Command Button Function Edit → Copy Cells Copies the cells selected in a results table to the system clipboard. Edit → Find in Results Displays the Find Probe Set dialog box that enables a text search of probe sets or spotted probe names in the query or pivot table. The search includes probe or probe set descriptions when these are displayed in the pivot table. Query → Experiment Information Displays experiment information for the analyses selected in the data tree. Query → Run Query Executes the query for the analyses selected in the data tree and populates the query table. Query → Pivot Executes the query for the analyses selected in the data tree and populates the pivot table. Annotations → Annotate Probe Sets Displays the Annotate dialog box. Annotations → Query Annotations Runs the annotation query. Graph → Scatter Displays the Scatter Graph dialog box. Graph → Fold Change Displays the Fold Change Graph dialog box. Graph → Series Displays the Series Graph dialog box. Graph → Histogram Displays the Histogram dialog box. Affymetrix® Data Mining Tool User’s Guide Table E.2 Session Toolbar Button Descriptions Menu Command Button Function Graph → Lasso Points Changes the cursor to a drawing tool that can circle (lasso) points in the scatter graph. View → Options Displays the Data Mining Options dialog box. View → Analysis Filters Displays the Filter Analysis dialog box. View → Data Tree Displays or hides the data tree in the Query window. View → Results Filters Displays or hides the filter grid in the DMT session. Shortcut Descriptions Menu Bar Command Shortcut Key Data → Save CTRL + S Data → Print CTRL + P Edit → Copy Cells CTRL + C Edit → Copy Graph CTRL + G Edit → Find in Results CTRL + F Query → Run Query CTRL + Q Annotations → Annotate Probes CTRL + A 361 362 APPENDIX E Toolbars & Shortcuts Index A comparison Affymetrix LIMS 26 technical support 6 algorithm correlation coefficient clustering 240–250, 353 matrix 354 SOM clustering 231–240, 349–352 annotation query results 120 tables 110 analysis filters 30, 31 filter analysis dialog box components 70 specifying 70–75 annotations deleting 122 loading 116–117 query results 120 querying 118–120 array sets creating 149–151 defined 149 deleting 153 editing 152 viewing 151 virtual set option 150 attributes finding 76 average 210–211 C cluster analysis 239 correlation coefficient algorithm 45, 240–250 SOM algorithm 43, 231–240 cluster correlation coefficient threshold 245, 246, 247 DMT session components 61 data tree 27 filter grid 27 graph pane 27 results pane 27 correlation coefficient 353 correlation coefficient algorithm 45, 240–250, 353 filtering 240–241, 244–245 modifiable parameters 247 saving seeds 249 seed 245, 248–250 seeding 240–241 configuration for Oracle 17 creating an Oracle 17 selecting 70–78 GeneChip® data 28 spot data 29 copying alias analyses DMT display query operators 66 count & percentage analysis 218–220 documentation conventions used 4 E epochs 239 experiment information table 33–35, 93–95 GeneChip® data 94 D data tree 27 database 25 publish 25 registering 51 selecting 53 unregistering 52 database connections 50 LIMS 50 MicroDB 50 default directory 54–55 deleting annotations 122 array set 153 probe list 144 query 90 descriptions 107 DMT installing 9–16 main toolbar 359 main toolbar buttons 359 main window 50 overview 25 session toolbar 360 shortcuts 361 starting 49 exporting data query results table 111 expression call search strings 68 F filter adding probes 109 filter (correlation coefficient algorithm) 247 filter analysis dialog box attribute section 72 components 73 find function 76 sample section 77 components 77 filter grid 27 adding probe lists 64 components 62 editing limits 65 entering limits 63 expression call search strings 68 GeneChip® data 63, 323 query builder 68 363 364 I n dex query operators 66 sort order 65 specifying 61–65 spot data 330 magnifying 176–178 plotting 173–176 selecting points 181 viewing probe information 179 histogram 42, 193–202 adding landmarks 196– 197 display options 199–202 magnifying 198 plotting 193–194 viewing bar information 195 printing 204 scatter graph 39, 158–171 display options 168–171 locating probes 163 magnifying 161–163 plotting 158–161 selecting points 166–168 viewing probe information 164–165 series graph 40, 185–193 display options 191–193 locating probes 188 plotting 186–187 viewing probe information 189 filtering (correlation coefficient algorithm) 240–241 filters analysis 30, 31 results 30, 31 selecting from sample section 78 find function 106 find probe 106 templates and attributes 76 fold change 212–213 fold change graph 40, 171–184 display options 183–184 locating probes 178 magnifying 176–178 plotting 173–176 selecting points 181 viewing probe information 179 G GeneChip® data analysis filters 70–75 DMT display 28 experiment information table 94 expression call search strings 68 filter grid explained 323 new query 59 query table data explained 339 graph pane 27 enlarging 202 graphs clearing 204 color options 202–204 copying 204 fold change 40 fold change graph 171–184 display options 183–184 locating probes 178 L lasso points fold change graph 181 learning rate 240, 352 limits editing 65 entering 63 LIMS 26 database connections 50 lists query operators 67 M Mann-Whitney test 216–217 matrix 46, 225 matrix analysis 223–228 overlap significance 223 population size 224 running 225 median 210–211 MicroDB 26 database connections 50 N neighborhood 240, 351 new query H GeneChip® data 59 spot data 59 histogram 42, 193–202 adding landmarks 196–197 display options 199–202 graph options 199 magnifying 198 plotting 193–194 viewing bar information 195 nodes 239, 349 normalization 79–81 after query or pivot 81 before query or pivot 80 intensity threshold 83 low and high percentage 83 options 81–83 target intensity 83 I initialization (SOM algorithm) 239 installing DMT 9–16 inter-quartile range 210–211 O open a saved query 89 Oracle alias configuration 17 creating an alias 17 Affymetrix® Data Mining Tool User’s Guide overlap significance 223 probe sets p-value 354 P pivot normalizing data after pivot 81 before pivot 80 pivot operation 101 pivot table 37, 97–105 annotating probes 108 including probe descriptions 102 options 104 selecting and viewing data 99–100 sorting columns 103 publish database 25 publishing applications 26 p-value overlap significance results pane 27 results tables annotating probes 108 copying 110 experiment information 93– 95 exporting 111 find 106 gene information 107 pivot 97–105 query 96 text search 106 viewing descriptions 107 354 Q query annotations 118–120 builder 68–69 building 30, 59 deleting 90 normalizing data after query 81 before query 80 open previously saved 89 operators 66 results 30 running 30, 79 save as 88 saving 87 selecting analyses 70–78 statements 66, 67 population size 224 printing graphs 204 probe descriptions pivot table 102 probe lists adding to a filter 64 adding to results filter 137 combining 132, 142 creating 127 creating from cluster analysis results 237 clustering results 251 query or pivot table 128 results filter 132 search array descriptions 131 deleting 144 editing members 140 highlighting members 138 input file 135 loading 134–137 managing 140 specifying input file 135 specifying members 134 using 137 viewing and editing 140 statistical analyses 43 tables experiment information 33–35 pivot 37 query 35 viewing 33 maximum in seeding (correlation coefficient algorithm) 247 minimum in seeding (correlation coefficient algorithm) 247 query table 35, 96 GeneChip® data 339 sort order 65 spot data 346 R ranges query operators 66 registering a database 51 results analyzing 43 cluster analysis 43 matrix analysis 46 pane clearing 112 expanding 111 row normalization (SOM algorithm) 239 row variation filters (SOM algorithm) 239 S saving cluster member probe list 251 probe list 237 query 87 scatter graph 39, 158–171 absolute call combinations 170 display options 168–171 locating probes 163 magnifying 161–163 plotting 158–161 point options 169 selecting points 166–168 viewing probe information 164–165 search strings 68 absolute call 68 difference call 68 365 366 I n dex seed (correlation coefficient algorithm) 245 importing 248 maximum probe sets 247 minimum probe sets 247 saving 249 threshold 247 status bar viewing 53 T tables annotating probes 108 copying 110 experiment information 93– 95 exporting data 111 find 106 gene information 107 modification options 333 modifying layout 334 pivot 97–105 sorting columns 103 query 96 text search 106 viewing descriptions 107 working with 334 seed (SOM algorithm) 239 seeding (correlation coefficient algorithm) 240–241 selecting analyses 70–78 database 53 series graph 40, 185–193 display options 191–193 formats 192 locating probes 188 plotting 186–187 viewing probe information 189 SOM algorithm 43, 231–240, 349–352 row variation filters 239 threshold filters 238 user-modifiable parameters 239, 351–352 spot data DMT display 29 new query 59 technical support 5 templates finding 76 text searches 106 threshold filters (SOM algorithm) 238 toolbar buttons 359 DMT session 360 DMT session buttons 360 main 359 spot data mode filter grid explained 330 query table data explained 346 standard deviation 210–211 starting DMT 49 statistical analyses average 210–211 count & percentage 218–220 fold change 212–213 inter-quartile range 210–211 Mann-Whitney test 216–217 median 210–211 standard deviation 210–211 T-Test 214–215 T-Test 214–215 tutorial lessons 255 U unregistering a database 52 V viewing descriptions 107 W windows main window tasks 50 modification options 333 view status bar 53

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Affymetrix Data Mining Tool manual