* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering Techniques
List of types of proteins wikipedia , lookup
Messenger RNA wikipedia , lookup
Genomic imprinting wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Molecular evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Epitranscriptome wikipedia , lookup
Gene nomenclature wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Expression vector wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microarrays • • • • • Short Introduction to Microarray Technology The data matrix Distance metrics Clustering techniques Relevance of networks What is a Microarray? • “Microarray” has become a general term, there are many types now – DNA microarrays – Protein microarrays – Transfection microarrays – Tissue microarray –… • We’ll be discussing cDNA microarrays Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured with latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser 3 Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) – –11-20 probes per gene, –probes for perfect match vs mismatch; • cDNA or spotted arrays (Brown/Botstein) –two colors – experiment vs control. • ... 4 Affymetrix Microarrays 1.28cm 50um ~107 oligonucleotides, some perfectly match mRNA (PM), some have one Mismatch (MM) Gene expression computed from PM and MM 5 Affymetrix Microarray Raw Image Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Scanner enlarged section of raw image 6 raw data Value 193 -70 144 33 318 1764 1537 1204 707 What is a DNA Microarray (very generally) • A grid of DNA spots (probes) on a substrate used to detect complementary sequences • The DNA spots can be deposited by – piezolectric (ink jet style) – Pen – Photolithography (Affymetrix) • The substrate can be plastic, glass, silicon (Affymetrix) • RNA/DNA of interest is labelled & hybridizes with the array • Hybridization with probes is detected optically. Types of DNA microarrays and their uses • What is measured depends on the chip design and the laboratory protocol: – Expression • Measure mRNA expression levels (usually polyadenylated mRNA) – Resequencing • Detect changes in genomic regions of interest – Tiling • Tiles probes over an entire genome for various applications (novel transcripts, ChIP, epigenetic modifications) – SNP • Detect which known SNPs are in the tested DNA – ?... What do Expression Arrays really measure? • Gene Expression • mRNA levels in a cell • mRNA levels averaged over a population of cells in a sample • relative mRNA levels averaged over populations of cells in multiple samples • relative mRNA hybridization readings averaged over populations of cells in multiple samples • some relative mRNA hybridization readings averaged over populations of cells in multiple samples Why “some” & “multiple samples” • “some” – In a comparison of Affymetrix vs spotted arrays, 10% of probesets yielded very different results. – “In the small number of cases in which platforms yielded discrepant results, qRTPCR generally did not confirm either set of data, suggesting that sequence-specific effects may make expression predictions difficult to make using any technique.”* – It appears that some transcripts just can’t be detected accurately by these techniques. * Independence and reproducibility across microarray platforms., Quackenbush et al. Nat Methods. 2005 May;2(5):337-44 Why “multiple samples” • “multiple samples” – We can only really depend on betweensample fold change for Microarrays not absolute values or within sample comparisons (>1.3-2.0 fold change, in general) Central “Assumption” of Gene Expression Microarrays • The level of a given mRNA is positively correlated with the expression of the associated protein. – Higher mRNA levels mean higher protein expression, lower mRNA means lower protein expression • Other factors: – Protein degradation, mRNA degradation, polyadenylation, codon preference, translation rates, alternative splicing, translation lag… • This is relatively obvious, but worth emphasizing • How Does it Work ? DNA strands corresponding to specific genes, are arrayed on a small glass, silicon or nylon slide (1x1 to 2x2 cm). Each contains many copies of single stranded DNA probe. • mRNA is extracted from the sample, converted to cDNA and attached to a fluorescent label (usually Cy3 or Cy5). • Single vs double labeled arrays: Absolute analysis: determine whether transcripts are present or not within the sample. The array Comparative analysis: determine relative change in abundance for each transcript. probe How Does it Work ? - (cont.) • The chip is exposed to a solution containing extracted labeled mRNA (hybridization, similar to Northern and Southern). • If the probe sequence finds a complementary base sequence in a DNA sample, it will bind to it. • Excess unbound sample is washed away. RNA fragments with fluorescent tags from sample Hybridization of labeled RNA fragments with DNA on chip How Does it Work ? - (cont.) The array is scanned to measure fluorescent label. The location of the bound sample is detected using the fluorescent reporter, and gene expression is determined. Detection of labeled (hybridized) DNA fragments Non-hybridized DNA hybridized DNA Microarrays The conentration of labels in the cDNA is measured! Assumptions for interpretation of Microarray data: •mRNA is stable over time until cDNA is transcribed •The concentration of cDNA equals the concentraton of mRNA •The labeled nucleotides are inserted statistically The labels are often flourophors with a cyanine structure: cy3 Microarrays * * * * * * * ** * * ** * ** * * labeled cDNA as probe target cDNA Microarrays •Immobilisation of whole cDNA •cDNA is obtained via becteria •Spotted by inkjet technology •cDNA libraries are needed oligonucleotide Microarrays •Immobilisation of characteristc sequences •Sequences are designed •Sequences are synthesized on the chip •Several commercial products are available Comparing two situations cDNA Microarrays * * * * * ** * * Situation A – treated probe * * ** * * ** * Situation B – wild type A reference is needed to compare data from different arrays.*) Mixture of A and B is spotted •Confocal microscope scans the array •Extraction of intensities with picture analysis software Comparing two situations Int(treated) Int(wild type) ratio Gene A 0.22 0.24 0.917 Gene B 0.67 1.21 0.598 Gene C 1.13 0.43 2.630 Gene D 2.45 2.44 1.01 0 < ratio < Inf. -Inf. < log2(ratio) < + Inf. where log2(ratio) > 0: increase log2(ratio) < 0: decrease Absolute Intensities •Absolute itensities without reference signal can be used if two microarray chips are exactly equal. •Commercial products (Affymetrix, Agilent, ...) guarantee an interchip variation below 2% •This technology works well. Target Sequences and Probes Example: • 1415771_at: – Description: Mus musculus nucleolin mRNA, complete cds – LocusLink: AF318184.1 (NT sequence is 2412 bp long) – Target Sequence is 129 bp long 11 probe pairs tiling the target sequence gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt Perfect Match and Mismatch Target ttccagacagactcctatggtgacttctctggaat Perfect match ctgtctgaggataccactgaagaga ctgtctgaggattccactgaagaga Probe pair Mismatch Affymetrix Chip Pseudo-image *image created using dChip software 1415771_at on MOE430A *image created using dChip software 1415771_at on MOE430A PM MM *Note that PM, MM are always adjacent *image created using dChip software Data Matrix Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Type I Type II Type III Type II Type III Gene 1 foo 0.78 0.22 0.32 0.87 0.50 Gene 2 bar 0.73 0.92 0.06 0.67 0.09 Gene 3 blee 0.99 0.60 0.48 0.10 0.56 Gene 4 bas 0.60 0.26 0.11 0.21 0.41 Gene 5 groo 0.44 0.84 0.42 0.84 0.86 Gene 6 gar 0.07 0.18 0.49 0.30 0.05 Gene 7 glee 0.28 0.49 0.47 0.93 0.89 Gene 8 glas 0.91 0.95 0.76 0.22 0.00 Gene 9 gree 0.72 0.51 0.35 0.54 0.81 Gene 10 goo 0.34 0.83 0.62 0.15 0.71 … Data Normalisation Background noise •The background is determined around each spot •Some spots are left empty to determine background •There are several different procedures to substract background: •The background around a point is subtracted for that point •The average background is subtracted from each point •A background profile is determined for the whole array – can also be used to determine systematic errors of array MAS5: Background & Noise Background •Divide chip into zones •Select lowest 2% intensity values •stdev of those values is zone variability •Background at any location is the sum of all zones background, weighted by 1/((distance^2) + fudge factor) Noise •Using same zones as above •Select lowest 2% background •stedev of those values is zone noise •Noise at any location is the sum of all zone noise as above •From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf MAS5 Model • Measured Value = N + P + S – N = Noise – P = Probe effects (non-specific hybridization) – S = Signal MAS5: Adjusted Intensity A = Intensity minus background, the final value should be > noise. A: adjusted intensity I: measured intensity b: background NoiseFrac: default 0.5 (another fudge factor) And the value should always be >=0.5 (log issues) (fudge factor) •From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf MAS5: Ideal Mismatch Because Sometimes MM > PM •From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf MAS5: Signal Value for each probe: Modified mean of probe values: Scaling Factor (Sc default 500) Signal ReportedValue(i) = nf * sf * 2 (SignalLogValuei) (nf=1) Tbi = Tukey Biweight (mean estimate, resistant to outliers) TrimMean = Mean less top and bottom 2% •From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf MAS5: p-value and calls • First calculate discriminant for each probe pair: R=(PM-MM)/(PM+MM) • Wilcoxon one sided ranked test used to compare R vs tau value and determine p-value • Present/Marginal/Absent calls are thresholded from p=value above and – Present =< alpha1 – alpha1 < Marginal < alpha2 – Alpha2 <= Absent • Default: alpha1=0.04, alpha2=0.06, tau=0.015 Data Normalisation Correct for: •Differences in labelling and detection efficiencies •Differences in quantity of initial mRNA Some widely used techniques: •Total intensity normalisation •Assumption: Quantity of initial mRNA is similar for both labelled samples •Total integrated intensities are equal for both samples •Re-scale intensities for each gene •Normalisation using regression techniques •Normalisation using ratio statistics Data Normalisation Some widely used techniques (continued): •Normalisation using regression techniques •Assumption: mRNA derived from closely related samples is used •Scatterplot of int(Cy5) vs. int(Cy3) has a diagonal part with slope 1 •Re-scale intensities for each gene •Local regression techniques also available for non-linear scatterplots •Normalisation using ratio statistics •Confidence limits are calculated via statistics Data Normalisation Data Normalisation Undefined values Gene expression datasets often contain undefined values. •The background and the signal give similar intensity •Surface of the chip is not planar (cDNA chips) •The probe is not properly fixed on the chip • hybridistion step didn‘t work properly •Probe was not washed away properly Undefined values can be •Discarded -> leads to problems in data analysis •Replaced with averages of rows/columns •Replaced by zero Similarity of Genes Co-regulated genes show a similar behaviour •There exists no a priori definition of similarity •Different similarity measures are used •Different patterns are seen with different similarity measures Similarity of Genes Example: •An expression vector is defined for each gene •The expression values are the dimensions (x1,x2, ..., xn) in expression x3 space •Each gene is represented as one n-dimensional point in expression space •Two genes with similar expression behaviour are spatially nearby •Similarity is inversely proportional to spatial distance x2 x1 Similarity of Genes Example (continued): •Different clustering algorithms can be applied to find clusters of genes x3 x2 x1 Similarity of Experiments •Instead of clustering genes it is also possible to cluster experiments •Each experiment is regarded as experiment vector in the experiment space Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Type I Type II Type III Type II Type III Gene 1 foo 0.78 0.22 0.32 0.87 0.50 Gene 2 bar 0.73 0.92 0.06 0.67 0.09 Gene 3 blee 0.99 0.60 0.48 0.10 0.56 Gene 4 bas 0.60 0.26 0.11 0.21 0.41 Gene 5 groo 0.44 0.84 0.42 0.84 0.86 Gene 6 gar 0.07 0.18 0.49 0.30 0.05 Gene 7 glee 0.28 0.49 0.47 0.93 0.89 Gene 8 glas 0.91 0.95 0.76 0.22 0.00 Gene 9 gree 0.72 0.51 0.35 0.54 0.81 Gene 10 goo 0.34 0.83 0.62 0.15 0.71 … Visualisation of Data Matrix •The data matrix can be visualised to get an impression of gene expression behaviour. •The order of genes in data matrix can be reorganised for better recognition of expression patterns. Distance Metrics • • • • For clustering algorithms the calculation of a distance between gene vectors or experiment vectors is a necessary step Distances metrics can be classified as • Metric distances • Semi-metric distances Metric distances: 1. dab >= 0 2. dab = dba 3. daa = 0 4. dab <= dac + dcb Semi-metric distances: obey 1) to 3), fail in 4) Distance Metrics Minkowski distance d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp If q = 1, d is Manhattan distance (semi-metric distance) d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 i p jp If q = 2, d is Euclidean distance (metric distance) d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j 2 ip jp Distance Metrics Pearson correlation coefficient (semi-metric distance) n ( x x )(x x ) i1 1 i2 2 i 1 d (i, j) n n 2 x )2 (x x ) (x i 1 i1 1 i 1 i 2 2 x2 (x x ) 1 1 (x x ) 2 2 -1 <= d(i,j) <= +1 x1 (x , x ) 1 2 Distance Metrics Rank-ordered Pearson correlation coefficient -> Spearman (semi-metric distance) Distance Metrics Other variations of Pearsons correlation coefficient: Uncentered Pearson correlation (semi-metric distance) n x x i 1 i1 i2 d (i, j) n n 2 x )2 (x x ) (x i 1 i1 1 i 1 i 2 2 Clustering Techniques Several classification criteria of clustering algorithms exist with regard to •Clustering result: hierarchical – not hierarchical •Clustering process: divisive - agglomerative •Clustering criterion: sequential - global Clustering Techniques Hierarchical clustering: Clustering Techniques Hierarchical clustering: phylogenetic tree Euclidean distance 20 15 10 5 3 4 2 1 5 Clustering Techniques Hierarchical clustering: Various hierarchical clustering algorithms exist •Single-linkage clustering, nearest-neighbour •Complete-linkage, furthest-neighbour •Average-linkage, unweighted pair-group method average (UPGMA) •Weighted-pair group average, UPGMA weighted by cluster sizes •Within-groups clustering •Ward‘s method •... Clustering Techniques k-means clustering: 1. 2. 3. 4. 5. 6. The number of clusters k has to be chosen in advance*) x2 Initial position of cluster centers is random**) For each data point the (euclidean) distance to each cluster center is calculated Each data point is assigned to its nearest cluster. Cluster centers are shifted to the center of data points assigned to this cluster center Step 3. – 5. is iterated until cluster centers are not shifted anymore x1 Clustering Techniques k-means clustering (continued): A reasonable number of cluster centers k can be estimated: weighted squared distance of data points to their cluster center k Clustering Techniques k-means clustering (continued): The initial position of cluster centers can be estimated by the distribution of the data vectors: x 2 x1 Clustering Techniques Principal Component Analysis (PCA) Singular value decomposition (SVD) •Reduction of effective dimensionality of gene-expression space •by (linear) combination of initial dimensions •Mathematically complex Clustering Techniques •No clustering technique is better than another •Different clustering techniques have shown to lead to reasonable results depending on the measured data: •Organism •Tissue •Experimental condition, e.g. mutants or time course experiments •Distance metric used Visualisation Visualisation in clusters: Euclidean distance 20 15 10 5 3 4 2 1 5 What have we learned from microarrays? • Evolution – Most of gene expression differences between chimpanzees human have been detected in their brain. • Development – Associating gene expression with metamorphosis stages in Drosophila ila • Regulation – Finding novel regulatory motifs by coupling motif search with co-expression • Behaviour – Can predict the behavior of honey-bees workers by their brain gene expression. • Functional annotation – Annotating unknown genes based on co expression (guilty by association) • Tissue – Molecular signature specific to subtype of cancer tissues Uses of Microarrays • Sample classification – What are the set of genes that differentiate between two or more groups of Treatments – What is the set of samples that have the same expression profile in the detected cell(s)? • Gene classification – What is the set of genes that have the same expression profile along a set of treatments? Why are Microarrays Important ? Primary goal: Generate expression information for every gene in the array (detect global changes in whole genome transcription, under similar set of conditions). • Infer probable function of new genes (functional genomics; based on similarities in expression patterns with those of known genes). Explore gene function and regulation in relation to structure. • Provide knowledge of inter-relationships between genes, revealing how multiple gene products work together. Microarrays Uses 1. • Measure differential gene expression in samples: Response to environmental factors (e.g. treatment, cell stimulation in-vitro, response to environmental factors, effect of drugs, time after treatment...). • Diseased vs. normal tissues. • Profiling tumors. • Gene regulation (e.g. during development). 2. Sequence identification: • Comparative genomics. • Genotyping: mutation detection & SNPs. • Direct sequencing (sequencing by hybridization; SBH), determine a sequence from its “signature” on a chip (the original objective for DNA arrays, back at the beginning of the 90’s). Gene Expression Matrix When the array surface is scanned with a laser, fluorescent labels attached to the complementary DNA reveal which probes are bound and their quantity. www.ebi.ac.uk/microarray/biology_intro.htm Outline: Microarrays (“DNA Chips”) What is functional genomics ? Microarray landmarks. Basic principles. Applications of DNA microarrays. Working with microarrays. Clustering analysis, example. Conclusions. Clustering analysis Clustering - Algorithmic Challenge A a powerful set of tools which partition samples into well-separated and homogeneous groups, based on their behaviors or patterns. Example: cluster genes that have common expression patterns in certain experimental conditions. Analyze the vast amount of data in gene expression matrices, and discover meaningful common biological functions, regulatory elements and relationships among genes. Clustering Methods 1 2 3 4 5 6 7 8 9 10 11 How will you cluster the faces ? (“raw” state cases). Cluster analysis is the tool used to group these faces in an objective manner. The same data could be presented in numeric format: 0/1. http://149.170.199.144/multivar/ca.htm 12 Clustering Methods Challenge: to select data and apply algorithm appropriately to get sensible partition of the data. Choosing the “right” algorithm: 1. Hierarchical clustering Based on pair-wise distances, the result is a tree (similar to phylogenetic tree). 2. Non-hierarchical clustering Partitions the n genes into dis-joint clusters. E.g. k-means clustering (k is a pre-defined parameter), where the variability within each cluster is minimized, and variability between clusters is maximal. Hierarchical classification (tree) Non-hierarchical classification Simultaneous (Traditional global correlation) Timeshifted Clustering AlgorithmIdentifies further (reasonable) types of expression relationships Inverted Mark Gerstein, Yale University, 2002. Clustering Example Advantages: • Intuitive and quick. • Visualization of results. • Useful as first data mining tool for microarray data. Pitfalls: • In any type of data, clusters could always be found. Future Model for Cancer Treatment A patient arrives with any kind of cancer… A biopsy of the tumor will be taken and undergo micro-array analysis for expression pattern. “Personal” drug cocktail will be formulated for the specific tumor type according to individual expression pattern. Microarrays and Drug “Tailoring” Pharmacogenomics: The entire spectrum of genes that interact with drugs. • Genes that determine drug sensitivity and toxicity. • “Tailoring” personal drugs relies on patient’s genetic constitution. How is individual genetic polymorphism linked to differences in the efficacy and toxicity of medications ? • • • • Drug-metabolizing enzymes. Transporters. Receptors. Other drug targets. Pharmacogenomics: Translating Functional Genomics into Rational Therapeutics. W.E. Evans and M.V. Relling. Science vol 286: 487-491, 15 October 1999. Micro-Array Market <$50 million >$500 million Average annual rate of growth of 150%. What Do DNA Chips Miss ? • Cellular protein production: there are 3 major “check points” to determine quantities of protein and function: • Transcription from genes to mRNA. • Translation of proteins from mRNA. • Post-translation modifications. DNA chips detect only the levels of transcription. Other proteome technologies: • Mass spectrometry. • 2-D electrophoresis. • Protein microarrays. • Tissue microarrays. Conclusion •Microarrays measure the concentration of mRNA •Several assumptions are made in the progress of analysis •Biological •Numerical •Experimental conditions are important! •Different distances matrics combined with different clustering algorithms can leed to completely different results • SUMMARY Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured with latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser 78 Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) – –11-20 probes per gene, –probes for perfect match vs mismatch; • cDNA or spotted arrays (Brown/Botstein) –two colors – experiment vs control. • ... 79 Affymetrix Microarrays 1.28cm 50um ~107 oligonucleotides, some perfectly match mRNA (PM), some have one Mismatch (MM) Gene expression computed from PM and MM 80 Affymetrix Microarray Raw Image Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Scanner enlarged section of raw image 81 raw data Value 193 -70 144 33 318 1764 1537 1204 707 Microarray Potential Applications • Earlier and more accurate diagnostics • New molecular targets for therapy • Improved and individualized treatments • fundamental biological discovery (e.g. finding and refining biological pathways) • Recent examples – molecular diagnosis of leukemia, breast cancer, ... – discovery that genetic signature strongly predicts outcome – a few new drugs, many new promising drug targets 82 Microarray Data Analysis Types • Gene Selection – Find genes for therapeutic targets (new drugs) • Classification (Supervised) – Identify disease – Predict outcome / select best treatment • Clustering (Unsupervised) – Find new biological classes / refine existing ones – Exploration 83 Microarray Data Analysis Challenges • Few records (samples), usually < 100 • Many columns (genes), usually > 1,000 • This is very likely to result in false positives, “discoveries” due to random noise • Model needs to be explainable to biologists • Good methodology is essential for minimizing and controlling false positives 84 Microarray Classification Overview Train data Gene data Data Cleaning & Preparation Class data Feature and Parameter Selection Model Building Test data Evaluation 85 Global Feature (Gene) Selection “Leaks” Information Class Gene Data data Train data Gene Selection Model Building Evaluation Test data is wrong, because the information is “leaked” via gene selection. When #Features >> # samples, leads to overly “optimistic” results. 86 Data Preparation Issues • Cleaning: inherent measurement noise • Thresholding: – min 20, max 16,000 for MAS-4 – MAS-5 does not generate negative numbers • Filtering - remove genes with low variation (for biological and efficiency reasons) – e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5 – or Std. Dev across samples in the bottom 1/3 – or MaxVal - MinVal < 200 and MaxVal/MinVal 87