Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SOM y SOTA: Clustering methods in the analysis of massive biological data Joaquín Dopazo. CNIO. Genes in the DNA... Between 30.000 and 100.000. 40-60% display alternative splicing …whose final effect can be different because of the variability. >protein kunase acctgttgatggcgacagggactgtatgctgatct atgctgatgcatgcatgctgactactgatgtgggg gctattgacttgatgtctatc.... …code for the structure of proteins... That undergo posttranslational modifications …which accounts for the function... More than 3 millon SNPs have been mapped From genotype to phenotype. (only the genetic component) …conforming complex interaction networks... A typical tissue is expressing among 5000 and 10000 genes …providing they are expressed in the proper moment and place... Each protein has an average of 8 interactions …in cooperation with other proteins… Pre-genomics scnario in the lab >protein kunase acctgttgatggcgacagggactgtatgctga tctatgctgatgcatgcatgctgactactgatg tgggggctattgacttgatgtctatc.... Bioinformatics tools for pre-genomic sequence data analysis Phylogenetic tree Information Sequence Molecular databases Motif databases Search results Motif Conserved region The aim: Extracting as much information as possible for one single data alignment Secondary and tertiary protein structure Post-genomic vision Who? Genome sequencing Literature, databases 2-hybrid systems, Mass spectrometry for protein complexes What do we know? And who else? SNPs Expression Arrays Where, when and how much? In what way? Post-genomic vision genes Information The new tools: interactions Clustering Feature selection Multiple correlation Datamining Information Databases polimorphisms Gene expression Neural Networks Brain and computers Brain computes in a different way from digital computers Brain Computers Structural components Neurons (Ramón y Cajal, 1911) chips Speed slow (10-3s) fast (10-9s) Procesing units 10 billion neurons, massively interconnected (60 trillion synapses) One or few Brain is a highly complex, nonlinear, and parallel computer Neurons are organized to perform complex computations many times faster than the fastest computers. What is a neural network? A Neural network is a massively parallel distributed processor able to store experiential knowledge and to make it available for use. It resembles to brain in two respects: Knowledge is acquired by the network through a learning process Interneuron connection strengths (synaptic weights) are used to store the knowledge. Neural Net classifiers Supervised Unsupervised Perceptrons Kohonen SOM Growing cell structures SOTA Supervised learning: the perceptron Input signals X1 X1 X2 X2 : : : : : : : : Xp Xp Activation function w1 w2 : w2 S uk J(.) Summing junction Threshold qk Output Supervised learning : training up Training set Summing junction down w1 11111110000000 00000001111111 w2 u =x1*w1+x2*w2 W1 = 1 W2=0 S Activation function J(.) u up =1 down = 0 J(u)= 1 if u 1 0 if u<1 Supervised learning: application Summing junction X 1 1 0 0 S J(.) 1 up u u =1*1+0*0= 1 J(u)= 1 if u 1 0 if u<1 Supervised vs. Unsupervised learning Supervised: The structure of the data is known beforehand. After a training process in which the network learns how to distinguish among classes, you use the network for assigning new items to the predefined classes Unsupervised: The structure of the data is not know beforehand. The network learns how data are distributed among classes, based on a function of distance Unsupervised learning: Kohonen self-organizing maps The basis Sensory pathways in the brain are organised in such a way that its arrangement reflects some physical characteristic of the external stimulus being sensed. Brain of higher animals seems to contain many kind of “maps” in the cortex. In visual areas there are orientation and color maps In the auditory cortex there exist the so-called tonotopic maps • The somatotopic maps represents the skin surface Kohonen SOM The causes of self-organisation Kohonen SOM mimics two-dimensional arrangements of neurons in the brain. Effects leading to spatially organized maps are: • Spatial concentration of the network activity on the neuron best tuned to the present input • Further sensitization of the best matching neuron and its topological neighborhood. Kohonen SOM The topology Two-dimensional network of cells with a hexagonal or rectangular (or other) arrangement. x1, x2..xn input Output nodes Neighborhood Neighborhood of a cell is defined as a time dependent function Kohonen SOM The algorithm Input Step 1. Initialize nodes to random values. Set the initial radius of the neighborhood. Step 2. Present new input: Compute distances to all nodes. Euclidean distances are commonly used Step 3. Select output node j* with minimum distance dj. Update node j* and neighbors. Nodes updated for the neighborhood NEj*(t) as: wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j NEj*(t) (t) is a gain term than decreases in time. Step4 Repeat by going to Step 2 until convergence. Kohonen SOM Limitations Arbitrary number of clusters The number of clusters is arbitrarily fixed from the beginning. Some clusters can remain unoccupied. Non proportional clustering Clusters are made based on the number of items so, distances among them are not proportional. Lack of the tree structure The use of a two-dimensional structure for the net makes impossible to recover a tree structure that relates the clusters and subclusters among them. Growing cell structures Kohonen SOM produce topologypreserving mapping. That is, the topology of the network and the number of clusters are fixed before to the training of the network Growing cell structures produce distribution-preserving mapping. The number of clusters and the connections among them are dynamically assigned during the training of the network. Insertion and deletion of neurons •After a fixed number of adaptations, every neuron q with a signal counter value hq > hc (a threshold) is used to create a new neuron The direct neighbor f of the neuron q having the greatest signal counter value is used to insert a new neuron between them. The new neuron is connected to preserve the topology of the network. • Signal counter values are adjusted in the neighborhood Similarly, neurons with signal counter values below a threshold can be removed. Growing cell structures Network dynamics Similar to the used by Kohonen SOM, but with several important differences: Adaptation strength is constant over time (eb and en for the best matching cell and its neighborhood). Only the best-matching cell and its neighborhood are adapted. Adaptation implies the increment of signal counter for the best-matching cell and the decrement in the remaining cells. New cells can be inserted and existent cells can be removed in order to adapt the output map to the distribution of the input vectors. Growing cell structures Limitations Arbitrary number of clusters The number of clusters is arbitrarily fixed from the beginning. Some clusters can remain unoccupied. Non proportional clustering Clusters are made based on the number of items so, distances among them are not proportional. Lack of the tree structure The use of a two-dimensional structure for the net makes impossible to recover a tree structure that relates the clusters and subclusters among them. But, sometimes behing the real world there is some hierarchy... A 20 items B C Many molecular data have different levels of structured information. Ej, phylogenies, molecular population data, DNA expression data (to same extent), etc. D Simulation Mapping a hierarchical structure using a non-hierarchical method (SOM) A,B G H C,D E,F Self Organising Tree Algorithm (SOTA) A new neural network designed to deal with data that are related among them by means of a binary tree topology Dopazo & Carazo, 1997, J. Mol. Evol.44:226-233 Derived from the Kohonen SOM and the growing cell structures but with several key differences: The topology of the network is a binary tree. Only growing of the network is allowed. The growing mimics a speciation event, producing two new neurons from the most heterogeneous neuron. Only terminal neurons are directly adapted by the input data, internal neurons are adapted through terminal neurons. SOTA: The algorithm Step 1. Initialize nodes to random values. Step 2. Present new input: Compute distances to all terminal nodes. Step 3. Select output node j* with minimum distance dj. Update node j* and neighbors. Nodes updated for the neighborhood NEj*(t) as: wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j NEj*(t) (t) is a gain term than decreases in time. Step 4 Repeat by going to Step 2 until convergence. Step 5 Reproduce the node with highest variability. The Self Organising Tree Algorithm (SOTA) is a hierarchical divisive method based on a neural network SOTA, unlike other hierarchical methods, grows from top to bottom until an appropriate level of variability is reached Input Dopazo, Carazo (1997) Herrero, Valencia, Dopazo (2001) SOTA algorithm (neighborhood) Initial state a s w Actualization Growing and different neighborhoods SOTA algorithm Initialise system Cycle: repeat as many epochs as necessary to get convergence in the present state of the network. Convergence: relative error of the network falls below a threshold Cycle EPOC H sister mother winner When a cycle finishes, the network size increases: two new neurons are attached to the neuron with higher resources. This neuron becomes mother neuron and does not receive direct inputs any more. NO Add cell Cycle convergence? YES Network convergence? NO YES End Applications Sequence analysis Microarray data analysis Population data analysis Sequence analysis in the genomics era • Massive data • Information •redundancy Codification Indeterminaciones. R = {A ó G}; N= {A ó G ó C ó T} Vectores de N x 4 (nucleótidos) o N x 20 (aminoácidos); más una componente para representar las deleciones Other possible codifications: Frequencies of dipeptides or dinucleotides Updating the neurons Updated Missing Classifying proteins with SOM Ferrán, Pflugfelder and Ferrara (1994) Self-organized neural maps of human protein sequences. Prot. Sci. 3:507-521. Gene expression analysis using DNA microarrays Cy5 Cy3 cDNA arrays Oligonucleotide arrays Research paradigm is shifting Hipothesis driven: one PhD per gene Ignorance driven: paralelized automated approach sequences Kb DNA arrays Gb Mb Tb - Pb Expression patterns 1 Different DNA-arrays 2 3 4 Patterns can be: • time series • dosage series • different patients • different tissues • etc. The data A Genes (thousands) B C Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc. Expression profile of all the genes for a experimental condition (array) Expression profile of a gene across the experimental conditions Experimental conditions (from tens up to no more than a few houndreds) Characteristics of the data: • Low signal to noise ratio • High redundancy and intra-gene correlations • Most of the genes are not informative with respect to the trait we are studying (account forunrelated physiological conditions, etc.) • Many genes have no annotation!! Study of many conditions. Can we find groups of experiments with similar gene expression profiles? Types of problems Unsupervised Different phenotypes... Supervised Reverse engineering Molecular classification of samples Co-expressing genes... What profile(s) do they display? and... Genes interacting in a network (A,B,C..)... What genes are responsible for? What do they have in common? Genes of a class Are there more genes? How is the network? B A C D E What are we measuring? green red Problem: is asymetrical A (background) Differential expression B (expression) B/A solution: log-transformation 100/1 = 100 2 10/1 = 10 1 1/1 = 1 1/10 = 0.1 -1 1/100 = 0.01 -2 transformation 0 Distance A Differences B<=>C B Correlation C A<=>B Clustering methods Non hierarchical deterministic NN Hierarchical K-means, PCA UPGMA SOM SOTA Provides different levels of information Robust Properties Aggregative hierarchical clustering Relationships among profiles are represented by branch lengths. Links recursively the closest pair of profiles until the complete hierarchy is reconstructed CLUSTER Allows to explore the relationship among groups of related genes at higher levels. Aggregative hierarchical clustering Problems • lack of robustness • solution may be not unique • dependent on the data order What level would you consider for defining a cluster? Subjective cluster definition Properties of neural networks for molecular data classification •Robust • Manage real-world data sets containing noisy, ill-defined items with irrelevant variables and outliers • Statistical distributions do not need to be parametric • Fast and scalable to big data sets Kohonen SOM Applied to microarray data Group11 samplea, sampleb ... Group12 samplea, sampleb ... Group13 samplea, sampleb ... t1 t2 sample1 a11 a12 .. a1p sample2 a21 a22 .. a2p : : an1 an2 : samplen .. Group14 tp samplea, sampleb ... : .. anp node44 node34 node24 z1 z2 .. x1 y1 zp y2 x2 .. .. yp xp Kohonen SOM microarray patterns gen1 gen2 .. genp sample1 a11 a12 .. a1p sample2 a21 a22 .. a2p : : an1 an2 : samplen : .. anp Kohonen SOM Example Response of human fibroblasts to serum Iyer et al., 1999 Science 283:83-87 The Self Organising Tree Algorithm (SOTA) The Self Organising Tree Algorithm (SOTA) is a divisive hierarchical method based on a neural network SOTA,opposite to other clustering methods, grows from top to bottom: growing can be stopped at the desired level of variability SOTA nodes are weighted averages of every item under the node SOTA Advantages of SOTA Robusteness against noise Divisive algorithm SOTA grows from top to bottom: growing can be stopped at any desired level of variability. Clusters´patterns Each node of the tree has a pattern associated wich corresponds to the cluster under itself. Distribution preserving The number of clusters depends on the variability of the data. From low resolution... TEST Where stop growing? ...to high resolution. exp1 exp2 .. expp gen1 a11 a12 .. a1p gen2 a21 a22 .. a2p : genn : an1 : an2 .. exp1 exp2 .. : anp expp gen1 a14 a17 .. a1q gen2 a23 a21 .. a2r : genn : an9 an4 : .. Permutation test for cluster size definition : ans 95% TEST are dij > 0.4? SOTA/SOM vs classical clustering (UPGMA) SOTA vs SOM Acuracy: the silhouette Is the object closer to its cluster or to the closer cluster? a(i ) = 1 d ( xi , x j ) xi A | A | 1 xi , x j A d ( xi , C ) = C A 1 d ( xi , x j ) | C | x j C b(i ) = min d ( xi , C ) C A s(i ) = b(i ) a(i ) maxa(i ), b(i ) a(i ) b(i ) a(i ) a(i ) 1(OK ) s (i ) = = 1 xi A a(i ) b(i ) b(i ) b(i ) b(i ) a(i ) b(i ) a(i ) b(i ) xi B a(i ) b(i ) s(i ) = = 1 1( Wrong ) b(i ) a(i ) a(i )