Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioInformatics (3) Computational Issues • Data Warehousing: – Organising Biological Information into a Structured Entity (World’s Largest Distributed DB) • Function Analysis (Numerical Analysis) : – Gene Expression Analysis : Applying sophisticated data mining/Visualisation to understand gene activities within an environment (Clustering ) – Integrated Genomic Study : Relating structural analysis with functional analysis • Structure Analysis (Symbolic Analysis) : – Sequence Alignment: Analysing a sequence using comparative methods against existing databases to develop hypothesis concerning relatives (genetics) and functions (Dynamic Programming and HMM) – Structure prediction : from a sequence of a protein to predict its 3D structure (Inductive LP) Data Warehousing : Mapping Biologic into Data Logic Structure Analysis : Alignments & Scores Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATA Score= 5(+1) + 3(-1) = 2 Local (motif) ACCACACA :::: ACACCATA Score= 4(+1) = 4 Suffix (shotgun assembly) ACCACACA ::: ACACCATA Score= 3(+1) =3 A comparison of the homology search and the motif search for functional interpretation of sequence information. Homology Search New sequence Retrieval Sequence database (Primary data) Motif Search Knowledge acquisition Similar sequence Expert knowledge New sequence Motif library (Empirical rules) Inference Expert knowledge Sequence interpretation Sequence interpretation Search and learning problems in sequence analysis Similarity search Proble ms in Biological Science Pairwise sequence alignment Database search for similar sequences Multiple sequence alignment Phylogenetic tree reconstruction Prot ein 3D structure alignment Structure/func tion ab initio prediction prediction Knowledge based prediction Mole cular classifi cation RNA seconda ry struc ture prediction RNA 3D structure prediction Protein 3D structure prediction Motif extraction Func tiona l sit e prediction Cellular locali zation p rediction Coding region p rediction Transmembrane domain prediction Protein seconda ry structure prediction Protein 3D structure prediction Supe rfamil y classification Ortholog/p aralog grouping of gene s 3D fold classification Math/Stat/CompSci method Optimi zation algorithms Dynamic progra mmi ng (DP) Simulated annealing (SA) Genetic algorithms (GA) Markov Chain Monte Carlo (MCMC: Metropolis and Gibbs sampl ers) Hopfield neural networ k Pattern recogn iti on and learning algo rit hms Discrimi nan t ana lysis Neural networks Suppor t vec tor machin es Hidden Markov models (HMM) Forma l gramm ar CART Clustering algorithms Hierarchical, k-means , etc PCA, MDS, etc Self -organ izing maps, etc (Whole genome) Gene Expression Analysis • Quantitative Analysis of Gene Activities (Transcription Profiles) Gene Expression Biotinylated RNA from experiment GeneChip expression analysis probe array Image of hybridized probe array Each probe cell contains millions of copies of a specific oligonucleotide probe Streptavidinphycoerythrin conjugate (Sub)cellular inhomogeneity Cell-cycle differences in expression. XIST RNA localized on inactive X-chromosome ( see figure) Cluster Analysis General Purpose: To divide samples into homogeneous groups based on a set of features. Gene Expression Analysis: To find co-regulated genes. Protein/protein complex Genes DNA regulatory elements Functional Analysis via Gene Expression Gene Expression Data Pairwise Measures Distance/Similarity Matrix Clustering Gene Clusters Motif Searching/... Regulatory Elements / Gene Functions Clustering Algorithms A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. Also, the clustering algorithm finds the centroid of a group of data sets.To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster. Clusters of Two-Dimensional Data Key Terms in Cluster Analysis • • • • Distance & Similarity measures Hierarchical & non-hierarchical Single/complete/average linkage Dendrograms & ordering Distance Measures: Minkowski Metric Suppose two objects x and y both have p features : x ( x1 x 2 xp) y ( y1 y 2 yp) The Minkowski metric is defined by d ( x, y) r ref p | i 1 xi yi |r Most Common Minkowski Metrics 1, r 2 (Euclidean distance ) d ( x, y ) 2 p | i 1 xi yi |2 2, r 1 (Manhattan distance) p d ( x , y ) | xi yi | i 1 3, r (" sup" distance ) d ( x , y ) max | xi yi | 1 i p An Example x 3 y 4 1, Euclidean distance : 2 4 2 32 5. 2, Manhattan distance : 4 3 7. 3, " sup" distance : max{4,3} 4. Manhattan distance is called Hamming distance when all features are binary. Gene Expression Levels Under 17 Conditions (1-High,0-Low) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 Hamming Distance : #( 01 ) #( 10 ) 4 1 5. Similarity Measures: Correlation Coefficient p s ( x, y ) ( x x)( y i i 1 i p p i 1 i 1 y) 2 2 ( x x ) ( y y ) i i averages : x p 1 p xi and y i 1 s ( x, y ) 1 p 1 p y. i 1 i Similarity Measures: Correlation Coefficient Expression Level Expression Level Gene A Gene B Gene B Time Time Expression Level Gene B Gene A Time Gene A Distance-based Clustering • Assign a distance measure between data • Find a partition such that: – Distance between objects within partition (i.e. same cluster) is minimized – Distance between objects from different clusters is maximised • Issues : – Requires defining a distance (similarity) measure in situation where it is unclear how to assign it – What relative weighting to give to one attribute vs another? – Number of possible partition is super-exponential hierarchical & nonNormalized Expression Data ab c d Hierarchical Clustering Techniques At the beginning, each object (gene) is a cluster. In each of the subsequent steps, two closest clusters will merge into one cluster until there is only one cluster left. Hierarchical Clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: 1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. The distance between two clusters is defined as the distance between • • • • Single-Link Method / Nearest Neighbor Complete-Link / Furthest Neighbor Their Centroids. Average of all cross-cluster pairs. Computing Distances • single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. • complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. • average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. Single-Link Method Euclidean Distance a b c b c a 2 5 b 3 c a,b d d 6 5 4 a,b,c c d (1) b c a 2 5 b 3 c d 6 5 4 a,b,c,d d (2) (3) c d a, b 3 5 c 4 d a, b, c 4 Distance Matrix Complete-Link Method Euclidean Distance a b c b c a 2 5 b 3 c a,b d d 6 5 4 a,b c,d a,b,c,d (2) (3) c d a, b 5 6 c 4 c, d a, b 6 c d (1) b c a 2 5 b 3 c d 6 5 4 Distance Matrix Compare Dendrograms Single-Link Complete-Link ab c d ab c d 0 2 4 6 Ordered dendrograms 2 n-1 linear orderings of n elements (n= # genes or conditions) Maximizing adjacent similarity is impractical. So order by: •Average expression level, •Time of max induction, or •Chromosome positioning Eisen98 Which clustering methods do you suggest for the following two-dimensional data? Nadler and Smith, Pattern Recognition Engineering, 1993 Problems of Hierarchical Clustering • It concerns more about complete tree structure than the optimal number of clusters. • There is no possibility of correcting for a poor initial partition. • Similarity and distance measures rarely have strict numerical significance. Non-hierarchical clustering Normalized Expression Data Tavazoie et al. 1999 (http://arep.med.harvard.edu) Clustering by K-means •Given a set S of N p-dimension vectors without any prior knowledge about the set, the K-means clustering algorithm forms K disjoint nonempty subsets such that each subset minimizes some measure of dissimilarity locally. The algorithm will globally yield an optimal dissimilarity of all subsets. •K-means algorithm has time complexity O(RKN) where K is the number of desired clusters and R is the number of iterations to converges. •Euclidean distance metric between the coordinates of any two genes in the space reflects ignorance of a more biologically relevant measure of distance. K-means is an unsupervised, iterative algorithm that minimizes the within-cluster sum of squared distances from the cluster mean. •The first cluster center is chosen as the centroid of the entire data set and subsequent centers are chosen by finding the data point farthest from the centers already chosen. 200-400 iterations. K-Means Clustering Algorithm 1) Select an initial partition of k clusters 2) Assign each object to the cluster with the closest center: 3) Compute the new centers of the clusters: n C ( S ) X i / n, X 1 ,..., X n S i 1 4) Repeat step 2 and 3 until no object changes cluster Representation of expression data T1 T2 T3 Gene 1 Time-point 3 Time-point 1 dij Gene N Normalized Expression Data from microarrays . Gene 1 Gene 2 Identifying prevalent expression patterns (gene clusters) 1.2 0.7 0.2 -0.3 1 2 -0.8 -1.3 -1.8 Time -point 3 Normalized Expression Normalized Expression Time-point 3 Normalized Expression Time-point 1 1.5 1 0.5 0 -0.5 1 2 3 -1 -1.5 Time -point 1.5 1 0.5 0 -0.5 1 2 -1 -1.5 -2 Time -point 3 Evaluate Cluster contents Genes gpm1 HTB1 RPL11A RPL12B RPL13A RPL14A RPL15A RPL17A RPL23A TEF2 YDL228c YDR133C YDR134C YDR327W YDR417C YKL153W YPL142C MIPS functional category Glycolysis Nuclear Organization Ribosome Translation Unknown