* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 4) Recalculate the new cluster center using
Survey
Document related concepts
Transcript
Proposal for FINAL YEAR PROJECT IN CS Umm Al Qura University Determine times of the crowds at the Haram in Mecca Team Name Team Logo Team Name Team Logo Team Members < < >مشاري محمد المطرفي42907549> < [email protected]> < < >طارق عطيه المالكي42905998> < [email protected]> < < >حسن محمد الشريف43009242> < [email protected]> < < >حاتم عبد العزيز متولى42805860> < [email protected]> < < >عبد هللا هندي الصاعدي42905336> < [email protected]> Project Leader < < >مشاري محمد المطرفي42907549> Project Supervisor < محمد نور.>د < [email protected] > Start – end Credit Hrs Proposal for < DD.MM.YYYY> –– < DD.MM.YYYY > FINAL YEAR PROJECT IN CS Umm Al Qura University Determine times of the crowds at the Haram in Mecca Background : Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Data mining involves tasks: Clustering : Definition: Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions. Popular clustering techniques include k-means clustering and expectation maximization (EM) clustering. It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Clustering algorithms : k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new center. A loop has been generated. As a result of this loop we may notice that the k centers change their location step by step until no more changes are done or in other words centers do not move any more. Finally, this algorithm aims at minimizing an objective function know as squared error function given by: where, ‘||xi - vj||’ is the Euclidean distance between xi and vj. ‘ci’ is the number of data points in ith cluster. ‘c’ is the number of cluster centers. Algorithmic steps for k-means clustering Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers. 1) Randomly select ‘c’ cluster centers. 2) Calculate the distance between each data point and cluster centers. 3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.. 4) Recalculate the new cluster center using: where, ‘ci’ represents the number of data points in ith cluster. 5) Recalculate the distance between each data point and new obtained cluster centers. 6) If no data point was reassigned then stop, otherwise repeat from step 3). Advantages 1) Fast, robust and easier to understand. 2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object, and t is # iterations. Normally, k, t, d << n. 3) Gives best result when data set are distinct or well separated from each other. Note: For more detailed figure for k-means algorithm please refer to k-means figure sub page. Disadvantages 1) The learning algorithm requires apriori specification of the number of cluster centers. 2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters. 3) The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results). 4) Euclidean distance measures can unequally weight underlying factors. 5) The learning algorithm provides the local optima of the squared error function. 6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig. 7) Applicable only when mean is defined i.e. fails for categorical data. 8) Unable to handle noisy data and outliers. 9) Algorithm fails for non-linear data set.(1) Fuzzy C-Means The Algorithm Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function: , where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by: , This iteration will stop when , where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point ofJm.(2) Hierarchical Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*) Step 3 can be done in different ways, which is what distinguishes singlelinkage from complete-linkage and average-linkage clustering. In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. A variation on average-link clustering is the UCLUS method of R. D'Andrade (1978) which uses the median distance, which is much more outlier-proof than the average distance. This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied. (*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links.(3) Mixture of Gaussians There’s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: component distributions have high “peaks” (data in one cluster are tight); the mixture model “covers” the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering: well-studied statistical inference techniques available; flexibility in choosing the component distribution; obtain a density estimation for each cluster; a “soft” classification is available. Mixture of Gaussians The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centred on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution: The algorithm works in this way: it chooses the component (the Gaussian) at random with probability it samples a point .(4) ; Classification : Prediction : Prediction is a way to predict the value of something depending on old data about it. Many forms of data mining are predictive. For example, a model might predict income based on education and other demographic factors. Predictions have an associated probability (How likely is this prediction to be true?) , Prediction probabilities are also known as confidence (How confident can I be of this prediction?). Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might specify that a person who has a bachelor’s degree and lives in a certain neighborhood is likely to have an income greater than the regional average. Rules have an associated support (What percentage of the population satisfies the rule?). Sometimes we would like to predict a continuous value, rather than a categorical label, Numeric prediction is the task of predicting continuous (or ordered) values for given input. For example, we may wish to predict the salary of college graduates with 10 years of work experience, or the potential sales of a new product given its price. By far, the most widely used approach for numeric prediction (hereafter referred to as prediction) is regression. In fact, many texts use the terms “regression” and “numeric prediction” synonymously. so , we want now to explain some methods of regression: - Linear Regression: Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is the simplest form of regression, and models y as a linear function of x. y = b + wx -Non Linear Regression: nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. y= w0+w1x+w2x2+w3x3 Project Scope : Model is used to improve the hand using Data Mining Project Description : Table Algorithm picture video Expected Outcome : Contains reports and predictive modeling Test results of different algorithms Data Mining model Web front end for the data mining model and crowd prediction dashboard Method/Approach : several methods. Relevant references : (1) Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases Chapter 8: Data Clustering” http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index. html An Efficient k-means Clustering Algorithm: Analysis and Implementation by Tapas Kanungo, David M. Mount, athan S. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu. Research issues on K-means Algorithm: An Experimental Trial Using Matlab by Joaquin Perez tega, Ma. Del ocio Boone Rojas and Maria J. Somodevilla Garcia. The k-means algorithm - Notes by Tan, Steinbach, Kumar Ghosh. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html k-means clustering by ke chen. (2) J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-57 J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algoritms", Plenum Press, New York (3) S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika, 2:241-254 R. D'andrade (1978): "U-Statistic Hierarchical Clustering" Psychometrika, 4:58-67 (4) A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood from Incomplete Data via theEM algorithm", Journal of the Royal Statistical Society, Series B, vol. 39, 1:1-38