Download 4) Recalculate the new cluster center using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proposal for
FINAL YEAR PROJECT IN CS
Umm Al Qura University
Determine times of the crowds at the Haram in Mecca
Team Name
Team Logo
Team Name
Team Logo
Team Members
< ‫< >مشاري محمد المطرفي‬42907549>
< [email protected]> < ‫< >طارق عطيه المالكي‬42905998>
< [email protected]> < ‫< >حسن محمد الشريف‬43009242>
< [email protected]> < ‫< >حاتم عبد العزيز متولى‬42805860>
< [email protected]> < ‫< >عبد هللا هندي الصاعدي‬42905336>
< [email protected]>
Project Leader
< ‫< >مشاري محمد المطرفي‬42907549>
Project Supervisor
< ‫محمد نور‬.‫>د‬
< [email protected] >
Start – end
Credit Hrs
Proposal for
< DD.MM.YYYY> –– < DD.MM.YYYY >
FINAL YEAR PROJECT IN CS
Umm Al Qura University
Determine times of the crowds at the Haram in Mecca
Background :
Data mining is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
Data mining involves tasks:

Clustering :
Definition: Clustering is a data mining (machine learning) technique used to place data elements
into related groups without advance knowledge of the group definitions. Popular clustering
techniques include k-means clustering and expectation maximization (EM) clustering.
It is a main task of exploratory data mining, and a common technique for statistical data analysis
used in many fields, including machine learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Clustering algorithms :
k-means is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to
define k centers, one for each cluster. These centers should be placed in a cunning way
because of different location causes different result. So, the better choice is to place them
as much as possible far away from each other. The next step is to take each point belonging
to a given data set and associate it to the nearest center. When no point is pending, the first
step is completed and an early group age is done. At this point we need to re-calculate k new
centroids as barycenter of the clusters resulting from the previous step. After we have these k
new centroids, a new binding has to be done between the same data set points and the
nearest new center. A loop has been generated. As a result of this loop we may notice that the
k centers change their location step by step until no more changes are done or in other words
centers do not move any more. Finally, this algorithm aims at minimizing an objective function
know as squared error function given by:
where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
‘ci’ is the number of data points in ith cluster.
‘c’ is the number of cluster centers.
Algorithmic steps for k-means clustering
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is minimum
of all the cluster centers..
4) Recalculate the new cluster center using:
where, ‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Advantages
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each
object, and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
Note: For more detailed figure for k-means algorithm please refer to k-means figure sub page.
Disadvantages
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will
not be able to resolve
that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get
different results (data represented in form of cartesian co-ordinates and polar co-ordinates
will give different results).
4) Euclidean distance measures can unequally weight underlying factors.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
8) Unable to handle noisy data and outliers.
9) Algorithm fails for non-linear data set.(1)
Fuzzy C-Means
The Algorithm
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to
belong to two or more clusters. This method (developed by Dunn in 1973 and
improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on
minimization of the following objective function:
,
where m is any real number greater than 1, uij is the degree of membership of xi in the
cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of
the cluster, and ||*|| is any norm expressing the similarity between any measured data
and the center.
Fuzzy partitioning is carried out through an iterative optimization of the objective
function shown above, with the update of membership uij and the cluster centers cj by:
,
This iteration will stop when
, where is a termination criterion
between 0 and 1, whereas k are the iteration steps. This procedure converges to a local
minimum or a saddle point ofJm.(2)
Hierarchical
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the
basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities)
between the clusters the same as the distances (similarities) between the
items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old
clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
(*)
Step 3 can be done in different ways, which is what distinguishes singlelinkage from complete-linkage and average-linkage clustering.
In single-linkage clustering (also called the connectedness or minimum method), we
consider the distance between one cluster and another cluster to be equal to
the shortest distance from any member of one cluster to any member of the other
cluster. If the data consist of similarities, we consider the similarity between one
cluster and another cluster to be equal to the greatest similarity from any member of
one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or maximum method), we
consider the distance between one cluster and another cluster to be equal to
the greatest distance from any member of one cluster to any member of the other
cluster.
In average-linkage clustering, we consider the distance between one cluster and
another cluster to be equal to the average distance from any member of one cluster to
any member of the other cluster.
A variation on average-link clustering is the UCLUS method of R. D'Andrade
(1978) which uses the median distance, which is much more outlier-proof than the
average distance.
This kind of hierarchical clustering is called agglomerative because it merges clusters
iteratively. There is also a divisive hierarchical clustering which does the reverse by
starting with all objects in one cluster and subdividing them into smaller pieces.
Divisive methods are not generally available, and rarely have been applied.
(*) Of course there is no point in having all the N items grouped in a single cluster
but, once you have got the complete hierarchical tree, if you want k clusters you just
have to cut the k-1 longest links.(3)
Mixture of Gaussians
There’s another way to deal with clustering problems: a model-based approach, which
consists in using certain models for clusters and attempting to optimize the fit between
the data and the model.
In practice, each cluster can be mathematically represented by a parametric
distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is
therefore modelled by a mixture of these distributions. An individual distribution used
to model a specific cluster is often referred to as a component distribution.
A mixture model with high likelihood tends to have the following traits:

component distributions have high “peaks” (data in one cluster are tight);

the mixture model “covers” the data well (dominant patterns in the data are
captured by component distributions).
Main advantages of model-based clustering:




well-studied statistical inference techniques available;
flexibility in choosing the component distribution;
obtain a density estimation for each cluster;
a “soft” classification is available.
Mixture of Gaussians
The most widely used clustering method of this kind is the one based on learning
a mixture of Gaussians: we can actually consider clusters as Gaussian distributions
centred on their barycentres, as we can see in this picture, where the grey circle
represents the first variance of the distribution:
The algorithm works in this way:

it chooses the component (the Gaussian) at random with probability

it samples a point
.(4)
;

Classification :

Prediction :
Prediction is a way to predict the value of something depending on old data about it.
Many forms of data mining are predictive. For example, a model might predict income based on education and other
demographic factors.
Predictions have an associated probability (How likely is this prediction to be true?) , Prediction probabilities are
also known as confidence (How confident can I be of this prediction?).
Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For
example, a rule might specify that a person who has a bachelor’s degree and lives in a certain neighborhood is likely
to have an income greater than the regional average. Rules have an associated support (What percentage of the
population satisfies the rule?).
Sometimes we would like to predict a continuous value, rather than a categorical label, Numeric prediction is the
task of predicting continuous (or ordered) values for given input.
For example, we may wish to predict the salary of college graduates with 10 years of work experience, or the
potential sales of a new product given its price.
By far, the most widely used approach for numeric prediction (hereafter referred to as prediction) is regression.
In fact, many texts use the terms “regression” and “numeric prediction” synonymously.
so , we want now to explain some methods of regression:
- Linear Regression:
Straight-line regression analysis involves a response variable, y, and a single predictor variable, x.
It is the simplest form of regression, and models y as a linear function of x.
y = b + wx
-Non Linear Regression:
nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is
a nonlinear combination of the model parameters and depends on one or more independent variables.
y= w0+w1x+w2x2+w3x3
Project Scope :
Model is used to improve the hand using Data Mining
Project Description :
Table
Algorithm
picture
video
Expected Outcome :
Contains reports and predictive modeling
Test results of different algorithms
Data Mining model
Web front end for the data mining model and crowd prediction dashboard
Method/Approach :
several methods.
Relevant references :

(1) Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases Chapter
8:
Data
Clustering”
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.
html





An Efficient k-means Clustering Algorithm: Analysis and Implementation by Tapas Kanungo,
David M. Mount,
athan S. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu.
Research issues on K-means Algorithm: An Experimental Trial Using Matlab by Joaquin Perez
tega, Ma. Del
ocio Boone Rojas and Maria J. Somodevilla Garcia.
The k-means algorithm - Notes by Tan, Steinbach, Kumar Ghosh.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
k-means clustering by ke chen.

(2) J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its






Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3:
32-57
J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function
Algoritms", Plenum Press, New York
(3) S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika,
2:241-254
R. D'andrade (1978): "U-Statistic Hierarchical Clustering" Psychometrika,
4:58-67
(4) A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood
from Incomplete Data via theEM algorithm", Journal of the Royal Statistical
Society, Series B, vol. 39, 1:1-38