Download An Empirical Model of High Dimensional Data , P.Rajasekhar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014
An Empirical Model of High Dimensional Data
Clustering Technique with Intra Cluster Mean
PriyankaAddada1, P.Rajasekhar2
1,2
Final M.Tech Student1, Assistant Professor 2
Department Computer Science and Engineering, Avanthi Institute Of Engineering.&Tech, Makavarapallem, Narsipatnam
Abstract:Grouping the similar type objects is always an
interesting research issue in the field of clustering in data
mining or knowledge and data engineering, but raw data of
high dimensional data cannot give the optimal results in
terms of accuracy, so feature extraction is very important
before clustering. In this paper we proposing an efficient
architecture with reprocessing of documents and relevance
matrix construction and efficient enhanced k means
clustering algorithm , in this approach centroids can be
computed with intra cluster mean, except initial iteration
during clustering of documents.
I. INTRODUCTION
In recent years, personalized information services play an
important role in people’s life. There are two important
problems worth researching in the fields. One is how to get
and describe user personal information, i.e. building user
model, the other is how to organize the information
resources, i.e. document clustering. Personal information is
described exactly only if user behavior and the resource
what they look for or search have been accurately
analyzed. The effectiveness of a personalized service
depends on completeness and accuracy of user model. The
basic operation is organizing the information resources. In
this paper we focus on document clustering.
Choose a subset which is having fine
features when compared to the intention concepts, subset
feature collection is an active method of decreasing the size
of the image, eliminating the unrelated information, precise
knowledge and increase output clarity based on this subset
feature of the data have been classified into four groups
established, binding, clarify, fusion approaches. The
established process associate in obtain of the training of the
feature collection process which are usually study
algorithm. The binding process are expecting the
occurrence of the study algorithm for the better output of
the identified subsets. The clarify process is independent of
study algorithm with fine commonly and low complex
computing value and also with high expectation al is
obtained. The fusion process is a collection of clarify and
binding process by using th clarify process it reduce the
search process and by the binding process we obtain the
best process of the subset of the feature of the cluster over
the database.
ISSN: 2231-5381
II. RELATED WORK
In this paper we propose an algorithm to identify the subset
of the selecting the useful features of the obtained output to
the original set of entities. This algorithm obtain the data
securely and efficiently to get the opinion of the points,
considering the secrecy worries about the time taken to get
the feature of the subset and coming to the efficiency it
provides the subset of the features is of which quality does
it have. Depending upon these features a new algorithm is
proposed and evaluated it by conducting the experiments. It
functions upon the two criteria one among them is
graphical clustering which classifies the clusters based on
graphical clustering and the other is selecting the classified
cluster which is powerfully associated to the subset feature
of the cluster. Features of the clusters are totally free in the
clustering scheme and this algorithm has huge chance of
developing the subset of features of the cluster.by this we
approve the low spanning clustering of the tree method.
We evaluate and obtain the secure and efficiency of the
algorithm through experimental approach.
Many experiments are conducted on this
algorithm with several algorithm which represents the
feature of selection algorithm such as FCBF, ReliefF, CFS,
Consist, and FOCUS-SF with mostly known classifiers
such as the probability-based Naive Bayes, the tree-based
C4.5, the instance-based IB1. The output obtained in the
real tine time scenario are of huge size image and small
array and a simple text explain our algorithm with this we
can get the improved presentation of the four type of the
classifiers.
We can enhance our current research work with
efficient privacy preserving mechanism by the
implementation of efficient key exchange protocol,because
direct transmission of key may lead to vulnerability of the
system,so implementation of group protocol like Shamir
secret sharing technique gives more secure and privacy to
the data.We are concluding our current research work with
efficient data clustering technique and Quality of the
clustering mechanism enhanced with preprocessing,
relevance matrix and centroid computation in k-means
algorithm , here relevant features of the dataset can be
http://www.ijettjournal.org
Page 119
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014
extracted, instead of raw datasets. Our experimental results
shows optimized clusters with evolutionary approach
Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn)
Where
dm is centroid (File relevance score or Document weight)
dn is document weight or file Weight
III. PROPOSED SYSTEM
We propose a novel mechanism of clustering
approach with enhances k means and similarity matrix
construction, our approach gives best performance results
in terms of accuracy, efficiency and reliability. In this
paper initially set of documents can be loaded and preprocessed for feature extraction, so it removes articles,
prepositions and conjunctions and other.
Compute
weights
Pre-process
Pre-processed
documents
Documents
Similarity
matrix
Evolutionary
approach
Clusters
Compute
centroids and
similarity
Fig1 : Architecture
High dimensional data is a categorical data, so we need to
convert to numerical data with the factors of term
frequency and inverse document frequencythen we can
compute the file relevance score or document weight. So
relevance matrix can be constructed for all documents with
respect to their document weights.The similarity matrix
reduces the computational complexity ,by eliminating the
iterative computation of similarity
In the following figure it shows a simple relevance or
similarity matrix, for every iteration of cluster generation,
similarity is required to place the document, now it is
available in the similarity matrix and obviously reduces
space and time complexity factors.
We are using a most widely used similarity measurement
i.e cosine similarity
ISSN: 2231-5381
http://www.ijettjournal.org
Page 120
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014
d1
d2
d3
d4
d5
d1
1.0
0.48
0.66
0.89
0.45
d2
0.77
1.0
0.88
0.67
0.88
d3
0.45
0.9
1.0
0.67
0.34
d4
0.32
0.47
0.77
1.0
0.34
d5
0.67
0.55
0.79
0.89
1.0
Fig2: Relevance matrix
In our approach we are enhancing K Means algorithm with
recentoird computation instead of single random selection
ate every iteration, the following algorithm shows the
optimized k-means algorithm as follows
Algorithm :
Step 1: Select K objects as centroids for first itearion of
clusters
2: while (iterations <= maximum number of user specified
iterations)
REFERENCES
[1] Privacy-Preserving Mining of Association Rules From
Outsourced Transaction DatabasesFoscaGiannotti, Laks V.
S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui
(Wendy) Wang.
Step 3: get_relevance_matrix(di,dj)
Where diis the document weight of i from relevance
matrix
Djis the document Weight of j from relevance matrix
[2] Privacy Preserving Decision Tree Learning Using
Unrealized Data SetsPui K. Fong and Jens H. WeberJahnke, Senior Member, IEEE Computer Society
Step4: place the selected data point to maximum similarity
cluster
[3]Anonymization of Centralized and Distributed Social
Networks by Sequential ClusteringTamirTassa and Dror J.
Cohen
Step5 : Centroid can be regenerated with intra cluster
average of individual clusters
Ex: (P11+P12+….P1k) / K
All points from the same cluster
Step 6.Compute intra cluster based similarity for next
iteration
In the traditional approach of k means
algorithm it randomly selects a new centroid, in our
approach we are enhancing by prior construction of
relevance matrix and by considering the average k
random selection of document Weight for new centroid
calculation .
IV. CONCLUSION:
We have been concluding our current research work with
efficient and improved data clustering technique over
feature set of high dimensional data of text documents.
Raw data can be pre-processed with inconsistent data
removal and document weight can be computed with
frequency values and clustered based on the centroids with
respect to intra cluster mean between the documents.
[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P.
McDaniel
[5] Tools for Privacy Preserving Distributed Data Mining ,
Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael
Y. Zhu
[6] S. Datta, C. R. Giannella, and H. Kargupta,
“Approximate distributed
K-Means
clustering over a peer-to-peer network,” IEEETKDE, vol.
21, no. 10, pp. 1372–1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,
“Classifying documentsby distributed P2P clustering.” in
INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically
distributedpeer-to-peer document clustering and cluster
summarization,”IEEE Trans. Knowl. Data Eng., vol. 21,
no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in
structuredP2P overlays,” in ICPP, 2003.
[10]
Privacy
Preserving
Clustering
By
Data
TransformationStanley R. M. Oliveira ,Osmar R. Za¨ıane.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 121
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014
BIOGRAPHIES
Priyankaaddadacompleted
my B.Tech
(2008-2012) in DevineniVenkataRamana
&Dr.HimaSekhar
MIC
College
of
Technology,Kanchikacherla Krishna district
. She Pursuing M.Tech in Avanthi Institute
of Engineering and Technology,Tamaram,
Makavarapalem. Her interesting areas are datamining,
network security.
P.Rajasekhar working as AssitantProfessor
in avanthi institute OfEngineering.&tech,
akavarapallem,narsipatnam. His areas of
interest aredata mining, network security
and cloudcomputing
ISSN: 2231-5381
http://www.ijettjournal.org
Page 122
Related documents