Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014 An Empirical Model of High Dimensional Data Clustering Technique with Intra Cluster Mean PriyankaAddada1, P.Rajasekhar2 1,2 Final M.Tech Student1, Assistant Professor 2 Department Computer Science and Engineering, Avanthi Institute Of Engineering.&Tech, Makavarapallem, Narsipatnam Abstract:Grouping the similar type objects is always an interesting research issue in the field of clustering in data mining or knowledge and data engineering, but raw data of high dimensional data cannot give the optimal results in terms of accuracy, so feature extraction is very important before clustering. In this paper we proposing an efficient architecture with reprocessing of documents and relevance matrix construction and efficient enhanced k means clustering algorithm , in this approach centroids can be computed with intra cluster mean, except initial iteration during clustering of documents. I. INTRODUCTION In recent years, personalized information services play an important role in people’s life. There are two important problems worth researching in the fields. One is how to get and describe user personal information, i.e. building user model, the other is how to organize the information resources, i.e. document clustering. Personal information is described exactly only if user behavior and the resource what they look for or search have been accurately analyzed. The effectiveness of a personalized service depends on completeness and accuracy of user model. The basic operation is organizing the information resources. In this paper we focus on document clustering. Choose a subset which is having fine features when compared to the intention concepts, subset feature collection is an active method of decreasing the size of the image, eliminating the unrelated information, precise knowledge and increase output clarity based on this subset feature of the data have been classified into four groups established, binding, clarify, fusion approaches. The established process associate in obtain of the training of the feature collection process which are usually study algorithm. The binding process are expecting the occurrence of the study algorithm for the better output of the identified subsets. The clarify process is independent of study algorithm with fine commonly and low complex computing value and also with high expectation al is obtained. The fusion process is a collection of clarify and binding process by using th clarify process it reduce the search process and by the binding process we obtain the best process of the subset of the feature of the cluster over the database. ISSN: 2231-5381 II. RELATED WORK In this paper we propose an algorithm to identify the subset of the selecting the useful features of the obtained output to the original set of entities. This algorithm obtain the data securely and efficiently to get the opinion of the points, considering the secrecy worries about the time taken to get the feature of the subset and coming to the efficiency it provides the subset of the features is of which quality does it have. Depending upon these features a new algorithm is proposed and evaluated it by conducting the experiments. It functions upon the two criteria one among them is graphical clustering which classifies the clusters based on graphical clustering and the other is selecting the classified cluster which is powerfully associated to the subset feature of the cluster. Features of the clusters are totally free in the clustering scheme and this algorithm has huge chance of developing the subset of features of the cluster.by this we approve the low spanning clustering of the tree method. We evaluate and obtain the secure and efficiency of the algorithm through experimental approach. Many experiments are conducted on this algorithm with several algorithm which represents the feature of selection algorithm such as FCBF, ReliefF, CFS, Consist, and FOCUS-SF with mostly known classifiers such as the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1. The output obtained in the real tine time scenario are of huge size image and small array and a simple text explain our algorithm with this we can get the improved presentation of the four type of the classifiers. We can enhance our current research work with efficient privacy preserving mechanism by the implementation of efficient key exchange protocol,because direct transmission of key may lead to vulnerability of the system,so implementation of group protocol like Shamir secret sharing technique gives more secure and privacy to the data.We are concluding our current research work with efficient data clustering technique and Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm , here relevant features of the dataset can be http://www.ijettjournal.org Page 119 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014 extracted, instead of raw datasets. Our experimental results shows optimized clusters with evolutionary approach Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn) Where dm is centroid (File relevance score or Document weight) dn is document weight or file Weight III. PROPOSED SYSTEM We propose a novel mechanism of clustering approach with enhances k means and similarity matrix construction, our approach gives best performance results in terms of accuracy, efficiency and reliability. In this paper initially set of documents can be loaded and preprocessed for feature extraction, so it removes articles, prepositions and conjunctions and other. Compute weights Pre-process Pre-processed documents Documents Similarity matrix Evolutionary approach Clusters Compute centroids and similarity Fig1 : Architecture High dimensional data is a categorical data, so we need to convert to numerical data with the factors of term frequency and inverse document frequencythen we can compute the file relevance score or document weight. So relevance matrix can be constructed for all documents with respect to their document weights.The similarity matrix reduces the computational complexity ,by eliminating the iterative computation of similarity In the following figure it shows a simple relevance or similarity matrix, for every iteration of cluster generation, similarity is required to place the document, now it is available in the similarity matrix and obviously reduces space and time complexity factors. We are using a most widely used similarity measurement i.e cosine similarity ISSN: 2231-5381 http://www.ijettjournal.org Page 120 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014 d1 d2 d3 d4 d5 d1 1.0 0.48 0.66 0.89 0.45 d2 0.77 1.0 0.88 0.67 0.88 d3 0.45 0.9 1.0 0.67 0.34 d4 0.32 0.47 0.77 1.0 0.34 d5 0.67 0.55 0.79 0.89 1.0 Fig2: Relevance matrix In our approach we are enhancing K Means algorithm with recentoird computation instead of single random selection ate every iteration, the following algorithm shows the optimized k-means algorithm as follows Algorithm : Step 1: Select K objects as centroids for first itearion of clusters 2: while (iterations <= maximum number of user specified iterations) REFERENCES [1] Privacy-Preserving Mining of Association Rules From Outsourced Transaction DatabasesFoscaGiannotti, Laks V. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui (Wendy) Wang. Step 3: get_relevance_matrix(di,dj) Where diis the document weight of i from relevance matrix Djis the document Weight of j from relevance matrix [2] Privacy Preserving Decision Tree Learning Using Unrealized Data SetsPui K. Fong and Jens H. WeberJahnke, Senior Member, IEEE Computer Society Step4: place the selected data point to maximum similarity cluster [3]Anonymization of Centralized and Distributed Social Networks by Sequential ClusteringTamirTassa and Dror J. Cohen Step5 : Centroid can be regenerated with intra cluster average of individual clusters Ex: (P11+P12+….P1k) / K All points from the same cluster Step 6.Compute intra cluster based similarity for next iteration In the traditional approach of k means algorithm it randomly selects a new centroid, in our approach we are enhancing by prior construction of relevance matrix and by considering the average k random selection of document Weight for new centroid calculation . IV. CONCLUSION: We have been concluding our current research work with efficient and improved data clustering technique over feature set of high dimensional data of text documents. Raw data can be pre-processed with inconsistent data removal and document weight can be computed with frequency values and clustered based on the centroids with respect to intra cluster mean between the documents. [4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel [5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributed K-Means clustering over a peer-to-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeer-to-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] Privacy Preserving Clustering By Data TransformationStanley R. M. Oliveira ,Osmar R. Za¨ıane. ISSN: 2231-5381 http://www.ijettjournal.org Page 121 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number3 - Dec 2014 BIOGRAPHIES Priyankaaddadacompleted my B.Tech (2008-2012) in DevineniVenkataRamana &Dr.HimaSekhar MIC College of Technology,Kanchikacherla Krishna district . She Pursuing M.Tech in Avanthi Institute of Engineering and Technology,Tamaram, Makavarapalem. Her interesting areas are datamining, network security. P.Rajasekhar working as AssitantProfessor in avanthi institute OfEngineering.&tech, akavarapallem,narsipatnam. His areas of interest aredata mining, network security and cloudcomputing ISSN: 2231-5381 http://www.ijettjournal.org Page 122