Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection Abstract • In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is dif¬ficult to be performed. • In this context, automated methods of anal¬ysis are of great interest. • In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowl¬edge from the documents under analysis. • We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. • We illustrate the pro¬posed approach by carrying out extensive experimentation with six wellknown clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Abstract con… • Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. • In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. • Our experiments show that the Average Link and Complete Link algo¬rithms provide the best results for our application domain. • If suit-ably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and dis¬cuss several practical results that can be useful for researchers and practitioners of forensic computing. Existing system • IT IS estimated that the volume of data in the digital world increased from 161 hexabytes in 2006 to 988 hexabytes in 2010 [1]—about 18 times the amount of information present in all the books ever written—and it continues to grow exponen¬tially. • This large amount of data has a direct impact in Computer Forensics, which can be broadly defined as the discipline that combines elements of law and computer science to collect and analyze data from computer systems in a way that is admissible as evidence in a court of law. • In our particular application do¬main, it usually involves examining hundreds of thousands of files per computer. • This activity exceeds the expert’s ability of analysis and interpretation of data. • Therefore, methods for auto¬mated data analysis, like those widely used for machine learning and data mining, are of paramount importance. • In particular, al¬gorithms for pattern recognition from the information present in text documents are promising, as it will hopefully become evi¬dent later in the paper. Architecture Diagram System specification HARDWARE REQUIREMENTS Processor : intel Pentium IV Ram : 512 MB Hard Disk : 80 GB HDD SOFTWARE REQUIREMENTS Operating System : windows XP / Windows 7 FrontEnd : Java BackEnd : MySQL 5 CONCLUSION • We presented an approach that applies document clustering methods to forensic analysis of computers seized in police in¬vestigations. • Also, we reported and discussed several practical results that can be very useful for researchers and practitioners of forensic computing. • More specifically, in our experiments the hierarchical algorithms known as Average Link and Com¬plete Link presented the best results. • Despite their usually high computational costs, we have shown that they are particularly suitable for the studied application domain because the dendro¬grams that they provide offer summarized views of the docu¬ments being inspected, thus being helpful tools for forensic ex¬aminers that analyze textual documents from seized computers. • As already observed in other application domains, dendrograms provide very informative descriptions and visualization capabil¬ities of data clustering structures [5]. THANK YOU