Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Preliminary Design for Digital Forensics Analysis of Terabyte Size Data Sets L.M. Liebrock, N. Marrero, D.P. Burton, R. Prine, E. Cornelius, M. Shakamuri, V. Urias New Mexico Institute of Mining and Technology (NMT), 801 Leroy Place, Socorro, NM 87801 [email protected], marrero, dburton, rprine, eea, mayuri, [email protected] talks with both the front-end and the back-end. The database stores all back-end analysis results. If a user requests an analysis that is not running, the middleware alerts the back-end. The middleware alerts the front-end as analysis completes, more data becomes available, the database changes, or other users connect to the database. The PDF architecture is shown in Figure 1. ABSTRACT Digital forensics is computationally intensive and current analysis systems do not handle the multiple terabyte size data sets that are now becoming a major issue for analysis. For these data sets, RAID file system analysis, parallel computing, collaboration, and visualization will be essential. Here we outline the preliminary design for a parallel digital forensics framework that is being developed to handle multiple terabyte size data set analysis. Categories and Subject Descriptors K.6.5 Security and Protection, H.3.2 Information Storage, H.2.8 Database Applications, D.1.3 Concurrent Programming General Terms Algorithms, Performance, Security. Keywords Digital Forensics, Terabyte Size Data Sets, Visualization, Visual Analytics, Parallel Computing, RAID File Systems. Figure 1: PDF Architecture 1. INTRODUCTION 3. RAID BITSTREAM EXTRACTION Digital forensic analysts are challenged by increasing variety and complexity of analyses, communication bandwidth, and storage size. Roussev and Golden [7] show that analysts will need to use distributed processing for digital forensics for large datasets. They found the time to open an 80GB target with FTK took over 4 days. At about $1k per terabyte, companies can build stores that are effectively not analyzable. Creating a forensically sound copy is not trivial due to the data size involved, the number of disks, and the RAID encoding. We propose to use one of two different parallelization techniques to image the drives: 1) image the whole device (which should not be an issue because the live RAID controller sees it as a single-disk) or 2) image constant sized blocks of data that correspond to the size of each disk drive within the RAID set. Both approaches can image independently in parallel. We propose a parallel digital forensics system (PDF) for terabyte size data sets. PDF will image RAID drives for parallel analysis, use a parallel back-end to speed analysis, use a database to store analysis results, and support visual analytics and collaboration. Reconstruction of the RAID is common to both techniques and is accomplished by taking the imaged disk data and determing a blocking size. A RAID map is then constructed based on data blocks and parity. Using the parity blocks, the period of the RAID set can be found. A non-trivial process for RAID reconstruction must be developed in order to extract necessary data [3]. For speed, reconstruction will be done in parallel. 2. SYSTEM ARCHITECTURE The essential components of PDF are the visualization / collaboration engine and the parallel processing analysis engine. To decouple the front-end of the system from the back-end, we are employing a database. The database communications wrapper 4. PARALLEL FORENSIC ANALYSIS Parallelization is the natural response to the size and complexity of forensic analysis for terabyte size data sets. Both network traffic and file system analysis require parallel performance, but require separate treatment. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07, March 11-15, 2007, Seoul, Korea. Copyright 2007 ACM 1-59593-480-4 /07/0003…$5.00. 4.1 Network Analysis Network-Centric Forensics determines compromised systems, intruder activity, as well as incident scope [1] to provide the 190 investigator initial clues for further analysis. To enable network traffic analysis from high bandwidth networks, in close to real time, each newly identified session is assigned to a specific processor for the collection and reconstruction of related traffic. By integrating other network source data with session network traffic, a detailed coherent picture of network activity and associated entities can be built. 6.2 Collaboration 4.2 File System Analysis 7. Future Work After RAID analysis, we have parallel independent files for analysis. Standard search and analysis mechanisms will be used. This paper has outlined the design of PDF. The components described here are in various stages of implementation, which are not detailed here due to space restrictions. Extensions of algorithms will continue with implementation as well to improve PDF’s handling of RAID systems, parallelization, collaboration, and visual analytics. Future papers will detail individual components as well as PDF performance. Here we consider two forms of collaboration. In static collaboration, the analysts are primarily working independently and simply wish to occasionally bring the attention of their colleague(s) to specific information. In interactive collaboration, the analysts are working much more closely via shared views. Shared views could also be used to train new analysts. For unallocated storage, deleted files must be reconstructed across the parallel processing system. This can be accomplished using file carving techniques. File carving utilizes header and footer values to determine which segments of data represent a file. Entropy calculations are performed on the remaining data. Blocks with similar scores will be treated as potential fragments of a single file. Next, take the selected blocks, reassemble file fragments, and validate the recovered files using mature and widely accepted tools such as “file”. 8. CONCLUSIONS We have described our proposed system for improving support for digital forensics of terabyte size data sets. We have specifically dealt with issues related to RAID analysis targets, parallelized analysis, collaboration, and visual analytics to reduce the analyst burden for these large data sets. Slack space analysis will mimic unallocated space analysis. However, if a complete file is found, it has been specifically hidden and will be noted for the analyst. After file reconstruction has been performed, analysis of slack space will follow standard procedures, e.g., text searches. 9. REFERENCES [1] Bair, A., Klayton Monroe, Jason Smith, Digital Forensic Research Workshop 2006 File Carving Challenge Entry, viewed Sept. 15, 2006, http://www.dfrws.org/2006/challenge/submissions/bair/. 5. FORENSIC DATABASE All analysis results will be housed in the database, which additional users can access with approval of the analysis initiator. Users are authenticated by the database communications wrapper, not the database itself. All transactions with the database will be logged indicating the time, date, user, and transaction. Communication between the database and the front- and backends will be encrypted to maintain the security of the analysis. [2] Beebe, N.L. and Clark, J.G., Approaching the Terabyte Dataset Problem in Digital Investigations, Advances in Digital Forensics, Eds. Politt, M. and Shenoi, S., Springer, 2006. [3] Brown, C.L.T., Computer Evidence: Collection and Preservation, Charles River Media, Hingham, MA, 2005. 6. VISUALIZATION FRONT-END [4] De Souza, K.X.S, Dos Santos, A.D., and Envangelista, S.R.M., Visualization of Ontologies through Hypertrees, Proceedings of the Latin American Conference on HumanComputer Interaction, 2003. The analysis will begin with the user indicating the dataset to be analyzed and what default analyses they want done. This will populate the database for the analysis. The user can then allow others access, if the analysis will be collaborative. [5] Hearst, M., TileBars: Visualization of Term Distribution Information in Full Text Information Access, Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems(CHI), Denver, CO, May 1995. 6.1 Visualization Processing times for keyword searches (10-20 keywords in 200 GB) takes days and the analyst is overwhelmed with ‘hits’ [2]. For terabyte size data sets, hits will increase dramatically. [6] Robertson, G.G., Mackinlay, J.D., and Card, S.K., Conetrees: Animated 3D Visualizations of Hierarchical Information. Proceedings of the ACM SIGSHI Conference on Human Factors in Computing Systems, ACM Press, 1992. The Department of Homeland Security established visual analytics for “analyzing terrorist threats, safeguarding borders and ports, and preparing for and responding to emergencies” [9]; the issues are similar for digital forensics. PDF’s visual analytics will address visualization, analysis, and understanding of terabyte size data sets, where the analyst will not be able to look at all data. Data mining will prioritize the data for analyst consideration [2]. [7] Roussev, V. and Golden, R.G., Breaking the Performance Wall: The Case of Distributed Digital Forensics, Digital Forensics Research Workshop 2004, Linthicum, MD, 2004. [8] Schneiderman, B., Tree Visualization with Tree-maps: 2-d Space-filling Approach, ACM Transactions on Graphics, V11, I1, 1992. Visual analytics will extend visualizations, including Tree-maps [8], Conetrees [6], Tilebars [5], Hypertrees [4], scatterplots, timelines, histograms, and three-dimensional volumes. Brushing and linking between views will enable discovery of relationships between data perspectives. [9] Thomas, J.J. and Cook, K.A., Illuminating the Path: The Research and Development Agenda for Visual Analytics, IEEE Press, 2005. 191