Download A Preliminary Design for Digital Forensics Analysis of Terabyte Size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Web analytics wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Business intelligence wikipedia , lookup

Data analysis wikipedia , lookup

Transcript
A Preliminary Design for Digital Forensics Analysis of
Terabyte Size Data Sets
L.M. Liebrock, N. Marrero, D.P. Burton, R. Prine, E. Cornelius, M. Shakamuri, V. Urias
New Mexico Institute of Mining and Technology (NMT), 801 Leroy Place, Socorro, NM 87801
[email protected], marrero, dburton, rprine, eea, mayuri, [email protected]
talks with both the front-end and the back-end. The database
stores all back-end analysis results. If a user requests an analysis
that is not running, the middleware alerts the back-end. The
middleware alerts the front-end as analysis completes, more data
becomes available, the database changes, or other users connect to
the database. The PDF architecture is shown in Figure 1.
ABSTRACT
Digital forensics is computationally intensive and current analysis
systems do not handle the multiple terabyte size data sets that are
now becoming a major issue for analysis. For these data sets,
RAID file system analysis, parallel computing, collaboration, and
visualization will be essential. Here we outline the preliminary
design for a parallel digital forensics framework that is being
developed to handle multiple terabyte size data set analysis.
Categories and Subject Descriptors
K.6.5 Security and Protection, H.3.2 Information Storage, H.2.8
Database Applications, D.1.3 Concurrent Programming
General Terms
Algorithms, Performance, Security.
Keywords
Digital Forensics, Terabyte Size Data Sets, Visualization, Visual
Analytics, Parallel Computing, RAID File Systems.
Figure 1: PDF Architecture
1. INTRODUCTION
3. RAID BITSTREAM EXTRACTION
Digital forensic analysts are challenged by increasing variety and
complexity of analyses, communication bandwidth, and storage
size. Roussev and Golden [7] show that analysts will need to use
distributed processing for digital forensics for large datasets. They
found the time to open an 80GB target with FTK took over 4
days. At about $1k per terabyte, companies can build stores that
are effectively not analyzable.
Creating a forensically sound copy is not trivial due to the data
size involved, the number of disks, and the RAID encoding. We
propose to use one of two different parallelization techniques to
image the drives: 1) image the whole device (which should not be
an issue because the live RAID controller sees it as a single-disk)
or 2) image constant sized blocks of data that correspond to the
size of each disk drive within the RAID set. Both approaches can
image independently in parallel.
We propose a parallel digital forensics system (PDF) for terabyte
size data sets. PDF will image RAID drives for parallel analysis,
use a parallel back-end to speed analysis, use a database to store
analysis results, and support visual analytics and collaboration.
Reconstruction of the RAID is common to both techniques and is
accomplished by taking the imaged disk data and determing a
blocking size. A RAID map is then constructed based on data
blocks and parity. Using the parity blocks, the period of the
RAID set can be found. A non-trivial process for RAID
reconstruction must be developed in order to extract necessary
data [3]. For speed, reconstruction will be done in parallel.
2. SYSTEM ARCHITECTURE
The essential components of PDF are the visualization /
collaboration engine and the parallel processing analysis engine.
To decouple the front-end of the system from the back-end, we
are employing a database. The database communications wrapper
4. PARALLEL FORENSIC ANALYSIS
Parallelization is the natural response to the size and complexity
of forensic analysis for terabyte size data sets. Both network
traffic and file system analysis require parallel performance, but
require separate treatment.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC’07, March 11-15, 2007, Seoul, Korea.
Copyright 2007 ACM 1-59593-480-4 /07/0003…$5.00.
4.1 Network Analysis
Network-Centric Forensics determines compromised systems,
intruder activity, as well as incident scope [1] to provide the
190
investigator initial clues for further analysis. To enable network
traffic analysis from high bandwidth networks, in close to real
time, each newly identified session is assigned to a specific
processor for the collection and reconstruction of related traffic.
By integrating other network source data with session network
traffic, a detailed coherent picture of network activity and
associated entities can be built.
6.2 Collaboration
4.2 File System Analysis
7. Future Work
After RAID analysis, we have parallel independent files for
analysis. Standard search and analysis mechanisms will be used.
This paper has outlined the design of PDF. The components
described here are in various stages of implementation, which are
not detailed here due to space restrictions. Extensions of
algorithms will continue with implementation as well to improve
PDF’s handling of RAID systems, parallelization, collaboration,
and visual analytics. Future papers will detail individual
components as well as PDF performance.
Here we consider two forms of collaboration. In static
collaboration, the analysts are primarily working independently
and simply wish to occasionally bring the attention of their
colleague(s) to specific information. In interactive collaboration,
the analysts are working much more closely via shared views.
Shared views could also be used to train new analysts.
For unallocated storage, deleted files must be reconstructed across
the parallel processing system. This can be accomplished using
file carving techniques. File carving utilizes header and footer
values to determine which segments of data represent a file.
Entropy calculations are performed on the remaining data. Blocks
with similar scores will be treated as potential fragments of a
single file. Next, take the selected blocks, reassemble file
fragments, and validate the recovered files using mature and
widely accepted tools such as “file”.
8. CONCLUSIONS
We have described our proposed system for improving support
for digital forensics of terabyte size data sets. We have
specifically dealt with issues related to RAID analysis targets,
parallelized analysis, collaboration, and visual analytics to reduce
the analyst burden for these large data sets.
Slack space analysis will mimic unallocated space analysis.
However, if a complete file is found, it has been specifically
hidden and will be noted for the analyst. After file reconstruction
has been performed, analysis of slack space will follow standard
procedures, e.g., text searches.
9. REFERENCES
[1] Bair, A., Klayton Monroe, Jason Smith, Digital Forensic
Research Workshop 2006 File Carving Challenge Entry,
viewed Sept. 15, 2006,
http://www.dfrws.org/2006/challenge/submissions/bair/.
5. FORENSIC DATABASE
All analysis results will be housed in the database, which
additional users can access with approval of the analysis initiator.
Users are authenticated by the database communications wrapper,
not the database itself. All transactions with the database will be
logged indicating the time, date, user, and transaction.
Communication between the database and the front- and backends will be encrypted to maintain the security of the analysis.
[2] Beebe, N.L. and Clark, J.G., Approaching the Terabyte
Dataset Problem in Digital Investigations, Advances in
Digital Forensics, Eds. Politt, M. and Shenoi, S., Springer,
2006.
[3] Brown, C.L.T., Computer Evidence: Collection and
Preservation, Charles River Media, Hingham, MA, 2005.
6. VISUALIZATION FRONT-END
[4] De Souza, K.X.S, Dos Santos, A.D., and Envangelista,
S.R.M., Visualization of Ontologies through Hypertrees,
Proceedings of the Latin American Conference on HumanComputer Interaction, 2003.
The analysis will begin with the user indicating the dataset to be
analyzed and what default analyses they want done. This will
populate the database for the analysis. The user can then allow
others access, if the analysis will be collaborative.
[5] Hearst, M., TileBars: Visualization of Term Distribution
Information in Full Text Information Access, Proceedings
of the ACM SIGCHI Conference on Human Factors in
Computing Systems(CHI), Denver, CO, May 1995.
6.1 Visualization
Processing times for keyword searches (10-20 keywords in 200
GB) takes days and the analyst is overwhelmed with ‘hits’ [2].
For terabyte size data sets, hits will increase dramatically.
[6] Robertson, G.G., Mackinlay, J.D., and Card, S.K.,
Conetrees: Animated 3D Visualizations of Hierarchical
Information. Proceedings of the ACM SIGSHI Conference
on Human Factors in Computing Systems, ACM Press, 1992.
The Department of Homeland Security established visual
analytics for “analyzing terrorist threats, safeguarding borders and
ports, and preparing for and responding to emergencies” [9]; the
issues are similar for digital forensics. PDF’s visual analytics will
address visualization, analysis, and understanding of terabyte size
data sets, where the analyst will not be able to look at all data.
Data mining will prioritize the data for analyst consideration [2].
[7] Roussev, V. and Golden, R.G., Breaking the Performance
Wall: The Case of Distributed Digital Forensics, Digital
Forensics Research Workshop 2004, Linthicum, MD, 2004.
[8] Schneiderman, B., Tree Visualization with Tree-maps: 2-d
Space-filling Approach, ACM Transactions on Graphics,
V11, I1, 1992.
Visual analytics will extend visualizations, including Tree-maps
[8], Conetrees [6], Tilebars [5], Hypertrees [4], scatterplots,
timelines, histograms, and three-dimensional volumes. Brushing
and linking between views will enable discovery of relationships
between data perspectives.
[9] Thomas, J.J. and Cook, K.A., Illuminating the Path: The
Research and Development Agenda for Visual Analytics,
IEEE Press, 2005.
191