Download Semantically-enabled Digital Investigations

Document related concepts

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Data analysis wikipedia , lookup

Computer and network surveillance wikipedia , lookup

Data remanence wikipedia , lookup

Transcript
Semantically-enabled Digital
Investigations
A method for semantic integration and correlation of digital
evidence using a hypothesis-based approach.
Spyridon Dossis
Department of Computer and Systems
Sciences
Degree project 30 HE credits
Degree subject (Computer and Systems Sciences)
Degree project at the master level
Autumn/Spring term 2012
Supervisor: Prof. Oliver Popov
Reviewer: Prof. Iskra Popova
Swedish title: Semantiskt digitala undersökningar
Abstract
Due to the continuous rise of security threats and increased sophistication and professionalism of
exploitation techniques, digital investigations are becoming an essential part of most information and
communication security processes and workflows. Digital investigations commonly combine a wide
span of different areas of expertise such as forensic analysis of storage devices, network
communications, OS artifacts, and logs from various host and network security appliances etc. The
complexity of analyzing all these data requires both time and considerable expertise by practitioners
while the bodies of knowledge of each field stay disparate and disjoint between them. Although there
is a plethora of tools and techniques that are employed during a digital investigation, the lack of
integration and interoperability between them as well as the formats of their source and resulting data
hinders the analysis process. The sheer amount of data encountered in most cases require a new (semi)automated approach that can reduce the requirements in both time and expertise as well as enhance
manual analytical skills by enabling easier and expressive integration and correlation of digital
evidences.
The Semantic Web initiative is a framework comprised of a number of different standards and
languages that was conceived as a better way to automate machine to machine communication and
integrate heterogeneous sources of data with a focus on the Web environment. The stack of
technologies builds upon well-established markup languages such as XML and unique identification
schemes such as URIs and extends them with expressive and schema-less data models such as RDF as
well as enable well-defined and semantic description of a domain’s knowledge and automated
inference of implicit knowledge encapsulated in any dataset. The stack is further complemented by
rule and querying systems that allow even richer manipulation and retrieval of information.
The thesis attempts to bridge these two disciplines and claims that distinctive advantages of the latter
can solve challenges of the former. The thesis’s main research question is how a method that is based
on existing Semantic Web technologies can be designed and implemented so as to automate
processing tasks applied in the context of digital investigations as well as assist the investigator into
integrating, correlating and querying disparate sources of forensically-relevant data and events with
the goal of faster and more accurate reconstruction of a case’s events. The thesis follows the design
science paradigm and reports in structured manner the steps followed in order to design, implement
and evaluate the proposed method.
The thesis, besides a detailed description of the proposed method, presents a prototype implementation
of such a system which is further demonstrated through experiments that simulate realistic cases of
security compromises in a networked environment. The method’s employment on these experiments
demonstrates the feasibility of it as well as its considerable efficiency and capacity to enable the
investigator to formulate hypotheses that arise during the analysis phase into queries that can span the
multitude of collected data.
Keywords
Digital Investigation, Semantic Web, Data Integration, Evidence Correlation, Hypothesis-based
approach.
Table of Contents
List of Figures ......................................................................................... 2
List of Tables ........................................................................................... 3
List of Abbreviations ............................................................................... 4
Introduction ............................................................................................ 5
1.1
Problem Description................................................................................... 5
1.2
Justification, Motivation and Benefits ........................................................... 5
1.3
Research Questions ................................................................................... 6
1.4
Audience .................................................................................................. 6
1.5
Limitations ............................................................................................... 6
1.6
Thesis Structure ........................................................................................ 7
Digital Evidence & Digital Investigations................................................. 8
2.1
Digital Evidence ........................................................................................ 8
2.1.1
Chain of Custody ................................................................................. 9
2.1.2
Order of Volatility ...............................................................................10
2.2
Digital Investigations ................................................................................10
2.2.1
The Event-based Digital Forensic Investigation Framework ......................11
2.2.2
The Digital Investigation Process ..........................................................12
2.2.3
The Scientific Method and the Hypothesis-Based Approach ......................14
2.3
Forensic Tools & Integration Issues ............................................................16
2.3.1
Tool Integration .................................................................................17
2.3.2
Data Representation ...........................................................................18
2.3.3
Correlation-based Analysis Techniques ..................................................18
Semantic Web Technologies .................................................................. 20
3.1
Semantic Web Foundations ........................................................................20
3.2
Semantic Web Architecture ........................................................................21
3.2.1
Uniform Resource Identifier (URI) ........................................................21
3.2.2
XML / Namespaces .............................................................................22
3.2.3
XML Schema / XML Query ...................................................................22
3.2.4
RDF / RDF Schema .............................................................................22
3.2.5
Ontologies .........................................................................................24
3.2.6
Rules / Query ....................................................................................26
3.2.7
Top Layers ........................................................................................27
Semantic Web & Digital Investigations ................................................. 28
4.1
XML-based Approaches .............................................................................28
4.2
RDF-based Approaches .............................................................................30
4.3
Ontological Approaches .............................................................................31
Research Methodology .......................................................................... 33
A Framework for Semantically-Enabled Digital Investigations .............. 36
6.1
An approach for digital evidence integration, correlation and hypothesis
evaluation based on Semantic Web technologies ...................................................36
6.2
Relation to Digital Investigation Reference Models ........................................40
6.3
Evaluation Criteria ....................................................................................43
A Semantically enabled Method for Digital Evidence Integration,
Correlation and Hypothesis Evaluation .................................................. 47
7.1
Description of the Method ..........................................................................47
7.2
Ontological Representation of Digital Evidence .............................................49
7.2.1
Network Packet Capture Ontology ........................................................50
7.2.2
Forensic Disk Image Ontology ..............................................................52
7.2.3
Windows Firewall Log Ontology ............................................................55
7.2.4
WHOIS Ontology ................................................................................58
7.2.5
Malicious Networks Ontology ...............................................................60
7.2.6
Malware Detection Ontology ................................................................62
7.3
Semantic Integration and Correlation of Forensic Evidence ............................63
7.3.1
Semantic Integration ..........................................................................64
7.3.2
Evidence Correlation ...........................................................................67
7.4
Query Formulation and Evaluation ..............................................................71
7.5
A reference method implementation ...........................................................73
7.5.1
Overview of the tools used ..................................................................73
7.5.2
Architecture of the PoC system ............................................................75
Demonstration of the Method ................................................................ 80
8.1
Description of the Experiments ..................................................................81
8.2
Integration and Correlation of Digital Artifacts .............................................83
8.3
Hypothesis formulation and evaluation ........................................................91
8.4
Evaluation of the Method ......................................................................... 108
Conclusions and Future Work .............................................................. 111
9.1
Conclusions ........................................................................................... 111
9.2
Future Work .......................................................................................... 112
List of References ................................................................................ 113
1
List of Figures
Figure 1: Overview of the Event-based Digital Forensic Investigation Framework,
adapted from (B. D. Carrier & E. H. Spafford 2004) ..............................................12
Figure 2: Semantic Web Architecture (Antoniou & Van Harmelen 2004) ...................21
Figure 3: Overview of the Design Science Method (Johannesson & Perjons 2012) ......34
Figure 4: Data Integration based on shared resource URI .......................................37
Figure 5: Data Integration based on owl:sameAs ...................................................37
Figure 6: Semantic inconsistency as reported by the reasoning engine .....................38
Figure 7: Class membersip entailment based on value restrictions ...........................39
Figure 8: Conceptual relation between forensic frameworks and the Semantic Web
stack................................................................................................................42
Figure 9: The abstracted method's structure .........................................................47
Figure 10: Ontological modeling of network packet captures ...................................51
Figure 11: Ontological modeling of a forensic disk image ........................................53
Figure 12: Ontological modeling of Windows firewall logs .......................................57
Figure 13: Ontological modeling of WHOIS data as provided by RIPE .......................59
Figure 14: Ontological modeling of FIRE's blacklist of malicious networks/hosts ........61
Figure 15: Ontological modeling of VirusTotal's anti-malware detection service .........62
Figure 16: Transformation process of raw data to their semantic representation ........64
Figure 17: De-duplication of data by semantic integration using URIs ......................65
Figure 18: Semantic Integration of related individuals represented in different
ontologies.........................................................................................................66
Figure 19: Integration of IP addresses/MD5 hash signatures ...................................67
Figure 20: Conversion process according to the SWRL Temporal Ontology ................69
Figure 21: Temporal relations of Allen's Interval Algebra ........................................70
Figure 22: Mereological correlation between IP addresses and Autonomous Systems .71
Figure 23: SPARQL graph pattern matching ..........................................................73
Figure 24: Proof-of-concept system architecture....................................................75
Figure 25: Attack scenario of a ‘bind_tcp’ shellcode triggered by a malicious Word
document downloaded from the Web. ..................................................................82
Figure 26: Attack scenario of a ‘reverse_tcp’ shellcode triggered by a malicious Word
document downloaded from the Web. ..................................................................83
Figure 27: Visualization of the semantic representation of the evidence files. ............85
2
List of Tables
Table 1: List of Generic Criteria in terms of the GQM methodology ...........................43
Table 2: List of Forensic related criteria in terms of the GQM methodology ...............44
Table 3: List of criteria with regarding to the Semantic Web principles in terms of the
GQM methodology .............................................................................................45
Table 4: Entities of the Network Packet Capture Ontology .......................................51
Table 5: Entities of the Disk Image Ontology ........................................................53
Table 6: List of fields of Windows Firewall log entries .............................................55
Table 7: Entities of the Windows Firewall Log Ontology ..........................................57
Table 8: Entities of the RIPE WHOIS Ontology .......................................................59
Table 9: Entities of the MalicousNetworks ontology ................................................61
Table 10: Entities of the VirusTotal ontology .........................................................62
Table 11: Integration semantic mappings between ontologies .................................66
Table 12: Semantic Representation of the Experiment 1 Evidence Files ....................83
Table 13: SWRL Rule Evaluation Results for Experiment 1 ......................................85
Table 14: Semantic Representation of the Experiment 2 Evidence Files ....................88
Table 15: SWRL Rule Evaluation Results for Experiment 2 ......................................89
3
List of Abbreviations
AFF
=
Advanced Forensic Format
AS
=
Autonomous System
DLL
=
Dynamic Link Library
DNS
=
Domain Name System
FAT
=
File Allocation Table
FIRE
=
Finding Rogue Networks Project
FIWALK
=
File Inode Walk
GUI
=
Graphical User Interface
HTTP
=
Hypertext Transfer Protocol
IDS
=
Intrusion Detection System
ISP
=
Internet Service Provider
MACE
=
Last Modified, Last Access, Created and Entry Modified Timestamps
Malware
=
Malicious Software
MD5
=
Message Digest Algorithm (version 5)
NIST
= National Institute of Standards and Technology
NSRL
=
National Software Reference Library
OWL
=
Web Ontology Language
RDF
=
Resource Description Framework
RIPE NCC =
Réseaux IP Européens, Network Coordination Center
SPARQL
=
SPARQL Protocol and RDF Query Language
SWRL
=
Semantic Web Rule Language
TCP
=
Transmission Control Protocol
W3C
=
World Wide Web Consortium
XML
=
Extensible Markup Language
4
Introduction
1.1 Problem Description
The field of Digital Forensics is facing an increasing number of challenges. Modern complex and
networked IT systems are under the constant threat of various and increasingly sophisticated types of
attacks ranging from reconnaissance attempts up to APT targeted attacks. Moreover, the constant
evolution of the technological landscape with the introduction of new products and technologies as
well the large volumes of data present Digital Forensic practitioners with almost unmanageable
complexity in the timely accomplishment of their tasks. DF cases may require an integration of
evidentiary data from disparate sources such as hard disks, volatile memory, security appliances logs
and network communication. Additional problems may arise in the case of multiple parties being
involved in the analysis of a security event due to different levels of expertise or communication
problems that may arise from a lack of an agreed set of terms and definitions. A final side-problem
identified is focused on the presentation level as being an important component of almost all DF
investigation models. The results of a DF investigation have to be communicated to a court’s jury or
an organization’s decision making board in a more understandable manner avoiding, unless if
required, technical jargon or intricate terminology without of course affecting the admissibility or the
probative value of the presented evidence.
1.2 Justification, Motivation and Benefits
Existing practices and tools in the DF practice are usually restricted by architectural limitations in a
specialized subset of the digital forensic collection and analysis needs such as file carving, log analysis
tool or network forensics analysis tools. These tools can also become quickly outdated by newly
introduced technologies or data formats. All the above can lead the forensic examiner to demanding
and lengthy periods of manual analysis for the purposes of integrating the outcomes of the various
tools as well as constant training in order to be able to deal with new problems.
The motivation of the present endeavor is the ability to automate and streamline the analysis and
reasoning process of any digital investigation. The addition of semantically expressed assertions
during the examination, analysis and presentation phases can improve the current DF investigation
models and techniques as well as promote new ideas in regards to how artifacts produced during
investigation can be represented, integrated and linked in novel and meaningful ways.
Automation of DF tasks can lead to multifaceted benefits both for DF practitioners and other relevant
parties. Lowering the barrier of entrance to less-experienced forensic examiners and a reduction of the
needed resources and time for manual analysis can improve considerably the efficiency and
capabilities of the various forensic units or CERTs. A commonly agreed representation layer of the
various artifacts and assertions produced during the investigation process can lower the dependence on
specific tools or vendors while enhance the ability to integrate sources of data of different nature into a
common analysis framework armed with advanced correlation and reasoning features. The addition of
a more expressive representation layer can enable the active involvement of other interested but less
technical educated parties (from the legal, academic or business areas) during the whole process.
Finally, a semantically enriched digital investigation can introduce new methods and techniques in
various aspects of case handling ranging from new forms of analysis methods such as conceptual
5
searching, refinement of data retention and case archiving policies, a provenance model for tracing the
lineage of artifacts produced during the investigation process as well as more accessible and
compelling presentation forms for the reporting of findings.
1.3 Research Questions
In order to better clarify and delimit the research area as well as provide a range of criteria based upon
the results can be evaluated, the following research questions have been defined.
 How can the Semantic Web technologies and the Linked Data initiative be applied to Digital
Forensics? A method that combines both areas has to be respectful to the limitations and
constraints of each but also suggest advantages that can be attained by introducing and applying the
concept of publishing and linking structured data in both the theoretical models as well as the
practice of the Digital Forensics field.
 How a common ontological-based knowledge representation layer can improve the level of
integration of currently disjoint specialized areas of DF such as storage, network, mobile, live
memory and others? A method that utilizes ontologies in order to represent commonly used
concepts and entities pertinent to the area of digital forensic investigations may enable a computerunderstandable and formalized expression format, integrate and aggregate into higher abstraction
levels data originated from different sources, enable automated reasoning able to infer new
knowledge as well as promote reusability of existing knowledge bases.
 How such a new method may improve the efficiency and capabilities of existing DF investigation
models, techniques and tools? A method proposing an automated handling of specific parts of a DF
investigation process should reduce the complexity and requirements of existing semi-automated
methods as well as propose new advanced capabilities that could improve the effectiveness of
existing tools and processes.
1.4 Audience
The thesis attempts to connect two seemingly disparate fields, those of Digital Forensics and Semantic
Web. This thesis builds on top of some existing attempts to take advantage of key features of Semantic
Web technologies with the goal to advance automation and data integration in the area of digital
forensics. As such the thesis is expected to be of interest for both practitioners and researchers in the
areas of digital forensics and information security in general as it demonstrates a new efficient and
flexible method for improving the current status of analytical techniques and skills employed during
most types of digital investigations. The thesis should also be relevant to the research community of
the Semantic Web and Linked Data initiative as it demonstrates a practical implementation of such
technologies in a new field and discusses various challenges and possible solutions that such a system
may face and deploy.
1.5 Limitations
The Semantic Web is a union of various standards, languages and technologies extending well-known
data encoding languages such as XML and introducing new techniques originating from fields such as
Artificial Intelligence and Knowledge Representation. All these technologies are continuously
6
improved and extended with new features as well as more and more supported by relevant tools and
programmatic libraries. Due to the scope of this thesis, only a subset of all the available features is
discussed and utilized. The method described in this thesis is able though to be further extended and
take advantage of relevant advancements in all these technologies. Additionally, these technologies are
not fully described in the respective sections of the thesis but references to the standards can be
followed for a deeper coverage.
Moreover, the current status of any digital investigations, especially in cases that are relevant to
system and network security compromises, can be quite complex in cases of large networks, advanced
types of malware or penetration technique etc. Additionally, a digital investigator may have to face a
plethora of data sources that have to be processed and analyzed in order to find traces and reconstruct
such events. This thesis has selected only a subset of such data sources that are quite commonly
important in most cases of digital investigations and cover a representative part of the spectrum. The
experiments conducted in the thesis have been selected so as to resemble common scenarios of actual
security compromises although certainly their complexity and the number of involved entities has
been simplified compared to real large-scale events.
1.6 Thesis Structure
The thesis has been divided into several chapters following a top to down approach due to the
interdisciplinary nature of it.
 Chapter 2 presents the theoretical background in the area of Digital Investigations. The Chapter is
divided into three parts, the first discussing the basic principles of digital evidence, the second
presenting the conceptual frameworks that guide digital investigations and the final one discussing
current problems and the challenges that the field faces.
 Chapter 3 is a short description of the Semantic Web initiative as well the stack of technologies and
languages it is comprised of.
 Chapter 4 presents related work that has also merged partially or fully these two areas. The chapter
is subdivided in three sections with respect to the layer of the Semantic Web Stack up to which this
merge has reached.
 Chapter 5 discusses the methodology followed during this thesis with respect to the research
methods applied and the steps followed for the design, implementation and evaluation of the
method.
 Chapter 6 presents the framework of the proposed method. The chapter provides a high-level view
of how these two disciplines can be combined and with what advantages. The chapter ends with the
specification of a number of evaluation criteria for the proposed method.
 Chapter 7 presents the proposed method. The first part discusses the overall structure of it while the
second part presents the ontologies developed for this thesis. The latter parts discuss in more detail
the parts of the method that deal with the integration, correlation and querying of the source data.
The chapter finishes with the presentation of a proof-of-concept system developed to evaluate the
proposed method.
 Chapter 8 describes the experiments that were conducted in order to examine both the practical
value of the implemented system as well as the feasibility and potential of the proposed method.
 Chapter 9 concludes with an overall discussion of the proposed method and the final outcome of the
thesis as well as discusses some of the possible extensions and improvements that may be worthy of
further research and experimentation.
7
Digital Evidence & Digital
Investigations
The subject of this chapter is a brief but concise definition and description of the digital forensics area,
the role of digital evidence as well as a presentation of prominent digital forensic process frameworks.
A short presentation of the different sub-topics of the field, various methods of analysis as well a
discussion of tools and techniques follows.
2.1 Digital Evidence
The central point of reference of every type of digital investigation is irrefutably the concept of digital
evidence. A multitude of different definitions can be found in the literature such as:
 ‘any data stored or transmitted using a computer that support or refute a theory of how an offence occurred or
that address critical elements of the offence such as intent or alibi ’ (E Casey 2002)
 ‘information stored or transmitted in binary form that may be relied upon In court’ (IOCE 2002)
 ‘digital evidence of an incident is any digital data that contain reliable information that supports or refutes a
hypothesis about the incident’ (B. D. Carrier & E. H. Spafford 2004)
As stated in (B. L. Schatz 2007), subtle differences exist between the definitions which are mainly
pertinent to the perspective of the claiming author or body, namely a legal or an investigative one.
Thus, some definitions focus on the investigative process and the support that evidence can provide to
hypothesis validation while others deal with the probative value of electronic data in the legal context.
Another important aspect of a definition of digital evidence is the scope of it. Due to the continuous
emergence of new and innovative digital technologies, digital evidence nowadays cannot be restricted
anymore on solely computers in their traditional form but must include new types of digital devices as
well such as mobile phones, portable tablets, digital cameras, GPS devices etc. Even the notion of
digital evidence in its traditional computer-related form shall be updated due to the introduction of
new paradigms, such as virtualization and cloud computing, that promote the physical and logical
abstraction and separation between the execution, the computation and the storage parts of a computer
system.
Digital evidence is the central point of any digital investigation or forensics process and thus its
rigorous and authenticated handling is of paramount importance. Digital evidence can surely be
identified and interpreted from different abstraction layers, such as electrical charges in memory
transistors and electromagnetic waves in wireless networks from a physical point of view, or as bits &
bytes in registers and memory addresses from a computing perspective or as files & directories from
an OS perspective. Despite, the different properties that digital evidence carries along in each different
layer, the interpretation of this piece of evidence as information and its relevance to the context of the
current investigative process is what adds value to it and promotes it from mere data to information of
evidentiary value.
Schatz (B. L. Schatz 2007) has identified 3 basic properties of digital evidence, namely latency,
fidelity and volatility. Latency refers to the fact that a digital encoding in the form of binary data needs
additional contextual information on how it should be interpreted. Fidelity is a property of digital data
that allows a copy of it, assuming the verification of the integrity of such a process, to be equally
8
treated as the original one. This is especially important in the DF area where access of the original data
must be restricted to exceptional circumstances only and be performed by competent personnel
(ACPO 2007). Finally, the volatile nature of digital evidence affects considerably the practice of
acquisition and further processing of it due to the fact that its authenticity can easily be disputed
except if proper and up-to-date procedures are always applied.
Although, the focus of the current thesis is on the latent nature of evidence and how this can be
enriched with semantic content, fidelity and volatility of evidence are quite important so as to be
encapsulated into two commonly-referred forensic principles, Chain of Custody and Order of
Volatility which for purposes of completeness are briefly discussed below.
2.1.1 Chain of Custody
According to the U.S. National Institute of Justice, chain of custody is defined as “a process used to
maintain and document the chronological history of the evidence” (National Institute of Justice
2011).This involves documentation of all individuals’ names involved in the collection, preservation
and analysis of evidence, timestamps of any process applied on the data as well as contextual
information such as case numbering, involved agencies and laboratories, additional data regarding the
case involved individuals or entities as well as a brief description of each item. Another definition
provided by the Scientific Working Group on Digital Evidence and Imaging Technology defines
Chain of Custody as “the chronological documentation of the movement, location and possession of
evidence”.
Thus anyone involved in the forensic examination process can be held responsible for not maintaining
the Chain of Custody due to bad practices or missing documentation. Maintaining Chain of Custody
though is confronted with a number of challenges due to the continuous growth in the complexity and
diversity of digital systems while at the same time most of the different tools and techniques on which
the investigator relies for the preservation and analysis have not been assured in regards to their
correctness and validity of results. (Guo et al. 2009) Such problems can affect the acceptability of
digital evidence as genuine and reliable by society (Turner 2005) and especially in the business world
where data retention rules and relevant legislation are imposed by governments to businesses requiring
stringent procedures of data preservation and maintenance in order to be able to verify their
authenticity and integrity. (Patzakis 2003)
In order to ensure Chain of Custody different techniques have been suggested and applied. The most
common approach is the usage of hash functions for performing integrity checks on the data
accompanied by time stamping of the performed forensic activities. A cryptographic hash of an entire
disk image can be used to ensure that no modifications have been applied on the image between the
acquisition and examination phases. In order to enable higher-level documentations of such images
with metadata such as time of acquisition, serial numbers of the device, names of those performing the
acquisition etc. Tool vendors, such as Encase, iLook, ProDiscover have introduced proprietary image
formats to be used with their tools with a varying level of inter-compatibility and openness. Open
formats have also been suggested with advanced capabilities such as the Digital Evidence Bag (DEB)
(Turner 2005) which supports additional metadata of the history of operations performed on the image
and the Advanced File Format (Cohen et al. 2009), which can support different types of evidence such
as disk drives, network packets, memory images and extracted files and support also cryptographic
operations and image signing.
9
2.1.2 Order of Volatility
As good forensic practice suggests, a complete acquisition of a copy of the entire target system is the
goal in order for the examiner to have an as accurate as possible picture of the system. However even
the process of collecting the data itself or direct modifications incurred by other users connected to the
system or “traps” that an attacker may leave behind in a compromised system may introduce unwanted
changes to the data. Given the different nature of data storage technologies (memory, network state,
running processes, disks, and backup media) that comprise a modern computer, special attention has to
be given on how to prioritize and perform the data collection steps for each.
Order of Volatility is a forensic principle promoting the concept that data must be collected in an order
based on their volatile nature, proceeding from the most volatile to the least one. (Brezinski & Killalea
2002) suggests collection of data in the following order: registers, memory, process table, temporary
file systems, disk, remote logging and monitoring data, physical configuration and network topology
and finally archival media. Especially with the latest advancements in the area of live memory
forensics (Schuster 2010) and the capabilities that it provides for extraction of new types of evidence
previously missed such as command prompt history (Stevens & Eoghan Casey 2010), order of
volatility is becoming even more important.
2.2 Digital Investigations
The set of principles and procedures that are followed during the lifecycle of digital evidence, from
acquisition and preservation, to analysis and reporting are encompassed under the term of digital
investigation. (Kruse & Heiser 2002) has defined digital forensics as the “preservation, identification,
extraction, documentation and interpretation of computer media for evidentiary and/or root cause
analysis” while the Scientific Working Group on Digital Evidence (SWGDE) has defined computer
forensics as “a sub-discipline of digital & multimedia evidence which involves the scientific
examination, analysis, and/or evaluation of digital evidence in legal matters” (Scientific Working
Group On Digital Evidence 2011). Attendants of the first digital forensic research workshop, have
defined the digital forensic science as:
“the use of scientifically derived and proven methods toward the preservation, collection, validation,
identification, analysis, interpretation, documentation and presentation of digital evidence derived
from digital sources for the purpose of facilitating or furthering the reconstruction of events found to
be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned
operations”. (Palmer 2001)
As seen, although both definitions cover the various steps of the forensic process, they present
variations in regards to the source of evidence and their purpose. Terms such as digital forensics and
computer forensics are commonly used interchangeably although as mentioned before the introduction
of new types of digital devices such as mobile phones, digital cameras etc. may make the term
computer forensics sound as too restrictive or specialized. (B. L. Schatz 2007) presents an interesting
description of how historically these terms have been used and evolved. The former concludes that
the practice of forensics in different contexts such as the judicial, the military or the corporate sector
and the different requirements in each one of them in regards to accuracy of results, performance in
terms of time and the rigor of the process in total and the primary objective of the process have caused
slight divergences between the definitions.
On the other side, security incident response is a closely related term that is mostly used in the
business domain, covering a vast and diverse amount of types of security incidents. A security incident
has been defined as “any unintended activity that results in a negative impact on information security”
10
(Tan et al. 2003) or “a violation or imminent threat of violation of computer security policies,
acceptable use policies, or standard security practices” (Scarfone & Masone 2004). Incident response
thus can be defined as “the process that intends to minimize the incident’s impact, and investigates
and learns from such security breaches”. (Lamis 2010)
Commonly, the purpose of the security incident response is the reconstruction of security incidents,
eradication of any remaining vulnerabilities and recovery of the system to its normal operating status.
Despite the differences in the context and objectives between the two procedures, integration of proper
forensic techniques into the incident response field is gaining importance (Kent et al. 2006). Proper
documentation and evidence handling is equally important especially in cases where attacker
attribution and further legal prosecution is wanted.
The generic term digital investigation can reflect the differences of focus and context between the
different fields and is used in the current thesis as encompassing both the fields. The rest of the thesis
will focus on cases of forensic analysis of networked computer intrusions and compromises in general,
thus research and practice from both fields is relevant.
2.2.1 The Event-based Digital Forensic Investigation Framework
The need of standardizing and formalizing digital investigations has led to the formulation of the
conceptual tasks and actions that are performed in the context of a digital investigation. A useful
framework has been introduced by Carrier and Spafford, named as Event-based Digital Forensic
Investigation Framework (B. D. Carrier & E. H. Spafford 2004). Such a conceptual framework can
improve the understanding of the different phases of the digital investigation.
The aforementioned framework has been influenced by procedures followed in investigations of
physical crime scenes and extended so as to cover digital ones as well. A graphical overview of the
framework is presented in Figure 1.
The main phases of the framework are briefly presented below:
 The Readiness Phase covers operations readiness (training of people, methodology) and
infrastructure readiness (configuration of the infrastructure in a forensics appropriate preparation
manner, forensic readiness).
 The Deployment Phase which includes the “detection, notification, confirmation and authorization
phases”. The first two sub-phases deal with the detection and acknowledgment of the incident while
the latter two deal with the investigators being granted allowance (e.g. search warrant) for
conducting the investigation.
 The Physical Crime Scene Investigation phase involves the examination of the physical scene of the
crime or incident. In case that any physical device that contains digital data is identified and seized,
this leads to the digital crime scene investigation phase.
 The Digital Crime Scene Investigation phase is comprised of three sub-phases, namely: System
Preservation & Documentation Phase, Evidence Searching & Documentation Phase and Event
Reconstruction & Documentation Phase. These sub-phases describe in a high-level the main
activities involved during the examination of the digital data and are presented in more detail
below. It is important to underline the significance of the Documentation phase since it appears in
each one of them. It is worth noting also, that the sub-phases of event searching and reconstruction
are iteratively conducted in an attempt to prove or not the hypotheses related to the events.
Additionally, analyzed physical or digital evidence can lead to new instances of these phases for
further acquisition and analysis.
11
 The Presentation phase finally is where the results of the analysis are presented along with the
documentation of all the actions performed throughout the process.
Figure 1: Overview of the Event-based Digital Forensic Investigation Framework, adapted from (B. D.
Carrier & E. H. Spafford 2004)
The phase that is of the most relevance to the present thesis is of course the Digital CSI one. Each one
of its three sub-phases is an essential part of a proper digital investigative process.
 In the System Preservation phase, an investigator is confronted with the challenge of not affecting
the state of the analyzed system and thus altering the stored digital data in undocumented ways. The
state of the system is preserved by copying the data to another digital media. This process is quite
different from those applied in the physical phase and thus integrity checking of the end result is
necessary (e.g. use of hashing functions). The principles of chain of custody and especially the
order of volatility as discussed above should guide the process of preservation else the validity of
the following phases can be challenged.
 During the Evidence Searching phase, the investigator searches through the preserved data for any
data of evidentiary value. This process is highly contextual since the target of interest is dependent
on the actual case. Different types of searching can be applied during this phase, such as keyword
searching, and the results of it can also be dependent on the investigator’s knowledge and previous
experience or the accuracy of the tools used.
 Finally, the event reconstruction phase is the process where the evidence detected during the
searching phase are aggregated and correlated in such a way so as to improve the understanding of
the incident and provide support for the proof or falsification the formulated initial hypotheses.
2.2.2 The Digital Investigation Process
Although the previously presented framework can provide a conceptual foundation of most types of
digital investigation, it is considered as too abstract from a practical perspective. Various authors have
proposed process models of a digital investigation with subtle differences between them in the
terminology used or the granularity. A prominent process model is discussed below and its main
categories briefly explained.
(Casey 2004) describes a staircase-like investigative process model that can provide a methodical and
practical approach. The model consists of the following steps:
 Incident alerts or accusation: This step involves the initial reporting of a crime of policy violation.
 Assessment of worth: The worth of investigating is estimated and in case of multiple cases in
parallel, a prioritization of them is performed.
12
 Incident/crime scene protocols: It includes the procedures and methodical steps that must be
followed by the investigator when accessing the incident/crime scene. The protocols may differ
depending on the real or virtual nature of the incident/crime scene.
 Identification of seizure: Recognition of any relevant object that could be of evidentiary value is
performed and seizing of it is conducted. Proper packaging procedures for identification and linking
to the specific incident/crime instance are applied.
 Preservation: It involves are necessary case management tasks in order to protect the integrity of
the original media and prohibit any inadvertent modifications. This step is the beginning of the
chain of custody through documentation that will allow the traceability of the object’s origin to the
final evidence. This step involves mostly imaging technologies in order to acquire as much exact
copies of the original object as possible. There is a variety of solutions used such as specialized
imaging hardware, write-block software etc.in order to fulfill the task. Order of Volatility has to be
considered in regarding what must be collected and in which order.
 Recovery: Prior to the analysis step, a recovery of any resident but not directly observable data has
to be performed. The most prominent example is that of data resident in a storage device which can
include deleted, hidden, camouflaged or fragmented parts. The completeness of the recovery step
may allow later access to not only active data but potentially to hidden and deleted ones thus
providing access to the maximum possible amount of content and therefore enabling the
investigator to perform a much more complete analysis .
 Harvesting: The investigator identifies categories of data that based on knowledge or experiences
are mostly related to the case in focus. The results of the previous phase are organized in such a
manner so as to allow access to specific categories of data which are known to be relevant to
specific types of cases, e.g. pictures and videos in the case of contraband material or executable
files and scripts in the case of computer compromises.
 Reduction: During this step, a filtering is performed based on related criteria in order to reduce the
amount of data needed for the analysis. A common technique is the automated removal of known
files that are part of operating systems or other applications. The signatures of these files,
commonly in the form of a hash, can be stored as database and used in combination with the
forensic tools in order to remove unnecessary data.
 Organization and Search: In order to facilitate a more thorough and complete analysis, groupings
of certain files and data in general can be performed. This can enable the investigator to have an
easier access to the data, perform search operations that can identify faster interesting data or events
and finally allow cross-referencing between data as well as the final reports.
 Analysis: This is the main task where the products of the previous steps are further evaluated for
their significance and probative value to the case. The main focus in this step is the content of the
data selected from the previous steps as well as interpretation of them in relation to the case in
hand. The analysis part is usually quite loosely defined in the majority of digital forensics process
models and a further more detailed description of it and its subcategories follows below.
 Reporting: The final report should contain all the necessary documentation of actions and results
attained throughout all the previous steps. The report also contains the results of the analysis phase
with data of evidentiary value along with any conclusions drawn by the examiner. The examiner
should remain objective by presenting only the supporting evidence as well as other scenarios that
cannot be supported by the current evidence.
 Persuasion and testing: Often, the final result of an investigative process must be communicated to
decision makers outlining in a clear and understandable manner the incident and the conclusions of
13
the investigation. There is the need of transforming technical and other intrinsic details into an
understandable narrative of the incident.
The Analysis part of the aforementioned process model can be further subdivided into 5 distinct tasks
that are further discussed below:
 Assessment: Digital data can include human readable content that can be directly accessible and
interpretable by the investigator. The content can be evaluated in regards to its relation to the
context of the case such as the means and motivation of the attacker.
 Experiment: Due to the unique nature of each specific case and the explicit combination of
technologies involved both in the actual incident as well the investigation process, the investigator
may be called to employ previously untested and untried methods and techniques. Detailed and
rigorous documentation of such actions is of paramount importance so as to enable the
reproducibility and testability of them by e.g. the courts in order to assess their results’
admissibility.
 Fusion: Data collected during an investigation comes in a multitude of different formats and from a
variety of sources. Each piece of information can provide an insight to a part of the incident. Data
have to be fused in order for the investigator to connect the different pieces of the puzzle and have a
better insight of the whole incident and its various parts. As an example, the various reported events
or actions can be used for the construction of a timeline of activities that were performed during the
investigated incident.
 Correlation: Correlation attempts to link the various events into causality relationships so as an
action A can be the cause that leads to an event B as an effect. Correlation can involve both
temporal correlation based on the chronological ordering of events but also on other contextual
information such as connection between persons involved in the case.
 Validation: This includes the results of the analysis phase where the findings along with their
backing reasoning are collected and further submitted to the jury or other decision makers for
further actions such as prosecution.
2.2.3 The Scientific Method and the Hypothesis-Based Approach
Although in theory, a digital investigation process model as the one presented above, covers all
aspects of the investigative process and can provide guidelines to the investigator, the implementation
of them introduces new types of problems and challenges. Despite, such models appearing as linear
ones, where the one step follows the previous one, in practice these steps are intertwined and not
clearly separated. In order for the results of such processes to be assessed in terms of their validity and
therefore assess their admissibility as evidence especially in the legal context, evaluation criteria have
been developed so as to judge them in scientific terms.
The most common criteria followed by U.S. courts and not only, are known by the name of the
Daubert standard (Wikipedia 2011). Briefly these criteria evaluate the scientific value of the
investigative process followed along with the results of it in the form of evidence according to the
following:
 The theory or technique used can be (and has been) tested. In more general terms the theory or
technique must be falsifiable, refutable and testable.
 There is a known or potential rate of error of the technique as well as standards and control must
exist and maintained covering how the technique should be operated.
14
 The theory or technique should be subjected to peer-review and publication.
 The theory or technique should be generally accepted within the relevant scientific community.
Based on the above criteria, it can be seen that process models such as the above do not deal directly
with important aspects of each step of an investigation such as the completeness, repeatability and
reliability. The details provided for each part of the investigation are lacking consistency where
especially parts such as the analysis one are quite ambiguous and abstract. The scientific method has
been introduced as a simpler and more flexible methodology that promotes repeatability and testability
in order to increase the reliability of the results. The main parts of the scientific method as described in
(Casey 2004) are the gathering of facts and their initial validation, hypothesis formation and
experimentation/testing, searching for evidence that supports/disproves the hypothesis and finally
revising the conclusions in the light of new evidence.
(Carrier 2006) has described a Hypothesis based approach to digital forensic investigations where the
general scientific method is bridged to digital investigations as a process of formulating and testing
hypotheses about previous states and events. (Casey 2004) provides a more detailed description of the
various steps of the scientific method and how they can be applied.
 Observation: An event observed either directly (a system does not perform as it should be) or
indirectly (a sensor has produced an alert of a possible security incident).
 Hypothesis: The investigator uses current facts about the incident along with experience and
knowledge in order to formulate a theory of what may have happened.
 Prediction: Based on the hypothesis, the investigator may predict based on previous knowledge
where artifacts relevant to that event may be located.
 Experimentation/Testing: The investigator analyzed the available evidence in order to test the
hypothesis. The goal of the scientific method is not only to support the hypothesis but also to use
the available evidence so as to falsify and eliminate other possible alternative explanations.
 Conclusion: The investigator forms a conclusion based upon the result of the previous tests and the
conclusions are further communicated to the interested parties.
However, the above method due to its empirical nature and reliance on the investigator’s knowledge
for formulating better hypotheses or performing more accurate predictions has to deal with various
challenges. (Rekhis & Boudriga 2011) has enumerated some of them in the form of the following
requirements:
 Formalization and proof automation is needed for the purpose of accuracy and practicality. In order
to reduce the analysis complexity and enable even less experienced or knowledgeable investigators
to deal with complex scenarios, a formalized and explicit representation of the investigator’s
knowledge and observations is deemed as necessary. Automation has the potential to decrease the
needed amount of analysis time.
 Integration of the entire investigator’s knowledge about the investigated systems. In order to enable
the reconstruction of the attack scenario and due to evidence coming from a variety of systems and
tools of different nature, integration is needed to allow the formulation of more complete
hypotheses and more accurate testing.
 Attack scenarios should also be represented in an expressive enough format so as to allow the
hypothesizing of complex or even novel and unknown types of attacks.
 Due to the fact that a digital investigation has to deal quite often with uncertainty and reduced
visibility of the system vulnerabilities or where the collected evidence cannot support full
15
understanding of the case, it is needed to be able to reason with uncertainty and even automatically
promote or filter hypotheses.
Various approaches have been suggested in regards to (semi-)automated hypothesis generation and
evaluation for the purpose of digital investigations. One of the proposed systems was introduced by
(Stallard & Levitt 2003) where an expert system utilizing a decision tree attempts to identify data
redundancies in traces of security incidents. The concept is to promote searching though the collected
evidence for any possible contradictions. The authors claim that such redundancies can be found in
various places such as file system metadata, log files and application specific file formats. The expert
system allows the digital investigator to hypothesize potential scenarios and rule out the correct ones.
In order to perform the integrity checking, the system’s normal state must be specified a priori,
something that is practically not so easy.
(Gladyshev & Patel 2004)proposed a Finite State machine approach for formalizing the reconstruction
of potential attack scenarios by discarding the ones that do not match with the collected evidence.
However, FSM are not very expressive thus having problems dealing with more complex cases. In a
similar approach (Carrier & Spafford 2006) proposed a computation model based on Finite State
Machine and a computer’s history. The model suggests that a computer has a history which is not fully
known and where the digital investigation process has to formulate and test hypotheses about the
previous states of the system and the occurred events. The system though has practicality issues due to
its modeling of quite primitive states such as enumeration of all the storage locations and is not clearly
shown how more complex cases of network intrusion can be handled.
(Willassen 2008a) focused on the evidentiary value of timestamp evidences and how modifications or
errors on them can affect the forensic analysis. The author suggests a solution of formulating and
testing hypotheses about skewed clock measurements and checks them with the observed evidence. In
a following paper, (Willassen 2008b) improved the previous idea by modeling a set of actions and
their effects on timestamps. Thus, testing of the formulated hypotheses is performed by attempting to
find a possible sequence of actions so as the order of observed timestamp values matches.
Finally, in (Rekhis & Boudriga 2011)a logic-based approach is presented as an extension to the
Temporal Logic of Actions by Lamport. The paper has listed a set of requirements for a formal digital
investigation of security incidents. According to it, a simplified model for defining attack scenarios is
needed so as to be able to describe the underlying patterns of complex attacks and also cope with new
unknown ones. Secondly, a method for hypothetical reasoning is needed so as to be able to deal with
incomplete knowledge of attack techniques or missing evidence. Thirdly, they suggest that the
existence of a library of attacks can promote knowledge reuse and collaboration between digital
investigators. Furthermore, modeling the evidence and integration of them as originated by different
sources is of paramount importance since each type of evidence is usually insufficient by itself to
provide a full understanding of the attack scenario. Finally, hypotheses should be dynamically
generated and prioritized based on their suitability.
2.3 Forensic Tools & Integration Issues
Forensic tools are playing a major role in modern digital investigation practice. Especially in the case
of complex security incidents, a multitude of different sources can provide evidentiary data such as
those collected from file system analysis on media devices (e.g. hard disks, USB flash drives),
network sources (IDS & Firewall logs, packet or flow captures) as well as new types of sources such
16
as live memory and mobile devices. The distributed nature of digital evidence calls for advanced
methods of tool interoperability and correlation of evidence. In (Garfinkel 2010), various challenges
that the area of digital forensics encounters today are enumerated and most of them are pertinent to
tools.
One of the fundamental issues is that the majority of modern tools have an evidence-oriented design,
being specialized for acquiring or analyzing specific pieces of evidence while at the same time
focusing more in solving crimes where the computer is used as a repository of person-targeted related
crimes instead of crimes with or against computers. The above problems require an amount of manual
effort by the investigator which becomes even harder taking into consideration the size of data
volumes of today. (Ayers 2009) considers modern storage media analysis tools such as Encase and
FTK as first generation tools mostly suited for manual analysis and being heavily limited in analysis of
large volumes of data and concludes by proposing requirements of second generation tools with
efficient data acquisition and data representation being rated high.
(Carrier 2003)has proposed a series of generic requirements for forensic tools in regards of their
usability, comprehensibility, accuracy and verifiability. Carrier has suggested that a forensic tool can
be seen as an interpreter of data between layers of abstraction. The tool is handling a set of inputs and
based on a specified rule set produces a set of outputs. During this process, two types of errors can be
introduced. Tool implementation errors are flaws that prohibit the tool to function as specified.
Secondly, abstraction errors can be introduced when the tool’s representation of the system is
inaccurate.
2.3.1 Tool Integration
Over the last years, the commercial forensic tools market has been marked by the emergence of
integrated forensic platforms such as EnCase and FTK. These vendors strive to integrate a variety of
tools and features in their platforms such as Navigation, Search and Presentation (Schatz 2007).
Navigation allows the visualization and exploration of the structure of the collected evidence, while
search allows easier identification of relevant data. Various searching techniques are offered with
keyword search being the most prominent but also more advanced ones such as regular expressions,
date ranges and by data type. Furthermore, these suites incorporate a variety of viewers in order to be
able to present different types of objects or even the same data object in multiple formats.
However, proprietary and undocumented formats that these tools use impair further integration
between tools and especially with specialized open-source tools. On the other side, the usage of opensource tools can promote the reliability of their results due to their code being open for review. Still,
though, open source tools are quite often focused on specific tasks and do not provide sophisticated
solutions for documentation and further blending in the digital investigation process.
In (S. L. Garfinkel 2010), a number of future directions that research in the area of Forensics should
pursue are mentioned. Some of them, related to integration issues, are the need of research for metafile carving techniques that can combine the results of multiple individual carvers, the need for a
modularized standardized architecture that can provide the capability of plug-ins for supporting
dynamically additional features and the need for multi-thread, multi-server distributed processing as
initiated by projects such as the “Distributed Environment for Large-scale inVestigations”(DELV) and
Open Computer Forensics Architecture (OCFA) (Vermaas et al. 2010).
17
2.3.2 Data Representation
How data is represented can be a key factor into improving the effectiveness and in the capacity for
integration of forensic tools. (Garfinkel 2010) prioritized forensic data abstraction as one of the most
important future research directions. The need for establishing a standardized set of abstractions and
data formats covering the broad scope of possible data types encountered during a digital investigation
is quite important.
One major initiative towards the goal of a unified, common data representation format has been
promoted by the Digital Forensics Research Workshop (DFRWS). The goal was the definition of a
standardized Common Digital Evidence Storage Format (CDESF), an open data format that can store
both the copy of the digital evidence as well as related metadata. Unfortunately though, due to
resource limitations, the group has disbanded and the project halted. However, other suggestions have
been proposed by mostly academia without though achieving extended acceptance and adoption. Most
of these suggestions are XML-based and thus will be further mentioned in Chapter 4 of the thesis.
2.3.3 Correlation-based Analysis Techniques
Although correlation has been described above as one of the most important steps of the analysis
phase, however current support for correlation analysis is poor between most forensic tools. (Case et
al. 2008) have proposed a framework for supporting automated evidence discovery and correlation
under the name of FACE. The goal of the framework is to automatically integrate and correlate data
objects as contained in memory dumps, network traces, disk images, log files and configuration files.
By correlating all the above, the investigator is able to acquire a more complete view in the form of
involved users, groups, processes, files and networks.
(Abbott et al. 2006) improved on a previously described Event Correlation for Forensics (ECF)
framework with the goal of performing cross-correlation of event information from log files as
collected from heterogeneous sources and identification of event scenarios from queries posed by the
investigator. The log events, after canonicalization, could be stored in a relational database and allow
for an interactive or automated scenario identification based on these events.
In another attempt, (Garfinkel 2006a) performed automated correlation analysis on a large number of
secondary market hard drives for the purpose of detecting interesting information such as credit card
numbers, social security numbers(SSN), email addresses etc. The author introduced the term CrossDrive Analysis (CDA) for correlating information extracted from the drives and identified by pseudounique identifiers.
Event correlation has been also quite extensively studied in the area of Intrusion Detection Systems
with the objective of producing intrusion reports that capture a high-level view of the network activity
taking advantage of spatial and temporal properties of the alerts. Alerts are usually combined into
meta-alerts where false positives may be identified and discarded throughout the correlation process
(Kruegel et al. 2005)
The main problem with correlation techniques and approaches presented above are that their focus is
usually on single domains and utilizing very specific characteristics. On the other side though, a digital
investigator has to deal with a variety of different domains and thus event correlation is becoming
even more difficult. Another challenge is the extensibility of such proposed solutions that usually
utilize their own devised event description languages that are weak on semantics, cumbersome to
introduce new terms and lacking contextual information such as configuration information of the
network and/or the system. As a conclusion drawn from this section, we can see that a more
18
expressive and extensible data representation format for representing evidence, contextual information
and attack scenarios can improve the analysis part of the investigation process by promoting tool
integration and high-level correlation of events from disparate domains for the purposes of hypothesis
evaluation of complex attack scenarios.
19
Semantic Web Technologies
This section gives a brief overview of the basic principles and mechanisms based on which the
Semantic Web is built. The purpose of this chapter is to provide a background to semantics and how
relevant technologies can provide solutions to knowledge representation and integration issues.
3.1 Semantic Web Foundations
Undoubtedly, the World Wide Web (WWW) nowadays is the most extended distributed platform ever
existed, realizing most of the goals initially envisioned by its creators for efficient information
representation and inter-linking, allowing publishing and accessing of data of every scale from
personal notes up to enormous databases, advanced searching capabilities and also dynamic and
personalized generation and presentation of data (Tim Berners-Lee et al. 1992).
The Web though confronts a variety of challenges nowadays. The vast amount of information on the
Web is designed for human consumption, with limited support for automated machine processing.
Even when data are well structured and organized under database schemas, variations in the
terminology used prohibit the machine to automatically ‘understand’ the structure giving rise to new
research fields such as that information integration and schema matching. Another major problem
regards the quality of search engines results and of the retrieved content. Despite advancements in this
area with a variety of heuristics, the keyword-based form of search has restrictions. The same word
may convey different meanings in different contexts or relevant content may have been expressed with
different terms and not retrieved, thus reducing the accuracy of the search engine (Tim Berners-Lee
1998).
The aforementioned challenges and shortcomings of the current Web gave rise to the idea of Semantic
Web. Semantic Web has been defined as:
“The Semantic Web is not a separate Web but an extension to the current one, in which information
is given well-defined meaning, better enabling computers and people to work in cooperation.” (Tim
Berners-Lee et al. 2001)
The first observation is that the term “information” over “data” is used. Data should be considered as
mere where information objects are considered as collection of data along with their semantics that
enable their correct interpretation (Haslhofer & Neuhold 2011). In order to better explain what a ‘welldefined meaning is’, we need to introduce two important concepts, fundamental to the semantic web,
namely metadata and ontologies.
Metadata is usually defined as “data about data”. A common example of metadata used over the Web
is the HTML tags. HTML tags such as <p> and <div> are associated with the content and describe
how they should be presented to the user. In general tags as those, have been used for the annotation of
electronic documents in order to allow the addition of data, presentation or process related semantics
to be attached to the data itself. However, metadata can be of dynamic nature, describing contextual or
domain-specific information about the content. This dynamicity though may create serious challenges
for automated processing since the metadata may have not been explicitly defined and their semantic
meaning has to be extracted with other approaches, such as linguistic approaches [Dhamankar2004],
with questionable results. (Haslhofer & Neuhold 2011) have categorized these interoperability issues
into 3 main categories, namely: technical, syntactic and semantic ones.
20
The term ontology can be defined as “an explicit and formal specification of a conceptualization”
(Gruber 1993). Ontologies can be used to capture domain-specific knowledge and express it in the
form of entities and their relationships. Knowledge representation can be both at an assertion or
instance level and express in a semantic manner aspects such as entities, attributes, domain
vocabularies, factual knowledge and their interrelationships. An ontology can be considered similar to
a database schema describing the structure and semantics of data, although with important differences
since ontologies enable automated reasoning over a set of given facts (Noy & Klein 2004). It is
important to highlight that the semantic web relies an open world assumption in which the inexistence
of a specific statement does not imply knowledge of the truth value of the statement e.g. false as in the
closed-world assumption thus better depicting the partial or incomplete knowledge that is apparent in
the digital investigations field.
This is the advantage of the semantic web approach over previous integration attempts, allowing
reasoning over data by combining data and metadata along with their references to ontologies. This
allows inference of conclusions from the given set of facts and thus making implicit knowledge
explicit. The vision behind the Semantic is that intelligent agents will be able to utilize knowledge and
reasoning in order to better understand their context, work autonomously and share information
(Zhang et al. 2011).
3.2 Semantic Web Architecture
The Semantic Web has evolved over the last decade to a complex aggregation of different
technologies each one responsible for different aspects of the Semantic Web framework. (Antoniou &
Van Harmelen 2004) have depicted the semantic web architecture using a layered approach as shown
in Figure 2. Such an approach enables better understanding on what are the main functions of the
different technologies as well how the layers relate to each other and share results.
Figure 2: Semantic Web Architecture (Antoniou & Van Harmelen 2004)
A brief presentation of the various layers follows.
3.2.1 Uniform Resource Identifier (URI)
Since the Semantic Web is built on top of the existing Web architecture, URIs provide the foundation
on which the other layers are based upon. A URI is a “compact sequence of characters that identifies
an abstract or physical resource” (Berners-Lee et al. 2005). URIs are extensively used in the Semantic
Web so as each resource described can be uniquely identified and referenced. URIs enable unique
identification of a resource under a global scope and a consistent interpretation thus alleviating the
21
problem of different resources represented by the same name in different contexts thus introducing
interoperability problems. It is important to note that a URI does not necessarily imply access of the
resource as in the more familiar Uniform Resource Locator (URL) scheme but can simply be used for
denoting a resource.
Unicode is a standard promoting a consistent encoding and representation of text, supporting most of
modern writing systems, thus enabling support for multi-lingual environments. Internationalized
Resource Identifier (IRI) is a form of URI that can support characters out of the ASCII character set
and thus are more useful in modern Web’s internationalized context.
3.2.2 XML / Namespaces
XML stands for Extensible Markup Language. XML is a markup language that allows the encoding of
arbitrary documents in the form of a consistent tagged format so as to be both human and machine
consumable. XML gains its extensibility from the fact that there is no predefined set of tags thus
allowing the construction and use of custom ‘tag’ sets per context. These tags are of the form of either
elements or attributes. XML is actually only imposing specific rules in regards to the well-formed
syntax and validity of the resulting encoded document. XMLs contribution over the last decade in
almost every imaginable field by promoting shared domain-specific ‘tag’ sets between communities
and sharing of data in a consistent way has been tremendous and almost unmanageable to enumerate
here.
XML namespaces enable the qualification of element and attribute names used in XML documents by
tying them to namespaces identified by URI references. This allows the inclusion and usage of
elements and attributes from different XML ‘tag’ sets and resolving possible syntactic conflicts where
the same ‘tag’ names are used but representing different concepts.
3.2.3 XML Schema / XML Query
XML Schema is a markup language used to define the shared vocabularies or ‘tag’ sets discussed
before. XML Schema is used to define the rules to which an XML document using the specified
vocabulary must abide to. XML Schema enables the definitions of custom elements and attributes
along with support for the description of a variety of different primitive data types (e.g. string, date,
and numeric) and simple restrictions on the acceptable values.
Although XML Schema has been used extensively for the definition of shared vocabularies and
automated exchange and parsing of XML documents, it is lacking expressivity power for defining the
structure of statement and the interrelationship of the various elements. Thus, although automated
parsing of XML documents is well supported by most programming languages, the system does not
really understand the concepts expressed or be able to infer new information. XML focuses mostly on
the technical and syntactic interoperability problem discussed before but provides very limited support
to convey semantic information.
XML Query is another technology that allows the extraction and manipulation of data from XML
documents. As XML documents are employing a tree-structured information model, XQuery provides
the means to address specific parts of it. New standards under work will provide full search text
functionality as well as update of XML documents capabilities.
3.2.4 RDF / RDF Schema
RDF stands for Resource Description Framework and has been introduced by W3C as a standard for
encoding arbitrary metadata. The motivation behind RDF was that the amount of data present on the
Web has reached such a large scale that management and even more, automated processing has been
22
constituted extremely difficult. The concept of using metadata for describing resources could promote
automated processing such as searching, referencing and clustering based on criteria such as content
relevance. The problem with XML is that although it is flexible by allowing the creation of custom set
of elements and attributes and definition of syntactic rules that apply on them, the semantic
relationship between the different XML elements cannot be expressed. Thus, differences in how
elements are organized in the XML tree structure cannot be resolved automatically, based on semantic
relationships, but need custom transformations to be defined by the developer such as XSLT.
In comparison to XML that introduced a common data serialization format by promoting an
interoperable and internationalized text encoding format, RDF introduces a common data model that
allows a consistent, simple, flexible and structured manner for encoding metadata concerning arbitrary
resources. RDF is an assertional language with three fundamental components, namely resources,
properties and statements. A resource can be literally any kind of concept in any domain that
assertions want to be made about. In order to promote integration of data over the Web, RDF imposes
that the name of a resource must be global and uniquely identified by a URI. The same applies for the
property that must be identified by a URI and be further specified as an element of a formal
vocabulary which is the purpose of RDF Schema and ontologies as discussed later. A statement
consists of just three components, a subject, a predicate and an object. The subject must be a resource
and the predicate a property as explained before. The object though can be either a resource or a
constant value such as a literal. Statements of this form are commonly referred to as triples and can be
represented by graph structures, with subjects and objects mapped to nodes and predicates mapped to
the arcs connecting the nodes.
This data model is simple enough in concept but powerful enough in flexibility allowing new metadata
to be bound to nodes or graphs merging thus allowing easier integration of data. RDF provides
additional features such as support for resources acting as containers (open sets) or collections (closed
sets) of other resources, reification where statements can be made about other statements, blank nodes
for supporting anonymous resources when a specific resource does not need to be directly referenced
and support for typed literals so as constant values for objects can be associated with their types as
those offered by the XML Schema Datatype Definition.
RDF statements can be serialized in a variety of formats. The most supported and recommended
syntax for applications is the RDF/XML due to good tool support for XML processing. RDF
statements are mapped into XML elements, attributes, element content and attribute values. Another
popular format is Notation 3(N3) which is a non-XML based language but with enhanced readability
and a more compact form. N3 provides support for most advanced features such as rule expression
using variables and quantification. Finally another serialization format that has been introduced by a
collaboration of HP and Nokia is Trix which is an alternative XML-based language for RDF graph
representation. All these formats allow the usage of XML namespaces thus enabling the intermixing of
resources from different vocabularies in the same graph or document. Although there are subtle
differences between these formats, these formats permit the serialization of complex RDF graphs in a
uniform and consistent manner so as to allow interchange of data between applications.
RDF Schema is a language for constructing simple RDF vocabularies and thus provides the
framework for the introduction of domain-specific classes and properties. RDF Schema provides the
following constructs:
23
 Class: The class is used in to specify a domain specific type of entity similar to the concept of
classes in object-oriented programming. rdfs:Resource is the root class and rdfs:Class is the class
representing all classes.
 Property: Properties are the concepts that link instances of classes between them or with literal
values, the predicates of the triplet statements. Properties are instances of the rdfs:Property class
and their use can be restricted with the rdfs:domain and rdfs:range properties. The former defines
instances of which classes can be participating as subjects of this predicate in statements, while the
latter defines instances of which classes or types of literals can be the objects of this predicate in
statements. The special property rdf:type is used to link resources with their class membership.
 Class and Property Relationships: RDF Schema allows the creation of simple hierarchies
between classes as well as properties. The special properties rdfs:subClassOf and
rdfs:subPropertyOf allow the description of relationships between classes or properties. These
relationships are quite similar to the ones found in the inheritance concept in object oriented
programming but with an important difference. Properties are defined under a global scope and are
not encapsulated as members of a class thus allowing the definition of new properties for an
existing class without the need of the class to be modified.
3.2.5 Ontologies
As defined before, ontology describes in a formal way a domain of interest, a formal specification of a
conceptualization. Ontology allows the description of various classes and objects pertinent to the
domain as well as relationships between them. Although RDF schema allowed the expression of basic
relationships of the form of properties and parent-child hierarchies, an ontology supports must more
expressivity and complex relationships such as value restrictions, class property restrictions,
cardinality constraints, union and disjoint relationships, equality and more. This capabilities support
ontological modeling and automated reasoning which are the cornerstones of the Semantic Web
initiative.
OWL stands for Web Ontology Language and is the most prominent ontology definition language
used in the Semantic Web framework. The most recent version is OWL 2 which has become a W3C
standard on 2009 and has been defined as:
“The W3C OWL 2 Web Ontology Language (OWL) is a Semantic Web Language designed to represent rich and
complex knowledge about things, groups of things, and relations between things. OWL is a computational logicbased language such that knowledge expressed in OWL can be reasoned with by computer programs either to
verify the consistency of that knowledge or to make implicit knowledge explicit.” (Hitzler et al. 2009)
Some of the basic constructs and capabilities of the language are briefly discussed below:
 Class: owl:Thing is the root of all classes and the base class of rdfs:Resource. owl:Class is used to
define classes and is a sub-class of rdfs:Class. OWL classes should be considered as sets of
individuals and describe formally the requirements for an individual to claim class membership.
The set of individuals associated with this class is called the class extension. owl:Nothing signifies
the empty set.
 Class Description
 Enumeration: The owl:one of supports the description of a class by explicitly describing the
allowed individuals. The class extension contains exactly the enumerated individuals like a closed
set.
24
 Property Restriction: Describes an anonymous class containing the individuals satisfying the
restriction.
 Value Constraint: Defines constraints on the range of the property but when applied to this
particular class description in contrast to the global scope of rdfs:range. There are two types of
constraints that add requirements between properties and classes, namely owl:allValuesFrom where
all individuals must have property values of the specified class type and the owl:someValuesFrom
where at least of the property value is of the specified class type. The owl:hasValue describes
individual that have at least once the specified property value either in the form of an individual or
a data value.
 Cardinality Constraint: Defines constraints on the number of semantically distinct values
(individuals or data values) that an individual can have on a specific property. owl:MaxCardinality
sets an upper limit while owl:MinCardinality sets a lower limit.
 Intersection, union and complement: These properties can be seen as similar to the logical AND,
OR and Not operators where a class extension can be defined as a combination of other class
extension. The owl:intersectionOf includes only those individuals that are members of all the class
descriptions in the list while owl:unionOf contains those that are members of at least one of the
class descriptions in the list. owl:complementOf includes the individuals that are not members of
the class extension of the specified class.
 Class axioms: Class descriptions can be combined into class axioms with different ways such as
 Subclass: rdfs:subClassOf defines that the class extension of a class description is a subset of the
class extension of another class description.
 Equivalency: owl:equivalentClass defines that two class descriptions have the same class
extension.
 Disjoint: owl:disjointWith defines that the class extensions of two class description do not have
any common members.
 Properties: There are two main types of properties, object properties that link individuals to
individuals and datatype properties that link individuals to data values.
 Property Relationships: The owl:equivalentProperty denotes that two properties share the same
property extension (the subject-object pair of property statements) without this meaning that they
are the same. owl:inverseOf denotes that a property has as domain the range of another and vice
versa.
 Global Cardinality Constraints: The owl:FunctionalProperty defines that a property can have
only one unique value for each instance e.g. a person can have only one biological mother.
Opposite to this, the owl:InverseFunctionalProperty signifies that the object of a property statement
uniquely determines the subject e.g. an individual person uniquely identifies another individual as
being the biological mother of the former.
 Logical Characteristics: The owl:TransitiveProperty defines that if the object of one statement and
the subject of another use the same transitive property then the subject of the first can also be
associated with the object of the second over the same property. A symmetric property defines that
the subject and the object of a statement with this property can infer a new statement with the
subject and object roles swapped.
 Individual Identity: It is important to highlight that OWL does not follow the so-called ‘unique
names” assumption allowing e.g. multiple URI references referring to the same individual. The
owl:sameAs property allows then to establish a link between two individuals having the same
identity but different references. The owl:differentFrom explicitly states that two references refer to
25
different individuals while the owl:AllDifferent allows the description of a list of individuals being
different to each other.
As it can be seen, OWL can be quite expressive with the side-effect though of increased complexity of
the reasoning process. Indeed, usage of all the features of OWL can lead the reasoning process to such
complexity so as to be computationally infeasible. In order to deal with this, the designers proposed
two different ways for assigning meanings to the ontology, the Direct Model-Theoretic and the RDFbased Semantics. The former leads to the OWL DL dialect that adds restrictions to the expressiveness
of the language so as to be compatible with Description Logic and its proven computational
completeness and decidability while the latter leads to the OWL Full with maximum expressiveness
but the risk of un-decidability in its reasoning process. OWL 2 further enhanced this idea with the
introduction of three sublanguages, EL, QL and RL that are syntactic subsets of the OWL2 language
with varying tradeoffs between computational complexity (e.g. logarithmic or polynomial) and
expressiveness in order to better accommodate different needs in regards to size of ontologies and
reasoning power.
3.2.6 Rules / Query
With the support of RDF Schema and OWL, semantics of concepts relevant to the domain of interest
can be expressed and based on these semantics, reasoning within ontologies and knowledge bases can
be performed. In order to provide support for the description of rules that cannot be expressed with the
available constructs of these languages, rule languages are also being standardized by the Semantic
Web community. SWRL stands for Semantic Web Rule Language and is intended to become the main
rule language of the Semantic Web framework. SWRL is based on OWL-DL and its rules are
expressed in terms of OWL concepts (classes, properties and individuals) in the form of an antecedent
and a consequent.
SWRL can support some more advanced use cases that reasoning over OWL either cannot or is
cumbersome to support such as:
 Reclassification: If an individual is a member of a specific class extension then it must also be
member of another class’ extension.
 Property Value Assignment: If specific conditions exist then a new statement can be inserted about
a subject with a specific predicate and object.
 Provision of built-in expressions: Checks on literal values such as comparisons (equal, lessThan,
greaterThan), math operations(add, substract, multiply, divide), string operations (substring,
contains, startsWith), date and time operations and list operations (listIntersection, length) as well.
The main problem that may arise with the use of SWRL is that the added expressivity can have an
effect to the computational decidability. It is worth noting that as OWL follows the open world
assumption so e.g. unknown assertions about an individual cannot disallow from being a potential
member of a class extension. As such SWRL cannot support negations present in the rules since
negation cannot be perceived as failure but as potential temporary lack of knowledge. Furthermore, the
unique names assumption is also not adopted so individuals must be explicitly stated to be different.
Finally due to the above, some features such as retraction (the consequent modifies a resource present
in the antecedent so as the rule can be applied again) and counting.
In order to provide support for retrieval of information from RDF data as well as RDFS and OWL
ontologies, a query language was needed. SPARQL which stands for Simple Protocol and RDF Query
26
Language is an SQL like language that can use RDF triples for expressing queries and returning
results. SPARQL is not only a query language but also defines a protocol on how to access RDF data.
SPARQL supports four main variations of queries as follows:
 SELECT Query: Extracts raw values from a SPARQL endpoint (a SPARQL protocol service that
enables users to query a knowledge base using SPARQL) and presents the results in a tabular
format.
 CONSTRUCT Query: Same as the SELECT but the results are transformed into valid RDF.
 ASK Query: Supports Boolean-type of queries, e.g. about the existence of an individual in the
knowledge base.
 DESCRIBE Query: Returns an RDF graph as a result but the query processor is responsible about
how to structure this RDF data in case that the client does not define a query result pattern.
Recent research is being done in the area of providing support to SPARQL for federated queries where
the query author can direct specific portions of a query to particular SPARQL endpoints and the
results will later be combined with the other parts by the federated query processor. Such features will
allow information aggregation from disparate sources in a single query thus improving greatly both the
quality and the amount of results.
3.2.7 Top Layers
The top layers are still under research and no technology specification or languages have been
standardized so far. According to (Al-Feel et al. 2009) the logic layer is supposed to provide “the
answer for the question of why this piece of information is taken or appear to the user” while the proof
layer should be able to provide deductive reasoning on why the results should be accepted. Formal
proofs along with the vertical layers of security features for authentication of the origin of data and
protection of the confidentiality and integrity of the data will enable software agents or users to
actually trust the results.
27
Semantic Web & Digital
Investigations
The purpose of this chapter is to present in a more detailed form the most relevant work that has been
already done to the present thesis. Although, the previous two chapters presented the two different
areas in isolation to each other, various solutions have also been proposed and implemented
combining these two. Most of this work is motivated by the fact that the information aggregation and
automated reasoning capabilities offered by the Semantic Web technologies can provide a solution to
the problems that digital investigations confront with respect to the large volume of data and their
disparate origins.
4.1 XML-based Approaches
One of the major attempts to introduce the XML markup language to the area of Digital Investigations
currently is that of DFXML (Digital Forensics XML) (S. L. Garfinkel 2009). The project intends to
provide a standardized XML vocabulary for representing metadata of a disk image (e.g. filename,
acquisition date, device info) as well as information about the content of the disk image such as
addresses and lengths of the “byte runs” (file fragments) of the resident files along with their
cryptographic hash values as well as operating system specific information such as Microsoft
Windows Registry entries. The DFXML provides various intuitive forensics-related XML tags such as
<source>, <imagefile>, <acquisition_date>, <volume>, <fileobject>, <byte_run>, <hashdigest>,
<filesize> etc. Garfinkel promotes the DFXML format by introducing a variety of tools that can either
produce or consume DFXML for various purposes. Fiwalk is a tool that produces DFXML describing
files in a disk image, depending on the Sleuth Kit for the actual interaction with the disk image.
Fiwalk provides though an abstraction layer over the complexity of Sleuth Kit thus allowing new
features such as reporting of differences between two disk images based on their DFXML
representation, easy removal of known files based on a “redaction file” or even removal of personallyidentifiable information for privacy reasons. The format is gaining support from other tool authors as
well such as the PhotoRec and the Scalpel carvers as well as existing digital evidence container
formats such as AFF4 and EWF which can output disk image related metadata in DFXML format.
Other approaches have focused on specific operating-system or application artifacts that are quite
important in the context of a digital investigation. The hivexml project allows the extraction of
Registry entries from hive files in an XML format using a simplistic tag set with <hive>, <node>
<value> elements and key, type and default attributes. The Electronic Discovery Reference Model
(EDRM) XML project is a XML-based format with the goal of improved data interchange regarding
metadata relevant to e-discovery cases. This XML schema focuses on metadata fields of documents,
such as Microsoft office files, and email messages thus defining elements and attributes such as Email
Author, Email Subject, Doc Identifier, Full Folder Path etc. The TrID project provides a utility that
can identify file types based on their binary signatures. The utility can output also its results in an
XML format where the respective XML schema defines elements such as <FileType>, <Ext>,
<Pattern> etc.
A generic XML-based framework by the name of XIRAF (XML Information Retrieval Approach to
Digital Forensics has been suggested by (Alink et al. 2006). The authors have underlined the
28
importance of having “a clean separation between feature extraction and analysis” and have proposed
a single XML-based output format for forensic analysis tools. The concept behind XIRAF is that most
forensic analysis tools operate on large binary objects (BLOB) and based on their specific
functionality they can either extract specific features, such as log files from a disk image or also
generate new BLOB content such as unzipping a compressed archive. The outputs of these tools are
then wrapped in an XML format and stored in an XML-aware database. The investigator can then use
the XQuery language to submit queries to this database and extract information about the case and the
evidences. In order to evaluate the proposed framework, the author have provided concrete examples
of advanced forensic capabilities such as a timeline browser of XML fragments produced by different
tools such as file-system metadata, chat logs, EXIF picture metadata etc., as well advanced types of
search such as based on specific metadata criteria or comparing with a hash set of known contraband
material.
(Levine & Liberatore 2009) proposed the DEX (Digital Evidence Exchange) format that can also wrap
the output from various tool in an XML-based representation but with an additional capability of
encoding the specific instructions given and the sequential order of the tools used. This can be used to
provide provenance-related metadata on what was the sequence of actions leading to the specific
extracted artifacts and even comparison between different investigation processes and their results.
The DEX defines a set of XML element similar to those present in the DFXML such as <DiskImage>,
<Volume>, <File> and <CommandLine>.
In a similar approach, (Lee et al. 2010)have proposed an XML-based data collection framework for
live forensics purposes. The authors have defined three main information object schemas using XML
Schema about live data, target software and Windows related objects. The live data schema contains
information about the system (e.g. running processes), the user (e.g. Windows account) and the
network (e.g. IP address, active connections, executable open ports). The target software schema can
describe application-specific information such as email accounts, web browsing history, instant
messaging logs etc.). The Windows information schema focused on Windows related artifacts such as
installed software, hardware configuration, user accounts and more. The authors propose also an
architecture consisting of four main components, the scenario type analyzer, the data collection bloc,
the report manager and case database. The scenario type analyzer’s purpose is to specify the most
relevant types of forensic artifacts that are pertinent to the case in hand. The data collection block is
responsible for the actual use of various tools for the collection of data and the representation of them
in the XML format following the previously discussed XML schemas. The report manager presents
the results of the previous step where it can support multiple different views depending on the needs
based on XSLT transformations. Finally the case database can archive the XML documents produced
from previous cases and used for mining case handling patterns through data mining techniques.
XML approaches have also been followed in the network forensics field with PDML and PSML
languages being markup languages for describing packet details and packet summaries respectively.
The PDML language is an XML-based markup language for providing a detailed view of a packet and
thus including elements such as <packet>, <proto> and <field> where the field element can support
attributes such as name, value, size and pos. PSML, on the other side, is about how to express in an
XML-format the summary view of a packet. The elements provided from the respective schema are
the <structure>, <packet> and <section> ones. PSML is quite flexible and mostly focusing on the
visual representation of summary view thus the provided constructs are quite abstract and
unstructured.
29
A plethora of other XML formats have also been defined in the computer and network security area in
general such as MANDIANT’s Indicators of Compromise that use an XML based language to
describe signatures of malware such as the existence of specific files or registry entries with the
capability of supporting conditional logic as well such as combining indicators with AND and OR
clauses. The MITRE project has defined a plethora of XML languages for various purposes. The Open
Vulnerability and Assessment Language (OVAL) for describing systems’ configuration information,
the current machine state in regards to vulnerabilities and patch states and reporting of the assessment
process results. The Common Event Expression (CEE) attempts to standardize the representation of
logs through the definition of a common dictionary of event fields and event expression taxonomy for
classification of different event types. Finally, the Malware Attribution Enumeration and
Characterization (MAEC) language is a language for encoding malware related information such as
artifacts, behaviors, payloads, propagation mechanisms and type of malware. The Common Attack
Pattern Enumeration and Classification language provides a schema for describing common attack
patterns such as their execution flow, related weaknesses (see CWE), related vulnerabilities (see CVE)
and methods of attacks.
As we can observe, there is a plethora of different XML formats covering different aspects of digital
investigations. Although, all these languages promote a standardized format and tool interoperability,
they are missing to convey the semantic content of what they represent. As such, real understanding
and automated processing by a software agent cannot really be performed since the semantic
interrelations of all these elements are not expressed. As an example, a software agent, although being
able to parse all these different XML documents, is not able to infer that the <FileObject> concept
found in a DFXML document is equivalent to the <File> concept found in DEX or that the <proto>
element in PDML is equivalent to the <Protocol> element defined in XLive. Therefore, although
XML based approaches can assist into establishing a common set of terms that can be both read and
written by different tools, the support for an intelligent agent to perform automated reasoning based on
relationships of these terms, such as packets targeting a specific port described in PDML and the
process that is behind that listening port as expressed in XLive, is lacking.
4.2 RDF-based Approaches
Despite the ability of RDF to express arbitrary metadata, it has found limited usage in the area of
digital investigations so far. The most prominent practical use of RDF is that of it being the basic
information model of the AFF4 forensic format. The Advanced Forensic Format (AFF) has been
introduced by (Garfinkel 2006b) as a file format and container to store digital evidence. The container
includes both a copy of the original data as well as arbitrary metadata. Metadata can include systemrelated information such as device-related data or user-specified ones such as the name of the
examiner. New requirements posed by practitioners though, such as support for distributed forensic
processes, forced a revision of AFF to its latest version, AFF4 (Cohen et al. 2009). In a subsequent
paper, (Cohen & B. Schatz 2010), the authors present in more detail the reasoning for their choice of
RDF. The authors argue that RDF was an important improvement over the previous ad-hoc
serialization protocol used for metadata due to the ability of using standard libraries for generating and
parsing RDF statements as well as the additional support of creating attributes with standard or custom
types instead of only string types as before.
(Giova 2011) adds to this idea and by introducing new RDF concepts such as evidence access
information (e.g. date and geographical location of access), examiner information (e.g. name,
institution, and role) as well as artifact-related information that were produced during the investigation
30
process (e.g. name of artifact, date, action) the chain of custody can be significantly improved. Such
additional metadata with respect to who and when accessed the evidence, from where and what actions
has performed with it can be considerably important in cases of remote or distributed investigations
and increase the admissibility of the evidence in courts.
4.3 Ontological Approaches
One of the first attempts to bridge digital investigations with the semantic web has been the
introduction of the FORE (Forensics for Rich Events) architecture by (Schatz et al. 2004a). The
architecture was composed of a forensic ontology, a generic event log parser and a custom-defined
rule language, FR3. The forensic ontology was based on two main concepts, the Entity for
representing tangible objects and the Event for representing the entities’ state changes over time. In
order to represent causal linkage, an ObjectProperty linking individuals of the Event class was defined.
A set of regular expression based parsers has been developed that could wrap low-level events into
individuals of the specific classes of Event described in the ontology. These semantic representations
of the events are then added to the knowledge base. Rules expressed in the FR3 correlation rule
language specified by the authors are evaluated against the knowledge base in order to compose
higher-level events out of combinations of lower level ones and add new causality links between
events based on the existing domain knowledge. The authors further demonstrated the possibilities of
such an approach by example cases where the automatic inference of higher level events could lead to
more meaningful and important notification alerts sent to the investigator or allow the investigator to
formulate hypotheses by using the owl:sameAs equivalence property to connect different individuals
that represent the same “object”. Finally, navigation through the causal links between events could
promote a search oriented investigation that could accelerate the whole process significantly by being
able to locate easier events of forensic interest.
In a subsequent paper, (Schatz, Mohay & Clark 2004b) demonstrated the ability to reuse concepts
defined in various disparate domain-specific ontologies such as the Network Entity Relationship
Diagram about networked hosts and the SOUPA ontology about context-aware pervasive computing
environments for a much more generalized event aggregation. By taking advantage of the OWL
capabilities of referencing external ontologies, the authors were able to express correlation rules that
combined concepts and events from disparate domains such as Windows login events, SAP login
events and physical door access logs.
Other approaches have focused on the ontological modeling of various aspects of digital
investigations. (Brinson et al. 2006) presented cyber-forensics ontology with a focus on the major
participants in a digital investigation along with their specific role and specialization. As such, the
ontology had two main concepts of technology and profession at top place and further analyzed into
other concepts such as hardware, software, law, academia, military and private sector and education.
The ontology provides a conceptual mapping of all the parties involved in a digital investigation and
could be used as a guideline for further curriculum development and certification pursue.
(Kahvedzic & Kechadi 2009) have presented a Digital Investigation Ontology (DIALOG) that
conceptualized different aspects of the digital investigation process focusing more on the actual case
and the evidence. As such the “DigitalInvestigationConcept” was further divided into four high level
concepts of “CrimeCase”, Information”, “InformationLocation” and “ForensicResource”. The
“CrimeCase” was further sub classed into concepts presenting specific types of crimes while the
“ForensicResource” was further refined into the “ForensicServiceObject” representing various
supporting resources during the investigation and the “ForensicSoftwareObject” covering
31
taxonomically different forensic software and their role in the investigation process. The
“Information” concept was further analyzed into different types of data that can be retrieved and be
relevant to an investigation such as “DataObject”, “FileObject”, “SoftwareObject” as well as
collective concepts relevant to investigations such as “UserActivityEvidence”,
“SystemConfigurationEvidence” etc. The authors further refined their ontology into modeling forensic
knowledge related to the Windows Registry data structure and exemplified it by using their ontology
along with SWRL-based rules for automatic aggregation on registry data such as composing higher
level events like software installation from correlation between registry keys and registry snapshots.
In continuation to the former approach, (Kahvedžić & Kechadi 2011) have demonstrated concrete
examples of using a blank ontology for encoding results retrieved from forensic tools with the ability
to encode various forensically relevant types of data such as metadata, content and events and their
relationships such as the author of a document, the persons and location depicted in an image file or
the person that performed an event. The authors suggest automated inference of the category of every
instance based on its properties and its mapping concept and its subsequent place in the concept
hierarchy. Finally, the ontology query language (SQWRL) can be used for evaluating queries against
the individuals for extracting additional information of evidentiary value.
In a similar approach, (Saad & Traore 2010) propose an ontology, represented in OWL, with network
forensics as the focused domain. The authors define ontology with various concepts relevant to the
area of network forensics as well as define taxonomic relations between concepts of the same
taxonomy and ontological relations between concepts of different taxonomies. In order to surpass the
restriction that relations represented in OWL are of binary type only, more complex concepts are
introduced for representing N-ary type of ontological relations or lists are used for representing
sequence of arguments. Furthermore, the authors argue that ontology reasoning can support
generalization, prediction and drawing conclusions from facts by supporting main forms of reasoning
such as deductive, inductive and abductive reasoning. The former is further supported through
examples and a case study analysis based on the previously defined ontology.
32
Research Methodology
According to (Hevner et al. 2004), the two main research paradigms in the IT research area are
behavioural and design science. Behavioural science’s goal is mostly to improve the knowledge about
existing IT systems and how they are used whereas design science’s goal is to promote novel ideas
and solutions for further development of IT. The main focus of design science is the design and
development of new artifacts that address practical problems where a problem is defined as a
difference between the current and the desired state of the researched topic. However, in comparison
to bare design processes, an additional goal of design science is the production of additional
knowledge with respect to the artifacts and their context that can be seen as a contribution to the
academic world. (Hevner et al. 2004) states that “effective design-science research must provide clear
and verifiable contributions in the areas of the design artifact, design foundations and/or design
methodologies”. As such, the former needs demands from a design science research to apply rigorous
research strategies and methods so as to enable critical evaluation and testing of the proposed design
and artifact as well as the new knowledge produced.
Since, the focus of the current thesis, is the suggestion of applying the semantic web initiative and its
associated technologies in the context of digital investigations with the goal of promoting data
integration and novel correlation techniques as a solution to current practical problems, the design
science paradigm has been deemed as more appropriate for providing a framework for conducting the
research.
(Hevner et al. 2004) has identified four different types of artifacts that can be the outcome of a design
science process, namely constructs, models, methods and instantiations. Constructs are terms and
concepts that constitute the building blocks for asserting statements and representing the various
entities of the problem and solution domain. Models can be descriptions of possible solutions to
practical problems such as a proposed system architecture that can further assist in the production of
new artifacts. Methods describe processes for solving problems such as algorithms or best practices.
Finally instantiations are actual implemented systems that can be used in practice to solve practical
problems.
The main outcome of this thesis is the suggestion and evaluation of proposed methods on how
semantic web technologies can be incorporated into existing digital investigation processes and tools.
However, due to the nature of semantic web technologies and for evaluation purposes of the proposed
methods, models in the form of domain ontologies are also used providing an explicit definition of
constructs relevant to the area of digital investigations. Finally, instantiations in the form of
specialized tools that are used for the evaluation purposes of the suggested methods are presented, that
can provide the basis for a more complex and complete semantically enabled system for digital
investigations.
Due to the complexity that a design science project can result in, methods providing a structured plan
of the various needed research activities and specifying their interrelations have been suggested. Such
a method, providing a framework of activities to be performed throughout the project is schematically
presented in Figure 3 and further discussed below.
33
Figure 3: Overview of the Design Science Method (Johannesson & Perjons 2012)
The goal of the first step is the clarification of the practical problem situation, its precise definition as
well as well-supported motivation on why a solution to this problem is needed. As mechanisms that
support this research activity can be considered knowledge resources originating from previous
research and stakeholders’ interests. Research strategies such as case studies and action research and
research methods such as questionnaires and observations can be applied for controlling the activity
and its results.
In the current thesis, domain knowledge obtained through extensive literature review and empirical
experience has been the main resource for the formulation and elaboration of the practical problem. A
focus has been given to peer-reviewed scientific articles and journals, as accessed through digital
library facilities, and proceedings of reputable digital investigation relevant conferences and
workshops due to their increased validity and their relevance to current academic research and
practitioners problems. Case studies of digital investigations involving a variety of data sources
(network captures, event logs, file-system forensics), in the context of international forensic contests
and workshops, have been studied in order to better understand the problems rising due to the lack of
advanced data integration and correlation techniques. The qualitative analysis of documents produced
as output of the former digital investigations has been the main research method for identifying and
defining the practical problems of manual integration and correlation of acquired evidence.
The second step of the methodology has the goal to identify and outline an artifact as a possible
solution to the previously defined explicated problem and further define the main requirements of the
artifact to be developed. Resources for this activity can be considered similar previously proposed or
implemented solutions along with the requirements that they have addressed as well as interests and
opinions of stakeholders. As in the previous step, case studies have been the main research strategy for
identifying and specifying requirements for the proposed method with observations and analysis of
documents as the main research methods for gaining a better understanding of the current forensic
practices and techniques restrictions as well as the usage environment and the analytical capabilities
required due to the complexity of the cases and the investigators needs.
34
The subsequent activity is the actual design and development of the proposed artifact. The proposed
method as the output artifact of the research process is described in detail and instantiations in the
form of prototype tools are developed. The design and development of the artifacts follow best
practices and established techniques from the semantic web area and components originating from
applied semantic web technologies are combined with new ones in order to produce the desired
artifact.
In the next step, the developed artifact is used to demonstrate if and how it can solve aspects of the
previously stated problem in the context of an illustrative or real-life case. The proposed method along
with the developed instantiation artifacts of this thesis, are validated through applying them for the
digital investigation of representative cases as those used as case studies in the previous research steps
when enough data are given or through experiments attempting to resemble realistic and probable
cases where data and evidence are superficially generated.
The goal of the next phase is the evaluation of the proposed artifact and the solution it provides to the
original practical problem along with the level of fulfillment of the identified requirements. Due to the
limitations of this research, a full-scale evaluation by introducing the developed artifacts to an
organization performing digital investigations and identify or measure the impact of such a solution
could not be performed. As such, the evaluation strategy followed in this research is the “ex ante
evaluation” where evaluation of the artifacts is carried out by the researcher in a theoretical manner
and supported by “informed arguments” with respect to the fulfillment of the requirements and the
generality of the solution.
The final step is the communication of artifact knowledge where information about the proposed
artifact is communicating to other researchers and practitioners. This report has this exact purpose of
documenting the research activities and results attained from the research project.
35
A Framework for SemanticallyEnabled Digital Investigations
In this chapter, the framework of the proposed method is presented by outlining its high level
characteristics, describing its relation to the reference digital investigation models presented before
and by elicitation of its main requirements that will act as the evaluation criteria of the later developed
method.
6.1 An approach for digital evidence integration,
correlation and hypothesis evaluation based
on Semantic Web technologies
As observed through the previous background studies, the problems that the Semantic Web initiative
attempts to solve such as the integration and automated processing of a vast amount of information
distributed all over the Web have quite in common with the restrictions and needs of modern forensic
investigation processes. The complexity of modern digital investigation cases involving a broad range
of concepts, technologies and entities constitutes efforts for a common universal evidence
representation schema too difficult to succeed. The need thus is rising for an expressive but flexible
manner for representing both domain knowledge and collected evidence information with the ability to
integrate and correlate them, regardless their different origin or format. This can support advanced
analytical capabilities and the formulation and testing of hypotheses posed by the investigator without
being committed to particular conceptual schemas that may have limited expressivity or lack of
reasoning capabilities.
The proposed approach for integrating semantic web technologies in modern digital investigation
processes and tools can promote, through extensible and with clearly defined semantics vocabularies,
the cooperation between the investigator and the analysis platform due to shared grounds of
communication and understanding as well as automation of parts of the analysis with respect to
evidence discovery and correlation. In order to further elaborate on the main characteristics and
potential advantages of such an approach, relations are drawn and described among identified
strengths and benefits of the semantic web in relevant previous studies (Reynolds et al. 2005)(Zhao &
Sandahl 2002) and needs of digital investigations.
 Information Integration: The RDF data model promotes easy integration of heterogeneous data due
to its schema independence and standardized statement representation of subject-predicate-object
form. RDF datasets containing such sets of statements can be merged into larger data sets. In
contrast to traditional approaches, where syntactic naming similarities may lead to unwanted
merges, Semantic Web’s promotion of the use of unique identifiers in the form of URI and shared
vocabularies of concepts, alleviates the problem and enables automated data merging for statements
with references to the same resource. In the case that the same concept is represented with different
URIs in different ontologies, the Non Unique Name assumption along with OWL constructs such as
owl:sameAs can enable automated equality inference and integration.
36
The ability to associate digital objects collected, created or extracted during the digital investigation
process with a unique name enables unambiguous referencing to it and can promote better methods
for case management and archiving. The use of a common resource for representing a concept that
can be used in different domains of the digital investigation process can enable the automated
integration of data as the example below presents. RDF statements are represented as Directed
Labeled Graphs (DLG) where circular nodes represent resources, arcs represent object or data type
properties, and rectangles represent literal values.
Figure 4: Data Integration based on shared resource URI
Figure 4 is the result of the merge of two simpler graphs where the URI reference identifies a
named individual that represents a specific IP address. The arc representing the ‘hasCountry’ data
type property originates from an RDF data set of IP to country mappings while the ‘hasHostname’
data type property originates from an RDF data set of IP to DNS mappings. The RDF engine is able
to automatically connect the two properties to the common URI reference thus achieving
integration of data originating from different sources.
In the case that a resource is identified by multiple URI references, even of different namespaces,
OWL provides the capability of either manually assigning or automatically inferring an equality
relationship between the multiple references as shown in Figure 5.
Figure 5: Data Integration based on owl:sameAs
Figure 5 depicts two file resources identified by URIs from different namespaces (e.g. extracted
from network and disk forensic analysis respectively) and are declared as being semantically the
same by the owl:sameAs property. Using this property, a link can be established between different
instances of the same digital object, thus adding additional information with respect to its origin, the
processes in which it participated or the agents that acted upon it.
 Semi-structured data support: Semi-structured data present irregularities such as missing attributes,
mixed typing and variable numbers of occurrences of attributes (Suciu 1998). Semi-structured data
do not necessarily rely on an a priori schema and this gives the flexibility to deal with schema-less
37
or incomplete information. This is a quite common case in the digital investigation context where
uncertainty and partial knowledge due to deleted, fragmented or missing data can be manifested. As
such, RDF statements can be processed even without explicit schema information by RDF
processors, although schema information can enable checking of the consistency and additional
automated inferences.
Semantic constraints and type related information and relationships can be defined in a schema and
used by a semantic reasoned for verifying the internal consistency of the RDF dataset. As an
example as presented in Figure 6, an instance of the File class that is associated with an IP address
through the ‘hasIPAddress’ property may be flagged as inconsistent based on the schema-expressed
restriction that the former property can have only instances of the Packet class as domain and that
Packet and File are two disjoint classes. This consistency checks can be quite important when
merging data from different sources as URI referencing misuse or semantic inconsistencies may be
automatically detected and communicated to the examiner.
Figure 6: Semantic inconsistency as reported by the reasoning engine
 Classification and Inference: An important aspect is that the RDF/OWL combination, in contrast to
object oriented programming paradigm, can infer class membership and typing information from
data based on the ontological specified definitions. Properties should not be considered similar to
attributes in object oriented programming, since they are not considered as part of a class definition
but their usage defines the class membership of an individual, thus enabling multiple class
memberships as well. As an example, a binary stream present in a physical disk’s image is initially
missing explicit type information. Throughout the investigation process and the output of different
tools such as specialized carvers (e.g. NTFS file-system reconstructions, file signature analysis, file
metadata analysis), new properties and their values are introduced about the initial resource. A
semantic reasoner can use definitions and restrictions defined in the ontology, in order to infer
dynamically class membership of the initial resource.
As an example, in Figure 7, a digital object has been extracted from an NTFS partition and the
NTFS file-system parser has inserted a data-type property that the file was deleted as per its MFT
entry. The digital object was examined by various parsers and an additional data-type property
about its MIME content type has been added implying that it is of an image type. According to the
ontological definitions, a class hierarchy of the File, ‘ImageFile’ and ‘DeletedImageFile’ concepts
has been specified. The ‘DeletedImageFile’ class’ extension has been specified as equivalent to the
intersection of the class extension of the individuals having the ‘image/jpeg’ as the value of the
‘hasMIMEType’ property and those having the true boolean value as the value of the ‘isDeleted’
property. The reasoner, upon finding an individual with these restrictions fulfilled, is able to infer
the type of the resource as being a member of the ‘DeletedImageFile’ class and subsequently of its
superclasses ‘File’ and ‘ImageFile’.
38
Figure 7: Class membersip entailment based on value restrictions
 Extensibility and Flexibility: Due to continuous new advances and technologies pertinent to the
area of digital investigations, it is important that applications dealing with evidence have to
demonstrate backward and forward compatibility with respect to the data model of their input and
output. As such, a backwards compatible application may be able to process input based on
previous data models. RDF/OWL is inherently providing such support where a reasoner can infer
new axioms based on the additional ontological definitions, when presented with information based
on older data models. The same applies for forward compatibility, where new concepts and axioms
defined in an existing ontology; do not affect existing tools that can still consume the remaining
axioms.
Based on existing practice, it is difficult to expect tool vendors and other stakeholders to agree on
standardized ontological definitions of the domain area. The former is manifested by the high
variation of different formats existing and the lack of standardization on these. However, OWL
provides the flexibility that different ontologies can be specified according to the stakeholder’s
interests and scope, but still be able to integrate through ontology mapping and alignment
processes.
 Provenance: RDF/OWL provides good annotation capabilities so as to be able to associate each
RDF statement with additional metadata regarding its originating source or other relevant
information. Reification is a strong feature of RDF whose idea can be useful for digital
investigations due to the requirements for establishing the chain of custody. The reification of a
triple statement of the form subject-predicate-object (e.g. <ex:a> <ex:b> <ex:c>) is another graph
that represents a blank node with the following properties
_:x rdf:type rdf:Statement .
_x rdf:subject <ex:a> .
_x rdf:predicate <ex:b> .
_x rdf:object <ex:c> .
Expressed in the latter way, a triple is represented as a RDF statement and thus additional metadata
can be attached to it by using the blank node as the subject for other statements. However, the latter
graph does not entail the first one and due to the loose semantic clarity and weak query support, the
use of it is gradually discouraged.
39
Another approach, building on the weakness of the reification approach, is the Named Graphs one.
A named RDF graph is a set of RDF statements but named with a URI reference. In such a way,
RDF data sets can be represented by the graph’s URI reference and additional metadata such as
provenance, trust, access control about these RDF data can be asserted using the graph’s URI
reference as subject (Carroll et al. 2005). SPARQL, as the preferred query language, provides
support to named graphs and allows queries to be executed against explicit graphs through their
naming. This can be quite useful, in the context of digital investigations, where RDF data-sets
originated from different tools, processes or systems and other log data that may be archived daily
can be uniquely identified, annotated and independently queried.
 Search: The Semantic Web advocates that ontology-based searching can provide significant
improvements over traditional keyword-based information retrieval (Shah et al. 2002). Concepts
and semantic relationships defined in the respective ontologies support the text extraction,
annotation and inference mechanism. Documents are annotated with additional semantic markup
that is later used over or along traditional keywords for document indexing purposes. A query can
be extended through a reasoning engine and generate more meaningful data by utilizing semantic
relationships between the searched concept and the semantically-annotated concepts of the
documents.
Keyword-based information retrieval is a prominent method in all types of modern digital forensics
and employed when examining media devices, mobile phones, security logs etc. The use of
ontologies providing semantic relationships between different terms and concepts can be able to
retrieve data that may not be directly asked by an examiner but semantically related to her query. In
a similar approach, although focused only on semantic linguistic relations, (Du et al. 2008) utilize
the WordNet, a general purpose lexical dictionary with additional labeling of the semantic
relationships between words, for query expansion taking advantage of synonym and antonym
relationships between terms for improved precision in the information retrieval process.
In this section, an outline has been presented of concrete advantages that a Semantic Web based
approach may offer over current practices and needs of the area of digital investigation. Based on the
former argumentation and examples along with the successful application of Semantic Web
technologies in relevant fields such as information retrieval, it can be claimed that such an approach
can indeed constitute a possible solution to the digital investigations’ practical problems discussed in
Chapter 2.
6.2 Relation to Digital Investigation Reference
Models
The proposed approach of semantically enabled integration and correlation of digital evidence relies
upon and refers to the previously described digital investigation frameworks and processes and
approaches. As such, a short description of the correspondence of the proposed approach to the above
is presented with the intention to provide a better insight of how the proposed approach can be
effectively combined with existing reference models.
40
The event based digital investigation framework has been selected as the most relevant proposed
framework due to its focus on event and the required forensic steps needed for the collection and
interpretation of digital evidence as reconstructed past events. The framework covers both the physical
and digital parts of the investigation but our proposed approach focuses only on the digital aspect of it.
However, the phases of Readiness, Deployment and Physical Crime Investigation are considered as
necessary predecessors of any digital investigation and their proper execution determines the quality
and trustworthiness of any input data to our proposed approach.
The digital investigation phase consists of three main sub-phases, the System Preservation, the
Evidence Searching and the Event Reconstruction. The System Preservation phase emphasizes the
necessity of proper acquisition and collection of the different sources that may contain data of
evidentiary value along with proper documentation of any actions performed on them. The results of
this phase are again considered as prerequisites for the quality of the collected evidence and any
subsequent analysis of them.
The Evidence Searching phase has the role of utilizing techniques and methods so as to separate
collected data with possible evidentiary value from unrelated and noisy data. Although, our proposed
approach does not focus on this phase, the successful execution of this phase can be of paramount
importance due to the computational complexity that large volumes of data can bring to any data
integration and correlation approach. Depending on the granularity of the ontologies used as well as
the concept relations that can be inferred or asserted, the transformation of given input to semantic
data can lead to huge amounts of triples and constitute any analysis attempt extremely cumbersome if
not infeasible. It is, of course, important to mention that the evidence searching phase must maintain
the integrity of any extracted, filtered or transformed data along with detailed documentation of any
applied operation.
The Event Reconstruction phase is the most closely related step to the proposed approach. In this step,
evidence that have been properly collected and preserved during the first phase and extracted or
filtered during the second phase are analyzed, integrated and correlated with the goal of providing a
reconstruction of any performed past actions. In this phase, a semantically-enabled approach can
provide a semi-automated or even fully automated method for representing the input data using the
expressivity of Description Logic and OWL, data integration of disparate sources of information and
even intelligent inference of complex types of events and their interrelationships.
Finally, although the current thesis does not deal directly with the Presentation Phase, a semantically
enabled approach with its inherent graph-capable connection of data fragments can provide a basis for
advanced and more intuitive visual interfaces (e.g. timelines of events, data path trails etc.) that may
enable a more meaningful interpretation and presentation of case relevant evidence.
The phases of the Event based digital investigation framework, as discussed before, present a highlevel view of the main activities involved in any digital investigation without providing a more
detailed description of how each of these phases has to be applied and what are the expected actions
that the examiner must perform. As such, the chosen Digital Investigation Process provides a much
more detailed description of the various steps and their expected outputs. Although, as previously
discussed, the steps of the presented Digital Investigation Process can be conducted in iterations and
not necessarily in a linear form, the steps that precede the Analysis part are not considered as of direct
relation to the proposed approach and not further discussed. However, as previously mentioned, the
successful performance of these steps according to forensic practices can significantly improve the
quality of the input data for the analysis step, considerably increase the efficiency, automation and
41
speed of the analysis and positively affect the results of the subsequent steps. As such reduction steps
of removing known files and thus reducing the amount of data needing to be processed as well as
further organization of evidence files based on the nature of the information they hold are performed
in this thesis by the author before applying the proposed analysis method.
The main parts of the analysis phase that are of interest to the present thesis are the ‘Fusion’ and
‘Correlation’. The purpose of the ‘Fusion’ phase is to merge data of different types and nature so as to
allow the examiner to acquire a broader and more thorough view of past events. This step is directly
mapped to the data integration capabilities that Semantic Web technologies offer, such as that data of
different forms can be represented to common and well-defined ontological concepts and
terminologies and user-defined relations can be easily drawn between them. The proposed approach
along with the later presented method can fuse data of different types (network captures, disk images,
log files etc.) using an extensible set of ontologies and interconnect data with different types of
relations such as part-whole, temporal, comparative etc. In the ‘Correlation’ phase data can be
combined in complex ways so as to reveal additional patterns of activity and establish (causality,
temporal, contextual etc.) relationships between different events or involved entities. The Semantic
Web approach can support such advanced forms of correlation through the automated inference
capabilities along with its support for rule engines that can support even more advanced expressivity
and establish correlations between different events. The ‘Validation’ phase where the results of the
analysis along with their backing reasoning are collected and further communicated can also be
enhanced with current and future research on the top ‘Logic’ and ‘Proof’ layers of the Semantic Web
architectural stack.
Finally and based upon the discussion regarding the scientific value of evidence and the concerns
about the reliability of any digital investigation process, the hypothesis-based approach and its process
of formulating and testing hypotheses has been considered as an important principle that should be
followed when conducting digital investigations. In accordance with additional requirements that such
a hypothesis based approach requires as expressed in (Rekhis & Boudriga 2011), the proposed
Semantically enabled approach can provide an automated, accurate and replay-able evaluation of
hypotheses as well as provide formalized and expressive means for a conceptual representation of
existing domain knowledge along with examiner submitted hypotheses. In our approach, we consider
the querying capabilities that Semantic Web technologies offer along with the formalized
conceptualization of forensic knowledge and involved entities as a structured and powerful way for
expressing and evaluating hypotheses not only against raw data but also complex events that may have
resulted from the integration of disparate data sources.
Hypothesis-Based Approach
Semantic Web
Digital Crime Scene Investigation Phase
Logic & Proof
Event Reconstruction - Analysis
Validation
System Preservation
Query
``
Rules
Evidence Search
Correlation
Fusion
Figure 8: Conceptual relation between forensic frameworks and the Semantic Web stack
42
Ontology
RDF Model
In Figure 8, the conceptual relations between the different phases of the presented framework and
process and the proposed approach based on Semantic Web technologies are depicted as previously
discussed.
6.3 Evaluation Criteria
Due to the lack of a common and established approach on evaluating digital forensic methods and
tools along with the interdisciplinary nature of the proposed method, evaluation criteria for assessing
the quality of the former and its results have been specified mostly based on relevant work in both the
area of digital investigations as well as semantic web applications. In order to establish a better
structured definition of the criteria, the ‘Goal-Question-Metric’ (GQM) approach (Basili et al. 1994) is
followed.
According to the GQM methodology, a measurement model is defined that can be used to evaluate the
alignment of a proposed or implemented system to its purpose as well as the correspondence of it to its
operational goals. As such, the three levels that constitute the model are the goals that identify the
engineering goals behind the system under evaluation, the questions that provide a way of refining the
goals into a set of operationalized characterization of the level of goal achievement and the metrics
which are defined as the necessary data that need to be collected and provide a quantitative basis for
the evaluation. It should be noted, though, that due to the restricted scope of the current thesis and the
nature of the proposed method as an alternative form of analysis for digital investigations, specific
criteria’s impact were difficult to be measurable in the context of a single project or case study but
instead qualitative arguments are given in their support, pointing in parallel to possible extensions of
the current work.
Based on relevant work regarding evaluation criteria and evaluation methodologies for forensic tools
(Hildebrandt et al. 2011) , Semantic Web based services and applications (Küster et al. 2010) and
software engineering methods in general (Kitchenham et al. 1997), a list of goals and criteria have
been collected and grouped into 3 main categories, namely the generic ones, the Forensic criteria and
the Semantic Web related ones . The goals are presented in an aggregated form in the tables below
along with their associated questions and metrics.
Table 1: List of Generic Criteria in terms of the GQM methodology
Generic Criteria
Goal
Questions
Metric
The proposed method should be
appropriate for the task in hand
What is the relationship of the
proposed method with existing
digital investigations practices
and tools?
The ability of the method to
handle different types of cases
(network-related events, media
devices examination etc.)
measured by the number of
different data types it can
process.
What are the case context
requirements for the method to
be applied?
The method should provide
good support for decisionmaking by providing relevant
and usable results.
What are the types of new
knowledge that such the method
can extract and what is its
usefulness.
43
The ability of the method to
support arbitrary queries and
provide answers over the whole
body of collected evidence. This
can be quantified by the
The method should be cost
effective in terms of storage and
time needs
How can the examiner
formulate and evaluate
hypotheses about the evidence
files and receive informative
results
precision and recall information
retrieval measures over the
query results.
How the method accepts and
stores input data, intermediate
and final results. What are the
storage requirements for such an
implementation?
Storage size requirements for
representing input and output
data.
How much time is needed for
applying the method on the
input data and how can it reduce
the time that the investigation
process takes?
The method should be flexible
and scalable
Can the method deal with new
sources of data or being able to
seamlessly integrate new forms
of ontologically-expressed
knowledge and rules.
Can the method support large
amounts of data and what
problems such complexity may
incur?
Time needed for performing the
analysis of data or evaluating
user-submitted queries.
The ability of the method to
process new data and accept
additional ontologies or rules
without the need of major
(possibly even none)
modifications on the existing
steps. It can be measured by the
amount of configuration or code
modifications such changes may
require.
The method’s ability to handle
large amounts of data. It can be
measured by the amounts of
input size in relation to the
processing time or produced
errors (e.g. number of captured
network packets, firewall logs,
disk image sizes etc.)
Table 2: List of Forensic related criteria in terms of the GQM methodology
Forensic Criteria
Goal
Questions
Metric
The method’s results should be
reproducible
Are the results of the method
behave in a deterministic
manner when applied on the
same input data or they are
inconsistent among multiple
The method’s results (e.g.
inferred axioms, query results)
should be the same given the
same dataset and independently
of other factors like order of
processing the evidence files.
44
tests?
This can be measured by the
number of errors or different
results after multiple
applications of the method on
the same dataset.
The method’s possible errors
should be minimal and
determined
Does the method produce
accurate results? Can the
method accept inconsistent or
malformed input data? How the
method deals with incomplete
data? Can the method produce
results that are ambiguous or
inconsistent to the specified
ontologies?
The method’s results can be
automatically checked by a
reasoning engine for possible
inconsistencies between asserted
and inferred axioms and the
given ontologies. The method’s
error rate can be measured by
the error messages produced
during its lifecycle.
The method must provide
logging capabilities for the
inclusion of arbitrary metadata
regarding the case, the entities
and the evidence objects
involved.
Does the method support the
addition of annotation axioms
with respect to the asserted or
inferred axioms?
The ability to insert logging
information during the method
can be measured by its
flexibility to accept arbitrary
metadata.
The method should protect the
integrity of the collected data
Can the method operate on
forensic copies of the collected
evidence?
Does the method allow the
logging of the various steps of it
as they are applied and their
results produced?
Does the method use hashing
algorithms in order to ensure the
consistency and integrity of
these forensic copies?
The method should protect the
integrity of the collected data,
files and devices throughout its
whole lifecycle by being able to
work on forensic copies instead
of the original and verify any
hash values that these copies
carry as forensic metadata. The
ability of performing these
checks for different data sources
can be considered as a metric.
Table 3: List of criteria with regarding to the Semantic Web principles in terms of the GQM methodology
Semantic Web Related Criteria
Goal
Questions
Metric
System Heterogeneity –
Platform Independence
Can parts of the method be
applied in different system and
the partial results later
recombined? Are there any
restrictions with respect to the
configuration of these analysis
The ability of the method to be
successfully applied in different
system configurations can be
measured through multiple tests
in different systems.
45
systems?
Implementable with the current
Semantic Web Stack
Can the method’s steps that
utilize Semantic Web concepts
be implemented with current
technology or other
improvements/extensions are
needed?
The method should be able to
rely on existing Semantic Web
technologies without the need
to develop or improve their
current status. Errors produced
or modifications needed when
implementing the proposed
method can be considered a
metric of how much
implementable the method
currently is.
The method and its results
should be semantically rich
allowing the description of high
level contexts and events along
with their interrelationships.
Can the method describe
arbitrary data? Can the method
accept descriptions of high level
and user-defined concepts and
associate set of lower level
events into them? Can the
method establish relationships
between these higher level
descriptions?
The method should be able to
accept user-defined high level
concepts and associate lower
level events to them using well
defined rules/restrictions. Errors
produced or inability to define
custom-defined events can be
considered as a metric of how
semantically rich the method is.
The above defined goals and their associated metrics can be used in order to establish an evaluation
framework for the method proposed as well as the results obtained. It should be noted once more
though, that the main goals of the proposed approach of large-scale automation of the digital
investigation process, integration and reasoning of all types of collected data as well as a formalized
ontological representation of the existing body of knowledge thus constituting digital investigation
more approachable to less technically adept people are not fully covered by the former goals. A much
broader evaluation framework that would involve studies of the people involved in Digital
Investigations and the effect that such a semantically-enabled method can bring upon their existing
practices and perspectives would be required and involve research methods such as action research,
surveys and interviews.
46
A Semantically enabled Method
for Digital Evidence Integration,
Correlation and Hypothesis
Evaluation
In this chapter, the proposed semantically enabled method for digital evidence fusion, correlation and
hypothesis evaluation is discussed both in a theoretical and practical level. The first part of the section
deals with the abstract building blocks that constitute the parts of the method and their relationships
while the second part deals with implementation details of the method along with the suggested
software architecture that has been used for a proof of concept practical application of the method.
7.1 Description of the Method
The basic design structure of the method is presented in Figure 9 and further discussed afterwards.
Figure 9: The abstracted method's structure
In the Data Collection phase, all current and future forensic acquisition techniques are implied. The
goal of the data collection phase is to generate the appropriate input for the remaining parts of the
method. Forensic principles such as proper seizing and acquisition of involved data sources (e.g. hard
disks, network packet captures, OS and application logs, memory contents, mobile devices etc.) are
assumed to be maintained during the process. Although the current method does not impose strict
requirements or checks on the input data, the application of the method on forensic copies of the
original data with their integrity well protected and their chain of custody maintained is important for
the subsequent credibility and admissibility of the generated results. Due to computational complexity
issues, it is expected that a preprocessing and data reduction step can be also applied using common
techniques such as KFF (Known File Filter) so as to reduce the amount of input data of the method.
The second step of the method involves the parsing of the input data with appropriate software or even
hardware tools and their representation in a common format using the RDF data model and respective
ontologies. As discussed before, ontologies, expressed in OWL in the Semantic Web context, allow
the modeling of the domain that each input source represents and the explicit definition of the
pertinent concepts, properties and interrelations. Unfortunately, so far little work has been done as
discussed before, on the specification of ontologies that conceptualize the various types of evidence
that are commonly used in digital investigations. Most ontological approaches have focused on
surrounding concepts and elements of digital investigations such as involved entities, procedures or
specialized sources of evidence such as the Windows Registry. In the context of the proof of concept
method implementation presented below, a number of lightweight ontologies are specified. However,
domain experts should cooperate in order to reach a consensus of more complete and accurate models
of the different evidence domains expressed in an ontological manner. Despite the above, the parsing
47
tools can be flexible enough so as to operate even with custom made ontologies specified by the
investigator or by other communities such as forensic tool vendors.
The output of the ‘Ontological Representation’ step should be the transformation of the input data and
their various formats (binary, XML, forensic disk image formats) into the RDF data model of the
triplet. The RDF data model’s flexibility allows the representation of arbitrary information in the form
of subject, predicate and object. Subjects should be given a URI so as to be uniquely distinguishable
while objects can also be represented as resources along with their URIs either as data values under
suitable namespaces. The predicates are specified as object or data properties in the respective
ontology and enable the representation of the interrelationships between resources or the specification
of the values of resources’ properties. The asserted axioms can be contained in a new blank ontology
that contains a reference to the ontology file that describes the classes and the properties. The final
output can be stored using the various ontology formats such as RDF/XML or RDF/OWL since most
semantic web frameworks support equally well both.
The next step is the ‘Automated Reasoning’ step. During this step, an OWL-based reasoner is called
upon the previously generated ontological representation of data in order to infer additional axioms
based on the resources’ ontological relations. Depending on the sophistication of the given ontology
and the detail of the given data, different types of new axioms can be inferred such as:
 Class Assertion Axioms: The inference engine takes into consideration defined property restrictions
in order to classify resources as members of specific class extensions. The advantage of this
inference is that a resource can belong to multiple classes in parallel. Additionally, a resource with
partial asserted knowledge can dynamically be classified in the best matching class descriptions
thus allowing the mapping of evidence data into higher order and custom defined concepts not
necessarily included in the original data.
 Property Assertion Axioms: The inference engine takes into consideration property relations such
as property hierarchies or transitivity in order to infer new property axioms that are attached to the
given individuals. New features such as chained properties defined in OWL2 allow the inference of
new property assertions based on more complex relationships of resources.
 Inverse Object Property Axioms: The inference engine takes into consideration the definition of
properties that are have an inverse relationship in order to infer the inverse object property
assertions based on the existing ones. This enables the resources that are connected by these
properties to establish bi-directional connections that further promote more advanced queries.
 Subclass Axioms: The inference engine is able to resolve a new inferred subclass hierarchy based
on previously asserted class descriptions when these extra hierarchical relations have not been
explicitly defined in the ontology.
Following this, the ‘Rule Evaluation’ step is where SWRL Rules are evaluated by a rule engine and
the newly asserted axioms are inserted back to the ontology. SWRL has been chosen as one of the
most promising rule language for the Semantic Web stack. SWRL Rules can assert new additional
axioms that cannot be inferred due to the DL-safe expressivity limitations of OWL. To the best of our
knowledge there is no fully support for SWRL rule evaluation by most OWL-inference engines and
thus some rules may have to be evaluated by external Rule Engines such as Jess. These external Rule
Engines may represent axioms in a different way than the RDF data model and thus require an
additional translation layer. The newly inferred axioms by the Rule Engine should be translated again
back to the RDF data model and integrated in the existing ontology. Depending on the ontologies
48
used, an additional round of inference may be applied if new axioms can be inferred based on the
newly-inferred axioms by the evaluation of the rules.
The final step is the ‘Integrated Query’ where queries expressed in the SPARQL language are
submitted to a SPARQL endpoint hosting the sets of asserted and inferred axioms. The term
‘Integrated’ is used since in the previous two steps, both by OWL-inference as well as SWRL
evaluation relations in the form of predicates can be established between resources of the same or
different datasets.
7.2 Ontological Representation of Digital
Evidence
As explained before, the ontology plays a central role in the proposed method. Currently, there are no
well-established ontologies for the area of digital investigations covering different types of evidence. It
would be too hopeful to expect that one commonly accepted ontology describing concepts and
relations of different types of digital evidence will soon be accepted by tool vendors and integrated in
their tools. Based on the above assumption, the method has been developed by considering its ability
to handle different ontologies from different namespaces. As such, a choice that has been adopted
during this research is that different types of evidence (network packets, forensic disk images, logs
etc.) are each associated with their own ontology under their respective namespace.
Conceptually, in the context of the current research, sources of data have been categorized into two
main types, namely the ‘Case Related Data’ (CRD) and the ‘Supportive Data’ (SD). CRD are data that
originate from main sources that are the result of the initial steps of the forensic process such as
collection and examination. CRD describe data sources which contain the data of direct evidentiary
value for the case in hand. CRD can include captures of packet captures in either raw format (pcap) or
sessionized (NetFlow), forensic images of hard disks and logs produced by different applications or
network appliances (firewall, IDS etc.). SD, on the other hand though, is additional information that
originates by further analysis of CRD or other types of metadata that can be provided by other
services. Examples of SD can include domain name information such as DNS reverse lookups,
WHOIS domain information, IP geo-location and IP to Autonomous System Number (ASN)
mappings, antivirus and antimalware engine outputs upon scanning extracted files etc. A clear
distinction between CRD and SD may not be feasible taking into consideration the procedures
followed by different investigation teams. For this research, the ability of data and metadata to be
collected directly from the collected sources versus the usage of external services for additional
metadata will be the main criterion for performing the distinction.
For the purposes of the implementation of a proof-of-concept system, a number of lightweight and
focused ontologies, expressed in OWL, have been defined and designed using the Protégé OWL
Editor. Based on the assumption that heavyweight ontologies that can better conceptualize the area of
digital investigations require large-scale cooperation and consensus among academia, industry and
individuals, a set of lightweight ontologies specializing in the ontological representation of different
types of data pertinent to a digital investigation, can provide a more pragmatic approach. Such an
approach can demonstrate the ability of semantic web solutions to still provide considerable
advantages, with the price of added complexity of course, in the areas of information integration and
automated reasoning.
Three main types of CRD have been chosen, namely network packet captures that provide an accurate
image of the network communications a system had in the past, forensic images of hard disks that
49
provide a forensic copy of the files and directories stored on a hard disk structured in the context of a
file system such as NTFS and finally firewall logs such as those created by Windows XP Firewall. For
SD, we have considered sources of data such as malware detection services such as ‘VirusTotal’
(www.virustotal.com) that can provide additional information about the potential malicious nature of a
file, network registration information such as those provided by RIPE (www.ripe.net/data-tools/db)
that include the IP range in which an IP address belongs and the Autonomous System (AS) to which it
is currently assigned and finally the results produced by other projects such as the FIRE project
(http://maliciousnetworks.org/) that provide real-time data from monitoring for reported malicious
networks that may contain hosts that are used to serve phishing sites or malware. In the sections
below, each information source and its respective proposed ontology are discussed in more detail.
7.2.1 Network Packet Capture Ontology
Network Packet capturing can record data packets that are transferred via a computer network. Our
focus is of course to the most prominent type of modern computer networks which are based on the
TCP/IP stack. TCP/IP Stack is structured in four layers, namely from bottom to top link layer, internet
layer, transport layer and application layer. A packet capture can monitor a network, wired either
wireless, and capture all data transmitted in the link layer. As such, a packet transmitted in the link
layer, carries along the data from all its hierarchically above layers such as protocol headers from the
internet layer (e.g. source and destination IP addresses), transport layer (e.g. TCP or UDP source or
destination port) and of course the application layer data (e.g. an HTTP request or response).
A Network Forensic Analysis Tool Is able to analyze the protocols of each layer that each packet uses
and additionally assemble the packets into streams thus providing a higher level of abstraction such as
the complete file or request that an application sent to another over the network. Network Forensics
has the objective of analyzing such captures of network traffic in order to extract transmitted files (e.g.
HTTP downloaded files) or application messages (emails or instant messaging conversations) as well
as traces of attempted or succeeded intrusions such as web application attack vectors like SQL
injections.
A basic ontology that contains terms and concepts important in a packet capture analysis has been
designed and the hierarchical structure of a part of the defined classes is presented in Figure 10. The
basic approach followed during the design of this ontology is that a packet capture file contains a set
of network packets which can be further aggregated into IP conversations between pairs of IP
Addresses. An IP conversation though between two distinct IP addresses may include a set of TCP or
UDP flows between different source and destination ports. Finally a TCP or UDP flow may contain a
set of application layer messages such as HTTP request and response messages. An application
protocol such as HTTP contains an internal structure with a set of different header fields and values
with different semantics such as the type of browser a user is using, the user credentials that are used
for authentication and the type of the returned resource (e.g. image or text or binary content). In this
ontology we have focused on the HTTP protocol, which is one of the most used protocols on the
Internet today. In order to further semantically annotate the HTTP requests and responses present in a
packet capture, the RDF vocabulary of HTTP as specified by the W3C ERT working group (Koch et
al. 2011) has been used. As such, the content that is exchanged via the HTTP protocol can be formally
annotated in a machine understandable manner. In a similar approach to the current thesis, although
only focused on HTTP based attacks, (Munir et al. 2011) further leverage the HTTP RDF vocabulary
by introducing semantic rules expressed in SWRL in order to detect malicious and malformed HTTP
requests or responses such as HTTP requests including HTTP response headers etc.
50
Figure 10: Ontological modeling of network packet captures
A brief description of the defined classes, object properties and data properties follows:
Table 4: Entities of the Network Packet Capture Ontology
CLASSES
PacketCaptureFile
A class that contains in its class extension all the resources that
represent packet capture file. In order to better manage network packet
captures in large network that span large periods of time, various
policies can be followed such as splitting the capture network traffic
into different files by a threshold file size or an amount of captured
packets. An individual that belongs to this class can represent an
individual packet capture file and properly annotated with additional
properties useful for documentation and chain of custody purposes.
IPAddress
A class that represents IP Addresses. Each different IP address is
identified by a different URI.
IPv4_Communication
A class that represents IP communications between two IP addresses.
An IP conversation has a source and a destination IP address. By source
we mean the client that initializes the connection and the destination as
the server accepting the connection.
Port
A class representing the network ports that a computer has. A Port can
be either a TCP or a UDP port and is identified by its number.
TCPFlow
A class representing a data flow over the TCP transport layer protocol
from a source TCP port to a destination TCP port. The same applies for
the UDP respective classes.
ApplicationLayerProtocol A TCP or UDP flow is characterized by the application layer protocol
that is used such as HTTP or DNS. This class extension contains all the
resources that represent a communication between two network hosts
using a specific application layer protocol under a specific TCP or UDP
51
connection.
OBJECT PROPERTIES
hasCommunication
This object property is used to link a packet capture file as the subject
with a set of IP communications that it contains as the object. Subproperties of it are the ‘hasTCPCommunication’ and
‘hasUDPCommunication’.
hasSourceIP ,
hasDestinationIP
These object properties link IP communications with individuals of the
IPAddress class that are the source and destination endpoints.
hasSourcePort ,
hasDestinationPort
These object properties link a TCP or UDP flow with the source and
destination ports which are individuals that are members of the Port
class.
hasApplicationLayer
This object property connects an IP communication with the application
protocol that is used.
hasHTTPRequest
This object property links an individual that is a member of an HTTP
communication with a set of individuals, members of the class HTTP
Request.
DATA PROPERTIES
hasPortNumber
An integer data value that carries the port’s number.
hasStartTimeStamp ,
hasEndTimeStamp
A data value of the type of DateTime as specified in the XML Schema
Datatypes. These values indicate the temporal duration of a specific
TCP or UDP stream.
hasContentMD5
This data value contains a hex representation of the MD5 hash signature
of the content that was carried through an HTTP response message.
Further classes such as Request, Response, RequestHeader, ResponseHeader, Content are imported
from the HTTP RDF vocabulary and are used to further annotate the various parts of HTTP requests
and responses.
7.2.2 Forensic Disk Image Ontology
Analysis of forensic images of hard disks, or storage devices in general, is unquestionably a central
part of almost every modern digital investigation. There are various types of storage devices that are
used such as hard drives (SATA, SCSI, SSD), USB flash disks, SD cards etc. There is also a plethora
of file systems that are used to organize and manage the contents of the device in the form of files,
directories and other associated data structures such as NTFS, FAT32, EXT3, HFS etc. Storage
devices, especially in the enterprise environment, can be combined using various techniques such as
RAID or Disk Spanning for added benefits such as fault-tolerance or dynamically increasing storage
sizes. There are also different techniques for acquiring forensic images of such storage devices
including specialized hardware devices or software solutions. A forensic image of a storage device can
also come in different format such as raw (dd), EWF, AFF and more.
52
Most forensic suites, especially the commercial ones, are able to read most of these image formats and
also interpret and reconstruct different file systems. Depending on the capabilities of the tool and the
file-system used, files and directories can be extracted along with deleted or fragmented ones, various
metadata can be retrieved such as timestamps, file-type, hash signatures or even files hidden using
various anti-forensic techniques (e.g. using the slack space). The results of the former can be used for
determining the creation or possession of specific files by a user (e.g. contraband material), historical
data of the usage of a system or even analyze malware-infected systems.
The DFXML approach (Simson Garfinkel 2011) attempts to insert a layer of abstraction over all this
variety of formats and types by introducing an XML-based vocabulary with XML entities and
attributes that capture important metadata about files and directories that are common amongst most
existing file system and storage types. DFXML is accompanied by a series of scripts/tools (e.g. fiwalk)
that can parse forensic images of storage devices and generate an XML document that provides a
listing of contained files and directories along with their most important metadata. As mentioned
before although XML provides a common data exchange format that almost all systems can generate
or process it is lacking the additional semantic relationships between the different entities.
Based on the DFXML proposed XML vocabulary, an ontology has been designed with the goal to
semantically express commonly-used concepts and attributes in the area of digital forensics along with
their interrelationships. The ontology has been created using the Protégé Ontology Editor and is
graphically presented in Figure 11. The design approach followed is that conceptually the term binary
content can describe equally well both an image of a large storage device, a small file or even a
sequence of bytes as part of a file. As such an output of the fiwalk tool represents the contents of an
image of a storage device which in its turn may consist of multiple partitions formatted with different
file systems which contain the various files. A file’s content can also be described by its byte-run, i.e.
the sequence of bytes and their location in the image or multiple ones in the case of fragmentation.
Figure 11: Ontological modeling of a forensic disk image
A brief description of the defined classes, object properties and data properties follows:
Table 5: Entities of the Disk Image Ontology
53
CLASSES
BinaryContent
A generic concept that describes all types of binary content
independently of their size or format.
FiwalkReport
The class represents individual instances of the fiwalk tool output.
MediaDeviceImage
The class represents all types of storage devices.
Partition
The class represents blocks of the total storage area that are logically
separated and possibly differently formatted. Partitioning is used in the
case of system with multiple operating systems or even separations of
large storage devices into smaller blocks.
FileSystem
This class represents different types of file systems that are used to
organize the data inside a partition as well as manage them (e.g. create
or delete files).
File
This class represents a logical entity of arbitrary information that
stores data formatted in a specific way.
ByteRun
This class represents a byte run which is a sequence of bytes that are
part of a file and are sequentially stored in a storage device.
OBJECT PROPERTIES
describes, isDescribedBy
These object properties is used to establish the relationship between an
individual that is a member of the FiwalkReport class with the
individual that is a member of the MediaDeviceImage class that the
former describes.
hasPartition, isPartitionOf
These object properties establish the relationship between an
individual that is a member of the MediaDeviceImage class and the
one or multiple individuals that are members of the Partition class that
it may contain.
hasFileSystem,
belongsToPartition
These object properties establish the relationship between an
individual that is a member of the Partition class and an individual that
is a member of the FileSystem class representing the file system with
which the partition is formatted.
containsFile,
isContainedInFileSystem
These object properties establish the relationship between an
individual that is a member of the FileSystem class and the multiple
individuals that are members of the File class.
hasByteRun,
belongsToFile
These object properties establish the relationship between an
individual that is a member of the File class and the single/multiple, in
the case of fragmentation, individuals that are members of the
ByteRun class.
DATA PROPERTIES
hasPathName
A string value of the path and the name of a specific file contained in a
storage device image or even the file of the image itself.
54
hasType
A string value that holds the type of the file after the file identification
process followed by fiwalk using the libmagic library.
hasFileModificationTime
The lexical representation of the timestamp of the last modification of
the file using the XML Schema DateTime datatype.
There are also a large set of data-type properties that describe different metadata about the file which
due to space limitations have not been included here.
7.2.3 Windows Firewall Log Ontology
The main purpose of a firewall device is to inspect the flowing network traffic and accept or deny
network packets based on some predefined security policy. Firewalls can be implemented either as
network devices being positioned in the perimeter of a network thus effectively protecting a set of
hosts that reside behind them or can be also operated in single hosts as well. Without going too much
in detail about the different types of firewalls and the ability of some to even inspect application layer
data, for the purpose of this thesis we will focus on stateful host-based firewalls such as Windows
Firewall.
The Windows Firewall was first integrated with the Windows operating system with the release of
Windows XP Service Pack 2. The Windows Firewall has a preconfigured set of rules describing
allowed or disallowed traffic based on packet characteristics such as source and destination ports and
source and destination IP addresses. The firewall is able to inspect both incoming as well as outgoing
traffic. The Windows Firewall enables the administrator to activate the secure logging capabilities of
the product through its configuration settings. The administrator has the option to either log dropped
packets that the firewall had to reject based on the specified rules either log the successful connections
that are allowed to pass through the firewall or both.
The security log is using the W3C Extended Log File Format. This format is based on a W3C
Working Draft with the aim to provide a standardized and flexible format for keeping log files related
to Web activities. This format is actually used by a variety of applications such as the Microsoft IIS
Server as well as other web server software such as the Apache Web Server. Of course, different
vendors of either software or specialized network firewall devices such as Cisco may use their own
proprietary log formats. However, basic information such as those kept in the W3C Extended Log File
format is expected to be logged by most other solutions as well. The following table provides basic
information of the various fields that can be found in a Windows firewall log.
Table 6: List of fields of Windows Firewall log entries
Item
Description
Version
The Windows Firewall software version
Software
Name of the application producing the log
Time
The time format used for reporting timestamps
Date - Time
The timestamp of the log event in the form of
YYYY-MM-DD HH:MM:SS
Action
Describes the action that the firewall has taken
upon the packet. The values available are OPEN
for an allowed outgoing connection, OPEN55
INBOUND for an allowed incoming connection,
CLOSE for a normal closure of a TCP
connection, DROP when a packet violates a
firewall rule and is subsequently rejected and
INFO-EVENTS-LOST that describes a number of
occurred events not recorded in the log.
Protocol
The network protocol in use such as TCP, UDP,
ICMP or the protocol numbers in case of other
than the former protocols.
Src-ip – Dst-ip
The source and destination IP address
Src-port – Dst-port
The source and destination port number
Size
The packet size in bytes
TCPFlags
The first letter of each active TCP flag present in
any TCP packet.
Path
Indicates the direction of the communication. The
value SEND is used for an outgoing packet from
the host and RECEIVE for an incoming packet.
A firewall log can provide a wealth of information regarding connections that the system has
established with other networked systems or possible incoming connection attempts that have been
either accepted or rejected. A high amount of rejected packets can be used as an indicator of a possible
intrusion attempt while large sets of packets with specific characteristics such as TCP SYN packets
may indicate network port scanning or even Denial of service attacks. As such a firewall log can be
quite useful in the case of digital investigation that pertains to network intrusions or other network
related activities. A firewall log that is properly stored, maintained and archived can be potentially
used as evidence of previous network activity and further correlated with network packets captures.
A lightweight ontology has been designed for semantically representing important terms and concepts
used in a firewall log analysis as well as their interrelationships. The ontology has been also designed
using the Protégé Ontology Editor and its structure is graphically presented in Figure 12. The main
design approach followed is that a Firewall Log File can contain a set of Firewall Log Entries. Each
entry shall include a source and destination host, a source and destination port as well as the associated
protocol used. Besides that, the different actions that a firewall may take upon each packet have been
conceptualized as a Firewall Event, with its sub-concepts of Open Inbound or Outbound Session
event, Close Session event and Drop Data Event.
56
Figure 12: Ontological modeling of Windows firewall logs
A brief description of the defined classes, object properties and data properties follows:
Table 7: Entities of the Windows Firewall Log Ontology
CLASSES
Host
A class representing all the network hosts
Port
A class representing the network ports that a computer has. A Port can be
either a TCP or a UDP port and is identified by its number.
Protocol
A class representing the different network protocols that a connection
may use. For the case of a Windows Firewall Log the options TCP, UDP
and ICMP have been specified as individuals, members of this class.
FirewallLogContainer
A class representing an entity that acts as the container of a set of firewall
logs. Windows Firewall Logs are commonly stored on text files on the
local system or another remote storage. Advanced configuration options
enable log rotation to be used where logs are kept in different files
depending on the date or a threshold amount of already logged entries.
FirewallLogEntry
A class representing firewall log entries. Practically, each line in the W3C
Extended Log File Format is mapped to a distinct firewall log entry.
FirewallEvent
A class representing the action that the firewall took upon a network
packet or session as described by a log entry. The sub-classes
CloseSessionEvent, DropDataEvent, OpenInboundSessionEvent and
OpenOutboundSessionEvent describe the different actions that the
Windows firewall may take.
OBJECT PROPERTIES
hasLogEntry
This object property connects individuals that are members of the
FirewallLogContainer class with individuals that are members of the
FirewallLogEntry class. This is a useful property that allows the digital
investigator to track down the actual log file that contains a specific log
entry of interest.
57
represents
A mapping between a firewall log entry and the firewall event that
describes it. This is a basic abstraction from the raw format of a single
firewall entry to a more abstracted and descriptive event. Events can thus
be more meaningfully aggregated later on in events of even higher
abstraction as well as reasoned and queried about.
hasSourceHost ,
hasDestinationHost
This object property connects individuals that are members of the
FirewallEvent class with individuals that are members of the Host class.
hasSourcePort ,
hasDestinationPort
This object property connects individuals that are members of the
FirewallEvent class with individuals that are members of the Port class.
hasProtocol
This object property connects individuals that are members of the
FirewallEvent class with the specified individuals of the Protocol class.
DATA PROPERTIES
has Action
This data property holds the lexical value of the action that the firewall
applied upon a packet or session such as DROP, CLOSE, OPEN etc.
hasAddress
This data property holds the lexical value of the IP address of a network
host.
hasNumber
This data property holds the numerical value of a network port.
hasDateTime
This data property holds the value of the date and timestamp wen the log
entry has been logged. It is formatted based on the DateTime data type
specified in the XML Schema DataTypes specification.
7.2.4 WHOIS Ontology
The modern Internet is without doubt quite different from the initial small research-oriented networks
from which it emerged. In order to deal with the ever increasing complexity of managing the IP
address space and administering the allocations of them as well the associated domain name
infrastructure a hierarchy of administrative organizations has emerged. Without going too much in
detail, nowadays there are 5 Regional Internet Registries (RIR) with the responsibility of managing
specific ranges of IP addresses. These RIRs are ARIN for North America, RIPE for Europe, Middle
East, Russia and Central Asia, APNIC for Asia and Australia, AfriNIC for Africa and LACNIC for
South America. RIRs are responsible to further split and allocate IP ranges and accept domain
registrations from their customers such as ISPs and organizations. The RIR is storing information
about the entity being assigned an IP range or domain name in its own records.
On the other side, WHOIS is a query and response protocol for querying such databases maintained by
the RIRs or other subsidiaries such as companies providing domain registration services. The WHOIS
protocol is further described in RFC 3912. Information retrieved from a WHOIS server can include the
domain name assigned to a network, the IP address block assigned or the Autonomous System. An
Autonomous System (AS) id characterized by its Autonomous System Number (ASN) and identifies a
set of routing prefixes under the control of a single entity. The Autonomous System is actually
publishing a well-defined routing policy and is used for interconnecting large scale networks on the
Internet using the BGP protocol.
58
In the case of digital investigations that involve network activity with remote IP addresses, it is of
significant value to acquire more information regarding the entity responsible to manage the network
where an IP address belongs to. Information acquired by using the WHOIS protocol and the DNS
infrastructure are commonly used to monitor whole networks for possible malicious activity such as
spamming or malware distribution and further notify the responsible network operators or blacklists
them in order to isolate them from the rest of the Internet. There is a variety of tools for almost all
modern operating system to submit WHOIS queries. These tools include both command-line and
graphical ones as well as specialized web services provided by various web sites.
The RIPE Network Coordination Centre provides such a web interface of its database that can be used
to submit WHOIS queries over HTTP and display the results in the browser. Recently, RIPE has
integrated each system with those of the other RIRs thus providing a unified query interface. The
online service can be accessed at https://apps.db.ripe.net/search/query.html. The results can be
returned either in XML or JSON format. This service, or other services to this, can be used in the
context of a digital investigation in order to provide a mapping between IP addresses and the networks
to which they belong to as well as based on the ASNs of their networks to further integrate them with
other sets of information regarding malicious or black-listed networks.
A quite simple ontology has been defined in the context of this research, in order to semantically
represent information that can be retrieved from a WHOIS query. A graphical visualization of the
ontology is presented in Figure 13. The main approach followed during the design of this ontology is
that an IP address belongs to an IP Address Block, an IP Range and is managed by an Autonomous
System (AS).
Figure 13: Ontological modeling of WHOIS data as provided by RIPE
A brief discussion of the defined classes, object and data properties follows:
Table 8: Entities of the RIPE WHOIS Ontology
CLASSES
ASSystem
A class that represents an Autonomous System.
IPRange
A class representing blocks of IP Addresses.
IPAddress
A class representing individual IP addresses.
OBJECT PROPERTIES
containsRange ,
isContainedInAS
This object property connects individuals that are members of the AS
System class with individuals that are members of the IPRange class.
The inverse object property provides the reverse relationship. Using this
59
property an AS system is connected with the IP Address blocks that was
assigned to.
containsIP ,
isContainedInRange
This object property connects individuals that are members of the
IPRange class with individuals that are members of the IPAddress class.
The inverse object property establishes the inverse relationship. This
property is used to map an individual IP address with the IP Address
block in which it belongs to.
DATA PROPERTIES
hasASNumber
This data property provides the numerical value of an Autonomous
System.
hasNetName
A descriptive name of the network responsible for a specific IP Address
Range as provided by the RIR.
hasCountry
A two-letter country code that indicates the country in which the
specified network operator is located.
hasRange
A lexical representation of an IP Address Block in the form of starting
to end IP address.
hasAddress
The lexical value of an IP Address.
7.2.5 Malicious Networks Ontology
The modern internet threat landscape is continuously becoming more and more complex. Over the
last decade a shift is manifested from server-side attacks to client-side attacks. Modern attack vectors
include sophisticated techniques such as phishing sites and drive-by downloads where the user is
tricked or system vulnerabilities are exploited in order to infect a client machine with malware.
Compromised machines are often becoming members of botnets, large sets of compromised hosts
under the control of a criminal group that are further used for distributing spam e-mails or performing
other types of activities such as distributed denial of service attacks (DDoS) against specified targets
or acting as proxies for the main malware distributing servers for evading detection (fast-flux
networks). These techniques are quite complex and a lot of research is already conducted and ongoing
on these topics. For the purposes of this research though, we consider the knowledge that an IP
address or network in general, which appears in the communication logs that an investigator examines,
demonstrates such malicious behavior to be quite important. Such information can be used by an
investigator, especially in cases of network-related incidents, to quickly identify suspicious traffic that
any of the examined systems may had with such malicious Internet hosts.
There are a large number of projects with the purpose of actively or passively monitoring Internet
hosts in order to detect such malicious behavior. Most commonly such projects produce blacklists of
IPs that are observed to perform such malicious actions as sending spam emails, hosting malicious
web pages or perform intense scanning activities. In this project, we have considered the FIRE project
(Stone-Gross et al. 2009) that is a part of the European FP7 Wombat project (http://www.wombatproject.eu/). The project aggregates and correlates security related information from a variety of
sources such as the Anubis software which monitors the actions performed by a Windows malicious
executable, Wepawet for analysis of malicious Javascript, PDF or Flash files that can be contained in a
web page, lists of spam and advertising URLs such as SpamCop or phishing sites such as PhishTank.
60
The project correlates all these data sources and along with information about the ASNs in which these
IP addresses belong to, is able to identify networks that contain hosts that consistently exhibit
malicious behavior. The results are further communicated via the website of the project at
(http://maliciousnetworks.org).
In the context of this research, a lightweight ontology has been designed as graphically presented in
Figure 14. The main design approach has been that an Autonomous System has a set of Hosts that may
be further characterized as Malicious Hosts in the case that the FIRE Project based on its analysis
determines so. The FIRE project separates malicious host in three main categories, based on their
behavior, namely ‘Phishing Server’ for hosting phishing web sites, ‘Exploit Server’ for hosting
malware files such as Windows executables and CCServer in the case that they act as Command &
Control servers for managing botnets.
Figure 14: Ontological modeling of FIRE's blacklist of malicious networks/hosts
A brief description of the defined classes, object properties and data properties follows:
Table 9: Entities of the MalicousNetworks ontology
CLASSES
AS
The class represents the concept of an Autonomous System.
Country
The class represents the concept of a Country.
Host
The class represents network hosts. A direct subclass is the ‘MaliciousHost’
class which further is sub-classed into the ‘PhishingServer’, ‘ExploitServer’
and ‘CCServer’ classes.
IPAddress
The class represents the concept of the IP Address.
OBJECT PROPERTIES
containsHost ,
isContainedInAS
This object property and its inverse connects individuals, members of the AS
class with individuals, members of the Host class.
locatedIn
This object property connects members of the Host class with members of the
Country class. The FIRE project further correlates IP Addresses with IP geolocation databases in order to annotate a host with the country which most
61
likely it is located in.
DATA PROPERTIES
hasASName
A string value of a descriptive name of the Autonomous System.
hasASNumber
The integer value of the Autonomous System Number.
hasCountryName
The lexical value of a country’s name.
hasIPAddressString The lexical value of an IP address.
7.2.6 Malware Detection Ontology
A final Supportive Data source that has been used in this project has been a malware detection service.
It is common practice, especially in cases where compromised systems are involved, to scan the files
found in a system’s storage image against an antimalware engine in order to detect any traces of
malware resident on the system. Investigators may use either an antimalware product of her choice or
use similar web services. One such web service is the site ‘VirusTotal’ (https://www.virustotal.com/).
This free online service provides access to a large set of commercial and free antimalware engines
such as AVG, Avast, McAfee, Symantec etc. and returns a summary of the results of each one of these
engines. The use of this service provides increased accuracy when analyzing a suspicious file since
hardly any of these engines can claim 100% detection rates. One limitation is that the web service
returns search results only in cases that a file with the queried hash value has been previously
submitted and analyzed by the service.
Unfortunately, there is not a common terminology or naming conventions followed by all these
different vendors and as such the results of a file analysis are mostly of descriptive form following the
naming scheme that the vendor follows. As such, a quite simple ontology has been defined in order to
semantically represent the results of such file analysis as graphically shown in Figure 15. The main
design approach has been that a File object is analyzed by the ‘VirusTotal’ service that in return
provides an AntiVirus Report with the description that each one of the engines returned.
Figure 15: Ontological modeling of VirusTotal's anti-malware detection service
A brief description of the defined classes, object properties and data properties follows:
Table 10: Entities of the VirusTotal ontology
CLASSES
62
File
The class represents File Object, commonly extracted from a forensic disk
image or a network packet capture that is submitted for analysis for
possible malware behavior.
AntivirusEngine
The class represents an Antivirus-Antimalware engine. Individuals of this
class are the engines that are currently supported by the ‘VirusTotal’
service which amount to over 30 as of present.
AntivirusReport
The class represents a collection of the results of the different engines as
returned by the service.
OBJECT
PROPERTIES
hasAVReport
This object property connects an individual, member of the File class,
with another individual, member of the AntivirusReport class.
hasResult
This object property connects an individual, member of the
AntivirusReport class with a blank node that is used to represent the result
attained from an antimalware engine.
generatedBy
This object property connects the blank node mentioned before that
represents an engine result with the individual that is a member of the
AntivirusEngine and represents the specific engine that provides the
result.
DATA PROPERTIES
hasAVName
The name of the engine formatted as string.
hasDate
The date on which the report has been produced.
hasMD5Hash
The MD5 hash value of the submitted file object.
hasPermanentLink
A URL link where the ‘VirusTotal’ service stores the generated report for
future reference.
hasResultDescription
The lexical representation of the output of the antimalware engine.
7.3 Semantic Integration and Correlation of
Forensic Evidence
Upon the definition of the semantic descriptions that have been presented in the previous section, an
automated method can access evidence files that are pertinent to a case and structured in the supported
formats described before and represent them in a semantic manner. The data contained in these
formats can be transformed into semantic representations using the classes, object and data properties
defined in the respective ontology. The result of this operation is a set of axioms following the
underlying fundamental data model of ‘subject-predicate-object’. This set of axioms can also be
graphically represented in the form of a graph by specialized visualization platforms.
A conceptual representation of this approach is presented in Figure 16. Source data can come in
different forms such as native formats like ‘pcap’ for network packet captures, ‘ewf’, ‘dd’, ‘aff’ for
disk image formats and others. By native formats we consider the direct output of acquisition tools
63
such as disk imaging software or hardware or networking monitoring tools. Another category of
source data can be the output of specialized forensic tools that can perform a preprocessing of the
native evidence format and introduce a first level of abstraction. Tools that fit in this category can be
e.g. file system parsers that reconstruct the file system resident in a disk image and output the structure
of directories and files along with their content or metadata or network stream assemblers that group
network packets into higher layer stream and connections such as TCP/UDP streams and application
layer messages or transmitted files. The final category is the supportive data exchange formats where
all different types of knowledge bases are considered that can provide additional information. These
knowledge bases can include additional information about IP addresses and networks or files. There is
a plethora of formats where this data can be contained in including online HTML pages or XMLformatted web service responses, stored data in relational databases, custom text formats and many
more.
Figure 16: Transformation process of raw data to their semantic representation
The output of this transformation process is the set of axioms, which is graphically represented by a
graph structure, where individual resources, as members of the classes defined in the respective
ontology, represented by circles in the figure, are interconnected by object properties and also
associated with data values by data properties. Each resource is uniquely identified by its URI under a
namespace that follows a naming scheme that can be decided by the examiner. A simple naming
scheme that can be followed is to use the file name of the evidence file or the name of the source of
supportive data appended by a descriptive name of the resource e.g. an IP address, a filename etc. The
transformation process can be either natively supported by forensic acquisition or forensic analysis
tools where an export function to a set of ontological axioms is provided or through specialized
software components/parsers. In the next step, the generated sets of axioms can be merged in order to
provide a complete ontological representation of all the sources of data involved in the case. During
the merging some apparent issues that need further treatment are those of integration and correlation in
and between the different data sets which are discussed in the subsequent sections.
7.3.1 Semantic Integration
In order to reduce the complexity of the asserted axioms that are generated by the aforementioned
process as well as being able to establish semantic relationships between multiple identical or closely
related individuals, an integration process is required. Three issues have been identified that the
integration step can resolve namely integrate same individuals within the same set of axioms, integrate
64
same individuals within different set of axioms but under the same ontology and finally same or
similar individuals within different set of axioms expressed under different ontologies.
The first issue can be tacked through de-duplication. As an example we can consider an IP address
which appears to have multiple network communications with different hosts in the same network
capture file. In order to reduce the complexity of subsequent reasoning, it is important that the
transformation process creates a single OWL Individual for each IP address. The RDF data model
allows the reuse of this resource either as the subject or the object of other axioms which promotes
integration of axioms that have shared members. An example is given in Figure 17 where two
different TCP sessions are connected with the same resource representing an individual that is member
of the IP address class. The two sessions are even connected with this resource via different object
properties as an IP address can act simultaneously both as a client of a remote service and as a server
providing other network services. The forensic tool or the respective parser can easily support such a
feature with an internal Set data structure that can keep single URI resources for each distinctive value.
Figure 17: De-duplication of data by semantic integration using URIs
The second issue is the integration of individuals representing the same entity when present under
different namespaces. Each forensic tool or parser can transform a specific type of evidence into an
ontological set of axioms under a unique per source namespace. In the case of multiple evidence files
of the same type there is the possibility that multiple OWL Individuals under each file’s namespace be
created representing the same entity e.g. the same IP address being present in multiple network
captures. OWL 2 has introduced support for keys (Parsia et al. 2008), a DL-Safe form of inverse
functional properties with support for data values as well. Defining a HasKey axiom, named instances
of a class that have same values on specified object and/or data properties can be considered as being
the same and the object relationship owl:sameAs connecting them be inferred by a reasoning engine.
A last issue is the integration of individuals that may represent the same or similar concepts between
different ontologies. An example would be the concept of the IP address which is present in both
network captures as well as firewall logs and many more network-related sources of data. As
discussed before, an advantageous approach would be if well-accepted ontologies were specified that
covered such generic terms and imported and reused in tool specific ontologies. However, this thesis
follows a more realistic approach by accepting each tool or data source be associated with its
respective ontology without having any shared common base. This introduces the problem that classes
in different ontologies may represent the same or closely related concepts thus hampering automated
integration of this data. Although research efforts exist on automated or semi-automated multiontology assignment (Jean-Mary et al. 2009), in the context of this thesis a more manual approach is
applied.
65
A solution followed in this thesis is the introduction of SWRL rules which can be used to infer
additional axioms that the expressivity limitations of OWL cannot support. SWRL rule evaluation by a
rule engine can establish intra-connections between individuals that are members of classes belonging
in different ontologies but represent related concepts. These new connections can be established in the
form of new object properties with a descriptive name and appropriately defined domain and range
restrictions. An example can be an object property by the name ‘FWHostToPacketCaptureIPAddress’
with individuals that are members of the class ‘Host’ of the WindowsXPFirewall ontology as the
domain restriction and individuals that are members of the class ‘IPAddress’ of the PacketCapture
ontology as the range restriction. The axioms defining these ‘bridging’ object properties can be either
included in the respective ontologies that need to be integrated or to a new dedicated ontology. This
thesis follows the second approach so as to decouple the different domain ontologies that can be
developed and expanded separately from each other. A new ontology is designed to import the various
domain ontologies that the provided sources of data may require and where such ‘bridging’ object
properties along with other concepts of combinatorial nature can be defined.
An graphical example is given in Figure 18 where the green colored arrow represents an object
property that ‘bridges’ two individuals that are members of different classes of different ontologies. In
this case, an individual of the ‘IPAddress’ class from the ‘PacketCapture’ ontology is connected with
an individual of the ‘Host’ class from the ‘WindowsXPFirewall’ ontology. This allows the integration
of different datasets thus enabling an automated manner to combine data sets from different sources
and further advanced reasoning and query capabilities.
Figure 18: Semantic Integration of related individuals represented in different ontologies
Based on the types of data sources that this thesis focuses on and the ontologies specified in the
previous section, the following similar classes between different ontologies have been identified and
respective object properties have been specified for interlinking individuals that are members of the
former classes:
Table 11: Integration semantic mappings between ontologies
Ontology A : Class A
Ontology B : Class B
Linking Object Property
PacketCapture : IPAddress
WindowsXPFirewallLog : Host
PcapIPToFWLogHost
PacketCapture : IPAddress
WHOIS : IPAddress
PcapIPToWHOISIpAddr
PacketCapture : IPAddress
FIRE : IPAddress
PcapIPToFireIPAddr
66
WindowsXPFirewallLog : Host
WHOIS : IPAddress
FWLogHostToWHOISIpAddr
WindowsXPFirewallLog : Host
FIRE : Host
FWLogHostToFireHost
WHOIS : IPAddress
FIRE : IPAddress
WHOISIpAddrToFireIPAddr
PacketCapture : Port
WindowsXPFirewallLog : Port
PcapPortToFWLogPort
HTTP : Content
DigitalMedia : File
HTTPContentToMediaFile
HTTP : Content
VirusTotal : File
HTTPContentToVTFile
DigitalMedia : File
VirusTotal : File
MediaFileToVTFile
The linking properties that have been defined have focused on two main types of information, namely
IP addresses and hash values of files. IP address is a piece of information that is present in almost any
type of network related logs and is quite fundamental in every network forensics related investigation.
An IP address can be observed either as a source or destination endpoint of a network packet or
communication stream or as the source or destination of a logged event from various network security
appliances such as network or host firewalls, IDS, VPN, network authentication logs etc. The ability to
link in an automated manner multiple observations of the same IP in different data sources can enable
the reconstruction of the actions that an IP address has performed, history of communications, acquire
additional information about the network and the operator that owns it as well as cross-checking with
IP reputation and ban lists for possible malicious behavior. On the other side, a file signature such as
an MD5 or SHA1 hash value can be used in order to reconstruct the trail of a file from its network
transmission up to its storage on the media device as well as cross-checking with antimalware or other
file analysis services that can provide more information about the content or the malicious nature of it.
These two types of integrate-able information are visually presented in Figure 19.
Figure 19: Integration of IP addresses/MD5 hash signatures
7.3.2 Evidence Correlation
Besides integration attempts that strive to interlink individual resources that represent the same or
closely related concepts, correlation can enable the investigator to connect resources of totally
different nature. As such, in the context of this thesis we will use the term correlation when being able
to automatically establish relations between resources of different nature, of the same or different
domains/ontologies, either by the reasoner or rule evaluation. In order to correlate resources of
67
different type there has to be a common reference, some type of information to which these resources
are directly or indirectly connected. Different types of correlation techniques are explicitly or
implicitly used in digital investigations such as temporal, spatial, mereological, size, IP-to-user and
many more. In this thesis we focus on two types of correlation, namely temporal and mereological
which are further discussed below.
7.3.2.1
Temporal Correlation
Time is of paramount importance in almost every type of digital investigation. A variety of different
resources can carry time related metadata including transmission time of a network packet, duration of
a network communication, timestamps of file activities on a file-system like creation, modification and
last access time, timestamps of logged events from firewalls, IDS, operating systems and many more.
A correlation of different resources based on their reference to time can enable the investigator to
reconstruct a global timeline of events that incorporates events from heterogeneous sources such as
disk images, network activity, logs etc. One forensic tool worthy to be mentioned is the log2timeline
(Guðjónsson 2010) which supports a large number of different input formats and merge different types
of events in one combined timeline. One problem with this approach though is that although the output
format of the tool supports advanced visualization techniques, events are linked between them in a
sequential manner based on the timestamps without support for more advanced types of queries as
discussed below. One point that needs attention is the translation of the different timestamps to a
common format and locale. Different timestamps may be expressed in different time-zones regarding
the local time-zone settings of the data source. This can create inconsistent results when timestamps
from different sources are combined together. We follow the assumption that this translation is part of
the preprocessing phase and that the timestamps that are given as input to the method are expressed in
a common format and time zone.
The approach followed in this thesis for semantically describing time-related information as well as
perform temporal correlation of resources is based on the method discussed in (M. J. O. Connor & Das
2011). The proposed SWRL Temporal Ontology is based on the valid-time temporal model upon
which a fact or a proposition can be associated with time instants or time intervals during which the
fact is considered to be true or valid. The SWRL Temporal Ontology provides various OWL entities in
the form of classes, object and data properties that can be used in order to represent arbitrary
propositions, time instants and intervals as well as the granularity of the temporal information. The
ontology specifies the class ‘ExtendedProposition’ as a semantic representation of any type of fact or
proposition that can carry temporal information. An instance that is a member of this class can be
connected via the ‘hasValidTime’ object property with instances that are members of the
‘ValidInstant’ or ‘ValidPeriod’ classes that represent time instants or periods respectively.
There are two main ways on how to introduce semantically expressed temporal information to existing
ontologies. The one is to modify the existing ontologies by adding new properties to declared classes
with their range pointing to instances of the ‘ValidInstant’ or ‘ValidPeriod’ classes. The problem with
this approach is that modifications may be needed to a large number of ontologies and inconsistencies
may arise between modified and non-modified ontologies. The other approach is to use sub-classing
for specifying classes of the existing ontologies to be subclasses of the ‘ExtendedProposition’ class.
One advantage with the latter approach is that the original ontology does not have to be modified as
the sub-classing axioms can be specified in an external ontology that imports the original ones. The
only problem with this approach is that axioms that have been asserted by the transformation process
and include references to date and time values have to be re-expressed so as to introduce instances of
the new temporal classes as well as be properly linked to them. This process is graphically represented
68
in Figure 20 where an original axiom about a timestamp with a domain specific data property is
converted so as to use the SWRL Temporal Ontology.
Figure 20: Conversion process according to the SWRL Temporal Ontology
Specifically, the class in which the instance belongs to is declared to be a subclass of the
‘ExtendedProposition’ class. An individual is created for every distinct timestamp that is a member of
the class ‘ValidInstance’ or the class ‘ValidPeriod’ in case a time period needs to be represented. This
new individual is then linked by the data property ‘hasTime’ with the literal value of the XML Schema
DateTime type. In case that the individual represents a time period two data properties are used, by
the name ‘hasStartTime’ and ‘hasFinishTime’ to connect the former with the two time endpoints.
After this conversion operation finishes, all time-related information has been converted in a semantic
uniform representation that can be further leveraged into performing temporal correlations between
individuals of different types. In this thesis, we have used the Allen’s Interval Algebra (AIA) as
described in (Allen 1983) which provides a description of the different relationships that time intervals
can have between them. AIA has defined 13 relations between time intervals which are graphically
represented in Figure 21.
69
Figure 21: Temporal relations of Allen's Interval Algebra
The rules that define the relations between the start and end timestamps between two intervals and the
resulting predicate that links them have been encoded in corresponding SWRL rules. Besides the
above relations that pertain to time intervals, (Hobbs & Pan 2004) have expanded and defined
additional predicates for relations between time instants and time intervals. The predicate ‘inside’ can
be used to describe a time instant that is between the start and end time points of an interval while the
predicates ‘before’ and ‘after’ can also be used when a time instant is before the start or after the end
of an interval respectively. The research of (M. J. O. Connor & Das 2011) has provided a set of SWRL
built-ins implementing the Allen temporal operators on temporal entities that can be used in SWRL
rules which upon evaluation can assert new axioms that utilize these temporal predicates. The
predicates that have been described above have been defined in a separate ontology that imports all the
domain specific ones with the aim of minimizing the need of modifications to the original ontologies.
The domain and range of these object properties have been defined to be instances of the Event class, a
generic class that is used to represent any type of event that can be referred to with respect to time.
7.3.2.2
Mereological Correlation
The other type of correlation studied in this thesis is part to whole relations which are part of the field
of ‘mereology’. These relations are commonly representing the connections between the parts of an
entity and the entity itself. A common characteristic of such relations is the transitivity since if A is
part of B and B is part of C then A is also part of C. The semantics of this type of relations must not be
confused with similar types of relations such as containment, membership and sub-classing as the
semantics of being part of something as well the transitivity are lacking in the latter cases. (Keet &
Artale 2007)
One type of this type of relation used in the selected types of data sources is the correlation between IP
networks and Autonomous Systems (AS). Autonomous systems are commonly a collection of IP
prefixes that are managed by the same network operator. By correlating instances of IP addresses
contained in either network packet captures or firewall logs with the data sets provides by Internet
registries such as RIPE part to whole relations can be established between the IP addresses and the
Autonomous Systems they are parts of.
70
Another type of this type of relation is the connection between a disk image, a partition, a file and a
sequence of bytes in the same data stream. More specifically, in the manner that a modern file system
like NTFS operates and the possibility of fragmentation, a logical file allocated in the file-system can
be split in a number of byte streams in different physical locations on the disk. A file though is also a
part of a partition which is a separation of the disk’s size into distinct areas that can support different
file systems. Finally, the partition is also a part of the disk image thus completing a chain of part to
whole relations.
Part-to-whole correlations between individuals of different types as those mentioned above can be
established either in the same ontology or between different ontologies. In the first case, the initial
transformation process from the raw input data to their ontological representation can establish these
relations along with appropriately defined predicates in the respective ontology. However, this method
does not rely on the domain specific ontology to provide always such expressivity and as thus allows
the establishment of such relations via SWRL rules as well that can later be utilized by the reasoning
engine as well. This applies also in the second case where these ‘partOf’ type of relations can be
established between individuals originating from the ontological representation of different sources of
data and their respective ontologies. A graphical example of the establishment of these relations is
presented in Figure 22 where an individual representing an IP address is connected through a
‘partOfAS’ predicate with a resource representing an Autonomous System. This correlation can
provide additional information about different IP addresses that are under the same network operator
as well as the inverse relations of which IP addresses belong to an autonomous system.
Figure 22: Mereological correlation between IP addresses and Autonomous Systems
7.4 Query Formulation and Evaluation
After the semantic representations of the different evidence files and supportive data have been
interconnected through the inference and the rule engine, a single integrated dataset is available to the
investigator for further examination. SPARQL has been standardized as the recommended query
language for the Semantic Web. SPARQL can operate on RDF data as those produced by the semantic
representation phase. The use of the SPARQL can allow the investigator to query the set of asserted
and inferred RDF triples via pattern matching. In this thesis, the SELECT type of query was the main
71
focus since it is the main method of information retrieving compared to the use patterns of the other
types of queries.
The structure of a SELECT query can be divided in the following parts:
 PREFIX: The prefixes are used to establish a mapping between a URI and a shortcut that can be
used in the rest of the query for improved readability.
 FROM: This optional clause is used when multiple datasets exist. In this thesis all RDF triples were
aggregated in a single dataset instead.
 WHERE: This is the core part of the query where triple patterns are specified against which the
dataset is searched for possible matches.
 LIMIT, OFFSET, ORDER BY: These are optional query modifiers with similar use as in SQL.
 FILTER: This optional clause enables the query to include constraints that restrict the set of the
query results upon various criteria. There is a variety of constraints that can be specified such as
XPath tests of comparison between XSD typed literals, regular expression checks against literal
values, XPath arithmetic operations on numeric values etc.
The set of the triple patterns specified in the WHERE clause form a so called ‘basic graph pattern’.
The triple patterns are similar to the RDF triples with the only difference that any or all of the three
parts (subject, predicate, object) can be a variable, syntactically denoted by the prefix of ‘?’). The
SPARQL processor searches the given RDF dataset attempting to find a sub-graph of it that matches
the given triple patterns. In the case that such sub-graphs are found, the values of the dataset replace
the corresponding variables in the ‘basic graph pattern’. The variables that have been matched to their
corresponding values in the dataset are said to be bound and can all or a subset of them returned as the
result of the query. SPARQL supports also OPTIONAL triple patterns in which case a graph pattern
that is partially matched to the dataset leaving some variables unbounded can yet be included in the
query results.
An example of such graph pattern matching is shown in Figure 23 where a small set of triple patterns
is queried against a larger dataset and its variables getting bound by their respective values. In this
simplistic example a query that searches for possible TCP connections which have destination IP
addresses that belong in networks that are located in China. The graph pattern on the left has variables
in the place of RDF resources that are searched for. The SPARQL engine scans the whole graph of the
given dataset in order to find existing graph patterns that match the queried one. On the right side, a
portion of the dataset is presented, where the resources colored with the green color are the ones that
match the query’s graph pattern. The variables of the query are bounded by the values of these
resources, URIs or literals. As it can be seen, besides the use of the OPTIONAL clause, a graph
pattern has to be matched completely in order to be included in the results. As an example, the
individual pcap:UDP1 of the Figure 23, is not included in the result since it is a UDP flow instead of a
TCP, although the remaining parts match the query pattern.
72
Figure 23: SPARQL graph pattern matching
7.5 A reference method implementation
The method so far has been described on a generic level that elaborated on the different parts of it
along with a discussion on how to perform key functions such as integrating and correlating
individuals among the ontological representations of the various data sources. In this section, a
reference implementation of the method will be presented along with a brief discussion of the tools
and techniques used. Semantic web technologies are continuously reshaping and gaining advanced
capabilities with the introduction of new standards and technologies thus it may be that some of the
ones used in this thesis may change significantly in the future. The method is quite flexible and the
presented proof of concept (PoC) system is just one of the different ways that the method can be
implemented. As discussed in the last section of this report, new advancements on distributed
SPARQL endpoints and queries or Rule Interchange Format can allow the integration of different
technologies and even more distributed and decentralized architectures.
7.5.1 Overview of the tools used
The main tools used to build the reference implementation of the method are given below along with a
short description.
 Java 1.6.0 Update 25 and the Eclipse IDE – Most of the code that has been written for the proof-ofconcept implementation has been in Java and the Eclipse IDE has been used as a code editor along
with its project management capabilities.
 Protégé 4.1.0 – Protégé is an Ontology Development and Editing tool. Protégé has been in order to
specify the various domain ontologies that have been used. Protégé provides also an embedded
reasoning engine that allows the manual creation of individuals and inference of new axioms based
on the defined restrictions and class hierarchy. The tool can assist in the discovery of
inconsistencies in the defined ontology as well as visualization of its parts for easier understanding.
 OWL API 3.2.4 – The OWL API is a Java API that provides a reference implementation for the
programmatic creation, modification and serialization of OWL ontologies. The project is open
source and already supports most of the features of OWL 2.0. The API provides a number of
different parsers thus can support different serialization formats for ontologies such as RDF/XML,
OWL/XML, Turtle, KRSS and others. The API provides also interfaces for integration with
reasoning engines for performing inference on given ontologies. The API has been used extensively
73






in order to transform the different raw input data to their respective ontological representations
according to the specified domain ontologies.
Pellet 2.3.0 – Pellet is an OWL reasoning engine that provides an API for programmatic reasoning
on OWL based ontologies. Pellet is one of the most established reasoners for OWL with support for
most of the part of OWL 2. Pellet can be integrated with OWL API thus providing reasoning
services to the application.
Protégé-OWL API 3.4.8 – The Protégé editor has two major versions, the 3.X and the 4.X which
are incompatible between them. Although Protégé 4.X uses the OWL API as the internal API for
editing of the ontologies, Protégé 3.X uses the Protégé-OWL API which is another Java library for
manipulation of OWL ontologies. This API has been used in this project in order to provide support
for integration with a rule engine as well support for the SWRL temporal ontology and SWRL
built-ins since the latter have been developed on top of it.
Jena 2.6.4 – Jena is a Java framework that contains an extensive set of libraries that deal with
Semantic Web technologies. The Java framework has been mostly focused on the support of RDF
and RDF Schema with support for different storage mechanisms for sets of triples such as inmemory or in relational databases and provision of basic inference capabilities. Jena can be also
integrated with DL reasoning engines such as Pellet for more advanced OWL 2 inference while
recent versions provide support for manipulation of OWL ontologies as well. Jena also provides
support for the SPARQL query language for the Semantic Web stack through the SPARQL API
that allows querying and updating of RDF knowledge bases. The SPARQL API is used in this
project for the querying part of the method although restricted in a command-line environment
without using advanced data publishing capabilities over HTTP that the sub-project Fuseki
provides.
Jess 7.1p2 – Jess is a rule engine based on Java that has the ability to reason upon a given
knowledge base in the form of declarative rules. Jess is based on the Rete algorithm for efficient
processing of rules (Forgy 1982). The research of (M. O. Connor et al. 2005) has led to the
‘SWRLTab’ which is a development environment both integrated with Protégé as a plugin as well
as an API that allows the definition and evaluation of SWRL rules. The ‘SWRLTab’ provides a
bridging interface for integration of its core APIs with different rule engines. Currently only the
Jess rule engine is supported which although is not supporting SWRL, the ‘SWRLTab’ handles
internally the conversion between the OWL representation and the Jess required one. The API
allows the definition of SWRL rules in a convenient text-based format while also automatically
handles the import of the OWL model and the SWRL rules to the Jess rule engine, the evaluation of
the rules and the transfer of any inferred knowledge back from the Jess rule engine to the OWL
model.
Kraken PCAP 1.3.0 – The Kraken PCAP is a Java API that allows the programmatic manipulation
of network packet captures in the ‘pcap’ format. This API provides a number of different network
protocol processors that can handle various technical aspects of network communications such as IP
fragmentation, TCP stream reassembling and even application layer protocols support such as
HTTP or SMTP decoding and extraction of transmitted files. This library has been slightly
modified in order to be better adjusted to the needs of this project and has been used for the
ontological representation of network packet captures.
Apache HTTP Components, JSoup, JSON – Other supportive Java APIs that have been used
include the Apache HTTP Components that provides a toolset for programmatic support of a basic
HTTP client, the JSoup library that eases the parsing of received HTML content for the extraction
74
of various pieces of data and JSON that allows the parsing of received data formatted in JSON
(JavaScript Object Notation).
7.5.2 Architecture of the PoC system
The PoC system has been designed so as to support the selected types of data sources which namely
are network packet captures, hard disk forensic images in the form of the ‘fiwalk’ tool DFXMLformatted output, Windows XP firewall logs, IP registration information from the RIPE registry
database, reputation lists of maliciously behaving autonomous systems and their hosts and finally
results of the online anti-malware VirusTotal service. A schematic overview of the implemented PoC
system is presented in Figure 24 and its various components are further discussed below.
Figure 24: Proof-of-concept system architecture
The first component of the system is the Evidence Manager. The evidence manager is tasked to parse
the application given arguments regarding the locations and names of the different evidence files and
load their byte contents for further processing. The Evidence Manager is also responsible for
identifying the type of the given source (e.g. pcap, disk image, firewall log) manually based on a user
given parameter or automatically based on the contents or the input file signature as well as properly
manages and categorizes all these evidence files. The evidence manager can also be responsible for
verifying the integrity of the given files in the case that hash values of the original files is also
provided. The evidence manager has been implemented to accept the locations of the evidence files in
the form of command line arguments although this could have been implemented also using a
graphical interface. The evidence manager can only load files from the local examiner’s system
75
although future enhancements could enable it to load evidence files via the network as well. The
evidence manager provides simple call interfaces that allows the other parts of the system to fetch files
of specific types or process iteratively all the input evidence files.
The second component is the semantic parser. The semantic parser is a generic Java interface that
contains the definition of a ‘parseToOntology’ method. Actually, there have been defined two
overloaded method, one that accepts the path and name of an evidence file and the other that accepts a
collection of strings which as we discussed below could be sets of IP addresses, MD5 hash values,
timestamps etc. The method returns an object of the class ‘OWLOntology’ which is part of the
OWLAPI and represents an OWL ontology along with the OWL Axioms and OWL Annotations it
consists of. Concrete Java classes have been created that implement the above abstract methods for the
different types of source data. As such 6 parsers have been implemented, one for each source type,
which ontologically represent the input data according to the specified ontologies. As seen in the
Figure, the parsers are also loading the respective domain ontologies, the ontologies for the different
formats of data or tool outputs as specified in §7.2. The ‘Semantic Parser’ is iteratively loading all the
files that the ‘Evidence Manager’ is handling along with the respective ontology and transforms the
input data into a set of OWL Axioms that contains the OWL Individuals, their class memberships,
links between them by object properties as well as their data values by data properties. ‘Semantic
Parsers’ are able to load the evidence file from the examiner’s local disk or fetch data from online
services such as VirusTotal and the RIPE database.
One problem that our system had to encounter was the amount of axioms that a database like RIPE or
VirusTotal could lead to. RIPE provides an interface to their registry database that can theoretically
provide data for each one of the 4 billion IP addresses. Similarly VirusTotal maintains a database of
over 100 million files. In order to reduce the complexity of the implemented system and based on the
practical limitations that specific online services such as RIPE or VirusTotal may block large amounts
of parallel requests, the idea of collectors has been implemented. The collectors have been data
structures of the form of List that the different parsers had access to and filling them up with entries as
the input data where processed. As such, collectors are responsible for maintaining a list of entities
that are actually observed in the main evidence files which later are loaded in the ‘Semantic Parser’ for
further processing. Four collectors have been defined which namely are, the ‘IPAddressCollector’, the
‘MD5HashCollector’, the ‘ASNumberCollector’ and the ‘TimestampCollector’.
The ‘IPAddressCollector’ is filled with entries of IP addresses that the parsers of the packet captures
and the firewall logs are encountering. After finishing processing all the files of these types, the final
list of IP addresses is then fed to the RIPE-specific semantic parser that iteratively queries the RIPE
database in order to acquire additional information about these IP addresses and represent them
semantically based on the respective ontology. The ‘MD5HashCollector’ maintains entries of MD5
hash values of files that are either found to be in a disk image or transferred via a network protocol
like HTTP and extracted from a network packet capture. The list of MD5 hash values is then fed to the
VirusTotal-specific semantic parser that iteratively queries the VirusTotal service through its online
web service in case of any file being already examined and listed as malicious. The
‘ASNumberCollector’ is maintaining a list of the numbers of Autonomous Systems based on the
results retrieved from the RIPE database. The AS numbers of networks whose IP addresses have been
encountered during the processing of the evidence files are then used by the FIRE-specific semantic
parser in order to search for networks that are blacklisted as containing malicious hosts. Finally the
‘TimestampCollector’ collects timestamps of either time instants as those found in firewall logs or
MAC file timestamps from disk images or representing time intervals such as the beginning and end
76
timestamps of a TCP session. The timestamps are used by a specialized component which upon
iteration generates individuals of the ‘ValidInstant’ and ‘ValidPeriod’ classes as defined in the SWRL
Temporal ontology which are later connected with other resources upon evaluation of SWRL rules.
The collectors are not an essential part of the method as they can be implemented in different ways.
One solution could be if online services like RIPE and VirusTotal provided SPARQL endpoints that
could enable remote implementations of the method query and fetch axioms relevant to the
encountered entities. In such a case, integration of individuals that were created from the domain
ontologies could be performed dynamically with the online sets of OWL axioms and their URI
references. Another point is that the semantic parsers have been implemented in a more involved
manner since they should be able to accept object references to the collectors’ data structures and
appropriately update them. A different technique, although more complex that could have been
adopted could be the iteration of the set of axioms that a semantic parser returns and e.g. the extraction
of all data-type properties of interest (e.g. all the data-type properties that are of type XML Schema
DateTime or custom defined types for representing e.g. IP addresses.
In an additional effort to reduce the complexity of the resulting set of axioms, a technique commonly
used in digital forensics with respect to known files hashes has been adopted. Most vendors of forensic
suites such as Encase and FTK provide lists of hashes of known files that are commonly part of the
operating system or well-recognized applications and thus should be ignored during the investigation.
NIST maintains such a list, known as the National Software Reference Library (NSRL). The SANS
website provides a query interface to this library via an online HTML-based form
(https://isc.sans.edu/tools/hashsearch.html). The user may fill in the hash value of a file and retrieve
information in the case that the hash is included in the database and the name of the file it belongs to.
In order to promote automation of the removal of such files from forensic tools, the database can be
queried also by using the DNS protocol. A special DNS zone, ‘md5.dshield.org’ has been configured
where the tool may issue a DNS request for a hostname of the form ‘hashvalue.md5.dshield.org by
substituting it with the corresponding value. In the case of a successful lookup, the tool can infer that
the hash value is contained in the database of known ‘good’ files and thus removed or ignored from
further forensic analysis. The semantic parser that processes the output of the Fiwalk tool representing
in XML the contents of a disk image has been adapted with such functionality. Thus, the resulting set
of axioms can contain a considerably lower amount of files and their resulting axioms. For a relatively
fresh disk image of a Windows XP SP3 OS installation, approximately 40-50% of the files were
removed by such an approach.
The next component is that of the ‘Inference Engine’. The OWL axioms that the semantic parsers have
asserted are aggregated together under a temporary namespace of an initially blank ontology. This
blank ontology can of course be named after some case identifier or any other type of identification
scheme the investigator may use. The inference engine imports all the referenced domain ontologies
that the semantic parsers have used and is now able to perform automated reasoning according to the
OWL/OWL2 specifications. Pellet is the reasoning engine that is used in the PoC system and has a
granular level of which types of OWL axioms the examiner wants to be inferred. One type is the
generation of class assertion axioms such as in the case of class hierarchies where an individual that is
a member of a subclass can be inferred to be a member of the parent class as well. Another useful type
is the generation of inverse object properties axioms where two individuals are establishing reverse
links based on the asserted object property and the defined inverse object property of the latter that can
improve the performance of query execution later on.
77
Due to the fact that in a realistic environment, the expressivity of the different domain tools is not
guaranteed or the investigator may desire to introduce new concepts that may be combinations of
multiple concepts dispersed in different ontologies and not being able to modify the original
ontologies, the method allows the creation of additional ontologies to support the definition of
additional ontological assertions. In the PoC system, an additional ontology which is called
‘IntegrationOntology’ has been created and it imports all the referenced domain ontologies as well as
their indirect references. The investigator is able to use an ontology editing tool such as Protégé to
define new classes, additional restrictions, new object properties etc. The inference engine is then able
to infer new axioms based not only on the domain ontologies but on the investigator custom-defined
one. This allows for a level of flexibility since the domain ontologies may not be easily modifiable but
the OWL specification allows, even more promotes, the creation of supplementary ontologies that may
reuse entities defined in other ontologies.
The next component is the SWRL Rule Engine. In the PoC system the Jess rule engine has been
adopted which although does not support directly SWRL rules and OWL ontologies, a bridging API as
part of the ProtegeOWL API provided the capability to load and evaluate SWRL rules and the set of
ontological axioms to the rule engine as well as import back any newly inferred axioms. SWRL rules
have been used in order to establish relations between individuals belonging in different ontologies but
representing similar concepts like IP addresses as well as correlate different individuals based on
shared grounds like time. As such, SWRL rules play a major role in the automated integration and
correlation parts of the method. The SWRL rules can be kept in a separate text file which promotes the
decoupling of the actual rules from the implementation of the method as well as enables sharing and
reuse of the rules in multiple cases as well.
The final component is responsible for accepting SPARQL queries from the user and evaluating them
against a SPARQL query engine. The Jena framework is providing the ARQ query engine that
supports the SPARQL RDF query language. The set of RDF triples that have been either asserted
during the semantic parsing of the source data or inferred by the reasoning or the rule engine are
loaded in-memory and SPARQL queries can be evaluated against it. The queries can once more be
stored in separate files thus promoting reuse and decoupling from the implemented system. The folder
that contains the text files of the SPARQL queries is given to the program in the form of a commandline argument which then iteratively loads and evaluates them. Currently the results are outputted to
the console with a simplistic table-based formatting.
The above PoC system provides a basic implementation of the proposed method. Each component
implements a part of the method as the latter has been described in §7.1. It needs to be emphasized
that the implemented system is far from optimal since all the processing is performed sequentially and
the whole set of triples is stored in-memory . The system can be much more efficiently designed by
taking advantage of upcoming technologies such as SPARQL Federated Query where a single
SPARQL query can be evaluated over multiple and diverse data sources and utilizing dedicated
persistence layers for storing the triples/graphs such as Jena’s TDB and SDB SPARQL databases. One
other point of discussion is the order of the calls to the reasoning and the rule engine. In the case that
the call to the reasoning engine may need to take advantage of axioms that can only be inferred by the
SWRL rules, then the call order may need to be reversed. Another option can be that a call to the
reasoning engine is done twice, one before the rule engine and the other after it. An interesting point to
highlight though is that the implemented system is quite flexible since besides the code part of the
semantic parsers, the calls to the reasoning, the rule and the query engine can be performed in a
dynamic and non-hardcoded manner. The information needed to perform reasoning; rule and query
78
evaluation does not have to be part of the code but can be kept in separate files such as RDF/XML
format for the ontologies and text files for the SWRL rules and the SPARQL queries.
79
Demonstration of the Method
In this section a demonstration of the method is shown using a common scenario that involves a host
getting infected by malware after visiting a malicious web-site. The experiment conducted has the
goal of evaluating the feasibility of the method as well as the qualitative advantages that it provides to
the investigator for the analysis of the case and the acquired evidence files. The infection of a machine
upon visiting a malicious website and its further exploitation via a remote terminal or even desktop
connection that the attacker may establish is a quite common method of intruding a private network. In
the cases of an advanced type of malware that includes Antivirus evasion techniques or when the
compromised machine is not protected by any local anti-malware engine or maybe a non-properly
updated one, the system’s compromise may go unnoticed. A forensic examination of the system’s
drive and its further scanning against an updated antimalware engine by the investigator may reveal
the residence of malicious files (e.g. executables, DLL libraries) that the attacker may have
downloaded on the system after the exploitation.
A typical scenario of a remote compromise of a networked host involves three or four steps, namely
reconnaissance in order to identify running and possibly vulnerable services on the remote system,
exploitation where the actual vulnerability is exploited using some or multiple attack vectors, privilege
escalation where the attacker attempts to gain administrative privileges on the compromised system
and finally post-exploitation where the attacker may perform various tasks such as download
additional malicious software on the compromised system, leave backdoors for further connection,
extrude important files resident on the victim, modify OS settings etc. The exploitation stage can be
performed mainly in either a direct manner where the attacker attempts to exploit a vulnerability of a
networked service running on the remote system or either attempt to trick the victim to activate the
exploit code directly on the system. The first method is commonly using buffer overflow attacks as the
attack vector such as stack-based and heap-based techniques that may enable the attacker to execute
arbitrary code on the compromised system. The second method is usually carried on using social
engineering techniques such as tricking the user to download and execute some malicious file send as
an email attachment, from a malicious website or through file sharing channels. Common attack
vectors employed in the second method are vulnerabilities that document-related software may have
such as Microsoft Word, Adobe Acrobat Reader, Picture Viewers or the support that such formats
provide for active content like JavaScript or Flash embedded code or vulnerabilities that web browsers
may have that enables specially crafted web content to bypass their security sandbox and execute code
on the victim’s machine without even the need of any user interaction (drive-by download &
execution).
After successful vulnerability exploitation, the attacker is able to execute code of his choice on the
compromised system. Commonly, this code is termed as ‘shell-code’ since it usually binds a command
shell process to a network port. The attacker upon establishing a network connection with this port,
can access remotely the compromised system through this command shell enabling him to perform a
vast array of tasks (e.g. create users, set up remote desktop services, disable local antimalware or
firewall services etc.). Even more advanced techniques exist that allow the attacker to inject the
compromised system with additional code libraries that enable the attacker to perform even more
advanced tasks (e.g. taking desktop screenshots, logging user keystrokes etc.). After a successful
exploitation, the connection can be set up in two different manners, the forward and the reverse one. In
the forward manner, the shell-code that has been executed on the system establishes a listening port
80
waiting for incoming connections by the attacker for serving the remote shell over it. Due to most
network or host firewalls disabling incoming requests to most ports and modern OS asking the user in
case of a process attempting to run a listening service, this way of remote connection is highly possible
to be unsuccessful. The reverse connection forces the compromised system to initiate the connection
to a listening service on the attacker’s end since outgoing connections are less probable to be
disallowed by a firewall.
8.1 Description of the Experiments
In order to better evaluate the proposed method and proof of concept implementation experiments
have been performed on a controlled local area network (LAN) environment between two connected
hosts, the attacking one and the victim. In order to avoid distributing malicious files over the normal
university network or other commercial ones (ISPs), a choice has been made to simulate such a system
infection on a LAN isolated environment. In order though, for the method to be able to integrate data
from the collected evidence with other data sources such as RIPE and FIRE, a manual process of IP
address modification has been followed. Since data sources like RIPE and FIRE do not obviously
maintain any data about local IP addresses, the IP address of the attacking machine has been changed
to one known malicious one such as those reported by FIRE. This change does not affect the integrity
of the results of the method and it can be implemented in either a local properly configured routed
setup or by modifying the values of the generated triples after the semantic transformation of any
packet capture evidence file including the attacker’s IP address.
The experiments have been performed on two systems interconnected via a network hub in a LAN that
provided Internet Access as well. Both systems were the same model, namely HP Compaq 8000 Elite
with an Intel Core 2 Duo E8400 Processor and 4096MB of RAM. The first system was running
Microsoft Windows XP Professional with Service Pack 3. The other system was running BackTrack 5
Release 1, a Linux based distribution known for its collection of exploitation tools such as Metasploit.
The exploitation part of the experiment has been centered on the disclosed MS11-006 vulnerability in
Windows Shell Graphics Processing that affected multiple versions of Windows OS such as Windows
XP Service Pack 3, Windows Vista SP 1 and 2 as well as Windows Server 2008. This module exploits
a buffer overflow vulnerability on how Windows handled thumbnails in Office documents. A specially
crafted Office document can trigger the vulnerable code when the user navigates to the folder that
contains it and display the folder’s contents in ‘Thumbnails’ view. Metasploit provides a plugin that
allows an attacker to generate such a malicious document combined with a payload of his choice. The
payload could be either code that will establish a listening predefined network port on the victim or
code that will force the victim to initiate a connection to a predefined remote IP address and port. The
payload can also establish either a simple shell connection or additionally use the ‘meterpreter’ postexploitation extension provided by Metasploit providing even more capabilities to the attacker.
After generating the malicious document, the attacker has to find a way to transfer the file on the
victim’s system. In the current setup, the attacker has setup a web server from which the user could
download it. As said before, the attacker simulates a known malicious host such as those listed by the
FIRE project under the ‘Exploit Server’ category. The user could be led to just a site serving malicious
files either directly through a link received in the body of an email, an IM message or in the content of
another webpage or social networking site. The user could be led also in a drive-by scenario by
visiting other sites initially and her browser being automatically redirected. After downloading of the
file, we assume that the user may navigate to the folder containing the received file at some point of
time and in the case that ‘thumbnail’ view is selected, the exploit code to be triggered.
81
In the first implemented scenario, the Windows XP system gets compromised by executing the exploit
code but a locally configured firewall disallows incoming connections to unknown ports. The attacker
has selected to create a listening port on the victim machine at port 4444. As such, although the port is
listening, the Windows XP Firewall is not allowing incoming connections to be established thus
disabling the attacker from continuing his attack. The Windows Firewall has been configured so as to
log both dropped packets as well as successfully established connections. The network traffic has been
captured by a third system connected to the hub and using Wireshark and saved as a file in the ‘pcap’
format. The scenario is schematically presented in Figure 25 below.
Figure 25: Attack scenario of a ‘bind_tcp’ shellcode triggered by a malicious Word document downloaded
from the Web.
Finally a forensic image of the victim’s system has been taken using FTK Imager and stored in the raw
format (dd). The fiwalk tool was used to get a DFXML-formatted XML file representing the files
present on the disk and their associated metadata. The firewall log file is extracted from the disk image
using common forensic techniques such as using FTK. The malicious file has not been deleted from
the victim’s system and the forensic image was taken approximately 5 minutes after the execution of
the malicious code. In a more realistic scenario though, there would be possibly a considerable amount
of time between the compromise of the system and the initiation of the digital investigation process.
However, at least in the case of proper retention of logs such as captured network traffic and that the
malicious file was still allocated on the file system, the method should produce the same result
although with a much larger dataset hindering it performance-wise. In order to further reduce the
complexity of the analysis, the Windows XP system was used for the experiment just after the
completion of OS installation thus reducing the number of user-generated files to a minimum.
In the second experiment, a variation of the first scenario is simulated. In that case, the malicious
document that is generated and downloaded by the victim attempts, after exploitation succeeds, to
establish a TCP connection to a remote IP and port that is under the control of the attacker. It is quite
common for such outgoing connections to use well-known ports such as port 80 that is used for HTTP
traffic so as to minimize the risk of getting denied by firewalls. The attacker has set up properly the
attacking system to listen for incoming connections from freshly compromised systems. Upon the
82
establishment of the connection and depending of the payload used, the attacker may be able to
establish a shell connection to the compromised system or utilize the ‘meterpreter’ plugin provided by
Metasploit for even more advanced post-exploitation capabilities. The ‘meterpreter’ plugin is injecting
a DLL in the memory of the compromised system providing advanced functions to the attacker such as
automated privilege escalation, download/upload files between the two systems, taking a screenshot of
the victim’s desktop, operating a keylogger etc. The injected DLL is not stored on the disk thus
minimizing any possible traces of the connection. In the conducted experiment, the meterpreter plugin
has been used and after the establishment of the connection, some sample operations have been
performed such as downloading to the compromised system an additional malicious file, uploading a
random file as well as dumping the hashed passwords of the user accounts from the system’s registry.
The firewall allowed by default outgoing connections towards port 80 although it was configured so as
to log allowed connections as well. The implemented scenario is graphically represented in Figure 26.
Figure 26: Attack scenario of a ‘reverse_tcp’ shellcode triggered by a malicious Word document
downloaded from the Web.
8.2 Integration and Correlation of Digital Artifacts
The first step of the experiment is the parsing of the collected evidence files (one packet capture of
‘pcap’ format, one DFXML-formatted file as the output of ‘fiwalk’ on the disk image and a text file
that contained the Windows XP firewall logs. The respective semantic parsers are applied on the
evidence files and the generated triples are stored in individual ontologies which are serialized to
RDF/XML formatted text files.
In the table below some quantitative data about the collected evidence files are summarized.
Table 12: Semantic Representation of the Experiment 1 Evidence Files
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size
25GB
Original Fiwalk XML output File Size
9,46MB
83
RDF/XML Serialization File Size
7,08MB
Number of Allocated Files in the Disk
6610
Number of Nodes in the Graph Representation
34012
Number of Edges in the Graph Representation
83032
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size
454KB
RDF/XML Serialization File Size
662KB
Number of TCP sessions
40
Number of Nodes in the Graph Representation
1616
Number of Edges in the Graph Representation
5891
Windows XP Firewall Log of the compromised system
Original File Size
38KB
RDF/XML Serialization File Size
684KB
Number of Log Entries
413
Number of Nodes in the Graph Representation
1344
Number of Edges in the Graph Representation
5866
RIPE NCC WHOIS Database
RDF/XML Serialization File Size
210KB
Number of Queried IP Addresses
37
Number of Nodes in the Graph Representation
137
Number of Edges in the Graph Representation
395
FIRE Malicious Networks Database
RDF/XML Serialization File Size
113KB
Number of Queried Autonomous Systems
5
Number of Nodes in the Graph Representation
384
Number of Edges in the Graph Representation
1083
VirusTotal Anti-Malware Web Service
RDF/XML Serialization File Size
2,45MB
Number of Queried and Indexed by VT Files
2304
Number of Nodes in the Graph Representation
11519
Number of Edges in the Graph Representation
18508
84
After parsing of the evidence files and their conversion to their respective semantic representations,
the set of triples generated by each parser are merged to a single ontology. The resulting set of triples
is the sum of all the nodes and edges present in the individual graphs which leads to a quite complex
graph. Visualization software such as Gephi can be used to graph the resulting set of graphs as
presented in Figure 27 where although not very clear, the sub-graphs of the evidence files are
disconnected between them.
Figure 27: Visualization of the semantic representation of the evidence files.
Use of the inference engine can introduce new triples based on the specified ontologies. Despite the
ontologies specified for the evidence types used in this thesis being quite lightweight by restricting to
basic expressions such as parent-child class relationships and inverse object properties, the Pellet
inference engine introduced 72130 inferred axioms which amounted to an increase of the RDF/XML
serialization of the ontology by 6,1MB approximately.
The next step is the establishment of interconnections between the separate sub-graphs through the use
of the ‘bridging’ ontology and its specified classes and object properties. As discussed before such
bridging can be performed by a variety of modeling approaches such as hierarchical relationships of
properties between different ontologies, SPARQL construct queries, SWRL rules etc. The main
approach followed in this thesis was through SWRL rules since it provided additional expressivity
than OWL as well better programmatic support for temporal related rules. The following table
contains the definition of the SWRL rules that have been specified as well as the number of axioms
that the Jess rule engine generated and later imported back to the main ontology.
Table 13: SWRL Rule Evaluation Results for Experiment 1
SWRL Rule Definition
Rule Description
85
Number
of
Generate
d Axioms
PacketCapture:hasIPValue(?x,?y) ^
WindowsXPFirewallLog:hasAddress(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:PcapIPToFWLogHost(?x,?w)
PacketCapture:hasIPValue(?x,?y) ^
WHOIS:hasAddress(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:PcapIPToWHOISIpAddr(?x,?w)
PacketCapture:hasIPValue(?x,?y) ^
Fire:hasIPAddressString(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:PcapIPToFireIPAddr(?x,?w)
WindowsXPFirewallLog:hasAddress(?x,?y) ^
WHOIS:hasAddress(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
-> IntegrationOntology:FWLogHostToWHOISIpAddr(?x,?w)
The rule ‘bridges’
individuals referring to the
same IP address value
between the
‘PacketCapture’ and the
Firewall ontologies.
26
The rule ‘bridges’
individuals referring to the
same IP address value
between the
‘PacketCapture’ and the
RIPE ontologies.
17
The rule ‘bridges’
individuals referring to the
same IP address value
between the
‘PacketCapture’ and the
FIRE ontologies.
1
The rule ‘bridges’
individuals referring to the
same IP address value
between the Firewall and
the RIPE ontologies.
37
The rule ‘bridges’
individuals referring to the
same IP address value
between the Firewall and
the FIRE ontologies.
WindowsXPFirewallLog:hasAddress(?x,?y) ^
Fire:hasIPAddressString(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
2
IntegrationOntology:FWLogHostToFireHost(?x,?w)
WHOIS:hasAddress(?x,?y) ^ Fire:hasIPAddressString(?w,?z)
^ swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:WHOISIpAddrToFireIPAddr(?x,?w)
The rule ‘bridges’
individuals referring to the
same IP address value
between the RIPE and the
FIRE ontologies.
The rule ‘bridges’
individuals referring to the
same network port
numberbetween the
‘PacketCapture’ and the
Firewall ontologies
PacketCapture:TCPPort(?x) ^
PacketCapture:hasNumericalValue(?x,?y) ^
WindowsXPFirewallLog:hasNumber(?w,?z) ^
swrlb:equal(?y,?z)
->
2
34
IntegrationOntology:PcapPortToFWLogPort(?x,?w)
The rule ‘bridges’
PacketCapture:hasContentMD5(?x,?y) ^
86
18
DigitalMedia:hasMD5(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:HTTPContentToMediaFile(?x,?w)
individuals referring to the
same MD5 hash value
between the
‘PacketCapture’ and the
‘DigitalMedia’ ontology
The rule ‘bridges’
individuals referring to the
same MD5 hash value
between the
‘PacketCapture’ and the
‘VirusTotal’ ontologies
1
22
IntegrationOntology:MediaFileToVTFile(?x,?w)
The rule ‘bridges’
individuals referring to the
same MD5 hash value
between the ‘DigitalMedia’
and the ‘VirusTotal’
ontologies
WindowsXPFirewallLog:FirewallEvent(?x) ^
WindowsXPFirewallLog:hasDateTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ->
temporal:hasValidTime(?x,?z)
The rule ‘connects’
individual firewall events
to the individuals
representing the temporal
instants.
413
DigitalMedia:File(?x) ^
DigitalMedia:hasFileCreationTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ^
swrlx:makeOWLThing(?filecreationevent,?x) ->
IntegrationOntology:FileCreationEvent(?filecreationevent) ^
IntegrationOntology:hasFileCreationEvent(?x,?filecreationeve
nt) ^ temporal:hasValidTime(?filecreationevent,?z)
The rule ‘connects’ file
individuals to temporal
instant individual based on
their file creation
timestamp. A new
individual is also created
that is a member of the
‘FileCreationEvent’ class.
9795
DigitalMedia:File(?x) ^
DigitalMedia:hasFileAccessTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ^
swrlx:makeOWLThing(?fileaccessevent,?x) ->
IntegrationOntology:FileAccessEvent(?fileaccessevent) ^
IntegrationOntology:Event(?fileaccessevent) ^
temporal:hasValidTime(?fileaccessevent,?z)
The rule ‘connects’ file
individuals to temporal
instant individual based on
their file last access
timestamp. A new
individual is also created
that is a member of the
‘FileAccessEvent’ class.
9795
DigitalMedia:File(?x) ^
DigitalMedia:hasMetadataChangeTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ^
swrlx:makeOWLThing(?filemetadatachangeevent,?x) ->
IntegrationOntology:FileMetadataChangeEvent(?filemetadatac
The rule ‘connects’ file
individuals to temporal
instant individual based on
their file metadata change
timestamp. A new
individual is also created
9795
PacketCapture:hasContentMD5(?x,?y) ^
VirusTotal:hasMD5Hash(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
IntegrationOntology:HTTPContentToVTFile(?x,?w)
DigitalMedia:hasMD5(?x,?y) ^
VirusTotal:hasMD5Hash(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z)
->
87
hangeevent) ^
IntegrationOntology:Event(?filemetadatachangeevent) ^
temporal:hasValidTime(?filemetadatachangeevent,?z)
that is a member of the
‘FileMetadataChangeEvent
’ class.
DigitalMedia:File(?x) ^
DigitalMedia:hasFileModificationTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ^
swrlx:makeOWLThing(?filemodificationevent,?x) ->
IntegrationOntology:FileModificationEvent(?filemodificatione
vent) ^ IntegrationOntology:Event(?filemodificationevent) ^
temporal:hasValidTime(?filemodificationevent,?z)
The rule ‘connects’ file
individuals to temporal
instant individuals based
on their file modification
timestamp. A new
individual is also created
that is a member of the
‘FileModificationEvent’
class.
9795
PacketCapture:hasStartTimeStamp(?x,?y1) ^
PacketCapture:hasEndTimeStamp(?x,?y2) ^
temporal:hasStartTime(?z,?z1) ^
temporal:hasFinishTime(?z,?z2) ^ temporal:equals(?y1,?z1) ^
temporal:equals(?y2,?z2) -> temporal:hasValidTime(?x,?z)
The rule ‘connects’
TCP/UDP sessions to
temporal period individuals
based on their start and
finish timestamp.
17
In the second experiment, the same approach has been followed. Some quantitative data describing the
evidence files of the second experiment are shown below.
Table 14: Semantic Representation of the Experiment 2 Evidence Files
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size
25GB
Original Fiwalk XML output File Size
9,34MB
RDF/XML Serialization File Size
6,44MB
Number of Allocated Files in the Disk
3273
Number of Nodes in the Graph Representation
16330
Number of Edges in the Graph Representation
45039
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size
2,63MB
RDF/XML Serialization File Size
2MB
Number of TCP sessions
57
Number of Nodes in the Graph Representation
5419
Number of Edges in the Graph Representation
21712
Windows XP Firewall Log of the compromised system
Original File Size
46KB
RDF/XML Serialization File Size
784KB
88
Number of Log Entries
480
Number of Nodes in the Graph Representation
1510
Number of Edges in the Graph Representation
6794
RIPE NCC WHOIS Database
RDF/XML Serialization File Size
38KB
Number of Queried IP Addresses
41
Number of Nodes in the Graph Representation
181
Number of Edges in the Graph Representation
326
FIRE Malicious Networks Database
RDF/XML Serialization File Size
113KB
Number of Queried Autonomous Systems
5
Number of Nodes in the Graph Representation
384
Number of Edges in the Graph Representation
1083
VirusTotal Anti-Malware Web Service
RDF/XML Serialization File Size
54KB
Number of Queried and Indexed by VT Files
2540
Number of Nodes in the Graph Representation
253
Number of Edges in the Graph Representation
386
In addition to the evaluation of the aforementioned rules that establish relationships between similar or
identical individuals belonging to different ontologies, an emphasis on temporal related rules has also
been given for the 2nd case. Taking advantage of the semantic parsers’ work that have generated
individuals representing time instant and time periods in accordance with the SWRL Temporal
Ontology as well as the SWRL temporal built-ins provided by the Protégé-OWL API that implement
most of the Allen’s temporal operators, custom temporal rules can be specified and evaluated by the
rule engine. Case 2’s evidence files have results in a number of 1024 individuals representing time
instants (file and firewall events) and 21 individuals representing time periods (TCP sessions).
Examples of such temporal rules are presented above along with their results for the 2nd case’s
evidence files that enable the investigator to establish temporal relations, such as before, after, starting
at the same time, between the semantic representations of time events and their associated forensic
events.
Table 15: SWRL Rule Evaluation Results for Experiment 2
SWRL Rule Definition
Rule Description
temporal:hasTime(?x,?t1) ^ temporal:hasTime(?y,?t2) ^
The rule connects the
individuals of time instants
temporal:before(?t1, ?t2) ^
89
Number
of
Generate
d Axioms
55770
temporal:add(?t1Plus,?t1,60,temporal:Seconds) ^
temporal:add(?t2Plus,?t2,0,temporal:Seconds) ^
temporal:before(?t2Plus,?t1Plus,temporal:Seconds) ->
IntegrationOntology:temporalBefore(?x,?y)
temporal:hasStartTime(?x,?z) ^ temporal:hasStartTime(?y,?w)
^ temporal:hasFinishTime(?x,?z2) ^
temporal:hasFinishTime(?y,?w2) ^ temporal:before(?z,?w) ^
temporal:before(?z2,?w) ^
temporal:add(?zPlus,?z,60,temporal:Seconds) ^
temporal:add(?wPlus,?w,0,temporal:Seconds) ^
temporal:before(?wPlus,?zPlus,temporal:Seconds) ->
IntegrationOntology:temporalBefore(?x,?y)
temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^
temporal:hasFinishTime(?y,?w2) ^ temporal:before(?z,?w1) ^
temporal:add(?zPlus,?z,60,temporal:Seconds) ^
with a custom property
when one of the timestamp
is positioned before the
other but no more than one
minute.
The rule connects the
individuals representing
time periods when one
period begins before
another but not more one
minute but it also ends
before the other has started.
136
The rule connects
individuals representing
time instants and time
periods when a time instant
is positioned before the
start of the period by no
more than one minute.
1008
The rule connects
individuals representing
time instants and time
periods when a period
starts and ends before a
time instant by no more
than one minute
1841
temporal:hasStartTime(?x,?t1) ^
temporal:hasStartTime(?y,?t3) ^
temporal:hasFinishTime(?x,?t2) ^
temporal:hasFinishTime(?y,?t4) ^ temporal:equals(?t1,?t3) ^
temporal:before(?t2,?t4) ->
IntegrationOntology:temporalStarts(?x,?y)
The rule connects
individuals representing
time periods when the two
periods start at the same
time but end in different
timestamps.
33
temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^
temporal:hasFinishTime(?y,?w2) ^ temporal:after(?z,?w1) ^
temporal:before(?z,?w2) ->
IntegrationOntology:temporalInside(?x,?y)
The rule connects
individuals representing
time instants and time
periods when a time instant
is between the beginning
and the end of the time
period.
64
temporal:add(?w1Plus,?w1,0,temporal:Seconds) ^
temporal:before(?w1Plus,?zPlus,temporal:Seconds) ->
IntegrationOntology:temporalBefore(?x,?y)
temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^
temporal:hasFinishTime(?y,?w2) ^ temporal:before(?w2,?z) ^
temporal:add(?zPlus,?z,60,temporal:Seconds) ^
temporal:add(?w1Plus,?w1,0,temporal:Seconds) ^
temporal:before(?w1Plus,?zPlus,temporal:Seconds)->
IntegrationOntology:temporalBefore(?y,?x)
It should be obvious that the investigator has much more flexibility to create even more meaningful
temporal-related rules utilizing the full spectrum of the Allen operators. In the aforementioned
90
examples a time window of one minute has been chosen since such exploitation events (download and
execution of a malicious file) can happen quite fast but also in order to reduce the complexity of the
rules. Unrestricted temporal relations between timestamps, especially in cases with multiple and large
files can lead to quite ‘heavy’ rules with high amounts of generated axioms. The context of the case
along with the shared expertise of the forensic community can lead to various heuristic for the
specification of more useful and lightweight rules.
8.3 Hypothesis formulation and evaluation
After the successful completion of all the previous steps, the investigator has in his/her disposal a large
set of triples which represent concepts commonly used in a digital investigation, individual entities of
these concepts corresponding to the observed and logged events of the case as well as the semantic
relationships that interconnect all these resources. In the last step, the investigator can utilize the
powerful SPARQL language in order to address queries to the collected dataset and receive
meaningful results. The main benefit of all the approach followed so far is that the dataset does not
consist of different formats that need tools and manual interpretation by the investigator but a
conceptual and logical representation of it. In this section, we will follow the possible stages of a
digital investigator’s analysis process in order to verify the security incident that may have
compromised the system as well as reconstruct all the involved parties and sequence of events.
Assuming that the investigator has been provided with the aforementioned evidence files and does not
have access to any additional information, she is called to understand if the system has been
compromised and if so how this may have been accomplished. One of the first steps that the
investigator may follow is to gain familiarity about the provided dataset and the information it carries.
The investigator should of course have already an understanding of the ontologies used for the
semantic representation of the evidence which we claim to be a much easier task to accomplish than
gaining technical expertise to all the different tools that evidence processing demands for. One of the
largest advantages of the SPARQL language and RDF in general is that the actual data along with
metadata such as information about the schema can be present both in the same set of triples. In such a
manner the investigator can get an overview of the provided dataset by querying directly the dataset
itself. Additionally, in order to reduce the extent of the reported queries in the document, the SPARQL
query part where the mappings between the prefixes used and the respective vocabulary namespaces is
shown once below and assumed in all the following queries.
PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#>
PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#>
PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#>
PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#>
PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#>
PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#>
PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
91
The investigator may formulate some hypotheses which can be expressed in the form of queries in
order to evaluate quickly the potential significance or relevance of the provided evidence to the case.
Some sample queries are provided below along with excerpts of their results that can demonstrate the
potential ability of the proposed method for fast and efficient queries upon aggregated data from
heterogeneous domains.
Hypoth
esis
The investigator hypothesizes that the compromised system may have had network
communications with external IP addresses that belong to autonomous systems that may be
listed as malicious networks.
Query
SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire
WHERE {
?tcpflow packetcapture:hasDestinationIP ?destip .
?destip packetcapture:hasIPValue ?destipvalue .
?destip integration:PcapIPToWHOISIpAddr ?whoisip .
?whoisip whois:isContainedInRange ?range .
?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?range whois:hasRange ?rangeValue .
?range whois:isContainedInAS ?as .
?as whois:hasNetName ?netname .
?as whois:hasASNumber ?asnumber .
?as whois:hasRoute ?route
}
Results
Interpre
tation
tcpflow
destipvalue
netname
asnumber
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_6>
"78.46.173.193"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
"HETZNER-AS"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
"24940"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_4>
"78.46.173.193"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
"HETZNER-AS"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
"24940"
^^<http://www.w3.o
rg/2001/XMLSchem
a#string>
The results of the query support the hypothesis that the compromised system had indeed
network communications with IP addresses that belongs to autonomous systems known to
demonstrate malicious behavior. The query is able to match a graph pattern in the provided
dataset thus retrieving additional information regarding the specific blacklisted AS.
92
Hypoth
esis
A common attack vector for compromising a system is through malware that may have been
downloaded and executed by the user from the Web. A hypothesis is that a file that has been
downloaded from the Web as can be extracted from the packet capture can be recognized as
malicious by the anti-malware engine.
Query
SELECT DISTINCT ?tcpflow ?http ?uri ?md5 ?link
WHERE {
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
?http packetcapture:hasHTTPRequest ?httpreq .
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?vtreport .
?vtreport virustotal:hasPermanentLink ?link
}
Results
tcpflow
<urn://bind_t
cp_FWed_tc
p.pcap#tcpSe
ssion_6>
http
<urn://bind
_tcp_FWed
_tcp.pcap#
http_52>
uri
md5
link
"/msf.doc"
^^<http://ww
w.w3.org/200
1/XMLSche
ma#string>
"9E10A7844
BA8BA4EFE
1A514D2710
5735"
^^<http://ww
w.w3.org/200
1/XMLSche
ma#string>
"http://www.virustotal.com/file/
8bacecdc64d63b334bca23f46cb
0723119bbaafa148479d4b91785
2c2ee44943/analysis/"
^^<http://www.w3.org/2001/XM
LSchema#string>
Interpre
tation
The SPARQL results show that upon evaluating the provided query on the integrated data of
the different types of evidence, a path could be found that connects one of the files that have
been extracted from the packet capture with an identified and analyzed known malware. The
result provides a support to the investigator’s hypothesis that indeed the compromised system
may have downloaded and potentially executed a malicious file. It should be obvious that the
SPARQL SELECT graph pattern can be expanded in order to retrieve more information such
as the involved IP addresses, TCP ports, HTTP request and result headers or the individual
comments by each antivirus engine that VirusTotal supports.
Hypoth
esis
The investigator hypothesizes that in the event of a successful compromise, traces of malicious
files may also be found in the hard disk’s image of the system. The query searches for any files
that may be listed as malicious by the anti-malware service.
93
Query
SELECT DISTINCT ?file ?pathName ?md5
WHERE {
?file rdf:type digitalmedia:File .
?file digitalmedia:hasPathName ?pathName .
?file digitalmedia:hasMD5 ?md5 .
?file integration:MediaFileToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?report .
?report virustotal:hasResult ?result .
?result virustotal:hasResultDescription ?description
}
Results
Interpre
tation
file
pathName
md5
<urn://infectedHostNE
W.xml#file_5578>
"WINDOWS/system32/driver
s/beep.sys"
^^<http://www.w3.org/2001/
XMLSchema#string>
"da1f27d85e0d1525f6621372
e7b685e9"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://infectedHostNE
W.xml#file_758>
"Documents and
Settings/John/My
Documents/msf.doc"
^^<http://www.w3.org/2001/
XMLSchema#string>
"9e10a7844ba8ba4efe1a514d2
7105735"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://infectedHostNE
W.xml#file_5686>
"WINDOWS/system32/driver
s/vdmindvd.sys"
^^<http://www.w3.org/2001/
XMLSchema#string>
"55e01061c74a8cefff58dc361
14a8d3f"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://infectedHostNE
W.xml#file_6139>
"WINDOWS/system32/servic
es.exe"
^^<http://www.w3.org/2001/
XMLSchema#string>
"0e776ed5f7cc9f94299e70461
b7b8185"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://infectedHostNE
W.xml#file_2847>
"WINDOWS/pchealth/helpctr/
System/Remote
Assistance/Interaction/Client/
RAClient.htm"
^^<http://www.w3.org/2001/
XMLSchema#string>
"cb4a33bd4fce7cc2eeccdc45d
939e8b7"
^^<http://www.w3.org/2001/
XMLSchema#string>
The results support the hypothesis that the system was indeed infected with malicious files. The
above results are an excerpt of the complete results which amounted to a number of 22. Some
of the reported files seem to be false positives however the structured format of these results
may allow further integration with additional datasets such as hash lists of malicious files or
94
further checks against online anti-malware services (e.g. online sandboxes and binary
analysis).
Hypoth
esis
The investigator hypothesizes that some of the malicious files identified on the image of the
system’s drive may have been downloaded from web communications the system may had.
If the downloaded malware was further stored in the disk without modifications, then hash
value equality provides a way to track the path of the file as it was downloaded and then
stored on the disk. To make the hypothesis even more explicit, the investigator may refine
the query so as to search for such files that may come from known malicious hosts and
networks.
Query
SELECT ?file ?uri ?destip ?host_fire ?type ?asname
WHERE {
?file rdf:type digitalmedia:File .
?file digitalmedia:hasMD5 ?md5 .
?httpbody integration:HTTPContentToMediaFile ?file .
?httpresp http:body ?httpbody .
?httpreq http:requestURI ?uri .
?httpreq http:resp ?httpresp .
?http packetcapture:hasHTTPRequest ?httpreq .
?http rdf:type packetcapture:HTTP .
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?tcpflow packetcapture:hasDestinationIP ?destip .
?destip integration:PcapIPToFireIPAddr ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?host_fire rdf:type ?type .
?host_fire fire:isContainedInAS ?as .
?as fire:hasASName ?asname
}
Results
file
uri
host_fire
type
<urn://infectedHo
stNEW.xml#file_
758>
"/msf.doc"
^^<http://www.w3.org
/2001/XMLSchema#st
ring>
<urn://firedb#host_7
8.46.173.1
93>
<http://www.w3.org/2002/07/
owl#Thing>
<urn://infectedHo
stNEW.xml#file_
"/msf.doc"
^^<http://www.w3.org
<urn://firedb#host_7
<http://people.dsv.su.se/~dossi
s/ontologies/Fire.owl#Exploit
95
758>
Interpre
tation
/2001/XMLSchema#st
ring>
8.46.173.1
93>
Server>
<urn://infectedHo
stNEW.xml#file_
758>
"/msf.doc"
^^<http://www.w3.org
/2001/XMLSchema#st
ring>
<urn://firedb#host_7
8.46.173.1
93>
<http://people.dsv.su.se/~dossi
s/ontologies/Fire.owl#Host>
<urn://infectedHo
stNEW.xml#file_
758>
"/msf.doc"
^^<http://www.w3.org
/2001/XMLSchema#st
ring>
<urn://firedb#host_7
8.46.173.1
93>
<http://people.dsv.su.se/~dossi
s/ontologies/Fire.owl#Malicio
usHost>
<urn://infectedHo
stNEW.xml#file_
758>
"/msf.doc"
^^<http://www.w3.org
/2001/XMLSchema#st
ring>
<urn://firedb#host_7
8.46.173.1
93>
<http://www.w3.org/2002/07/
owl#NamedIndividual>
The results show that indeed a malicious file, named ‘msf.doc’ has been downloaded and
later stored on the disk. The host from which the file was downloaded was included in the
FIRE blacklist as a malicious host and more specifically it has been categorized as an
Exploit Server.
Hy
pot
hesi
s
The investigator hypothesizes that after downloading and storing the malicious file on the
system, some user or automated action may have led to an actual exploitation of the system and
the execution of malicious arbitrary code. It is common that malicious code that manages to
install itself in the system may attempt to communicate with other external systems (e.g.
exfiltrate data, receive commands, participate in DDos attacks). The query searches for any
firewall events that indicate any unsuccessful connection attempts to the host.
Que
ry
PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#>
PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#>
PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#>
PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#>
PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#>
PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#>
PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
SELECT DISTINCT ?event ?ripeAS ?type ?fireip ?host_type ?fireipB
96
WHERE {
?event xpfw:hasSourceHost ?host .
?event rdf:type xpfw:FirewallEvent .
?event rdf:type ?type .
?host integration:FWLogHostToFireHost ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?host_fire rdf:type ?host_type .
?host_fire fire:isContainedInAS ?as .
?host integration:FWLogHostToWHOISIpAddr ?ripeip .
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?tcpflow packetcapture:hasDestinationIP ?destip .
?destip integration:PcapIPToFireIPAddr ?fireipB .
?fireipB fire:IPbelongsToHost ?host_fireB .
?host_fireB fire:isContainedInAS ?as .
?ripeip whois:isContainedInRange ?riperange .
?riperange whois:isContainedInAS ?ripeAS .
?ripeAS whois:hasCountry ?country .
FILTER regex(str(?type),"Event","i")
FILTER regex(str(?host_type),"Server","i")
}
Res
ults
event
ripe
AS
type
fireip
host_type
fireipB
<urn://pfi
rewall.log
#event_2
99>
<ur
n://r
ipedb#
as_2
494
0>
<http://people.dsv.su.se/~do
ssis/ontologies/WindowsXP
FirewallLog.owl#DropData
Event>
<urn://f
iredb#ipad
dr_78.4
6.104.4
3>
<http://people.dsv.s
u.se/~dossis/ontolo
gies/Fire.owl#CCS
erver>
<urn://fi
redb#ipad
dr_78.4
6.173.1
93>
<urn://pfi
rewall.log
#event_3
15>
<ur
n://r
ipedb#
as_2
494
<http://people.dsv.su.se/~do
ssis/ontologies/WindowsXP
FirewallLog.owl#DropData
Event>
<urn://f
iredb#ipad
dr_78.4
6.104.4
3>
<http://people.dsv.s
u.se/~dossis/ontolo
gies/Fire.owl#CCS
erver>
<urn://fi
redb#ipad
dr_78.4
6.173.1
93>
97
0>
<urn://pfi
rewall.log
#event_2
73>
Inte
rpre
tati
on
<ur
n://r
ipedb#
as_2
494
0>
<http://people.dsv.su.se/~do
ssis/ontologies/WindowsXP
FirewallLog.owl#DropData
Event>
<urn://f
iredb#ipad
dr_78.4
6.104.4
3>
<http://people.dsv.s
u.se/~dossis/ontolo
gies/Fire.owl#CCS
erver>
<urn://fi
redb#ipad
dr_78.4
6.173.1
93>
The results provide support for the hypothesis that the system’s firewall has indeed prohibited
and logged incoming connections from an IP address that belongs to the same Autonomous
System as some of the web communications that the system has been reported to have with
malicious hosts. Mereological correlation of the part-to-whole relationship that connects IP
addresses with the AS to which they belong has allowed quick identification of further
communication that the compromised system may had with other malicious hosts. In this case,
the firewall events were of the ‘DropDataEvent’ type that represented unsuccessful incoming
connections. The originating IP address of these attempts did not only belong on the same
malicious network as other web communications the system had but was also blacklisted as a
Command and Control Server (CCServer).
Based on the sample queries presented above, the investigator has gained considerable support on
some of the hypotheses that can be significant to the case in hand. The queries’ results have indicated
that the system had web communications with known malicious hosts, downloaded and stored on the
system’s disk malicious files as well as there were further attempts to communicate with other
malicious hosts that were probably related with the former ones.
Based on the temporal relations that have been established in the 2nd case, temporal queries can also be
evaluated against the aggregated and integrated data. Some examples of such temporal queries are
given below.
Hypo
thesis
After extracting and identifying a malicious file in one of the web communications that the
system had, the investigator hypothesizes that a successful execution of it can lead to the
infection of the machine with additional malicious files that the attacker may inject. The
investigator formulates a query to identify any possible malicious files that have been created
on the host’s disk shortly after the malicious web communication.
Quer
y
SELECT DISTINCT ?uri ?file ?pathname ?descriptionDisk
WHERE {
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
?http packetcapture:hasHTTPRequest ?httpreq .
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
98
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?report .
?report virustotal:hasResult ?result .
?tcpflow temporal:hasValidTime ?duration .
?duration integration:temporalBefore ?timestamp .
?fileevent temporal:hasValidTime ?timestamp .
?file integration:hasFileCreationEvent ?fileevent .
?file digitalmedia:hasPathName ?pathname .
?file integration:MediaFileToVTFile ?vtfileDisk .
?vtfileDisk virustotal:hasAVReport ?reportDisk .
?reportDisk virustotal:hasResult ?resultDisk .
?resultDisk virustotal:hasResultDescription ?descriptionDisk
}
Resul
ts
uri
file
pathname
descriptionDisk
<urn://reverse
TCP.xml#file_
754>
"Documents and
Settings/John/MyFile.
exe"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"Trojan.Win32.Generi
c!BT"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
<urn://reverse
TCP.xml#file_
754>
"Documents and
Settings/John/MyFile.
exe"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"BackDoor.Netguy.4"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"/~dossis/Windows7S
erial.doc"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
<urn://reverse
TCP.xml#file_
754>
"Documents and
Settings/John/MyFile.
exe"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"Malware.Kernelbot"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"/~dossis/Windows7S
erial.doc"
^^<http://www.w3.or
<urn://reverse
TCP.xml#file_
754>
"Documents and
Settings/John/MyFile.
exe"
"TR/Rootkit.Gen"
^^<http://www.w3.or
g/2001/XMLSchema#
"/~dossis/Windows7S
erial.doc"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
"/~dossis/Windows7S
erial.doc"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
99
g/2001/XMLSchema#
string>
"/~dossis/Windows7S
erial.doc"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
Interp
retati
on
^^<http://www.w3.or
g/2001/XMLSchema#
string>
<urn://reverse
TCP.xml#file_
754>
"Documents and
Settings/John/MyFile.
exe"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
string>
"Riskware.WinNT.Bou
pke!IK"
^^<http://www.w3.or
g/2001/XMLSchema#
string>
The results verify the hypothesis that indeed in a short time period (one minute according to
the aforementioned custom rules) after downloading a malicious document from a Web
server, an additional malicious executable has been created and stored on the system’s disk.
Hypoth
esis
The investigator further hypothesizes that a successful compromise of the system by a
downloaded malicious file may have enabled an attacker to establish a shell connection to
the system and access various files on the system’s disk. The investigator formulates a
query to identify which files have been accessed in a short time period after the download
from the web of a malicious file.
Query
SELECT DISTINCT ?uri ?file ?pathname
WHERE {
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
?http packetcapture:hasHTTPRequest ?httpreq .
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?report .
?report virustotal:hasResult ?result .
?tcpflow temporal:hasValidTime ?duration .
?duration integration:temporalBefore ?timestamp .
?fileevent temporal:hasValidTime ?timestamp .
?file integration:hasFileLastAccessEvent ?fileevent .
?file digitalmedia:hasPathName ?pathname
}
100
Results
uri
file
pathname
<urn://reverseTCP.x
ml#file_4785>
"WINDOWS/system32/config/
SAM"
^^<http://www.w3.org/2001/
XMLSchema#string>
"/~dossis/Windows7Serial.doc
"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://reverseTCP.x
ml#file_907>
"Documents and
Settings/LocalService/Local
Settings/Application
Data/Microsoft/Windows/Usr
Class.dat"
^^<http://www.w3.org/2001/
XMLSchema#string>
"/~dossis/Windows7Serial.doc
"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://reverseTCP.x
ml#file_4795>
"WINDOWS/system32/config/
system.LOG"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://reverseTCP.x
ml#file_4788>
"WINDOWS/system32/config/
SECURITY"
^^<http://www.w3.org/2001/
XMLSchema#string>
<urn://reverseTCP.x
ml#file_4791>
"WINDOWS/system32/config/
software.LOG"
^^<http://www.w3.org/2001/
XMLSchema#string>
"/~dossis/Windows7Serial.doc
"
^^<http://www.w3.org/2001/
XMLSchema#string>
"/~dossis/Windows7Serial.doc
"
^^<http://www.w3.org/2001/
XMLSchema#string>
"/~dossis/Windows7Serial.doc
"
^^<http://www.w3.org/2001/
XMLSchema#string>
Interpre
tation
The results (only an excerpt is shown above) list all the files whose last access time was in a
short time window after the download of the malicious file. The results contain entries of
registry files such as the SAM that contains the user accounts’ passwords and may indicate
possible dumping of the hashed passwords by the attacker.
Hyp
othe
sis
The investigator hypothesizes that after a malicious file has been downloaded and possibly
executed, a communication of the compromised system with the attacker may be attempted. As
before, in most cases like botnets, the compromised system may attempt to communicate with
the C&C server, controlled by the attacker. The formulated query searches for firewall logged
events in a short time window after the malicious download towards hosts that may be already
blacklisted.
Que
ry
SELECT DISTINCT ?uri ?type ?host ?typeB
WHERE {
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
?http packetcapture:hasHTTPRequest ?httpreq .
101
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?report .
?report virustotal:hasResult ?result .
?tcpflow temporal:hasValidTime ?duration .
?timestamp integration:temporalInside ?duration .
?event temporal:hasValidTime ?timestamp .
?event rdf:type xpfw:FirewallEvent .
?event rdf:type ?type .
?event xpfw:hasDestinationHost ?host .
?host integration:FWLogHostToFireHost ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?host_fire rdf:type ?typeB .
FILTER regex(str(?type),"Event","i")
FILTER regex(str(?typeB),"Server","i")
}
Res
ults
uri
type
host
typeB
"/~dossis/Wind
ows7Serial.doc"
^^<http://www.
w3.org/2001/X
MLSchema#stri
ng>
<http://people.dsv.su.se/~dossis/
ontologies/WindowsXPFirewallLo
g.owl#OpenOutboundSessionEve
nt>
<urn://pfire
wall.log#ip_
78.46.104.4
3>
<http://people.dsv.s
u.se/~dossis/ontolog
ies/Fire.owl#CCServ
er>
"/~dossis/Wind
ows7Serial.doc"
^^<http://www.
w3.org/2001/X
MLSchema#stri
ng>
<http://people.dsv.su.se/~dossis/
ontologies/IntegrationOntology.o
wl#Event>
<urn://pfire
wall.log#ip_
78.46.104.4
3>
<http://people.dsv.s
u.se/~dossis/ontolog
ies/Fire.owl#CCServ
er>
"/~dossis/Wind
ows7Serial.doc"
^^<http://www.
<http://people.dsv.su.se/~dossis/
ontologies/WindowsXPFirewallLo
g.owl#FirewallEvent>
<urn://pfire
wall.log#ip_
78.46.104.4
<http://people.dsv.s
u.se/~dossis/ontolog
ies/Fire.owl#CCServ
102
w3.org/2001/X
MLSchema#stri
ng>
Inte
rpre
tati
on
3>
er>
The results show that indeed an outgoing connection has been established, in a short time
period after the malicious download, to a system that was already blacklisted as a known
Command and Control server.
Hypothesis
Upon identifying a suspicious connection to a malicious host as logged by the firewall,
the investigator wants to verify which files have been recently accessed before this
event. The formulated query searches for files whose last access timestamp is before the
suspicious connection, in a short time window.
Query
SELECT DISTINCT ?host ?pathname
WHERE {
?event xpfw:hasDestinationHost ?host .
?event rdf:type xpfw:FirewallEvent .
?event rdf:type ?type .
?host integration:FWLogHostToFireHost ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?event temporal:hasValidTime ?timestampA .
?timestampB integration:temporalBefore ?timestampA .
?fileevent temporal:hasValidTime ?timestampB .
?file integration:hasFileLastAccessEvent ?fileevent .
?file digitalmedia:hasPathName ?pathname
}
Results
host
pathname
<urn://pfirewall.log#ip_78.46.104.
43>
"Documents and Settings/John/My
Documents/Windows7Serial.doc"
^^<http://www.w3.org/2001/XMLSchema#st
ring>
<urn://pfirewall.log#ip_78.46.104.
43>
"Documents and Settings/All
Users/Documents/My Music/.."
^^<http://www.w3.org/2001/XMLSchema#st
ring>
<urn://pfirewall.log#ip_78.46.104.
43>
"Documents and Settings/John/My
Documents/My Music/.."
103
^^<http://www.w3.org/2001/XMLSchema#st
ring>
Interpretatio
n
<urn://pfirewall.log#ip_78.46.104.
43>
"Documents and Settings/John/My
Documents/."
^^<http://www.w3.org/2001/XMLSchema#st
ring>
<urn://pfirewall.log#ip_78.46.104.
43>
"Documents and
Settings/John/Recent/Windows7Serial.lnk"
^^<http://www.w3.org/2001/XMLSchema#st
ring>
The results show that the malicious file that has been downloaded from the web has
been recently accessed before the communication to the C&C server was established.
This result further corroborates the hypothesis that this file was indeed executed and the
one that caused the suspicious connection.
Hy
pot
hes
is
The investigator hypothesizes that the user may have been drawn to downloading the malicious
file through a phishing attempt by a malicious site. The formulated query searches for the web
pages that were visited by the user shortly before or even the same time as when the malicious
file was downloaded.
Qu
ery
PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#>
PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#>
PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#>
PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#>
PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#>
PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#>
PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#>
PREFIX http: <http://www.w3.org/2011/http#>
PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX content: <http://www.w3.org/2011/content#>
SELECT DISTINCT ?uri ?fieldValue ?title
WHERE {
{
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
104
?http packetcapture:hasHTTPRequest ?httpreq .
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
?vtfile virustotal:hasAVReport ?vtreport .
?vtreport virustotal:hasResult ?result .
?tcpflow temporal:hasValidTime ?durationA .
?durationB integration:temporalStarts ?durationA .
?tcpFlowB temporal:hasValidTime ?durationB .
?tcpFlowB packetcapture:hasApplicationLayerProtocol ?httpB .
?httpB rdf:type packetcapture:HTTP .
?httpB packetcapture:hasHTTPRequest ?httpreqB .
?httpreqB http:requestURI ?uriB .
?httpreqB http:headers ?header .
?httpreqB http:resp ?httprespB .
?httprespB http:body ?httpbodyB .
?httpbodyB content:title ?title .
?header http:fieldName ?fieldName .
?header http:fieldValue ?fieldValue .
FILTER (regex(str(?fieldName),"Host","i") || regex(str(?fieldName),"Referer","i"))
}
UNION
{
?tcpflow packetcapture:hasApplicationLayerProtocol ?http .
?http rdf:type packetcapture:HTTP .
?http packetcapture:hasHTTPRequest ?httpreq .
?httpreq http:resp ?httpresp .
?httpreq http:requestURI ?uri .
?httpresp http:body ?httpbody .
?httpbody packetcapture:hasContentMD5 ?md5 .
?httpbody integration:HTTPContentToVTFile ?vtfile .
105
?vtfile virustotal:hasAVReport ?vtreport .
?vtreport virustotal:hasResult ?result .
?tcpflow temporal:hasValidTime ?durationA .
?durationB integration:temporalBefore ?durationA .
?tcpFlowB temporal:hasValidTime ?durationB .
?tcpFlowB packetcapture:hasApplicationLayerProtocol ?httpB .
?httpB rdf:type packetcapture:HTTP .
?httpB packetcapture:hasHTTPRequest ?httpreqB .
?httpreqB http:requestURI ?uriB .
?httpreqB http:headers ?header .
?httpreqB http:resp ?httprespB .
?httprespB http:body ?httpbodyB .
?httpbodyB content:title ?title .
?header http:fieldName ?fieldName .
?header http:fieldValue ?fieldValue .
FILTER (regex(str(?fieldName),"Host","i") || regex(str(?fieldName),"Referer","i"))
}
}
Re
sul
ts
uri
fieldnam
fieldValue
e
title
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
"Host"
^^<http:
//www.
w3.org/
2001/X "people.dsv.su.se"
MLSche ^^<http://www.w3.org/2001/XM
ma#strin LSchema#string>
g>
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
//www.
"Referer
"
^^<http:
//www.
w3.org/
2001/X
"http://www.google.se/search?hl
=sv&source=hp&q=semantic+web
+dsv&gbv=2&oq=semantic+web+
dsv&gs_l=hp.3...1743.4076.0.421
6.16.14.0.2.2.0.110.1162.12j2.14.
0...0.0...1c.lAuiKGn2X-s"
106
"http://people.dsv.su.se/~dossis/
"
^^<http://www.w3.org/2001/XM
LSchema#string>
"http://people.dsv.su.se/~dossis/
"
^^<http://www.w3.org/2001/XM
LSchema#string>
w3.org/ MLSche ^^<http://www.w3.org/2001/XM
2001/X ma#strin LSchema#string>
MLSche g>
ma#strin
g>
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
"Host"
^^<http:
//www.
w3.org/
2001/X "people.dsv.su.se"
MLSche ^^<http://www.w3.org/2001/XM
ma#strin LSchema#string>
g>
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
"Referer
"
^^<http:
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
"Host"
^^<http:
//www.
w3.org/
2001/X "http://www.google.se/"
MLSche ^^<http://www.w3.org/2001/XM
ma#strin LSchema#string>
g>
"/~dossi
s/Windo
ws7Seria
l.doc"
^^<http:
"Referer
"www.google.se"
"
^^<http: ^^<http://www.w3.org/2001/XM
//www. LSchema#string>
w3.org/
"http://people.dsv.su.se/~dossis/
"
^^<http://www.w3.org/2001/XM
LSchema#string>
107
"http://people.dsv.su.se/~dossis/
coolsites.htm"
^^<http://www.w3.org/2001/XM
LSchema#string>
"http://people.dsv.su.se/~dossis/
coolsites.htm"
^^<http://www.w3.org/2001/XM
LSchema#string>
"http://www.google.se/search?hl
=sv&source=hp&q=semantic+web
+dsv&gbv=2&oq=semantic+web+
dsv&gs_l=hp.3...1743.4076.0.421
6.16.14.0.2.2.0.110.1162.12j2.14.
0...0.0...1c.lAuiKGn2X-s"
^^<http://www.w3.org/2001/XM
LSchema#string>
"http://www.google.se/search?hl
=sv&source=hp&q=semantic+web
+dsv&gbv=2&oq=semantic+web+
dsv&gs_l=hp.3...1743.4076.0.421
6.16.14.0.2.2.0.110.1162.12j2.14.
//www.
w3.org/
2001/X
MLSche
ma#strin
g>
Int
erp
ret
ati
on
2001/X
MLSche
ma#strin
g>
0...0.0...1c.lAuiKGn2X-s"
^^<http://www.w3.org/2001/XM
LSchema#string>
An excerpt of the results shows some of the websites visited by the user before downloading the
malicious file. The results can show that before downloading the malicious file, the browser has
visited the main page of the malicious site (people.dsv.su.se/~dossis in our example) along with
a specific page under it (coolsites.htm). By utilizing also the Refered HTTP request header, the
examiner is able to identify, a web search submitted on the Google search engine (“semantic
web dsv”) where the malicious site was included in the results.
The ability of the above queries to quickly provide integrated and correlated results upon the multitude
of initial unstructured or semi-structured collected evidence should be apparent by now. Although no
specific measurement has been taken on the time or expertise that an investigator may have needed to
provide such answers manually, it should be reasonable to claim that such an approach is quite more
flexible and scalable. It should be emphasized the fact that such queries allowed the investigator to
connect events such as dropped incoming connections by the firewall with network communications
and disk image by evading the complexity of the sheer amount of data. Besides that, a dropped
incoming connection from an external host would not have probably raised any suspicions or alerts
although there could have been logical connections with other logged events by other tools and
techniques. The evaluation of the above hypotheses can greatly assist the investigation team into being
able to reconstruct the sequence of events and provide a more complete and accurate narrative of the
investigated event.
8.4 Evaluation of the Method
As a final part, a lightweight evaluation of the method and lessons learned are discussed and the
attained results are compared to the initially specified goals and criteria in section 6.3.
A generic goal that has been specified was that the proposed method should be appropriate and
relevant to the digital investigation in focus. Through the design and the prototype implementation of
the method, it has been shown that the method can be used in a variety of different contexts involving
different types of evidence files. The method should not be considered as a replacement of existing
tools or techniques but rather as an additional layer on top of them with the ability of integrating and
correlating meaningfully their respective results. In the proof of concept implementation six different
types of data have been used (disk images, packet captures, firewall logs, site blacklists, antimalware
engine, and network data) which are quite commonly used either wholly or partly in digital
investigations and especially in cases involving network intrusions. There haven’t been any
restrictions regarding the types of cases or data that the method can handle besides the technical
restriction that a respective ontology must have been designed and an appropriate semantic parser
implemented that can operate on the semi-structured or unstructured data that an existing tool outputs.
One of the strongest points of this method, compared to traditional techniques, is that through the
power of SPARQL it gives access to the user of an expressive query interface against the integrated
data. The queries can include numerous of unbound variables and constitute quite complex graph
108
patterns against which results are attained. The queries can include every term specified in the
respective ontologies and evaluation over datasets spanning hundreds of thousands of triples and
representing the complete body of the collected evidence was demonstrated.
The resulting semantic representations of the different evidence files as expressed in the RDF/XML
format have been shown to be of equal or lesser size than that of the original files. Of course, the
resulting size of the set of axioms depends on the abstraction level of the ontology used as well as the
features of OWL that are used since evaluation and storage of the inferred axioms leads to even larger
files. However, the proposed method provides an efficient way for the investigator to focus the initial
phase of the investigation on the metadata of evidence such as timestamps, hash values, file types and
names etc. and then resort to the initial evidence files and retrieve specific content upon identifying the
most relevant to the case ones such as executables, digital images etc. In addition, although no explicit
time measurements have been made, the query resolution has been proven almost instantaneous on
commodity hardware for a case that resulted in approximately 300 thousand triples. Inference and
storage of inverse object properties has shown to accelerate most of the queries considerably.
An additional benefit of this method is its flexibility and potential to fully decouple the
implementation code from the used ontologies/rules/queries. Indeed, as described before, all these
entities can be stored as separate and external files or even downloaded dynamically from URLs. The
prototype code’s main role is to organize and streamline the whole process from calling the
appropriate parsers to later invoke the inference and rule engine. The various semantic parsers are the
only parts that are highly dependent on the various formats of their source data and the ones that are
responsible with extracting and asserting the semantic information according to the specified ontology.
Some few hard-coded references to the locations of the ontology files or the way in which the parsers
are invoked are shortcomings of this prototype implementation and can be easily lifted off with a more
pluggable architecture and a graphical interactive layer. The prototype implementation was capable to
deal with the complexity and size of the evidence files used in the experiments in a timely manner.
Most parsing, inference and rule tasks did not take more than 5-10 minutes to complete. The only
tasks that needed a considerable amount of time were the live matching of all files against the online
antivirus engine as well as the temporal-related rules’ evaluation. More specifically, rules that needed
to cross-match around 1000 timestamps for e.g. temporal ‘before’ relations, were leading the Jess rule
engine to memory exhaustion problems. Upon trial and error, an optimum size of 500 timestamps has
been found to avoid such problems and need approximately 7-8 hours to complete. The task of
splitting the set of time instant individuals to subgroups and later merging the inferred axioms has
been performed manually in this thesis. However, with more computing resources available and
possible automated ways or heuristics to ‘divide-and-conquer’ such problems, the ability of the
method to handle even larger amount of data will increase.
Regarding the forensic-related requirements that have been initially specified, the method has been
proven to be quite accurate and precise. The method has been applied to the same evidence files twice
providing approximately the same results. Small differences were caused from minor network disrupts
or few dropped API calls against the used online services without affecting significantly the bulk of
the case’s body. However, reliance on online sources surely can affect the results of this method and
disrupt the automated processing by the semantic parsers. The ability though of employing the parsers
asynchronously and separately of each other and later merging of their generated statements can
alleviate such problems with the need of the implementation of some form of temporary storage for
the collector modules. The results of the method have been checked successfully for possible
ontological inconsistencies using the Protégé’s reasoning engines. The most probable cause of
109
inconsistencies can be when either the specified ontologies are erroneous or when multiple ontologies
cross-reference each other inconsistently. Such problems though, must be checked and corrected
before the employment of the method. The method is capable of working on forensic copies of the
various sources of evidence since the various semantic parsers can be implemented to do so without
any major hindrances. Minor logging capabilities have been added to the prototype implementation
concerning the start and finish of the various steps of the method. The resulting file is an ontology file
which is fully compliant with the standards and able to be processed by ontology editors such as
Protégé for further inclusion of annotation statements.
Finally, concerning the semantic-related criteria that have been set, the prototype implementation has
been proven to be quite flexible and able to operate successfully on at least two different testing
laptops. The implementation is based on Java and thus platform and OS-independent. Additionally, the
various steps of the method can be easily separated and run under different systems. As an example,
different parsers can work in parallel on different evidence files of the case and then the results can be
easily merged into a single dataset. Such techniques have been applied during the thesis in order to
deal with increased complexity in a manual manner, however future improvements may take
advantage of contemporary distributed processing techniques such as WebPie (Urbani et al. 2010)
which is a parallel inference engine based on the Hadoop framework. The proof of concept system has
been implemented using contemporary tools and libraries based on the current Semantic Web
standards. To the best of the author’s knowledge, there is no standard yet regarding the semantic
representation of temporal information however the followed approach seems promising. The former
has led to the adoption of the Jess rule engine as being able to handle the temporal related rules even
though not a Semantic-web based tool. Finally, the method can accept arbitrary user-specified
concepts such as ontology classes and properties which can be asserted as either a custom extension of
the various specific ontologies or either as parts of an additional individual ontology as done in this
thesis.
Overall, the described method along with its prototype implementation was able to fulfill quite
successfully almost all of the specified criteria. During the evaluation the author was not able to gather
accurate timing information mostly due to the fact that the testing systems were most of the times in
use in parallel with the execution of the developed application as well the variance of its performance
based on factors such as the CPU speed and RAM size of the system or the network speed.
Furthermore, the size of the evidence files used and the amount and complexity of generated triples
can vary greatly thus constituting such measurements more difficult to be accurate. However,
approximate results regarding the time and complexity related performance metrics of the method
have been presented showing that the method is both feasible and relatively fast compared to a
traditional manual analysis while exploiting most of the advantages of the semantic web technologies
in integrating and correlating heterogeneous data sources as well maintaining the potential to ascribe
to the strict requirements of a forensic process.
110
Conclusions and Future Work
9.1 Conclusions
The aim of this thesis was to identify and describe the potential benefit that Semantic Web based
technologies and ideas can bring to the area of digital investigations as well as tackle some of its most
prominent problems. Such problems pertain to the ever-increasing amount and complexity of data, the
heterogeneity and incompatibility of various disparate tools and techniques as well as the lack of
automation and advanced forms of analytical capabilities.
The thesis started with an extended background research on both fields, the state of the art of digital
investigations regarding both their conceptual foundations as well as concrete problems and
limitations that they currently face, and the Semantic Web with its stack of cross-complementary
technologies and standards along with its distinctive capabilities in automated reasoning, rule
evaluation and expressive querying. The thesis continued with a study and evaluation of recent
approaches on the merging of these two fields along with their promising results and possible
shortcomings. Based on this background knowledge, the thesis proposed a generic and adaptable
method based on the semantic representation, integration and correlation of digital evidence describing
both a generic conceptual framework bridging the two fields as well as describing a proof-of-concept
implementation of it.
The thesis continued with a demonstration of the method utilizing the implemented prototype system
upon two experiments that closely resemble a quite common contemporary method of compromising a
system with malicious payloads over the Internet. The demonstration showed various examples of how
sources of data of different origin and nature (disk images, network captures, firewall logs) can be
automatically semantically represented according to respective ontologies and similar or identical
entities be integrated and correlated upon various factors such as hash values, IP addresses, network
blocks and time. The ability of such integrated and correlated data to provide a fast and meaningful
insight to the investigator has been showcased through a number of relevant queries of combinatorial
nature providing results in a much effortless and analytically-rich approach. The thesis concluded with
an evaluation of the proposed method according to previously specified criteria and highlighted some
of its strong points such as increased automation, improved analytical capabilities, decoupled
implementation, and ability to accept user-defined concepts, rules and queries. The evaluation
pinpointed also some of the shortcomings of the method with respect mostly to its performance and
scalability capabilities that can be further improved.
Overall, the outcome of the thesis in accordance with the research questions posed in the beginning, is
that the Semantic Web technologies can offer a lot in the field of digital investigations and information
security overall due to their distinctive abilities on automation and data integration. An ontological
representation of the various collected data can alleviate the problem of heterogeneous and nonintegrate-able forensic and security tools and databases while enable a more abstract but expressive
conceptualization of existing forensic knowledge thus minimizing the barrier to people with less
technical expertise. The thesis showed also that the (semi-)automated capabilities of an inference and
a rule engine can improve current forensic techniques and analytical skills by allowing easy and fast
ways to integrate and correlate forensic evidence while encapsulating most of the complexity by the
provision of a querying interface to the investigator that matches better the thinking process behind an
investigation procedure.
111
9.2 Future Work
Both fields, the Semantic Web on the one hand and digital forensics on the other hand, are
continuously evolving and reshaping. The main contribution of the thesis was to propose the adoption
of the Semantic Web framework as an enabling technology that can alleviate some of the most
challenging problems that the field of digital forensics faces. The main objective was to take
advantage of key features of Semantic Web in the area of heterogeneous data integration, support for
automation and improved analytical capabilities for the investigator in the form of an expressive and
flexible querying layer. The goal of such a proposal is as the PoC system showed, to be able to reduce
the time and expertise needed from the investigator to deal with an ever increasing arsenal of
specialized tools and data formats as well as the respective corpus of forensic knowledge and
techniques. As such, further research should be focused on the study of how real investigations are
conducted and if or how such an approach can produce tangible benefits by being introduced in the
workflow. Better evaluation of the method in terms of both its time efficiency and ability to deal with
complex data can be performed through real-life usage of it by investigation teams or individuals and
application of relevant research methods such as interviews, surveys etc.
On the technical side, the proposed system was merely a proof of concept and still quite far from a
production level implementation. There are literally thousands of improvements that can be made in
such a system by taking advantage of recent advancements in both fields. The first and most important
step would be the engineering of ontologies that could cover even more fields such as OS artifacts,
mobile forensics, live memory etc. It would be beneficial if such ontologies could reach a consensus
amongst the digital forensic community since the usage of common ontologies can avoid various
difficulties that arise in multi-ontology environments where there are overlaps between them. Even the
ontologies described in this thesis, do not cover exhaustively their respective knowledge areas and can
be further extended.
One basic approach that was adopted during this thesis was that since common domain ontologies
have not been yet developed and standardized, such method can still be implemented even in such a
multi-ontological environment. One of the main issues was the establishment of relations between
individuals representing the same concept, e.g. the same IP address. The technique followed
throughout this thesis was the use of SWRL rules for performing such a ‘bridging’. This part can be
further improved and even automated by taking advantage of recent developments in the area of
automated multi-ontological concepts and individuals linking. Furthermore, developments in the area
of distributed inference engines, RDF triples storage systems and federated querying can significantly
improve the robustness, efficiency and computational feasibility of such a method and constituting it
capable of dealing with massive amounts of data, while reducing the time needed. Eventually, such
technologies can even enable real-time integration and correlation of logs and captured events from
various information security appliances such as firewalls, IDS, antivirus, server and database systems
etc. Finally, a graphical interactive and user-friendly layer positioned on top of such a method, can
greatly increase its usability while in parallel reduce the training time needed for adoption. Such a
method could also be considered as an educational method allowing persons with less technical
background to develop analytical skills needed during a digital investigation.
112
List of References
ACPO, 2007. Good Practice Guide for Computer-Based Evidence. Available at:
http://www.7safe.com/electronic_evidence/ACPO_guidelines_computer_evidence_v4_web.pdf.
Abbott, J. et al., 2006. Automated recognition of event scenarios for digital forensics. Proceedings of the 2006
ACM symposium on Applied computing SAC 06, p.293. Available at:
http://portal.acm.org/citation.cfm?doid=1141277.1141346.
Al-Feel, H., Koutb, M.A. & Suoror, H., 2009. Toward An Agreement on Semantic Web Architecture. Europe,
49(384,633,765), pp.806-810. Available at:
http://www.akademik.unsri.ac.id/download/journal/files/waset/v49-142.pdf.
Alink, W. et al., 2006. XIRAF – XML-based indexing and querying for digital forensics. Digital Investigation,
3(Supplement-1), pp.50-58. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000776.
Allen, J.F., 1983. Maintaining knowledge about temporal intervals R. J. Brachman & H. J. Levesque, eds.
Communications of the ACM, 26(11), pp.832-843. Available at:
http://portal.acm.org/citation.cfm?doid=182.358434.
Antoniou, G. & Van Harmelen, F., 2004. A Semantic Web Primer M. P. Papazoglou & J. W. Schmidt, eds., The
MIT Press. Available at: http://doi.wiley.com/10.1002/asi.20368.
Ayers, D., 2009. A second generation computer forensic analysis system. Digital Investigation, 6(Supplement 1),
p.S34-S42. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000371.
Basili, V.R., Caldiera, G. & Rombach, H.D., 1994. The goal question metric approach. In J. J. Marciniak, ed.
Encyclopedia of Software Engineering. John Wiley & Sons, pp. 528-532. Available at:
http://www.citeulike.org/user/whazzup221/article/4186285.
Berners-Lee, T, Fielding, R. & Masinter, L., 2005. RFC 3986 - Uniform Resource Identifier (URI): Generic
Syntax. Technical report httptoolsietforghtmlrfc3986, pp.1-62. Available at: http://tools.ietf.org/html/rfc3986.
Berners-Lee, Tim, 1998. Semantic web road map. Design Issues for the World Wide Web, 2008(September
1998), pp.1-5. Available at: http://www.w3.org/DesignIssues/Semantic.html.
Berners-Lee, Tim et al., 1992. World-Wide Web: The Information Universe. Internet Research, 2(1), pp.52-58.
Available at: http://www.emeraldinsight.com/10.1108/eb047254.
Berners-Lee, Tim, Hendler, J. & Lassila, O., 2001. The Semantic Web A. Gómez-Pérez, Y. Yu, & Y. Ding, eds.
Scientific American, 284(5), pp.34-43. Available at:
http://www.nature.com/doifinder/10.1038/scientificamerican0501-34.
Brezinski, D. & Killalea, T., 2002. RFC3227: Guidelines for Evidence Collection and Archiving. RFC Editor
United States, 2010. Available at: http://portal.acm.org/citation.cfm?id=RFC3227.
Brinson, A., Robinson, A. & Rogers, M., 2006. A cyber forensics ontology: Creating a new approach to studying
cyber forensics. Digital Investigation, 3(2), pp.37-43. Available at:
http://linkinghub.elsevier.com/retrieve/pii/S1742287606000703.
Carrier, B & Spafford, E., 2006. Categories of digital investigation analysis techniques based on the computer
history model. Digital Investigation, 3(Supplement 1), pp.121-130. Available at:
http://linkinghub.elsevier.com/retrieve/pii/S1742287606000739.
Carrier, Brian, 2006. CERIAS Tech Report 2006-06 A HYPOTHESIS-BASED APPROACH TO DIGITAL
FORENSIC INVESTIGATIONS by Brian D . Carrier Center for Education and Research in Information
Assurance and Security , Purdue University , West Lafayette , IN 47907-2086. History. Available at:
https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2006-06.pdf.
Carrier, B.D. & Spafford, E.H., 2004. An Event-Based Digital Forensic Investigation Framework. Proceedings
of the 4th Digital Forensic Research Workshop DFRWS, pp.1-12. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.3355&rep=rep1&type=pdf.
Carrier, Brian, 2003. Defining Digital Forensic Examination and Analysis Tools Using Abstraction Layers.
International Journal, 1(4), pp.1-12. Available at:
http://www.utica.edu/academic/institutes/ecii/publications/articles/A04C3F91-AFBB-FC134A2E0F13203BA980.pdf.
113
Carroll, J.J. et al., 2005. Named graphs, provenance and trust. Proceedings of the 14th international conference
on World Wide Web WWW 05, 14, p.613. Available at:
http://portal.acm.org/citation.cfm?doid=1060745.1060835.
Case, A. et al., 2008. FACE: Automated digital evidence discovery and correlation. Digital Investigation,
5(Suppl. 1), p.S65-S75. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287608000340.
Casey, E, 2004. Digital evidence and computer crime: forensic science, computers and the Internet, Academic
Pr. Available at:
http://books.google.com/books?hl=en&amp;lr=&amp;id=Xo8GMt_AbQsC&amp;oi=fnd&amp;pg=PP1&amp;
dq=Digital+Evidence+and+Computer+Crime%E2%80%94Forensic+Science,+Computers+and+the+Internet,+
Second+Edition&amp;ots=-YN1HU71ME&amp;sig=N3K1I-XiNfljRB5gReQiK9B3_8s.
Casey, E, 2002. Handbook of computer crime investigation: forensic tools and technology. No ISBN
0121631036, p.462. Available at: http://www.ncjrs.gov/App/abstractdb/AbstractDBDetails.aspx?id=195111.
Cohen, M., Garfinkel, Simson & Schatz, B., 2009. Extending the advanced forensic format to accommodate
multiple data sources, logical evidence, arbitrary information and forensic workflow. Digital Investigation,
6(9), p.S57-S68. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000401.
Cohen, M. & Schatz, B., 2010. Hash based disk imaging using AFF4. Digital Investigation, 7(Suppl. 1), p.S121S128. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287610000423.
Connor, M.J.O. & Das, A.K., 2011. A Method for Representing and Querying Temporal Information in OWL A.
Fred, J. Filipe, & H. Gamboa, eds. Language, 127, pp.97-110. Available at:
http://books.google.com/books?hl=en&amp;lr=&amp;id=AdLf6o25F00C&amp;oi=fnd&amp;pg=PA97&amp;
dq=A+Method+for+Representing+and+Querying+Temporal+Information+in+OWL&amp;ots=IcNulJfrhR&a
mp;sig=rkQ1FTtx0z4AhMAmeBdzTB8WgAg.
Connor, M.O. et al., 2005. Supporting Rule System Interoperability on the Semantic Web with SWRL Y. Gil et
al., eds. The Semantic Web–ISWC 2005, 3729(2), pp.974–986. Available at:
http://www.springerlink.com/index/f16373n77h8p2181.pdf.
Du, L. et al., 2008. A Latent Semantic Indexing and WordNet based Information Retrieval Model for Digital
Forensics. In IEEE International Conference on Intelligence and Security Informatics. pp. 70-75.
Forgy, C., 1982. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial
Intelligence, 19(1), pp.17-37. Available at: http://linkinghub.elsevier.com/retrieve/pii/0004370282900200.
Garfinkel, S, 2006. Forensic feature extraction and cross-drive analysis. Digital Investigation, 3, pp.71-81.
Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000697.
Garfinkel, S.L., 2006. AFF : A New Format for Storing Hard Drive Imagens. Communications of the ACM,
49(2), pp.85-87.
Garfinkel, S.L., 2009. Automating Disk Forensic Processing with SleuthKit, XML and Python. 2009 Fourth
International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, pp.73-84. Available
at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5341559.
Garfinkel, S.L., 2010. Digital forensics research: The next 10 years. Digital Investigation, 7(Suppl. 1), p.S64S73. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287610000368.
Garfinkel, Simson, 2011. Digital forensics XML and the DFXML toolset. Digital Investigation, pp.1-14.
Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287611000910.
Giova, G., 2011. Improving Chain of Custody in Forensic Investigation of Electronic Digital Systems. Journal of
Computer Science, 11(1).
Gladyshev, P. & Patel, A., 2004. Finite state machine approach to digital event reconstruction. Digital
Investigation, 1(2), pp.130-149. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287604000271.
Gruber, T.R., 1993. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2),
pp.199-220. Available at: http://linkinghub.elsevier.com/retrieve/doi/10.1006/knac.1993.1008.
Guo, Y., Slay, J. & Beckett, J., 2009. Validation and verification of computer forensic software tools—Searching
Function. Digital Investigation, 6(SUPPL.), p.S12-S22. Available at:
http://linkinghub.elsevier.com/retrieve/pii/S1742287609000358.
Guðjónsson, K., 2010. Mastering the Super Timeline With log2timeline. SANS Institute.
Haslhofer, B. & Neuhold, E.J., 2011. A Retrospective on Semantics and Interoperability Research. In D. Fensel,
ed. Foundations for the Web of Information and Services. Springer Berlin Heidelberg, pp. 3-27. Available at:
http://cs.univie.ac.at/research/research-groups/multimedia-information-systems/publikation/infpub/2921/.
114
Hevner, A.R. et al., 2004. Design Science in Information Systems Research. MIS Quarterly, 28(1), pp.75-105.
Available at: http://www.jstor.org/stable/25148625.
Hildebrandt, M., Kiltz, S. & Dittmann, J., 2011. A Common Scheme for Evaluation of Forensic Software. In IT
Security Incident Management and IT Forensics (IMF), 2011 Sixth International Conference on. pp. 92-106.
Hitzler, P. et al., 2009. OWL 2 Web Ontology Language Primer, W3C. Available at:
http://www.w3.org/TR/2009/REC-owl2-primer-20091027/.
Hobbs, J.R. & Pan, F., 2004. An ontology of time for the semantic web. Acm Transactions On Asian Language
Information Processing, 3(1), pp.66-85. Available at:
http://portal.acm.org/citation.cfm?doid=1017068.1017073.
IOCE, 2002. G8 Proposed Principles for Forensic Evidence. Available at:
http://www.ioce.org/fileadmin/user_upload/2002/G8 Proposed principles for forensic evidence.pdf [Accessed
December 20, 2011].
Jean-Mary, Y.R., Shironoshita, E.P. & Kabuka, M.R., 2009. Ontology Matching with Semantic Verification.
Web semantics Online, 7(3), pp.235-251. Available at: http://www.ncbi.nlm.nih.gov/pubmed/20186256.
Kahvedzic, D. & Kechadi, T., 2009. DIALOG: A framework for modeling, analysis and reuse of digital forensic
knowledge. Digital Investigation, 6(Supplement 1), p.S23 - S33. Available at:
http://www.sciencedirect.com/science/article/pii/S174228760900036X.
Kahvedžić, D. & Kechadi, T., 2011. Semantic Modelling of Digital Forensic Evidence. In O. Akan et al., eds.
Digital Forensics and Cyber Crime. Springer Berlin Heidelberg, pp. 149-156. Available at:
http://dx.doi.org/10.1007/978-3-642-19513-6_13.
Keet, C.M. & Artale, A., 2007. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied
Ontology, 3(1), pp.91-110. Available at: http://portal.acm.org/citation.cfm?id=1412417.1412418.
Kent, K. et al., 2006. Guide to Integrating Forensic Techniques into Incident Response. Nist Special Publication,
August(SP 800-86), p.121. Available at: http://cybersd.com/sec2/800-86Summary.pdf.
Kitchenham, B., Linkman, S. & Law, D., 1997. DESMET: a methodology for evaluating software engineering
methods and tools H. Rombach, V. Basili, & R. Selby, eds. Computing Control Engineering Journal, 8(3),
p.120. Available at: http://link.aip.org/link/CCEJEL/v8/i3/p120/s1&Agg=doi.
Koch, J., Velasco, C.A. & Abou-Zahra, S., 2011. HTTP Vocabulary in RDF 1.0. W3C Working Draft. Available
at: http://www.w3.org/TR/2011/WD-HTTP-in-RDF10-20110510/.
Kruegel, C., Valeur, F. & Vigna, G., 2005. Intrusion Detection and Correlation, Springer. Available at:
http://www.netlibrary.com/Details.aspx.
Kruse, W. & Heiser, J., 2002. Computer Forensics: Incident Response Essentials, Addison-Wesley. Available at:
http://www.best-seller-books.com/computer-forensics-incident-response-essentials.pdf.
Küster, U., König-ries, B. & Klusch, M., 2010. Criteria , Approaches and Challenges Evaluating Semantic Web
Service Technologies M. L. A. Sheth, ed. Progressive Concepts for Semantic Web Evolution Application and
Developments, pp.1-24.
Lamis, T., 2010. A Forensic Approach to Incident Response. Human Factors, pp.177-185.
Lee, S. et al., 2010. A proposal for automating investigations in live forensics. Computer Standards Interfaces,
32(5-6), pp.246-255. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0920548909000762.
Levine, B.N. & Liberatore, M., 2009. DEX: Digital evidence provenance supporting reproducibility and
comparison. Digital Investigation, 6(9), p.S48-S56. Available at:
http://linkinghub.elsevier.com/retrieve/pii/S1742287609000395.
Munir, R.F. et al., 2011. Detect HTTP Specification Attacks Using Ontology, IEEE.
National Institute of Justice, U., 2011. U.S. National Institute of Justice. Crimes Scene Guides. Available at:
http://www.nij.gov/topics/law-enforcement/investigations/crime-scene/guides/glossary.htm.
Noy, N.F. & Klein, M., 2004. Ontology Evolution: Not the Same as Schema Evolution. Knowledge and
Information Systems, 6(4), pp.428-440. Available at:
http://springerlink.metapress.com/openurl.asp?genre=article&id=doi:10.1007/s10115-003-0137-2.
Palmer, G., 2001. A Road Map for Digital Forensic Research. New York, 1, pp.27–30. Available at:
http://www.dfrws.org/2001/dfrws-rm-final.pdf.
Parsia, B., Sattler, U. & Schneider, T., 2008. Easy Keys for OWL. OWLed. Available at:
http://www.webont.org/owled/2008/papers/owled2008eu_submission_3.pdf.
Patzakis, B.J., 2003. Maintaining The Digital Chain of Custody. IFOSEC.
115
Rekhis, S. & Boudriga, N., 2011. Logic-based approach for digital forensic investigation in communication
Networks. Computers Security, In Press,. Available at: http://www.sciencedirect.com/science/article/B6V8G52BPJWH-1/2/1f69b962893b83cb7ceffcc14c4ee2e3.
Reynolds, D. et al., 2005. An assessment of RDF / OWLmodelling. October, 12, pp.2005–189. Available at:
http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf.
Saad, S. & Traore, I., 2010. Method ontology for intelligent network forensics analysis. In Privacy Security and
Trust {(PST)}, 2010 Eighth Annual International Conference on. pp. 7-14.
Scarfone, K. & Masone, K., 2004. Computer Security Incident Handling Guide Recommendations of the
National Institute of Standards and Technology. Nist Special Publication, 2(Revision 1), pp.800–61. Available
at:
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Computer+Security+Incident+Handling+Gu
ide+Recommendations+of+the+National+Institute+of+Standards+and+Technology#1.
Schatz, B., Mohay, G. & Clark, A., 2004a. Generalising Event Forensics Across Multiple Domains. Most.
Schatz, B., Mohay, G. & Clark, A., 2004b. Rich Event Representation for Computer Forensics. Asia Pacific
Industrial Engineering and Management Systems APIEMS 2004, pp.1-16. Available at:
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:RICH+EVENT+REPRESENTATION+FO
R+COMPUTER+FORENSICS#0.
Schatz, B.L., 2007. Digital evidence: representation and assurance. Queensland University of Technology.
Available at: http://eprints.qut.edu.au/16507/.
Schuster, A., 2010. Recent Advances in Memory Forensics. In ZISC.
Scientific Working Group On Digital Evidence, 2011. Scientific Working Group on Digital Evidence ( SWGDE
) SWGDE Model Quality Assurance Manual for Digital Scientific Working Group on Digital Evidence (
SWGDE ). Quality Assurance, 2011(Version 1), pp.1-117.
Shah, U., Finin, T. & Joshi, A., 2002. Information retrieval on the semantic web. In C. Nicholas et al., eds.
Proceedings of the eleventh international conference on Information and knowledge management. ACM New
York, NY, USA, pp. 461–468. Available at: http://portal.acm.org/citation.cfm?id=584868.
Stallard, T. & Levitt, K., 2003. Automated analysis for digital forensic science: semantic integrity checking L.
Karl, ed. 19th Annual Computer Security Applications Conference 2003 Proceedings, 0(Acsac), pp.160-167.
Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1254321.
Stevens, R.M. & Casey, Eoghan, 2010. Extracting Windows command line details from physical memory.
Digital Investigation, 7(Suppl. 1), p.S57-S63. Available at:
http://linkinghub.elsevier.com/retrieve/pii/S1742287610000356.
Stone-Gross, B. et al., 2009. FIRE: FInding Rogue nEtworks. 2009 Annual Computer Security Applications
Conference, pp.231-240. Available at:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5380682.
Suciu, D., 1998. An overview of semistructured data. SIGACT News, 29(4), pp.28-38. Available at:
http://portal.acm.org/citation.cfm?id=306198.306204.
Tan, T., Ruighaver, T. & Ahmad, A., 2003. Incident Handling : Where the need for planning is often not
recognised. Network, (November), pp.1-10.
Turner, P., 2005. Unification of digital evidence from disparate sources (Digital Evidence Bags). Digital
Investigation, 2(3), pp.223-228. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287605000575.
Urbani, J. et al., 2010. OWL reasoning with WebPIE: calculating the closure of 100 billion triples L. Aroyo et
al., eds. The Semantic Web Research and Applications, 6088, pp.213-227. Available at:
http://www.springerlink.com/index/2581664J64961667.pdf.
Vermaas, O., Simons, J. & Meijer, R., 2010. Open Computer Forensic Architecture a Way to Process Terabytes
of Forensic Disk Images Huebner E And Zanero S, ed. Architecture, pp.45-67.
Wikipedia, 2011. Daubert standard. Available at: http://en.wikipedia.org/wiki/Daubert_standard [Accessed
December 15, 2011].
Willassen, S.Y., Chapter 1 HYPOTHESIS BASED INVESTIGATION OF DIGITAL TIMESTAMPS. Available
at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.173.5585 [Accessed January 10, 2012].
Willassen, S.Y., 2008. Timestamp evidence correlation by model based clock hypothesis
testing. , p.15. Available at: http://dl.acm.org/citation.cfm?id=1363217.1363237 [Accessed
January 10, 2012].
116
Zhang, S. et al., 2011. An Ontology-Based Context-aware Approach for Behaviour Analysis. In L. Chen et al.,
eds. Activity Recognition in Pervasive Intelligent Environments. Atlantis Press, pp. 127-148. Available at:
http://dx.doi.org/10.2991/978-94-91216-05-3_6.
Zhao, Y. & Sandahl, K., 2002. Potential advantages of semantic web for internet commerce. Computer, (3).
Available at: www.ida.liu.se/~yuxzh/doc/iceis-030120.pdf.
117
Departmen of Computer and Systems Sciences
Stockholm University
Forum 100
SE-164 40 Kista
Phone: 08 – 16 20 00
www.su.se
118