Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Semantically-enabled Digital Investigations A method for semantic integration and correlation of digital evidence using a hypothesis-based approach. Spyridon Dossis Department of Computer and Systems Sciences Degree project 30 HE credits Degree subject (Computer and Systems Sciences) Degree project at the master level Autumn/Spring term 2012 Supervisor: Prof. Oliver Popov Reviewer: Prof. Iskra Popova Swedish title: Semantiskt digitala undersökningar Abstract Due to the continuous rise of security threats and increased sophistication and professionalism of exploitation techniques, digital investigations are becoming an essential part of most information and communication security processes and workflows. Digital investigations commonly combine a wide span of different areas of expertise such as forensic analysis of storage devices, network communications, OS artifacts, and logs from various host and network security appliances etc. The complexity of analyzing all these data requires both time and considerable expertise by practitioners while the bodies of knowledge of each field stay disparate and disjoint between them. Although there is a plethora of tools and techniques that are employed during a digital investigation, the lack of integration and interoperability between them as well as the formats of their source and resulting data hinders the analysis process. The sheer amount of data encountered in most cases require a new (semi)automated approach that can reduce the requirements in both time and expertise as well as enhance manual analytical skills by enabling easier and expressive integration and correlation of digital evidences. The Semantic Web initiative is a framework comprised of a number of different standards and languages that was conceived as a better way to automate machine to machine communication and integrate heterogeneous sources of data with a focus on the Web environment. The stack of technologies builds upon well-established markup languages such as XML and unique identification schemes such as URIs and extends them with expressive and schema-less data models such as RDF as well as enable well-defined and semantic description of a domain’s knowledge and automated inference of implicit knowledge encapsulated in any dataset. The stack is further complemented by rule and querying systems that allow even richer manipulation and retrieval of information. The thesis attempts to bridge these two disciplines and claims that distinctive advantages of the latter can solve challenges of the former. The thesis’s main research question is how a method that is based on existing Semantic Web technologies can be designed and implemented so as to automate processing tasks applied in the context of digital investigations as well as assist the investigator into integrating, correlating and querying disparate sources of forensically-relevant data and events with the goal of faster and more accurate reconstruction of a case’s events. The thesis follows the design science paradigm and reports in structured manner the steps followed in order to design, implement and evaluate the proposed method. The thesis, besides a detailed description of the proposed method, presents a prototype implementation of such a system which is further demonstrated through experiments that simulate realistic cases of security compromises in a networked environment. The method’s employment on these experiments demonstrates the feasibility of it as well as its considerable efficiency and capacity to enable the investigator to formulate hypotheses that arise during the analysis phase into queries that can span the multitude of collected data. Keywords Digital Investigation, Semantic Web, Data Integration, Evidence Correlation, Hypothesis-based approach. Table of Contents List of Figures ......................................................................................... 2 List of Tables ........................................................................................... 3 List of Abbreviations ............................................................................... 4 Introduction ............................................................................................ 5 1.1 Problem Description................................................................................... 5 1.2 Justification, Motivation and Benefits ........................................................... 5 1.3 Research Questions ................................................................................... 6 1.4 Audience .................................................................................................. 6 1.5 Limitations ............................................................................................... 6 1.6 Thesis Structure ........................................................................................ 7 Digital Evidence & Digital Investigations................................................. 8 2.1 Digital Evidence ........................................................................................ 8 2.1.1 Chain of Custody ................................................................................. 9 2.1.2 Order of Volatility ...............................................................................10 2.2 Digital Investigations ................................................................................10 2.2.1 The Event-based Digital Forensic Investigation Framework ......................11 2.2.2 The Digital Investigation Process ..........................................................12 2.2.3 The Scientific Method and the Hypothesis-Based Approach ......................14 2.3 Forensic Tools & Integration Issues ............................................................16 2.3.1 Tool Integration .................................................................................17 2.3.2 Data Representation ...........................................................................18 2.3.3 Correlation-based Analysis Techniques ..................................................18 Semantic Web Technologies .................................................................. 20 3.1 Semantic Web Foundations ........................................................................20 3.2 Semantic Web Architecture ........................................................................21 3.2.1 Uniform Resource Identifier (URI) ........................................................21 3.2.2 XML / Namespaces .............................................................................22 3.2.3 XML Schema / XML Query ...................................................................22 3.2.4 RDF / RDF Schema .............................................................................22 3.2.5 Ontologies .........................................................................................24 3.2.6 Rules / Query ....................................................................................26 3.2.7 Top Layers ........................................................................................27 Semantic Web & Digital Investigations ................................................. 28 4.1 XML-based Approaches .............................................................................28 4.2 RDF-based Approaches .............................................................................30 4.3 Ontological Approaches .............................................................................31 Research Methodology .......................................................................... 33 A Framework for Semantically-Enabled Digital Investigations .............. 36 6.1 An approach for digital evidence integration, correlation and hypothesis evaluation based on Semantic Web technologies ...................................................36 6.2 Relation to Digital Investigation Reference Models ........................................40 6.3 Evaluation Criteria ....................................................................................43 A Semantically enabled Method for Digital Evidence Integration, Correlation and Hypothesis Evaluation .................................................. 47 7.1 Description of the Method ..........................................................................47 7.2 Ontological Representation of Digital Evidence .............................................49 7.2.1 Network Packet Capture Ontology ........................................................50 7.2.2 Forensic Disk Image Ontology ..............................................................52 7.2.3 Windows Firewall Log Ontology ............................................................55 7.2.4 WHOIS Ontology ................................................................................58 7.2.5 Malicious Networks Ontology ...............................................................60 7.2.6 Malware Detection Ontology ................................................................62 7.3 Semantic Integration and Correlation of Forensic Evidence ............................63 7.3.1 Semantic Integration ..........................................................................64 7.3.2 Evidence Correlation ...........................................................................67 7.4 Query Formulation and Evaluation ..............................................................71 7.5 A reference method implementation ...........................................................73 7.5.1 Overview of the tools used ..................................................................73 7.5.2 Architecture of the PoC system ............................................................75 Demonstration of the Method ................................................................ 80 8.1 Description of the Experiments ..................................................................81 8.2 Integration and Correlation of Digital Artifacts .............................................83 8.3 Hypothesis formulation and evaluation ........................................................91 8.4 Evaluation of the Method ......................................................................... 108 Conclusions and Future Work .............................................................. 111 9.1 Conclusions ........................................................................................... 111 9.2 Future Work .......................................................................................... 112 List of References ................................................................................ 113 1 List of Figures Figure 1: Overview of the Event-based Digital Forensic Investigation Framework, adapted from (B. D. Carrier & E. H. Spafford 2004) ..............................................12 Figure 2: Semantic Web Architecture (Antoniou & Van Harmelen 2004) ...................21 Figure 3: Overview of the Design Science Method (Johannesson & Perjons 2012) ......34 Figure 4: Data Integration based on shared resource URI .......................................37 Figure 5: Data Integration based on owl:sameAs ...................................................37 Figure 6: Semantic inconsistency as reported by the reasoning engine .....................38 Figure 7: Class membersip entailment based on value restrictions ...........................39 Figure 8: Conceptual relation between forensic frameworks and the Semantic Web stack................................................................................................................42 Figure 9: The abstracted method's structure .........................................................47 Figure 10: Ontological modeling of network packet captures ...................................51 Figure 11: Ontological modeling of a forensic disk image ........................................53 Figure 12: Ontological modeling of Windows firewall logs .......................................57 Figure 13: Ontological modeling of WHOIS data as provided by RIPE .......................59 Figure 14: Ontological modeling of FIRE's blacklist of malicious networks/hosts ........61 Figure 15: Ontological modeling of VirusTotal's anti-malware detection service .........62 Figure 16: Transformation process of raw data to their semantic representation ........64 Figure 17: De-duplication of data by semantic integration using URIs ......................65 Figure 18: Semantic Integration of related individuals represented in different ontologies.........................................................................................................66 Figure 19: Integration of IP addresses/MD5 hash signatures ...................................67 Figure 20: Conversion process according to the SWRL Temporal Ontology ................69 Figure 21: Temporal relations of Allen's Interval Algebra ........................................70 Figure 22: Mereological correlation between IP addresses and Autonomous Systems .71 Figure 23: SPARQL graph pattern matching ..........................................................73 Figure 24: Proof-of-concept system architecture....................................................75 Figure 25: Attack scenario of a ‘bind_tcp’ shellcode triggered by a malicious Word document downloaded from the Web. ..................................................................82 Figure 26: Attack scenario of a ‘reverse_tcp’ shellcode triggered by a malicious Word document downloaded from the Web. ..................................................................83 Figure 27: Visualization of the semantic representation of the evidence files. ............85 2 List of Tables Table 1: List of Generic Criteria in terms of the GQM methodology ...........................43 Table 2: List of Forensic related criteria in terms of the GQM methodology ...............44 Table 3: List of criteria with regarding to the Semantic Web principles in terms of the GQM methodology .............................................................................................45 Table 4: Entities of the Network Packet Capture Ontology .......................................51 Table 5: Entities of the Disk Image Ontology ........................................................53 Table 6: List of fields of Windows Firewall log entries .............................................55 Table 7: Entities of the Windows Firewall Log Ontology ..........................................57 Table 8: Entities of the RIPE WHOIS Ontology .......................................................59 Table 9: Entities of the MalicousNetworks ontology ................................................61 Table 10: Entities of the VirusTotal ontology .........................................................62 Table 11: Integration semantic mappings between ontologies .................................66 Table 12: Semantic Representation of the Experiment 1 Evidence Files ....................83 Table 13: SWRL Rule Evaluation Results for Experiment 1 ......................................85 Table 14: Semantic Representation of the Experiment 2 Evidence Files ....................88 Table 15: SWRL Rule Evaluation Results for Experiment 2 ......................................89 3 List of Abbreviations AFF = Advanced Forensic Format AS = Autonomous System DLL = Dynamic Link Library DNS = Domain Name System FAT = File Allocation Table FIRE = Finding Rogue Networks Project FIWALK = File Inode Walk GUI = Graphical User Interface HTTP = Hypertext Transfer Protocol IDS = Intrusion Detection System ISP = Internet Service Provider MACE = Last Modified, Last Access, Created and Entry Modified Timestamps Malware = Malicious Software MD5 = Message Digest Algorithm (version 5) NIST = National Institute of Standards and Technology NSRL = National Software Reference Library OWL = Web Ontology Language RDF = Resource Description Framework RIPE NCC = Réseaux IP Européens, Network Coordination Center SPARQL = SPARQL Protocol and RDF Query Language SWRL = Semantic Web Rule Language TCP = Transmission Control Protocol W3C = World Wide Web Consortium XML = Extensible Markup Language 4 Introduction 1.1 Problem Description The field of Digital Forensics is facing an increasing number of challenges. Modern complex and networked IT systems are under the constant threat of various and increasingly sophisticated types of attacks ranging from reconnaissance attempts up to APT targeted attacks. Moreover, the constant evolution of the technological landscape with the introduction of new products and technologies as well the large volumes of data present Digital Forensic practitioners with almost unmanageable complexity in the timely accomplishment of their tasks. DF cases may require an integration of evidentiary data from disparate sources such as hard disks, volatile memory, security appliances logs and network communication. Additional problems may arise in the case of multiple parties being involved in the analysis of a security event due to different levels of expertise or communication problems that may arise from a lack of an agreed set of terms and definitions. A final side-problem identified is focused on the presentation level as being an important component of almost all DF investigation models. The results of a DF investigation have to be communicated to a court’s jury or an organization’s decision making board in a more understandable manner avoiding, unless if required, technical jargon or intricate terminology without of course affecting the admissibility or the probative value of the presented evidence. 1.2 Justification, Motivation and Benefits Existing practices and tools in the DF practice are usually restricted by architectural limitations in a specialized subset of the digital forensic collection and analysis needs such as file carving, log analysis tool or network forensics analysis tools. These tools can also become quickly outdated by newly introduced technologies or data formats. All the above can lead the forensic examiner to demanding and lengthy periods of manual analysis for the purposes of integrating the outcomes of the various tools as well as constant training in order to be able to deal with new problems. The motivation of the present endeavor is the ability to automate and streamline the analysis and reasoning process of any digital investigation. The addition of semantically expressed assertions during the examination, analysis and presentation phases can improve the current DF investigation models and techniques as well as promote new ideas in regards to how artifacts produced during investigation can be represented, integrated and linked in novel and meaningful ways. Automation of DF tasks can lead to multifaceted benefits both for DF practitioners and other relevant parties. Lowering the barrier of entrance to less-experienced forensic examiners and a reduction of the needed resources and time for manual analysis can improve considerably the efficiency and capabilities of the various forensic units or CERTs. A commonly agreed representation layer of the various artifacts and assertions produced during the investigation process can lower the dependence on specific tools or vendors while enhance the ability to integrate sources of data of different nature into a common analysis framework armed with advanced correlation and reasoning features. The addition of a more expressive representation layer can enable the active involvement of other interested but less technical educated parties (from the legal, academic or business areas) during the whole process. Finally, a semantically enriched digital investigation can introduce new methods and techniques in various aspects of case handling ranging from new forms of analysis methods such as conceptual 5 searching, refinement of data retention and case archiving policies, a provenance model for tracing the lineage of artifacts produced during the investigation process as well as more accessible and compelling presentation forms for the reporting of findings. 1.3 Research Questions In order to better clarify and delimit the research area as well as provide a range of criteria based upon the results can be evaluated, the following research questions have been defined. How can the Semantic Web technologies and the Linked Data initiative be applied to Digital Forensics? A method that combines both areas has to be respectful to the limitations and constraints of each but also suggest advantages that can be attained by introducing and applying the concept of publishing and linking structured data in both the theoretical models as well as the practice of the Digital Forensics field. How a common ontological-based knowledge representation layer can improve the level of integration of currently disjoint specialized areas of DF such as storage, network, mobile, live memory and others? A method that utilizes ontologies in order to represent commonly used concepts and entities pertinent to the area of digital forensic investigations may enable a computerunderstandable and formalized expression format, integrate and aggregate into higher abstraction levels data originated from different sources, enable automated reasoning able to infer new knowledge as well as promote reusability of existing knowledge bases. How such a new method may improve the efficiency and capabilities of existing DF investigation models, techniques and tools? A method proposing an automated handling of specific parts of a DF investigation process should reduce the complexity and requirements of existing semi-automated methods as well as propose new advanced capabilities that could improve the effectiveness of existing tools and processes. 1.4 Audience The thesis attempts to connect two seemingly disparate fields, those of Digital Forensics and Semantic Web. This thesis builds on top of some existing attempts to take advantage of key features of Semantic Web technologies with the goal to advance automation and data integration in the area of digital forensics. As such the thesis is expected to be of interest for both practitioners and researchers in the areas of digital forensics and information security in general as it demonstrates a new efficient and flexible method for improving the current status of analytical techniques and skills employed during most types of digital investigations. The thesis should also be relevant to the research community of the Semantic Web and Linked Data initiative as it demonstrates a practical implementation of such technologies in a new field and discusses various challenges and possible solutions that such a system may face and deploy. 1.5 Limitations The Semantic Web is a union of various standards, languages and technologies extending well-known data encoding languages such as XML and introducing new techniques originating from fields such as Artificial Intelligence and Knowledge Representation. All these technologies are continuously 6 improved and extended with new features as well as more and more supported by relevant tools and programmatic libraries. Due to the scope of this thesis, only a subset of all the available features is discussed and utilized. The method described in this thesis is able though to be further extended and take advantage of relevant advancements in all these technologies. Additionally, these technologies are not fully described in the respective sections of the thesis but references to the standards can be followed for a deeper coverage. Moreover, the current status of any digital investigations, especially in cases that are relevant to system and network security compromises, can be quite complex in cases of large networks, advanced types of malware or penetration technique etc. Additionally, a digital investigator may have to face a plethora of data sources that have to be processed and analyzed in order to find traces and reconstruct such events. This thesis has selected only a subset of such data sources that are quite commonly important in most cases of digital investigations and cover a representative part of the spectrum. The experiments conducted in the thesis have been selected so as to resemble common scenarios of actual security compromises although certainly their complexity and the number of involved entities has been simplified compared to real large-scale events. 1.6 Thesis Structure The thesis has been divided into several chapters following a top to down approach due to the interdisciplinary nature of it. Chapter 2 presents the theoretical background in the area of Digital Investigations. The Chapter is divided into three parts, the first discussing the basic principles of digital evidence, the second presenting the conceptual frameworks that guide digital investigations and the final one discussing current problems and the challenges that the field faces. Chapter 3 is a short description of the Semantic Web initiative as well the stack of technologies and languages it is comprised of. Chapter 4 presents related work that has also merged partially or fully these two areas. The chapter is subdivided in three sections with respect to the layer of the Semantic Web Stack up to which this merge has reached. Chapter 5 discusses the methodology followed during this thesis with respect to the research methods applied and the steps followed for the design, implementation and evaluation of the method. Chapter 6 presents the framework of the proposed method. The chapter provides a high-level view of how these two disciplines can be combined and with what advantages. The chapter ends with the specification of a number of evaluation criteria for the proposed method. Chapter 7 presents the proposed method. The first part discusses the overall structure of it while the second part presents the ontologies developed for this thesis. The latter parts discuss in more detail the parts of the method that deal with the integration, correlation and querying of the source data. The chapter finishes with the presentation of a proof-of-concept system developed to evaluate the proposed method. Chapter 8 describes the experiments that were conducted in order to examine both the practical value of the implemented system as well as the feasibility and potential of the proposed method. Chapter 9 concludes with an overall discussion of the proposed method and the final outcome of the thesis as well as discusses some of the possible extensions and improvements that may be worthy of further research and experimentation. 7 Digital Evidence & Digital Investigations The subject of this chapter is a brief but concise definition and description of the digital forensics area, the role of digital evidence as well as a presentation of prominent digital forensic process frameworks. A short presentation of the different sub-topics of the field, various methods of analysis as well a discussion of tools and techniques follows. 2.1 Digital Evidence The central point of reference of every type of digital investigation is irrefutably the concept of digital evidence. A multitude of different definitions can be found in the literature such as: ‘any data stored or transmitted using a computer that support or refute a theory of how an offence occurred or that address critical elements of the offence such as intent or alibi ’ (E Casey 2002) ‘information stored or transmitted in binary form that may be relied upon In court’ (IOCE 2002) ‘digital evidence of an incident is any digital data that contain reliable information that supports or refutes a hypothesis about the incident’ (B. D. Carrier & E. H. Spafford 2004) As stated in (B. L. Schatz 2007), subtle differences exist between the definitions which are mainly pertinent to the perspective of the claiming author or body, namely a legal or an investigative one. Thus, some definitions focus on the investigative process and the support that evidence can provide to hypothesis validation while others deal with the probative value of electronic data in the legal context. Another important aspect of a definition of digital evidence is the scope of it. Due to the continuous emergence of new and innovative digital technologies, digital evidence nowadays cannot be restricted anymore on solely computers in their traditional form but must include new types of digital devices as well such as mobile phones, portable tablets, digital cameras, GPS devices etc. Even the notion of digital evidence in its traditional computer-related form shall be updated due to the introduction of new paradigms, such as virtualization and cloud computing, that promote the physical and logical abstraction and separation between the execution, the computation and the storage parts of a computer system. Digital evidence is the central point of any digital investigation or forensics process and thus its rigorous and authenticated handling is of paramount importance. Digital evidence can surely be identified and interpreted from different abstraction layers, such as electrical charges in memory transistors and electromagnetic waves in wireless networks from a physical point of view, or as bits & bytes in registers and memory addresses from a computing perspective or as files & directories from an OS perspective. Despite, the different properties that digital evidence carries along in each different layer, the interpretation of this piece of evidence as information and its relevance to the context of the current investigative process is what adds value to it and promotes it from mere data to information of evidentiary value. Schatz (B. L. Schatz 2007) has identified 3 basic properties of digital evidence, namely latency, fidelity and volatility. Latency refers to the fact that a digital encoding in the form of binary data needs additional contextual information on how it should be interpreted. Fidelity is a property of digital data that allows a copy of it, assuming the verification of the integrity of such a process, to be equally 8 treated as the original one. This is especially important in the DF area where access of the original data must be restricted to exceptional circumstances only and be performed by competent personnel (ACPO 2007). Finally, the volatile nature of digital evidence affects considerably the practice of acquisition and further processing of it due to the fact that its authenticity can easily be disputed except if proper and up-to-date procedures are always applied. Although, the focus of the current thesis is on the latent nature of evidence and how this can be enriched with semantic content, fidelity and volatility of evidence are quite important so as to be encapsulated into two commonly-referred forensic principles, Chain of Custody and Order of Volatility which for purposes of completeness are briefly discussed below. 2.1.1 Chain of Custody According to the U.S. National Institute of Justice, chain of custody is defined as “a process used to maintain and document the chronological history of the evidence” (National Institute of Justice 2011).This involves documentation of all individuals’ names involved in the collection, preservation and analysis of evidence, timestamps of any process applied on the data as well as contextual information such as case numbering, involved agencies and laboratories, additional data regarding the case involved individuals or entities as well as a brief description of each item. Another definition provided by the Scientific Working Group on Digital Evidence and Imaging Technology defines Chain of Custody as “the chronological documentation of the movement, location and possession of evidence”. Thus anyone involved in the forensic examination process can be held responsible for not maintaining the Chain of Custody due to bad practices or missing documentation. Maintaining Chain of Custody though is confronted with a number of challenges due to the continuous growth in the complexity and diversity of digital systems while at the same time most of the different tools and techniques on which the investigator relies for the preservation and analysis have not been assured in regards to their correctness and validity of results. (Guo et al. 2009) Such problems can affect the acceptability of digital evidence as genuine and reliable by society (Turner 2005) and especially in the business world where data retention rules and relevant legislation are imposed by governments to businesses requiring stringent procedures of data preservation and maintenance in order to be able to verify their authenticity and integrity. (Patzakis 2003) In order to ensure Chain of Custody different techniques have been suggested and applied. The most common approach is the usage of hash functions for performing integrity checks on the data accompanied by time stamping of the performed forensic activities. A cryptographic hash of an entire disk image can be used to ensure that no modifications have been applied on the image between the acquisition and examination phases. In order to enable higher-level documentations of such images with metadata such as time of acquisition, serial numbers of the device, names of those performing the acquisition etc. Tool vendors, such as Encase, iLook, ProDiscover have introduced proprietary image formats to be used with their tools with a varying level of inter-compatibility and openness. Open formats have also been suggested with advanced capabilities such as the Digital Evidence Bag (DEB) (Turner 2005) which supports additional metadata of the history of operations performed on the image and the Advanced File Format (Cohen et al. 2009), which can support different types of evidence such as disk drives, network packets, memory images and extracted files and support also cryptographic operations and image signing. 9 2.1.2 Order of Volatility As good forensic practice suggests, a complete acquisition of a copy of the entire target system is the goal in order for the examiner to have an as accurate as possible picture of the system. However even the process of collecting the data itself or direct modifications incurred by other users connected to the system or “traps” that an attacker may leave behind in a compromised system may introduce unwanted changes to the data. Given the different nature of data storage technologies (memory, network state, running processes, disks, and backup media) that comprise a modern computer, special attention has to be given on how to prioritize and perform the data collection steps for each. Order of Volatility is a forensic principle promoting the concept that data must be collected in an order based on their volatile nature, proceeding from the most volatile to the least one. (Brezinski & Killalea 2002) suggests collection of data in the following order: registers, memory, process table, temporary file systems, disk, remote logging and monitoring data, physical configuration and network topology and finally archival media. Especially with the latest advancements in the area of live memory forensics (Schuster 2010) and the capabilities that it provides for extraction of new types of evidence previously missed such as command prompt history (Stevens & Eoghan Casey 2010), order of volatility is becoming even more important. 2.2 Digital Investigations The set of principles and procedures that are followed during the lifecycle of digital evidence, from acquisition and preservation, to analysis and reporting are encompassed under the term of digital investigation. (Kruse & Heiser 2002) has defined digital forensics as the “preservation, identification, extraction, documentation and interpretation of computer media for evidentiary and/or root cause analysis” while the Scientific Working Group on Digital Evidence (SWGDE) has defined computer forensics as “a sub-discipline of digital & multimedia evidence which involves the scientific examination, analysis, and/or evaluation of digital evidence in legal matters” (Scientific Working Group On Digital Evidence 2011). Attendants of the first digital forensic research workshop, have defined the digital forensic science as: “the use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned operations”. (Palmer 2001) As seen, although both definitions cover the various steps of the forensic process, they present variations in regards to the source of evidence and their purpose. Terms such as digital forensics and computer forensics are commonly used interchangeably although as mentioned before the introduction of new types of digital devices such as mobile phones, digital cameras etc. may make the term computer forensics sound as too restrictive or specialized. (B. L. Schatz 2007) presents an interesting description of how historically these terms have been used and evolved. The former concludes that the practice of forensics in different contexts such as the judicial, the military or the corporate sector and the different requirements in each one of them in regards to accuracy of results, performance in terms of time and the rigor of the process in total and the primary objective of the process have caused slight divergences between the definitions. On the other side, security incident response is a closely related term that is mostly used in the business domain, covering a vast and diverse amount of types of security incidents. A security incident has been defined as “any unintended activity that results in a negative impact on information security” 10 (Tan et al. 2003) or “a violation or imminent threat of violation of computer security policies, acceptable use policies, or standard security practices” (Scarfone & Masone 2004). Incident response thus can be defined as “the process that intends to minimize the incident’s impact, and investigates and learns from such security breaches”. (Lamis 2010) Commonly, the purpose of the security incident response is the reconstruction of security incidents, eradication of any remaining vulnerabilities and recovery of the system to its normal operating status. Despite the differences in the context and objectives between the two procedures, integration of proper forensic techniques into the incident response field is gaining importance (Kent et al. 2006). Proper documentation and evidence handling is equally important especially in cases where attacker attribution and further legal prosecution is wanted. The generic term digital investigation can reflect the differences of focus and context between the different fields and is used in the current thesis as encompassing both the fields. The rest of the thesis will focus on cases of forensic analysis of networked computer intrusions and compromises in general, thus research and practice from both fields is relevant. 2.2.1 The Event-based Digital Forensic Investigation Framework The need of standardizing and formalizing digital investigations has led to the formulation of the conceptual tasks and actions that are performed in the context of a digital investigation. A useful framework has been introduced by Carrier and Spafford, named as Event-based Digital Forensic Investigation Framework (B. D. Carrier & E. H. Spafford 2004). Such a conceptual framework can improve the understanding of the different phases of the digital investigation. The aforementioned framework has been influenced by procedures followed in investigations of physical crime scenes and extended so as to cover digital ones as well. A graphical overview of the framework is presented in Figure 1. The main phases of the framework are briefly presented below: The Readiness Phase covers operations readiness (training of people, methodology) and infrastructure readiness (configuration of the infrastructure in a forensics appropriate preparation manner, forensic readiness). The Deployment Phase which includes the “detection, notification, confirmation and authorization phases”. The first two sub-phases deal with the detection and acknowledgment of the incident while the latter two deal with the investigators being granted allowance (e.g. search warrant) for conducting the investigation. The Physical Crime Scene Investigation phase involves the examination of the physical scene of the crime or incident. In case that any physical device that contains digital data is identified and seized, this leads to the digital crime scene investigation phase. The Digital Crime Scene Investigation phase is comprised of three sub-phases, namely: System Preservation & Documentation Phase, Evidence Searching & Documentation Phase and Event Reconstruction & Documentation Phase. These sub-phases describe in a high-level the main activities involved during the examination of the digital data and are presented in more detail below. It is important to underline the significance of the Documentation phase since it appears in each one of them. It is worth noting also, that the sub-phases of event searching and reconstruction are iteratively conducted in an attempt to prove or not the hypotheses related to the events. Additionally, analyzed physical or digital evidence can lead to new instances of these phases for further acquisition and analysis. 11 The Presentation phase finally is where the results of the analysis are presented along with the documentation of all the actions performed throughout the process. Figure 1: Overview of the Event-based Digital Forensic Investigation Framework, adapted from (B. D. Carrier & E. H. Spafford 2004) The phase that is of the most relevance to the present thesis is of course the Digital CSI one. Each one of its three sub-phases is an essential part of a proper digital investigative process. In the System Preservation phase, an investigator is confronted with the challenge of not affecting the state of the analyzed system and thus altering the stored digital data in undocumented ways. The state of the system is preserved by copying the data to another digital media. This process is quite different from those applied in the physical phase and thus integrity checking of the end result is necessary (e.g. use of hashing functions). The principles of chain of custody and especially the order of volatility as discussed above should guide the process of preservation else the validity of the following phases can be challenged. During the Evidence Searching phase, the investigator searches through the preserved data for any data of evidentiary value. This process is highly contextual since the target of interest is dependent on the actual case. Different types of searching can be applied during this phase, such as keyword searching, and the results of it can also be dependent on the investigator’s knowledge and previous experience or the accuracy of the tools used. Finally, the event reconstruction phase is the process where the evidence detected during the searching phase are aggregated and correlated in such a way so as to improve the understanding of the incident and provide support for the proof or falsification the formulated initial hypotheses. 2.2.2 The Digital Investigation Process Although the previously presented framework can provide a conceptual foundation of most types of digital investigation, it is considered as too abstract from a practical perspective. Various authors have proposed process models of a digital investigation with subtle differences between them in the terminology used or the granularity. A prominent process model is discussed below and its main categories briefly explained. (Casey 2004) describes a staircase-like investigative process model that can provide a methodical and practical approach. The model consists of the following steps: Incident alerts or accusation: This step involves the initial reporting of a crime of policy violation. Assessment of worth: The worth of investigating is estimated and in case of multiple cases in parallel, a prioritization of them is performed. 12 Incident/crime scene protocols: It includes the procedures and methodical steps that must be followed by the investigator when accessing the incident/crime scene. The protocols may differ depending on the real or virtual nature of the incident/crime scene. Identification of seizure: Recognition of any relevant object that could be of evidentiary value is performed and seizing of it is conducted. Proper packaging procedures for identification and linking to the specific incident/crime instance are applied. Preservation: It involves are necessary case management tasks in order to protect the integrity of the original media and prohibit any inadvertent modifications. This step is the beginning of the chain of custody through documentation that will allow the traceability of the object’s origin to the final evidence. This step involves mostly imaging technologies in order to acquire as much exact copies of the original object as possible. There is a variety of solutions used such as specialized imaging hardware, write-block software etc.in order to fulfill the task. Order of Volatility has to be considered in regarding what must be collected and in which order. Recovery: Prior to the analysis step, a recovery of any resident but not directly observable data has to be performed. The most prominent example is that of data resident in a storage device which can include deleted, hidden, camouflaged or fragmented parts. The completeness of the recovery step may allow later access to not only active data but potentially to hidden and deleted ones thus providing access to the maximum possible amount of content and therefore enabling the investigator to perform a much more complete analysis . Harvesting: The investigator identifies categories of data that based on knowledge or experiences are mostly related to the case in focus. The results of the previous phase are organized in such a manner so as to allow access to specific categories of data which are known to be relevant to specific types of cases, e.g. pictures and videos in the case of contraband material or executable files and scripts in the case of computer compromises. Reduction: During this step, a filtering is performed based on related criteria in order to reduce the amount of data needed for the analysis. A common technique is the automated removal of known files that are part of operating systems or other applications. The signatures of these files, commonly in the form of a hash, can be stored as database and used in combination with the forensic tools in order to remove unnecessary data. Organization and Search: In order to facilitate a more thorough and complete analysis, groupings of certain files and data in general can be performed. This can enable the investigator to have an easier access to the data, perform search operations that can identify faster interesting data or events and finally allow cross-referencing between data as well as the final reports. Analysis: This is the main task where the products of the previous steps are further evaluated for their significance and probative value to the case. The main focus in this step is the content of the data selected from the previous steps as well as interpretation of them in relation to the case in hand. The analysis part is usually quite loosely defined in the majority of digital forensics process models and a further more detailed description of it and its subcategories follows below. Reporting: The final report should contain all the necessary documentation of actions and results attained throughout all the previous steps. The report also contains the results of the analysis phase with data of evidentiary value along with any conclusions drawn by the examiner. The examiner should remain objective by presenting only the supporting evidence as well as other scenarios that cannot be supported by the current evidence. Persuasion and testing: Often, the final result of an investigative process must be communicated to decision makers outlining in a clear and understandable manner the incident and the conclusions of 13 the investigation. There is the need of transforming technical and other intrinsic details into an understandable narrative of the incident. The Analysis part of the aforementioned process model can be further subdivided into 5 distinct tasks that are further discussed below: Assessment: Digital data can include human readable content that can be directly accessible and interpretable by the investigator. The content can be evaluated in regards to its relation to the context of the case such as the means and motivation of the attacker. Experiment: Due to the unique nature of each specific case and the explicit combination of technologies involved both in the actual incident as well the investigation process, the investigator may be called to employ previously untested and untried methods and techniques. Detailed and rigorous documentation of such actions is of paramount importance so as to enable the reproducibility and testability of them by e.g. the courts in order to assess their results’ admissibility. Fusion: Data collected during an investigation comes in a multitude of different formats and from a variety of sources. Each piece of information can provide an insight to a part of the incident. Data have to be fused in order for the investigator to connect the different pieces of the puzzle and have a better insight of the whole incident and its various parts. As an example, the various reported events or actions can be used for the construction of a timeline of activities that were performed during the investigated incident. Correlation: Correlation attempts to link the various events into causality relationships so as an action A can be the cause that leads to an event B as an effect. Correlation can involve both temporal correlation based on the chronological ordering of events but also on other contextual information such as connection between persons involved in the case. Validation: This includes the results of the analysis phase where the findings along with their backing reasoning are collected and further submitted to the jury or other decision makers for further actions such as prosecution. 2.2.3 The Scientific Method and the Hypothesis-Based Approach Although in theory, a digital investigation process model as the one presented above, covers all aspects of the investigative process and can provide guidelines to the investigator, the implementation of them introduces new types of problems and challenges. Despite, such models appearing as linear ones, where the one step follows the previous one, in practice these steps are intertwined and not clearly separated. In order for the results of such processes to be assessed in terms of their validity and therefore assess their admissibility as evidence especially in the legal context, evaluation criteria have been developed so as to judge them in scientific terms. The most common criteria followed by U.S. courts and not only, are known by the name of the Daubert standard (Wikipedia 2011). Briefly these criteria evaluate the scientific value of the investigative process followed along with the results of it in the form of evidence according to the following: The theory or technique used can be (and has been) tested. In more general terms the theory or technique must be falsifiable, refutable and testable. There is a known or potential rate of error of the technique as well as standards and control must exist and maintained covering how the technique should be operated. 14 The theory or technique should be subjected to peer-review and publication. The theory or technique should be generally accepted within the relevant scientific community. Based on the above criteria, it can be seen that process models such as the above do not deal directly with important aspects of each step of an investigation such as the completeness, repeatability and reliability. The details provided for each part of the investigation are lacking consistency where especially parts such as the analysis one are quite ambiguous and abstract. The scientific method has been introduced as a simpler and more flexible methodology that promotes repeatability and testability in order to increase the reliability of the results. The main parts of the scientific method as described in (Casey 2004) are the gathering of facts and their initial validation, hypothesis formation and experimentation/testing, searching for evidence that supports/disproves the hypothesis and finally revising the conclusions in the light of new evidence. (Carrier 2006) has described a Hypothesis based approach to digital forensic investigations where the general scientific method is bridged to digital investigations as a process of formulating and testing hypotheses about previous states and events. (Casey 2004) provides a more detailed description of the various steps of the scientific method and how they can be applied. Observation: An event observed either directly (a system does not perform as it should be) or indirectly (a sensor has produced an alert of a possible security incident). Hypothesis: The investigator uses current facts about the incident along with experience and knowledge in order to formulate a theory of what may have happened. Prediction: Based on the hypothesis, the investigator may predict based on previous knowledge where artifacts relevant to that event may be located. Experimentation/Testing: The investigator analyzed the available evidence in order to test the hypothesis. The goal of the scientific method is not only to support the hypothesis but also to use the available evidence so as to falsify and eliminate other possible alternative explanations. Conclusion: The investigator forms a conclusion based upon the result of the previous tests and the conclusions are further communicated to the interested parties. However, the above method due to its empirical nature and reliance on the investigator’s knowledge for formulating better hypotheses or performing more accurate predictions has to deal with various challenges. (Rekhis & Boudriga 2011) has enumerated some of them in the form of the following requirements: Formalization and proof automation is needed for the purpose of accuracy and practicality. In order to reduce the analysis complexity and enable even less experienced or knowledgeable investigators to deal with complex scenarios, a formalized and explicit representation of the investigator’s knowledge and observations is deemed as necessary. Automation has the potential to decrease the needed amount of analysis time. Integration of the entire investigator’s knowledge about the investigated systems. In order to enable the reconstruction of the attack scenario and due to evidence coming from a variety of systems and tools of different nature, integration is needed to allow the formulation of more complete hypotheses and more accurate testing. Attack scenarios should also be represented in an expressive enough format so as to allow the hypothesizing of complex or even novel and unknown types of attacks. Due to the fact that a digital investigation has to deal quite often with uncertainty and reduced visibility of the system vulnerabilities or where the collected evidence cannot support full 15 understanding of the case, it is needed to be able to reason with uncertainty and even automatically promote or filter hypotheses. Various approaches have been suggested in regards to (semi-)automated hypothesis generation and evaluation for the purpose of digital investigations. One of the proposed systems was introduced by (Stallard & Levitt 2003) where an expert system utilizing a decision tree attempts to identify data redundancies in traces of security incidents. The concept is to promote searching though the collected evidence for any possible contradictions. The authors claim that such redundancies can be found in various places such as file system metadata, log files and application specific file formats. The expert system allows the digital investigator to hypothesize potential scenarios and rule out the correct ones. In order to perform the integrity checking, the system’s normal state must be specified a priori, something that is practically not so easy. (Gladyshev & Patel 2004)proposed a Finite State machine approach for formalizing the reconstruction of potential attack scenarios by discarding the ones that do not match with the collected evidence. However, FSM are not very expressive thus having problems dealing with more complex cases. In a similar approach (Carrier & Spafford 2006) proposed a computation model based on Finite State Machine and a computer’s history. The model suggests that a computer has a history which is not fully known and where the digital investigation process has to formulate and test hypotheses about the previous states of the system and the occurred events. The system though has practicality issues due to its modeling of quite primitive states such as enumeration of all the storage locations and is not clearly shown how more complex cases of network intrusion can be handled. (Willassen 2008a) focused on the evidentiary value of timestamp evidences and how modifications or errors on them can affect the forensic analysis. The author suggests a solution of formulating and testing hypotheses about skewed clock measurements and checks them with the observed evidence. In a following paper, (Willassen 2008b) improved the previous idea by modeling a set of actions and their effects on timestamps. Thus, testing of the formulated hypotheses is performed by attempting to find a possible sequence of actions so as the order of observed timestamp values matches. Finally, in (Rekhis & Boudriga 2011)a logic-based approach is presented as an extension to the Temporal Logic of Actions by Lamport. The paper has listed a set of requirements for a formal digital investigation of security incidents. According to it, a simplified model for defining attack scenarios is needed so as to be able to describe the underlying patterns of complex attacks and also cope with new unknown ones. Secondly, a method for hypothetical reasoning is needed so as to be able to deal with incomplete knowledge of attack techniques or missing evidence. Thirdly, they suggest that the existence of a library of attacks can promote knowledge reuse and collaboration between digital investigators. Furthermore, modeling the evidence and integration of them as originated by different sources is of paramount importance since each type of evidence is usually insufficient by itself to provide a full understanding of the attack scenario. Finally, hypotheses should be dynamically generated and prioritized based on their suitability. 2.3 Forensic Tools & Integration Issues Forensic tools are playing a major role in modern digital investigation practice. Especially in the case of complex security incidents, a multitude of different sources can provide evidentiary data such as those collected from file system analysis on media devices (e.g. hard disks, USB flash drives), network sources (IDS & Firewall logs, packet or flow captures) as well as new types of sources such 16 as live memory and mobile devices. The distributed nature of digital evidence calls for advanced methods of tool interoperability and correlation of evidence. In (Garfinkel 2010), various challenges that the area of digital forensics encounters today are enumerated and most of them are pertinent to tools. One of the fundamental issues is that the majority of modern tools have an evidence-oriented design, being specialized for acquiring or analyzing specific pieces of evidence while at the same time focusing more in solving crimes where the computer is used as a repository of person-targeted related crimes instead of crimes with or against computers. The above problems require an amount of manual effort by the investigator which becomes even harder taking into consideration the size of data volumes of today. (Ayers 2009) considers modern storage media analysis tools such as Encase and FTK as first generation tools mostly suited for manual analysis and being heavily limited in analysis of large volumes of data and concludes by proposing requirements of second generation tools with efficient data acquisition and data representation being rated high. (Carrier 2003)has proposed a series of generic requirements for forensic tools in regards of their usability, comprehensibility, accuracy and verifiability. Carrier has suggested that a forensic tool can be seen as an interpreter of data between layers of abstraction. The tool is handling a set of inputs and based on a specified rule set produces a set of outputs. During this process, two types of errors can be introduced. Tool implementation errors are flaws that prohibit the tool to function as specified. Secondly, abstraction errors can be introduced when the tool’s representation of the system is inaccurate. 2.3.1 Tool Integration Over the last years, the commercial forensic tools market has been marked by the emergence of integrated forensic platforms such as EnCase and FTK. These vendors strive to integrate a variety of tools and features in their platforms such as Navigation, Search and Presentation (Schatz 2007). Navigation allows the visualization and exploration of the structure of the collected evidence, while search allows easier identification of relevant data. Various searching techniques are offered with keyword search being the most prominent but also more advanced ones such as regular expressions, date ranges and by data type. Furthermore, these suites incorporate a variety of viewers in order to be able to present different types of objects or even the same data object in multiple formats. However, proprietary and undocumented formats that these tools use impair further integration between tools and especially with specialized open-source tools. On the other side, the usage of opensource tools can promote the reliability of their results due to their code being open for review. Still, though, open source tools are quite often focused on specific tasks and do not provide sophisticated solutions for documentation and further blending in the digital investigation process. In (S. L. Garfinkel 2010), a number of future directions that research in the area of Forensics should pursue are mentioned. Some of them, related to integration issues, are the need of research for metafile carving techniques that can combine the results of multiple individual carvers, the need for a modularized standardized architecture that can provide the capability of plug-ins for supporting dynamically additional features and the need for multi-thread, multi-server distributed processing as initiated by projects such as the “Distributed Environment for Large-scale inVestigations”(DELV) and Open Computer Forensics Architecture (OCFA) (Vermaas et al. 2010). 17 2.3.2 Data Representation How data is represented can be a key factor into improving the effectiveness and in the capacity for integration of forensic tools. (Garfinkel 2010) prioritized forensic data abstraction as one of the most important future research directions. The need for establishing a standardized set of abstractions and data formats covering the broad scope of possible data types encountered during a digital investigation is quite important. One major initiative towards the goal of a unified, common data representation format has been promoted by the Digital Forensics Research Workshop (DFRWS). The goal was the definition of a standardized Common Digital Evidence Storage Format (CDESF), an open data format that can store both the copy of the digital evidence as well as related metadata. Unfortunately though, due to resource limitations, the group has disbanded and the project halted. However, other suggestions have been proposed by mostly academia without though achieving extended acceptance and adoption. Most of these suggestions are XML-based and thus will be further mentioned in Chapter 4 of the thesis. 2.3.3 Correlation-based Analysis Techniques Although correlation has been described above as one of the most important steps of the analysis phase, however current support for correlation analysis is poor between most forensic tools. (Case et al. 2008) have proposed a framework for supporting automated evidence discovery and correlation under the name of FACE. The goal of the framework is to automatically integrate and correlate data objects as contained in memory dumps, network traces, disk images, log files and configuration files. By correlating all the above, the investigator is able to acquire a more complete view in the form of involved users, groups, processes, files and networks. (Abbott et al. 2006) improved on a previously described Event Correlation for Forensics (ECF) framework with the goal of performing cross-correlation of event information from log files as collected from heterogeneous sources and identification of event scenarios from queries posed by the investigator. The log events, after canonicalization, could be stored in a relational database and allow for an interactive or automated scenario identification based on these events. In another attempt, (Garfinkel 2006a) performed automated correlation analysis on a large number of secondary market hard drives for the purpose of detecting interesting information such as credit card numbers, social security numbers(SSN), email addresses etc. The author introduced the term CrossDrive Analysis (CDA) for correlating information extracted from the drives and identified by pseudounique identifiers. Event correlation has been also quite extensively studied in the area of Intrusion Detection Systems with the objective of producing intrusion reports that capture a high-level view of the network activity taking advantage of spatial and temporal properties of the alerts. Alerts are usually combined into meta-alerts where false positives may be identified and discarded throughout the correlation process (Kruegel et al. 2005) The main problem with correlation techniques and approaches presented above are that their focus is usually on single domains and utilizing very specific characteristics. On the other side though, a digital investigator has to deal with a variety of different domains and thus event correlation is becoming even more difficult. Another challenge is the extensibility of such proposed solutions that usually utilize their own devised event description languages that are weak on semantics, cumbersome to introduce new terms and lacking contextual information such as configuration information of the network and/or the system. As a conclusion drawn from this section, we can see that a more 18 expressive and extensible data representation format for representing evidence, contextual information and attack scenarios can improve the analysis part of the investigation process by promoting tool integration and high-level correlation of events from disparate domains for the purposes of hypothesis evaluation of complex attack scenarios. 19 Semantic Web Technologies This section gives a brief overview of the basic principles and mechanisms based on which the Semantic Web is built. The purpose of this chapter is to provide a background to semantics and how relevant technologies can provide solutions to knowledge representation and integration issues. 3.1 Semantic Web Foundations Undoubtedly, the World Wide Web (WWW) nowadays is the most extended distributed platform ever existed, realizing most of the goals initially envisioned by its creators for efficient information representation and inter-linking, allowing publishing and accessing of data of every scale from personal notes up to enormous databases, advanced searching capabilities and also dynamic and personalized generation and presentation of data (Tim Berners-Lee et al. 1992). The Web though confronts a variety of challenges nowadays. The vast amount of information on the Web is designed for human consumption, with limited support for automated machine processing. Even when data are well structured and organized under database schemas, variations in the terminology used prohibit the machine to automatically ‘understand’ the structure giving rise to new research fields such as that information integration and schema matching. Another major problem regards the quality of search engines results and of the retrieved content. Despite advancements in this area with a variety of heuristics, the keyword-based form of search has restrictions. The same word may convey different meanings in different contexts or relevant content may have been expressed with different terms and not retrieved, thus reducing the accuracy of the search engine (Tim Berners-Lee 1998). The aforementioned challenges and shortcomings of the current Web gave rise to the idea of Semantic Web. Semantic Web has been defined as: “The Semantic Web is not a separate Web but an extension to the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” (Tim Berners-Lee et al. 2001) The first observation is that the term “information” over “data” is used. Data should be considered as mere where information objects are considered as collection of data along with their semantics that enable their correct interpretation (Haslhofer & Neuhold 2011). In order to better explain what a ‘welldefined meaning is’, we need to introduce two important concepts, fundamental to the semantic web, namely metadata and ontologies. Metadata is usually defined as “data about data”. A common example of metadata used over the Web is the HTML tags. HTML tags such as <p> and <div> are associated with the content and describe how they should be presented to the user. In general tags as those, have been used for the annotation of electronic documents in order to allow the addition of data, presentation or process related semantics to be attached to the data itself. However, metadata can be of dynamic nature, describing contextual or domain-specific information about the content. This dynamicity though may create serious challenges for automated processing since the metadata may have not been explicitly defined and their semantic meaning has to be extracted with other approaches, such as linguistic approaches [Dhamankar2004], with questionable results. (Haslhofer & Neuhold 2011) have categorized these interoperability issues into 3 main categories, namely: technical, syntactic and semantic ones. 20 The term ontology can be defined as “an explicit and formal specification of a conceptualization” (Gruber 1993). Ontologies can be used to capture domain-specific knowledge and express it in the form of entities and their relationships. Knowledge representation can be both at an assertion or instance level and express in a semantic manner aspects such as entities, attributes, domain vocabularies, factual knowledge and their interrelationships. An ontology can be considered similar to a database schema describing the structure and semantics of data, although with important differences since ontologies enable automated reasoning over a set of given facts (Noy & Klein 2004). It is important to highlight that the semantic web relies an open world assumption in which the inexistence of a specific statement does not imply knowledge of the truth value of the statement e.g. false as in the closed-world assumption thus better depicting the partial or incomplete knowledge that is apparent in the digital investigations field. This is the advantage of the semantic web approach over previous integration attempts, allowing reasoning over data by combining data and metadata along with their references to ontologies. This allows inference of conclusions from the given set of facts and thus making implicit knowledge explicit. The vision behind the Semantic is that intelligent agents will be able to utilize knowledge and reasoning in order to better understand their context, work autonomously and share information (Zhang et al. 2011). 3.2 Semantic Web Architecture The Semantic Web has evolved over the last decade to a complex aggregation of different technologies each one responsible for different aspects of the Semantic Web framework. (Antoniou & Van Harmelen 2004) have depicted the semantic web architecture using a layered approach as shown in Figure 2. Such an approach enables better understanding on what are the main functions of the different technologies as well how the layers relate to each other and share results. Figure 2: Semantic Web Architecture (Antoniou & Van Harmelen 2004) A brief presentation of the various layers follows. 3.2.1 Uniform Resource Identifier (URI) Since the Semantic Web is built on top of the existing Web architecture, URIs provide the foundation on which the other layers are based upon. A URI is a “compact sequence of characters that identifies an abstract or physical resource” (Berners-Lee et al. 2005). URIs are extensively used in the Semantic Web so as each resource described can be uniquely identified and referenced. URIs enable unique identification of a resource under a global scope and a consistent interpretation thus alleviating the 21 problem of different resources represented by the same name in different contexts thus introducing interoperability problems. It is important to note that a URI does not necessarily imply access of the resource as in the more familiar Uniform Resource Locator (URL) scheme but can simply be used for denoting a resource. Unicode is a standard promoting a consistent encoding and representation of text, supporting most of modern writing systems, thus enabling support for multi-lingual environments. Internationalized Resource Identifier (IRI) is a form of URI that can support characters out of the ASCII character set and thus are more useful in modern Web’s internationalized context. 3.2.2 XML / Namespaces XML stands for Extensible Markup Language. XML is a markup language that allows the encoding of arbitrary documents in the form of a consistent tagged format so as to be both human and machine consumable. XML gains its extensibility from the fact that there is no predefined set of tags thus allowing the construction and use of custom ‘tag’ sets per context. These tags are of the form of either elements or attributes. XML is actually only imposing specific rules in regards to the well-formed syntax and validity of the resulting encoded document. XMLs contribution over the last decade in almost every imaginable field by promoting shared domain-specific ‘tag’ sets between communities and sharing of data in a consistent way has been tremendous and almost unmanageable to enumerate here. XML namespaces enable the qualification of element and attribute names used in XML documents by tying them to namespaces identified by URI references. This allows the inclusion and usage of elements and attributes from different XML ‘tag’ sets and resolving possible syntactic conflicts where the same ‘tag’ names are used but representing different concepts. 3.2.3 XML Schema / XML Query XML Schema is a markup language used to define the shared vocabularies or ‘tag’ sets discussed before. XML Schema is used to define the rules to which an XML document using the specified vocabulary must abide to. XML Schema enables the definitions of custom elements and attributes along with support for the description of a variety of different primitive data types (e.g. string, date, and numeric) and simple restrictions on the acceptable values. Although XML Schema has been used extensively for the definition of shared vocabularies and automated exchange and parsing of XML documents, it is lacking expressivity power for defining the structure of statement and the interrelationship of the various elements. Thus, although automated parsing of XML documents is well supported by most programming languages, the system does not really understand the concepts expressed or be able to infer new information. XML focuses mostly on the technical and syntactic interoperability problem discussed before but provides very limited support to convey semantic information. XML Query is another technology that allows the extraction and manipulation of data from XML documents. As XML documents are employing a tree-structured information model, XQuery provides the means to address specific parts of it. New standards under work will provide full search text functionality as well as update of XML documents capabilities. 3.2.4 RDF / RDF Schema RDF stands for Resource Description Framework and has been introduced by W3C as a standard for encoding arbitrary metadata. The motivation behind RDF was that the amount of data present on the Web has reached such a large scale that management and even more, automated processing has been 22 constituted extremely difficult. The concept of using metadata for describing resources could promote automated processing such as searching, referencing and clustering based on criteria such as content relevance. The problem with XML is that although it is flexible by allowing the creation of custom set of elements and attributes and definition of syntactic rules that apply on them, the semantic relationship between the different XML elements cannot be expressed. Thus, differences in how elements are organized in the XML tree structure cannot be resolved automatically, based on semantic relationships, but need custom transformations to be defined by the developer such as XSLT. In comparison to XML that introduced a common data serialization format by promoting an interoperable and internationalized text encoding format, RDF introduces a common data model that allows a consistent, simple, flexible and structured manner for encoding metadata concerning arbitrary resources. RDF is an assertional language with three fundamental components, namely resources, properties and statements. A resource can be literally any kind of concept in any domain that assertions want to be made about. In order to promote integration of data over the Web, RDF imposes that the name of a resource must be global and uniquely identified by a URI. The same applies for the property that must be identified by a URI and be further specified as an element of a formal vocabulary which is the purpose of RDF Schema and ontologies as discussed later. A statement consists of just three components, a subject, a predicate and an object. The subject must be a resource and the predicate a property as explained before. The object though can be either a resource or a constant value such as a literal. Statements of this form are commonly referred to as triples and can be represented by graph structures, with subjects and objects mapped to nodes and predicates mapped to the arcs connecting the nodes. This data model is simple enough in concept but powerful enough in flexibility allowing new metadata to be bound to nodes or graphs merging thus allowing easier integration of data. RDF provides additional features such as support for resources acting as containers (open sets) or collections (closed sets) of other resources, reification where statements can be made about other statements, blank nodes for supporting anonymous resources when a specific resource does not need to be directly referenced and support for typed literals so as constant values for objects can be associated with their types as those offered by the XML Schema Datatype Definition. RDF statements can be serialized in a variety of formats. The most supported and recommended syntax for applications is the RDF/XML due to good tool support for XML processing. RDF statements are mapped into XML elements, attributes, element content and attribute values. Another popular format is Notation 3(N3) which is a non-XML based language but with enhanced readability and a more compact form. N3 provides support for most advanced features such as rule expression using variables and quantification. Finally another serialization format that has been introduced by a collaboration of HP and Nokia is Trix which is an alternative XML-based language for RDF graph representation. All these formats allow the usage of XML namespaces thus enabling the intermixing of resources from different vocabularies in the same graph or document. Although there are subtle differences between these formats, these formats permit the serialization of complex RDF graphs in a uniform and consistent manner so as to allow interchange of data between applications. RDF Schema is a language for constructing simple RDF vocabularies and thus provides the framework for the introduction of domain-specific classes and properties. RDF Schema provides the following constructs: 23 Class: The class is used in to specify a domain specific type of entity similar to the concept of classes in object-oriented programming. rdfs:Resource is the root class and rdfs:Class is the class representing all classes. Property: Properties are the concepts that link instances of classes between them or with literal values, the predicates of the triplet statements. Properties are instances of the rdfs:Property class and their use can be restricted with the rdfs:domain and rdfs:range properties. The former defines instances of which classes can be participating as subjects of this predicate in statements, while the latter defines instances of which classes or types of literals can be the objects of this predicate in statements. The special property rdf:type is used to link resources with their class membership. Class and Property Relationships: RDF Schema allows the creation of simple hierarchies between classes as well as properties. The special properties rdfs:subClassOf and rdfs:subPropertyOf allow the description of relationships between classes or properties. These relationships are quite similar to the ones found in the inheritance concept in object oriented programming but with an important difference. Properties are defined under a global scope and are not encapsulated as members of a class thus allowing the definition of new properties for an existing class without the need of the class to be modified. 3.2.5 Ontologies As defined before, ontology describes in a formal way a domain of interest, a formal specification of a conceptualization. Ontology allows the description of various classes and objects pertinent to the domain as well as relationships between them. Although RDF schema allowed the expression of basic relationships of the form of properties and parent-child hierarchies, an ontology supports must more expressivity and complex relationships such as value restrictions, class property restrictions, cardinality constraints, union and disjoint relationships, equality and more. This capabilities support ontological modeling and automated reasoning which are the cornerstones of the Semantic Web initiative. OWL stands for Web Ontology Language and is the most prominent ontology definition language used in the Semantic Web framework. The most recent version is OWL 2 which has become a W3C standard on 2009 and has been defined as: “The W3C OWL 2 Web Ontology Language (OWL) is a Semantic Web Language designed to represent rich and complex knowledge about things, groups of things, and relations between things. OWL is a computational logicbased language such that knowledge expressed in OWL can be reasoned with by computer programs either to verify the consistency of that knowledge or to make implicit knowledge explicit.” (Hitzler et al. 2009) Some of the basic constructs and capabilities of the language are briefly discussed below: Class: owl:Thing is the root of all classes and the base class of rdfs:Resource. owl:Class is used to define classes and is a sub-class of rdfs:Class. OWL classes should be considered as sets of individuals and describe formally the requirements for an individual to claim class membership. The set of individuals associated with this class is called the class extension. owl:Nothing signifies the empty set. Class Description Enumeration: The owl:one of supports the description of a class by explicitly describing the allowed individuals. The class extension contains exactly the enumerated individuals like a closed set. 24 Property Restriction: Describes an anonymous class containing the individuals satisfying the restriction. Value Constraint: Defines constraints on the range of the property but when applied to this particular class description in contrast to the global scope of rdfs:range. There are two types of constraints that add requirements between properties and classes, namely owl:allValuesFrom where all individuals must have property values of the specified class type and the owl:someValuesFrom where at least of the property value is of the specified class type. The owl:hasValue describes individual that have at least once the specified property value either in the form of an individual or a data value. Cardinality Constraint: Defines constraints on the number of semantically distinct values (individuals or data values) that an individual can have on a specific property. owl:MaxCardinality sets an upper limit while owl:MinCardinality sets a lower limit. Intersection, union and complement: These properties can be seen as similar to the logical AND, OR and Not operators where a class extension can be defined as a combination of other class extension. The owl:intersectionOf includes only those individuals that are members of all the class descriptions in the list while owl:unionOf contains those that are members of at least one of the class descriptions in the list. owl:complementOf includes the individuals that are not members of the class extension of the specified class. Class axioms: Class descriptions can be combined into class axioms with different ways such as Subclass: rdfs:subClassOf defines that the class extension of a class description is a subset of the class extension of another class description. Equivalency: owl:equivalentClass defines that two class descriptions have the same class extension. Disjoint: owl:disjointWith defines that the class extensions of two class description do not have any common members. Properties: There are two main types of properties, object properties that link individuals to individuals and datatype properties that link individuals to data values. Property Relationships: The owl:equivalentProperty denotes that two properties share the same property extension (the subject-object pair of property statements) without this meaning that they are the same. owl:inverseOf denotes that a property has as domain the range of another and vice versa. Global Cardinality Constraints: The owl:FunctionalProperty defines that a property can have only one unique value for each instance e.g. a person can have only one biological mother. Opposite to this, the owl:InverseFunctionalProperty signifies that the object of a property statement uniquely determines the subject e.g. an individual person uniquely identifies another individual as being the biological mother of the former. Logical Characteristics: The owl:TransitiveProperty defines that if the object of one statement and the subject of another use the same transitive property then the subject of the first can also be associated with the object of the second over the same property. A symmetric property defines that the subject and the object of a statement with this property can infer a new statement with the subject and object roles swapped. Individual Identity: It is important to highlight that OWL does not follow the so-called ‘unique names” assumption allowing e.g. multiple URI references referring to the same individual. The owl:sameAs property allows then to establish a link between two individuals having the same identity but different references. The owl:differentFrom explicitly states that two references refer to 25 different individuals while the owl:AllDifferent allows the description of a list of individuals being different to each other. As it can be seen, OWL can be quite expressive with the side-effect though of increased complexity of the reasoning process. Indeed, usage of all the features of OWL can lead the reasoning process to such complexity so as to be computationally infeasible. In order to deal with this, the designers proposed two different ways for assigning meanings to the ontology, the Direct Model-Theoretic and the RDFbased Semantics. The former leads to the OWL DL dialect that adds restrictions to the expressiveness of the language so as to be compatible with Description Logic and its proven computational completeness and decidability while the latter leads to the OWL Full with maximum expressiveness but the risk of un-decidability in its reasoning process. OWL 2 further enhanced this idea with the introduction of three sublanguages, EL, QL and RL that are syntactic subsets of the OWL2 language with varying tradeoffs between computational complexity (e.g. logarithmic or polynomial) and expressiveness in order to better accommodate different needs in regards to size of ontologies and reasoning power. 3.2.6 Rules / Query With the support of RDF Schema and OWL, semantics of concepts relevant to the domain of interest can be expressed and based on these semantics, reasoning within ontologies and knowledge bases can be performed. In order to provide support for the description of rules that cannot be expressed with the available constructs of these languages, rule languages are also being standardized by the Semantic Web community. SWRL stands for Semantic Web Rule Language and is intended to become the main rule language of the Semantic Web framework. SWRL is based on OWL-DL and its rules are expressed in terms of OWL concepts (classes, properties and individuals) in the form of an antecedent and a consequent. SWRL can support some more advanced use cases that reasoning over OWL either cannot or is cumbersome to support such as: Reclassification: If an individual is a member of a specific class extension then it must also be member of another class’ extension. Property Value Assignment: If specific conditions exist then a new statement can be inserted about a subject with a specific predicate and object. Provision of built-in expressions: Checks on literal values such as comparisons (equal, lessThan, greaterThan), math operations(add, substract, multiply, divide), string operations (substring, contains, startsWith), date and time operations and list operations (listIntersection, length) as well. The main problem that may arise with the use of SWRL is that the added expressivity can have an effect to the computational decidability. It is worth noting that as OWL follows the open world assumption so e.g. unknown assertions about an individual cannot disallow from being a potential member of a class extension. As such SWRL cannot support negations present in the rules since negation cannot be perceived as failure but as potential temporary lack of knowledge. Furthermore, the unique names assumption is also not adopted so individuals must be explicitly stated to be different. Finally due to the above, some features such as retraction (the consequent modifies a resource present in the antecedent so as the rule can be applied again) and counting. In order to provide support for retrieval of information from RDF data as well as RDFS and OWL ontologies, a query language was needed. SPARQL which stands for Simple Protocol and RDF Query 26 Language is an SQL like language that can use RDF triples for expressing queries and returning results. SPARQL is not only a query language but also defines a protocol on how to access RDF data. SPARQL supports four main variations of queries as follows: SELECT Query: Extracts raw values from a SPARQL endpoint (a SPARQL protocol service that enables users to query a knowledge base using SPARQL) and presents the results in a tabular format. CONSTRUCT Query: Same as the SELECT but the results are transformed into valid RDF. ASK Query: Supports Boolean-type of queries, e.g. about the existence of an individual in the knowledge base. DESCRIBE Query: Returns an RDF graph as a result but the query processor is responsible about how to structure this RDF data in case that the client does not define a query result pattern. Recent research is being done in the area of providing support to SPARQL for federated queries where the query author can direct specific portions of a query to particular SPARQL endpoints and the results will later be combined with the other parts by the federated query processor. Such features will allow information aggregation from disparate sources in a single query thus improving greatly both the quality and the amount of results. 3.2.7 Top Layers The top layers are still under research and no technology specification or languages have been standardized so far. According to (Al-Feel et al. 2009) the logic layer is supposed to provide “the answer for the question of why this piece of information is taken or appear to the user” while the proof layer should be able to provide deductive reasoning on why the results should be accepted. Formal proofs along with the vertical layers of security features for authentication of the origin of data and protection of the confidentiality and integrity of the data will enable software agents or users to actually trust the results. 27 Semantic Web & Digital Investigations The purpose of this chapter is to present in a more detailed form the most relevant work that has been already done to the present thesis. Although, the previous two chapters presented the two different areas in isolation to each other, various solutions have also been proposed and implemented combining these two. Most of this work is motivated by the fact that the information aggregation and automated reasoning capabilities offered by the Semantic Web technologies can provide a solution to the problems that digital investigations confront with respect to the large volume of data and their disparate origins. 4.1 XML-based Approaches One of the major attempts to introduce the XML markup language to the area of Digital Investigations currently is that of DFXML (Digital Forensics XML) (S. L. Garfinkel 2009). The project intends to provide a standardized XML vocabulary for representing metadata of a disk image (e.g. filename, acquisition date, device info) as well as information about the content of the disk image such as addresses and lengths of the “byte runs” (file fragments) of the resident files along with their cryptographic hash values as well as operating system specific information such as Microsoft Windows Registry entries. The DFXML provides various intuitive forensics-related XML tags such as <source>, <imagefile>, <acquisition_date>, <volume>, <fileobject>, <byte_run>, <hashdigest>, <filesize> etc. Garfinkel promotes the DFXML format by introducing a variety of tools that can either produce or consume DFXML for various purposes. Fiwalk is a tool that produces DFXML describing files in a disk image, depending on the Sleuth Kit for the actual interaction with the disk image. Fiwalk provides though an abstraction layer over the complexity of Sleuth Kit thus allowing new features such as reporting of differences between two disk images based on their DFXML representation, easy removal of known files based on a “redaction file” or even removal of personallyidentifiable information for privacy reasons. The format is gaining support from other tool authors as well such as the PhotoRec and the Scalpel carvers as well as existing digital evidence container formats such as AFF4 and EWF which can output disk image related metadata in DFXML format. Other approaches have focused on specific operating-system or application artifacts that are quite important in the context of a digital investigation. The hivexml project allows the extraction of Registry entries from hive files in an XML format using a simplistic tag set with <hive>, <node> <value> elements and key, type and default attributes. The Electronic Discovery Reference Model (EDRM) XML project is a XML-based format with the goal of improved data interchange regarding metadata relevant to e-discovery cases. This XML schema focuses on metadata fields of documents, such as Microsoft office files, and email messages thus defining elements and attributes such as Email Author, Email Subject, Doc Identifier, Full Folder Path etc. The TrID project provides a utility that can identify file types based on their binary signatures. The utility can output also its results in an XML format where the respective XML schema defines elements such as <FileType>, <Ext>, <Pattern> etc. A generic XML-based framework by the name of XIRAF (XML Information Retrieval Approach to Digital Forensics has been suggested by (Alink et al. 2006). The authors have underlined the 28 importance of having “a clean separation between feature extraction and analysis” and have proposed a single XML-based output format for forensic analysis tools. The concept behind XIRAF is that most forensic analysis tools operate on large binary objects (BLOB) and based on their specific functionality they can either extract specific features, such as log files from a disk image or also generate new BLOB content such as unzipping a compressed archive. The outputs of these tools are then wrapped in an XML format and stored in an XML-aware database. The investigator can then use the XQuery language to submit queries to this database and extract information about the case and the evidences. In order to evaluate the proposed framework, the author have provided concrete examples of advanced forensic capabilities such as a timeline browser of XML fragments produced by different tools such as file-system metadata, chat logs, EXIF picture metadata etc., as well advanced types of search such as based on specific metadata criteria or comparing with a hash set of known contraband material. (Levine & Liberatore 2009) proposed the DEX (Digital Evidence Exchange) format that can also wrap the output from various tool in an XML-based representation but with an additional capability of encoding the specific instructions given and the sequential order of the tools used. This can be used to provide provenance-related metadata on what was the sequence of actions leading to the specific extracted artifacts and even comparison between different investigation processes and their results. The DEX defines a set of XML element similar to those present in the DFXML such as <DiskImage>, <Volume>, <File> and <CommandLine>. In a similar approach, (Lee et al. 2010)have proposed an XML-based data collection framework for live forensics purposes. The authors have defined three main information object schemas using XML Schema about live data, target software and Windows related objects. The live data schema contains information about the system (e.g. running processes), the user (e.g. Windows account) and the network (e.g. IP address, active connections, executable open ports). The target software schema can describe application-specific information such as email accounts, web browsing history, instant messaging logs etc.). The Windows information schema focused on Windows related artifacts such as installed software, hardware configuration, user accounts and more. The authors propose also an architecture consisting of four main components, the scenario type analyzer, the data collection bloc, the report manager and case database. The scenario type analyzer’s purpose is to specify the most relevant types of forensic artifacts that are pertinent to the case in hand. The data collection block is responsible for the actual use of various tools for the collection of data and the representation of them in the XML format following the previously discussed XML schemas. The report manager presents the results of the previous step where it can support multiple different views depending on the needs based on XSLT transformations. Finally the case database can archive the XML documents produced from previous cases and used for mining case handling patterns through data mining techniques. XML approaches have also been followed in the network forensics field with PDML and PSML languages being markup languages for describing packet details and packet summaries respectively. The PDML language is an XML-based markup language for providing a detailed view of a packet and thus including elements such as <packet>, <proto> and <field> where the field element can support attributes such as name, value, size and pos. PSML, on the other side, is about how to express in an XML-format the summary view of a packet. The elements provided from the respective schema are the <structure>, <packet> and <section> ones. PSML is quite flexible and mostly focusing on the visual representation of summary view thus the provided constructs are quite abstract and unstructured. 29 A plethora of other XML formats have also been defined in the computer and network security area in general such as MANDIANT’s Indicators of Compromise that use an XML based language to describe signatures of malware such as the existence of specific files or registry entries with the capability of supporting conditional logic as well such as combining indicators with AND and OR clauses. The MITRE project has defined a plethora of XML languages for various purposes. The Open Vulnerability and Assessment Language (OVAL) for describing systems’ configuration information, the current machine state in regards to vulnerabilities and patch states and reporting of the assessment process results. The Common Event Expression (CEE) attempts to standardize the representation of logs through the definition of a common dictionary of event fields and event expression taxonomy for classification of different event types. Finally, the Malware Attribution Enumeration and Characterization (MAEC) language is a language for encoding malware related information such as artifacts, behaviors, payloads, propagation mechanisms and type of malware. The Common Attack Pattern Enumeration and Classification language provides a schema for describing common attack patterns such as their execution flow, related weaknesses (see CWE), related vulnerabilities (see CVE) and methods of attacks. As we can observe, there is a plethora of different XML formats covering different aspects of digital investigations. Although, all these languages promote a standardized format and tool interoperability, they are missing to convey the semantic content of what they represent. As such, real understanding and automated processing by a software agent cannot really be performed since the semantic interrelations of all these elements are not expressed. As an example, a software agent, although being able to parse all these different XML documents, is not able to infer that the <FileObject> concept found in a DFXML document is equivalent to the <File> concept found in DEX or that the <proto> element in PDML is equivalent to the <Protocol> element defined in XLive. Therefore, although XML based approaches can assist into establishing a common set of terms that can be both read and written by different tools, the support for an intelligent agent to perform automated reasoning based on relationships of these terms, such as packets targeting a specific port described in PDML and the process that is behind that listening port as expressed in XLive, is lacking. 4.2 RDF-based Approaches Despite the ability of RDF to express arbitrary metadata, it has found limited usage in the area of digital investigations so far. The most prominent practical use of RDF is that of it being the basic information model of the AFF4 forensic format. The Advanced Forensic Format (AFF) has been introduced by (Garfinkel 2006b) as a file format and container to store digital evidence. The container includes both a copy of the original data as well as arbitrary metadata. Metadata can include systemrelated information such as device-related data or user-specified ones such as the name of the examiner. New requirements posed by practitioners though, such as support for distributed forensic processes, forced a revision of AFF to its latest version, AFF4 (Cohen et al. 2009). In a subsequent paper, (Cohen & B. Schatz 2010), the authors present in more detail the reasoning for their choice of RDF. The authors argue that RDF was an important improvement over the previous ad-hoc serialization protocol used for metadata due to the ability of using standard libraries for generating and parsing RDF statements as well as the additional support of creating attributes with standard or custom types instead of only string types as before. (Giova 2011) adds to this idea and by introducing new RDF concepts such as evidence access information (e.g. date and geographical location of access), examiner information (e.g. name, institution, and role) as well as artifact-related information that were produced during the investigation 30 process (e.g. name of artifact, date, action) the chain of custody can be significantly improved. Such additional metadata with respect to who and when accessed the evidence, from where and what actions has performed with it can be considerably important in cases of remote or distributed investigations and increase the admissibility of the evidence in courts. 4.3 Ontological Approaches One of the first attempts to bridge digital investigations with the semantic web has been the introduction of the FORE (Forensics for Rich Events) architecture by (Schatz et al. 2004a). The architecture was composed of a forensic ontology, a generic event log parser and a custom-defined rule language, FR3. The forensic ontology was based on two main concepts, the Entity for representing tangible objects and the Event for representing the entities’ state changes over time. In order to represent causal linkage, an ObjectProperty linking individuals of the Event class was defined. A set of regular expression based parsers has been developed that could wrap low-level events into individuals of the specific classes of Event described in the ontology. These semantic representations of the events are then added to the knowledge base. Rules expressed in the FR3 correlation rule language specified by the authors are evaluated against the knowledge base in order to compose higher-level events out of combinations of lower level ones and add new causality links between events based on the existing domain knowledge. The authors further demonstrated the possibilities of such an approach by example cases where the automatic inference of higher level events could lead to more meaningful and important notification alerts sent to the investigator or allow the investigator to formulate hypotheses by using the owl:sameAs equivalence property to connect different individuals that represent the same “object”. Finally, navigation through the causal links between events could promote a search oriented investigation that could accelerate the whole process significantly by being able to locate easier events of forensic interest. In a subsequent paper, (Schatz, Mohay & Clark 2004b) demonstrated the ability to reuse concepts defined in various disparate domain-specific ontologies such as the Network Entity Relationship Diagram about networked hosts and the SOUPA ontology about context-aware pervasive computing environments for a much more generalized event aggregation. By taking advantage of the OWL capabilities of referencing external ontologies, the authors were able to express correlation rules that combined concepts and events from disparate domains such as Windows login events, SAP login events and physical door access logs. Other approaches have focused on the ontological modeling of various aspects of digital investigations. (Brinson et al. 2006) presented cyber-forensics ontology with a focus on the major participants in a digital investigation along with their specific role and specialization. As such, the ontology had two main concepts of technology and profession at top place and further analyzed into other concepts such as hardware, software, law, academia, military and private sector and education. The ontology provides a conceptual mapping of all the parties involved in a digital investigation and could be used as a guideline for further curriculum development and certification pursue. (Kahvedzic & Kechadi 2009) have presented a Digital Investigation Ontology (DIALOG) that conceptualized different aspects of the digital investigation process focusing more on the actual case and the evidence. As such the “DigitalInvestigationConcept” was further divided into four high level concepts of “CrimeCase”, Information”, “InformationLocation” and “ForensicResource”. The “CrimeCase” was further sub classed into concepts presenting specific types of crimes while the “ForensicResource” was further refined into the “ForensicServiceObject” representing various supporting resources during the investigation and the “ForensicSoftwareObject” covering 31 taxonomically different forensic software and their role in the investigation process. The “Information” concept was further analyzed into different types of data that can be retrieved and be relevant to an investigation such as “DataObject”, “FileObject”, “SoftwareObject” as well as collective concepts relevant to investigations such as “UserActivityEvidence”, “SystemConfigurationEvidence” etc. The authors further refined their ontology into modeling forensic knowledge related to the Windows Registry data structure and exemplified it by using their ontology along with SWRL-based rules for automatic aggregation on registry data such as composing higher level events like software installation from correlation between registry keys and registry snapshots. In continuation to the former approach, (Kahvedžić & Kechadi 2011) have demonstrated concrete examples of using a blank ontology for encoding results retrieved from forensic tools with the ability to encode various forensically relevant types of data such as metadata, content and events and their relationships such as the author of a document, the persons and location depicted in an image file or the person that performed an event. The authors suggest automated inference of the category of every instance based on its properties and its mapping concept and its subsequent place in the concept hierarchy. Finally, the ontology query language (SQWRL) can be used for evaluating queries against the individuals for extracting additional information of evidentiary value. In a similar approach, (Saad & Traore 2010) propose an ontology, represented in OWL, with network forensics as the focused domain. The authors define ontology with various concepts relevant to the area of network forensics as well as define taxonomic relations between concepts of the same taxonomy and ontological relations between concepts of different taxonomies. In order to surpass the restriction that relations represented in OWL are of binary type only, more complex concepts are introduced for representing N-ary type of ontological relations or lists are used for representing sequence of arguments. Furthermore, the authors argue that ontology reasoning can support generalization, prediction and drawing conclusions from facts by supporting main forms of reasoning such as deductive, inductive and abductive reasoning. The former is further supported through examples and a case study analysis based on the previously defined ontology. 32 Research Methodology According to (Hevner et al. 2004), the two main research paradigms in the IT research area are behavioural and design science. Behavioural science’s goal is mostly to improve the knowledge about existing IT systems and how they are used whereas design science’s goal is to promote novel ideas and solutions for further development of IT. The main focus of design science is the design and development of new artifacts that address practical problems where a problem is defined as a difference between the current and the desired state of the researched topic. However, in comparison to bare design processes, an additional goal of design science is the production of additional knowledge with respect to the artifacts and their context that can be seen as a contribution to the academic world. (Hevner et al. 2004) states that “effective design-science research must provide clear and verifiable contributions in the areas of the design artifact, design foundations and/or design methodologies”. As such, the former needs demands from a design science research to apply rigorous research strategies and methods so as to enable critical evaluation and testing of the proposed design and artifact as well as the new knowledge produced. Since, the focus of the current thesis, is the suggestion of applying the semantic web initiative and its associated technologies in the context of digital investigations with the goal of promoting data integration and novel correlation techniques as a solution to current practical problems, the design science paradigm has been deemed as more appropriate for providing a framework for conducting the research. (Hevner et al. 2004) has identified four different types of artifacts that can be the outcome of a design science process, namely constructs, models, methods and instantiations. Constructs are terms and concepts that constitute the building blocks for asserting statements and representing the various entities of the problem and solution domain. Models can be descriptions of possible solutions to practical problems such as a proposed system architecture that can further assist in the production of new artifacts. Methods describe processes for solving problems such as algorithms or best practices. Finally instantiations are actual implemented systems that can be used in practice to solve practical problems. The main outcome of this thesis is the suggestion and evaluation of proposed methods on how semantic web technologies can be incorporated into existing digital investigation processes and tools. However, due to the nature of semantic web technologies and for evaluation purposes of the proposed methods, models in the form of domain ontologies are also used providing an explicit definition of constructs relevant to the area of digital investigations. Finally, instantiations in the form of specialized tools that are used for the evaluation purposes of the suggested methods are presented, that can provide the basis for a more complex and complete semantically enabled system for digital investigations. Due to the complexity that a design science project can result in, methods providing a structured plan of the various needed research activities and specifying their interrelations have been suggested. Such a method, providing a framework of activities to be performed throughout the project is schematically presented in Figure 3 and further discussed below. 33 Figure 3: Overview of the Design Science Method (Johannesson & Perjons 2012) The goal of the first step is the clarification of the practical problem situation, its precise definition as well as well-supported motivation on why a solution to this problem is needed. As mechanisms that support this research activity can be considered knowledge resources originating from previous research and stakeholders’ interests. Research strategies such as case studies and action research and research methods such as questionnaires and observations can be applied for controlling the activity and its results. In the current thesis, domain knowledge obtained through extensive literature review and empirical experience has been the main resource for the formulation and elaboration of the practical problem. A focus has been given to peer-reviewed scientific articles and journals, as accessed through digital library facilities, and proceedings of reputable digital investigation relevant conferences and workshops due to their increased validity and their relevance to current academic research and practitioners problems. Case studies of digital investigations involving a variety of data sources (network captures, event logs, file-system forensics), in the context of international forensic contests and workshops, have been studied in order to better understand the problems rising due to the lack of advanced data integration and correlation techniques. The qualitative analysis of documents produced as output of the former digital investigations has been the main research method for identifying and defining the practical problems of manual integration and correlation of acquired evidence. The second step of the methodology has the goal to identify and outline an artifact as a possible solution to the previously defined explicated problem and further define the main requirements of the artifact to be developed. Resources for this activity can be considered similar previously proposed or implemented solutions along with the requirements that they have addressed as well as interests and opinions of stakeholders. As in the previous step, case studies have been the main research strategy for identifying and specifying requirements for the proposed method with observations and analysis of documents as the main research methods for gaining a better understanding of the current forensic practices and techniques restrictions as well as the usage environment and the analytical capabilities required due to the complexity of the cases and the investigators needs. 34 The subsequent activity is the actual design and development of the proposed artifact. The proposed method as the output artifact of the research process is described in detail and instantiations in the form of prototype tools are developed. The design and development of the artifacts follow best practices and established techniques from the semantic web area and components originating from applied semantic web technologies are combined with new ones in order to produce the desired artifact. In the next step, the developed artifact is used to demonstrate if and how it can solve aspects of the previously stated problem in the context of an illustrative or real-life case. The proposed method along with the developed instantiation artifacts of this thesis, are validated through applying them for the digital investigation of representative cases as those used as case studies in the previous research steps when enough data are given or through experiments attempting to resemble realistic and probable cases where data and evidence are superficially generated. The goal of the next phase is the evaluation of the proposed artifact and the solution it provides to the original practical problem along with the level of fulfillment of the identified requirements. Due to the limitations of this research, a full-scale evaluation by introducing the developed artifacts to an organization performing digital investigations and identify or measure the impact of such a solution could not be performed. As such, the evaluation strategy followed in this research is the “ex ante evaluation” where evaluation of the artifacts is carried out by the researcher in a theoretical manner and supported by “informed arguments” with respect to the fulfillment of the requirements and the generality of the solution. The final step is the communication of artifact knowledge where information about the proposed artifact is communicating to other researchers and practitioners. This report has this exact purpose of documenting the research activities and results attained from the research project. 35 A Framework for SemanticallyEnabled Digital Investigations In this chapter, the framework of the proposed method is presented by outlining its high level characteristics, describing its relation to the reference digital investigation models presented before and by elicitation of its main requirements that will act as the evaluation criteria of the later developed method. 6.1 An approach for digital evidence integration, correlation and hypothesis evaluation based on Semantic Web technologies As observed through the previous background studies, the problems that the Semantic Web initiative attempts to solve such as the integration and automated processing of a vast amount of information distributed all over the Web have quite in common with the restrictions and needs of modern forensic investigation processes. The complexity of modern digital investigation cases involving a broad range of concepts, technologies and entities constitutes efforts for a common universal evidence representation schema too difficult to succeed. The need thus is rising for an expressive but flexible manner for representing both domain knowledge and collected evidence information with the ability to integrate and correlate them, regardless their different origin or format. This can support advanced analytical capabilities and the formulation and testing of hypotheses posed by the investigator without being committed to particular conceptual schemas that may have limited expressivity or lack of reasoning capabilities. The proposed approach for integrating semantic web technologies in modern digital investigation processes and tools can promote, through extensible and with clearly defined semantics vocabularies, the cooperation between the investigator and the analysis platform due to shared grounds of communication and understanding as well as automation of parts of the analysis with respect to evidence discovery and correlation. In order to further elaborate on the main characteristics and potential advantages of such an approach, relations are drawn and described among identified strengths and benefits of the semantic web in relevant previous studies (Reynolds et al. 2005)(Zhao & Sandahl 2002) and needs of digital investigations. Information Integration: The RDF data model promotes easy integration of heterogeneous data due to its schema independence and standardized statement representation of subject-predicate-object form. RDF datasets containing such sets of statements can be merged into larger data sets. In contrast to traditional approaches, where syntactic naming similarities may lead to unwanted merges, Semantic Web’s promotion of the use of unique identifiers in the form of URI and shared vocabularies of concepts, alleviates the problem and enables automated data merging for statements with references to the same resource. In the case that the same concept is represented with different URIs in different ontologies, the Non Unique Name assumption along with OWL constructs such as owl:sameAs can enable automated equality inference and integration. 36 The ability to associate digital objects collected, created or extracted during the digital investigation process with a unique name enables unambiguous referencing to it and can promote better methods for case management and archiving. The use of a common resource for representing a concept that can be used in different domains of the digital investigation process can enable the automated integration of data as the example below presents. RDF statements are represented as Directed Labeled Graphs (DLG) where circular nodes represent resources, arcs represent object or data type properties, and rectangles represent literal values. Figure 4: Data Integration based on shared resource URI Figure 4 is the result of the merge of two simpler graphs where the URI reference identifies a named individual that represents a specific IP address. The arc representing the ‘hasCountry’ data type property originates from an RDF data set of IP to country mappings while the ‘hasHostname’ data type property originates from an RDF data set of IP to DNS mappings. The RDF engine is able to automatically connect the two properties to the common URI reference thus achieving integration of data originating from different sources. In the case that a resource is identified by multiple URI references, even of different namespaces, OWL provides the capability of either manually assigning or automatically inferring an equality relationship between the multiple references as shown in Figure 5. Figure 5: Data Integration based on owl:sameAs Figure 5 depicts two file resources identified by URIs from different namespaces (e.g. extracted from network and disk forensic analysis respectively) and are declared as being semantically the same by the owl:sameAs property. Using this property, a link can be established between different instances of the same digital object, thus adding additional information with respect to its origin, the processes in which it participated or the agents that acted upon it. Semi-structured data support: Semi-structured data present irregularities such as missing attributes, mixed typing and variable numbers of occurrences of attributes (Suciu 1998). Semi-structured data do not necessarily rely on an a priori schema and this gives the flexibility to deal with schema-less 37 or incomplete information. This is a quite common case in the digital investigation context where uncertainty and partial knowledge due to deleted, fragmented or missing data can be manifested. As such, RDF statements can be processed even without explicit schema information by RDF processors, although schema information can enable checking of the consistency and additional automated inferences. Semantic constraints and type related information and relationships can be defined in a schema and used by a semantic reasoned for verifying the internal consistency of the RDF dataset. As an example as presented in Figure 6, an instance of the File class that is associated with an IP address through the ‘hasIPAddress’ property may be flagged as inconsistent based on the schema-expressed restriction that the former property can have only instances of the Packet class as domain and that Packet and File are two disjoint classes. This consistency checks can be quite important when merging data from different sources as URI referencing misuse or semantic inconsistencies may be automatically detected and communicated to the examiner. Figure 6: Semantic inconsistency as reported by the reasoning engine Classification and Inference: An important aspect is that the RDF/OWL combination, in contrast to object oriented programming paradigm, can infer class membership and typing information from data based on the ontological specified definitions. Properties should not be considered similar to attributes in object oriented programming, since they are not considered as part of a class definition but their usage defines the class membership of an individual, thus enabling multiple class memberships as well. As an example, a binary stream present in a physical disk’s image is initially missing explicit type information. Throughout the investigation process and the output of different tools such as specialized carvers (e.g. NTFS file-system reconstructions, file signature analysis, file metadata analysis), new properties and their values are introduced about the initial resource. A semantic reasoner can use definitions and restrictions defined in the ontology, in order to infer dynamically class membership of the initial resource. As an example, in Figure 7, a digital object has been extracted from an NTFS partition and the NTFS file-system parser has inserted a data-type property that the file was deleted as per its MFT entry. The digital object was examined by various parsers and an additional data-type property about its MIME content type has been added implying that it is of an image type. According to the ontological definitions, a class hierarchy of the File, ‘ImageFile’ and ‘DeletedImageFile’ concepts has been specified. The ‘DeletedImageFile’ class’ extension has been specified as equivalent to the intersection of the class extension of the individuals having the ‘image/jpeg’ as the value of the ‘hasMIMEType’ property and those having the true boolean value as the value of the ‘isDeleted’ property. The reasoner, upon finding an individual with these restrictions fulfilled, is able to infer the type of the resource as being a member of the ‘DeletedImageFile’ class and subsequently of its superclasses ‘File’ and ‘ImageFile’. 38 Figure 7: Class membersip entailment based on value restrictions Extensibility and Flexibility: Due to continuous new advances and technologies pertinent to the area of digital investigations, it is important that applications dealing with evidence have to demonstrate backward and forward compatibility with respect to the data model of their input and output. As such, a backwards compatible application may be able to process input based on previous data models. RDF/OWL is inherently providing such support where a reasoner can infer new axioms based on the additional ontological definitions, when presented with information based on older data models. The same applies for forward compatibility, where new concepts and axioms defined in an existing ontology; do not affect existing tools that can still consume the remaining axioms. Based on existing practice, it is difficult to expect tool vendors and other stakeholders to agree on standardized ontological definitions of the domain area. The former is manifested by the high variation of different formats existing and the lack of standardization on these. However, OWL provides the flexibility that different ontologies can be specified according to the stakeholder’s interests and scope, but still be able to integrate through ontology mapping and alignment processes. Provenance: RDF/OWL provides good annotation capabilities so as to be able to associate each RDF statement with additional metadata regarding its originating source or other relevant information. Reification is a strong feature of RDF whose idea can be useful for digital investigations due to the requirements for establishing the chain of custody. The reification of a triple statement of the form subject-predicate-object (e.g. <ex:a> <ex:b> <ex:c>) is another graph that represents a blank node with the following properties _:x rdf:type rdf:Statement . _x rdf:subject <ex:a> . _x rdf:predicate <ex:b> . _x rdf:object <ex:c> . Expressed in the latter way, a triple is represented as a RDF statement and thus additional metadata can be attached to it by using the blank node as the subject for other statements. However, the latter graph does not entail the first one and due to the loose semantic clarity and weak query support, the use of it is gradually discouraged. 39 Another approach, building on the weakness of the reification approach, is the Named Graphs one. A named RDF graph is a set of RDF statements but named with a URI reference. In such a way, RDF data sets can be represented by the graph’s URI reference and additional metadata such as provenance, trust, access control about these RDF data can be asserted using the graph’s URI reference as subject (Carroll et al. 2005). SPARQL, as the preferred query language, provides support to named graphs and allows queries to be executed against explicit graphs through their naming. This can be quite useful, in the context of digital investigations, where RDF data-sets originated from different tools, processes or systems and other log data that may be archived daily can be uniquely identified, annotated and independently queried. Search: The Semantic Web advocates that ontology-based searching can provide significant improvements over traditional keyword-based information retrieval (Shah et al. 2002). Concepts and semantic relationships defined in the respective ontologies support the text extraction, annotation and inference mechanism. Documents are annotated with additional semantic markup that is later used over or along traditional keywords for document indexing purposes. A query can be extended through a reasoning engine and generate more meaningful data by utilizing semantic relationships between the searched concept and the semantically-annotated concepts of the documents. Keyword-based information retrieval is a prominent method in all types of modern digital forensics and employed when examining media devices, mobile phones, security logs etc. The use of ontologies providing semantic relationships between different terms and concepts can be able to retrieve data that may not be directly asked by an examiner but semantically related to her query. In a similar approach, although focused only on semantic linguistic relations, (Du et al. 2008) utilize the WordNet, a general purpose lexical dictionary with additional labeling of the semantic relationships between words, for query expansion taking advantage of synonym and antonym relationships between terms for improved precision in the information retrieval process. In this section, an outline has been presented of concrete advantages that a Semantic Web based approach may offer over current practices and needs of the area of digital investigation. Based on the former argumentation and examples along with the successful application of Semantic Web technologies in relevant fields such as information retrieval, it can be claimed that such an approach can indeed constitute a possible solution to the digital investigations’ practical problems discussed in Chapter 2. 6.2 Relation to Digital Investigation Reference Models The proposed approach of semantically enabled integration and correlation of digital evidence relies upon and refers to the previously described digital investigation frameworks and processes and approaches. As such, a short description of the correspondence of the proposed approach to the above is presented with the intention to provide a better insight of how the proposed approach can be effectively combined with existing reference models. 40 The event based digital investigation framework has been selected as the most relevant proposed framework due to its focus on event and the required forensic steps needed for the collection and interpretation of digital evidence as reconstructed past events. The framework covers both the physical and digital parts of the investigation but our proposed approach focuses only on the digital aspect of it. However, the phases of Readiness, Deployment and Physical Crime Investigation are considered as necessary predecessors of any digital investigation and their proper execution determines the quality and trustworthiness of any input data to our proposed approach. The digital investigation phase consists of three main sub-phases, the System Preservation, the Evidence Searching and the Event Reconstruction. The System Preservation phase emphasizes the necessity of proper acquisition and collection of the different sources that may contain data of evidentiary value along with proper documentation of any actions performed on them. The results of this phase are again considered as prerequisites for the quality of the collected evidence and any subsequent analysis of them. The Evidence Searching phase has the role of utilizing techniques and methods so as to separate collected data with possible evidentiary value from unrelated and noisy data. Although, our proposed approach does not focus on this phase, the successful execution of this phase can be of paramount importance due to the computational complexity that large volumes of data can bring to any data integration and correlation approach. Depending on the granularity of the ontologies used as well as the concept relations that can be inferred or asserted, the transformation of given input to semantic data can lead to huge amounts of triples and constitute any analysis attempt extremely cumbersome if not infeasible. It is, of course, important to mention that the evidence searching phase must maintain the integrity of any extracted, filtered or transformed data along with detailed documentation of any applied operation. The Event Reconstruction phase is the most closely related step to the proposed approach. In this step, evidence that have been properly collected and preserved during the first phase and extracted or filtered during the second phase are analyzed, integrated and correlated with the goal of providing a reconstruction of any performed past actions. In this phase, a semantically-enabled approach can provide a semi-automated or even fully automated method for representing the input data using the expressivity of Description Logic and OWL, data integration of disparate sources of information and even intelligent inference of complex types of events and their interrelationships. Finally, although the current thesis does not deal directly with the Presentation Phase, a semantically enabled approach with its inherent graph-capable connection of data fragments can provide a basis for advanced and more intuitive visual interfaces (e.g. timelines of events, data path trails etc.) that may enable a more meaningful interpretation and presentation of case relevant evidence. The phases of the Event based digital investigation framework, as discussed before, present a highlevel view of the main activities involved in any digital investigation without providing a more detailed description of how each of these phases has to be applied and what are the expected actions that the examiner must perform. As such, the chosen Digital Investigation Process provides a much more detailed description of the various steps and their expected outputs. Although, as previously discussed, the steps of the presented Digital Investigation Process can be conducted in iterations and not necessarily in a linear form, the steps that precede the Analysis part are not considered as of direct relation to the proposed approach and not further discussed. However, as previously mentioned, the successful performance of these steps according to forensic practices can significantly improve the quality of the input data for the analysis step, considerably increase the efficiency, automation and 41 speed of the analysis and positively affect the results of the subsequent steps. As such reduction steps of removing known files and thus reducing the amount of data needing to be processed as well as further organization of evidence files based on the nature of the information they hold are performed in this thesis by the author before applying the proposed analysis method. The main parts of the analysis phase that are of interest to the present thesis are the ‘Fusion’ and ‘Correlation’. The purpose of the ‘Fusion’ phase is to merge data of different types and nature so as to allow the examiner to acquire a broader and more thorough view of past events. This step is directly mapped to the data integration capabilities that Semantic Web technologies offer, such as that data of different forms can be represented to common and well-defined ontological concepts and terminologies and user-defined relations can be easily drawn between them. The proposed approach along with the later presented method can fuse data of different types (network captures, disk images, log files etc.) using an extensible set of ontologies and interconnect data with different types of relations such as part-whole, temporal, comparative etc. In the ‘Correlation’ phase data can be combined in complex ways so as to reveal additional patterns of activity and establish (causality, temporal, contextual etc.) relationships between different events or involved entities. The Semantic Web approach can support such advanced forms of correlation through the automated inference capabilities along with its support for rule engines that can support even more advanced expressivity and establish correlations between different events. The ‘Validation’ phase where the results of the analysis along with their backing reasoning are collected and further communicated can also be enhanced with current and future research on the top ‘Logic’ and ‘Proof’ layers of the Semantic Web architectural stack. Finally and based upon the discussion regarding the scientific value of evidence and the concerns about the reliability of any digital investigation process, the hypothesis-based approach and its process of formulating and testing hypotheses has been considered as an important principle that should be followed when conducting digital investigations. In accordance with additional requirements that such a hypothesis based approach requires as expressed in (Rekhis & Boudriga 2011), the proposed Semantically enabled approach can provide an automated, accurate and replay-able evaluation of hypotheses as well as provide formalized and expressive means for a conceptual representation of existing domain knowledge along with examiner submitted hypotheses. In our approach, we consider the querying capabilities that Semantic Web technologies offer along with the formalized conceptualization of forensic knowledge and involved entities as a structured and powerful way for expressing and evaluating hypotheses not only against raw data but also complex events that may have resulted from the integration of disparate data sources. Hypothesis-Based Approach Semantic Web Digital Crime Scene Investigation Phase Logic & Proof Event Reconstruction - Analysis Validation System Preservation Query `` Rules Evidence Search Correlation Fusion Figure 8: Conceptual relation between forensic frameworks and the Semantic Web stack 42 Ontology RDF Model In Figure 8, the conceptual relations between the different phases of the presented framework and process and the proposed approach based on Semantic Web technologies are depicted as previously discussed. 6.3 Evaluation Criteria Due to the lack of a common and established approach on evaluating digital forensic methods and tools along with the interdisciplinary nature of the proposed method, evaluation criteria for assessing the quality of the former and its results have been specified mostly based on relevant work in both the area of digital investigations as well as semantic web applications. In order to establish a better structured definition of the criteria, the ‘Goal-Question-Metric’ (GQM) approach (Basili et al. 1994) is followed. According to the GQM methodology, a measurement model is defined that can be used to evaluate the alignment of a proposed or implemented system to its purpose as well as the correspondence of it to its operational goals. As such, the three levels that constitute the model are the goals that identify the engineering goals behind the system under evaluation, the questions that provide a way of refining the goals into a set of operationalized characterization of the level of goal achievement and the metrics which are defined as the necessary data that need to be collected and provide a quantitative basis for the evaluation. It should be noted, though, that due to the restricted scope of the current thesis and the nature of the proposed method as an alternative form of analysis for digital investigations, specific criteria’s impact were difficult to be measurable in the context of a single project or case study but instead qualitative arguments are given in their support, pointing in parallel to possible extensions of the current work. Based on relevant work regarding evaluation criteria and evaluation methodologies for forensic tools (Hildebrandt et al. 2011) , Semantic Web based services and applications (Küster et al. 2010) and software engineering methods in general (Kitchenham et al. 1997), a list of goals and criteria have been collected and grouped into 3 main categories, namely the generic ones, the Forensic criteria and the Semantic Web related ones . The goals are presented in an aggregated form in the tables below along with their associated questions and metrics. Table 1: List of Generic Criteria in terms of the GQM methodology Generic Criteria Goal Questions Metric The proposed method should be appropriate for the task in hand What is the relationship of the proposed method with existing digital investigations practices and tools? The ability of the method to handle different types of cases (network-related events, media devices examination etc.) measured by the number of different data types it can process. What are the case context requirements for the method to be applied? The method should provide good support for decisionmaking by providing relevant and usable results. What are the types of new knowledge that such the method can extract and what is its usefulness. 43 The ability of the method to support arbitrary queries and provide answers over the whole body of collected evidence. This can be quantified by the The method should be cost effective in terms of storage and time needs How can the examiner formulate and evaluate hypotheses about the evidence files and receive informative results precision and recall information retrieval measures over the query results. How the method accepts and stores input data, intermediate and final results. What are the storage requirements for such an implementation? Storage size requirements for representing input and output data. How much time is needed for applying the method on the input data and how can it reduce the time that the investigation process takes? The method should be flexible and scalable Can the method deal with new sources of data or being able to seamlessly integrate new forms of ontologically-expressed knowledge and rules. Can the method support large amounts of data and what problems such complexity may incur? Time needed for performing the analysis of data or evaluating user-submitted queries. The ability of the method to process new data and accept additional ontologies or rules without the need of major (possibly even none) modifications on the existing steps. It can be measured by the amount of configuration or code modifications such changes may require. The method’s ability to handle large amounts of data. It can be measured by the amounts of input size in relation to the processing time or produced errors (e.g. number of captured network packets, firewall logs, disk image sizes etc.) Table 2: List of Forensic related criteria in terms of the GQM methodology Forensic Criteria Goal Questions Metric The method’s results should be reproducible Are the results of the method behave in a deterministic manner when applied on the same input data or they are inconsistent among multiple The method’s results (e.g. inferred axioms, query results) should be the same given the same dataset and independently of other factors like order of processing the evidence files. 44 tests? This can be measured by the number of errors or different results after multiple applications of the method on the same dataset. The method’s possible errors should be minimal and determined Does the method produce accurate results? Can the method accept inconsistent or malformed input data? How the method deals with incomplete data? Can the method produce results that are ambiguous or inconsistent to the specified ontologies? The method’s results can be automatically checked by a reasoning engine for possible inconsistencies between asserted and inferred axioms and the given ontologies. The method’s error rate can be measured by the error messages produced during its lifecycle. The method must provide logging capabilities for the inclusion of arbitrary metadata regarding the case, the entities and the evidence objects involved. Does the method support the addition of annotation axioms with respect to the asserted or inferred axioms? The ability to insert logging information during the method can be measured by its flexibility to accept arbitrary metadata. The method should protect the integrity of the collected data Can the method operate on forensic copies of the collected evidence? Does the method allow the logging of the various steps of it as they are applied and their results produced? Does the method use hashing algorithms in order to ensure the consistency and integrity of these forensic copies? The method should protect the integrity of the collected data, files and devices throughout its whole lifecycle by being able to work on forensic copies instead of the original and verify any hash values that these copies carry as forensic metadata. The ability of performing these checks for different data sources can be considered as a metric. Table 3: List of criteria with regarding to the Semantic Web principles in terms of the GQM methodology Semantic Web Related Criteria Goal Questions Metric System Heterogeneity – Platform Independence Can parts of the method be applied in different system and the partial results later recombined? Are there any restrictions with respect to the configuration of these analysis The ability of the method to be successfully applied in different system configurations can be measured through multiple tests in different systems. 45 systems? Implementable with the current Semantic Web Stack Can the method’s steps that utilize Semantic Web concepts be implemented with current technology or other improvements/extensions are needed? The method should be able to rely on existing Semantic Web technologies without the need to develop or improve their current status. Errors produced or modifications needed when implementing the proposed method can be considered a metric of how much implementable the method currently is. The method and its results should be semantically rich allowing the description of high level contexts and events along with their interrelationships. Can the method describe arbitrary data? Can the method accept descriptions of high level and user-defined concepts and associate set of lower level events into them? Can the method establish relationships between these higher level descriptions? The method should be able to accept user-defined high level concepts and associate lower level events to them using well defined rules/restrictions. Errors produced or inability to define custom-defined events can be considered as a metric of how semantically rich the method is. The above defined goals and their associated metrics can be used in order to establish an evaluation framework for the method proposed as well as the results obtained. It should be noted once more though, that the main goals of the proposed approach of large-scale automation of the digital investigation process, integration and reasoning of all types of collected data as well as a formalized ontological representation of the existing body of knowledge thus constituting digital investigation more approachable to less technically adept people are not fully covered by the former goals. A much broader evaluation framework that would involve studies of the people involved in Digital Investigations and the effect that such a semantically-enabled method can bring upon their existing practices and perspectives would be required and involve research methods such as action research, surveys and interviews. 46 A Semantically enabled Method for Digital Evidence Integration, Correlation and Hypothesis Evaluation In this chapter, the proposed semantically enabled method for digital evidence fusion, correlation and hypothesis evaluation is discussed both in a theoretical and practical level. The first part of the section deals with the abstract building blocks that constitute the parts of the method and their relationships while the second part deals with implementation details of the method along with the suggested software architecture that has been used for a proof of concept practical application of the method. 7.1 Description of the Method The basic design structure of the method is presented in Figure 9 and further discussed afterwards. Figure 9: The abstracted method's structure In the Data Collection phase, all current and future forensic acquisition techniques are implied. The goal of the data collection phase is to generate the appropriate input for the remaining parts of the method. Forensic principles such as proper seizing and acquisition of involved data sources (e.g. hard disks, network packet captures, OS and application logs, memory contents, mobile devices etc.) are assumed to be maintained during the process. Although the current method does not impose strict requirements or checks on the input data, the application of the method on forensic copies of the original data with their integrity well protected and their chain of custody maintained is important for the subsequent credibility and admissibility of the generated results. Due to computational complexity issues, it is expected that a preprocessing and data reduction step can be also applied using common techniques such as KFF (Known File Filter) so as to reduce the amount of input data of the method. The second step of the method involves the parsing of the input data with appropriate software or even hardware tools and their representation in a common format using the RDF data model and respective ontologies. As discussed before, ontologies, expressed in OWL in the Semantic Web context, allow the modeling of the domain that each input source represents and the explicit definition of the pertinent concepts, properties and interrelations. Unfortunately, so far little work has been done as discussed before, on the specification of ontologies that conceptualize the various types of evidence that are commonly used in digital investigations. Most ontological approaches have focused on surrounding concepts and elements of digital investigations such as involved entities, procedures or specialized sources of evidence such as the Windows Registry. In the context of the proof of concept method implementation presented below, a number of lightweight ontologies are specified. However, domain experts should cooperate in order to reach a consensus of more complete and accurate models of the different evidence domains expressed in an ontological manner. Despite the above, the parsing 47 tools can be flexible enough so as to operate even with custom made ontologies specified by the investigator or by other communities such as forensic tool vendors. The output of the ‘Ontological Representation’ step should be the transformation of the input data and their various formats (binary, XML, forensic disk image formats) into the RDF data model of the triplet. The RDF data model’s flexibility allows the representation of arbitrary information in the form of subject, predicate and object. Subjects should be given a URI so as to be uniquely distinguishable while objects can also be represented as resources along with their URIs either as data values under suitable namespaces. The predicates are specified as object or data properties in the respective ontology and enable the representation of the interrelationships between resources or the specification of the values of resources’ properties. The asserted axioms can be contained in a new blank ontology that contains a reference to the ontology file that describes the classes and the properties. The final output can be stored using the various ontology formats such as RDF/XML or RDF/OWL since most semantic web frameworks support equally well both. The next step is the ‘Automated Reasoning’ step. During this step, an OWL-based reasoner is called upon the previously generated ontological representation of data in order to infer additional axioms based on the resources’ ontological relations. Depending on the sophistication of the given ontology and the detail of the given data, different types of new axioms can be inferred such as: Class Assertion Axioms: The inference engine takes into consideration defined property restrictions in order to classify resources as members of specific class extensions. The advantage of this inference is that a resource can belong to multiple classes in parallel. Additionally, a resource with partial asserted knowledge can dynamically be classified in the best matching class descriptions thus allowing the mapping of evidence data into higher order and custom defined concepts not necessarily included in the original data. Property Assertion Axioms: The inference engine takes into consideration property relations such as property hierarchies or transitivity in order to infer new property axioms that are attached to the given individuals. New features such as chained properties defined in OWL2 allow the inference of new property assertions based on more complex relationships of resources. Inverse Object Property Axioms: The inference engine takes into consideration the definition of properties that are have an inverse relationship in order to infer the inverse object property assertions based on the existing ones. This enables the resources that are connected by these properties to establish bi-directional connections that further promote more advanced queries. Subclass Axioms: The inference engine is able to resolve a new inferred subclass hierarchy based on previously asserted class descriptions when these extra hierarchical relations have not been explicitly defined in the ontology. Following this, the ‘Rule Evaluation’ step is where SWRL Rules are evaluated by a rule engine and the newly asserted axioms are inserted back to the ontology. SWRL has been chosen as one of the most promising rule language for the Semantic Web stack. SWRL Rules can assert new additional axioms that cannot be inferred due to the DL-safe expressivity limitations of OWL. To the best of our knowledge there is no fully support for SWRL rule evaluation by most OWL-inference engines and thus some rules may have to be evaluated by external Rule Engines such as Jess. These external Rule Engines may represent axioms in a different way than the RDF data model and thus require an additional translation layer. The newly inferred axioms by the Rule Engine should be translated again back to the RDF data model and integrated in the existing ontology. Depending on the ontologies 48 used, an additional round of inference may be applied if new axioms can be inferred based on the newly-inferred axioms by the evaluation of the rules. The final step is the ‘Integrated Query’ where queries expressed in the SPARQL language are submitted to a SPARQL endpoint hosting the sets of asserted and inferred axioms. The term ‘Integrated’ is used since in the previous two steps, both by OWL-inference as well as SWRL evaluation relations in the form of predicates can be established between resources of the same or different datasets. 7.2 Ontological Representation of Digital Evidence As explained before, the ontology plays a central role in the proposed method. Currently, there are no well-established ontologies for the area of digital investigations covering different types of evidence. It would be too hopeful to expect that one commonly accepted ontology describing concepts and relations of different types of digital evidence will soon be accepted by tool vendors and integrated in their tools. Based on the above assumption, the method has been developed by considering its ability to handle different ontologies from different namespaces. As such, a choice that has been adopted during this research is that different types of evidence (network packets, forensic disk images, logs etc.) are each associated with their own ontology under their respective namespace. Conceptually, in the context of the current research, sources of data have been categorized into two main types, namely the ‘Case Related Data’ (CRD) and the ‘Supportive Data’ (SD). CRD are data that originate from main sources that are the result of the initial steps of the forensic process such as collection and examination. CRD describe data sources which contain the data of direct evidentiary value for the case in hand. CRD can include captures of packet captures in either raw format (pcap) or sessionized (NetFlow), forensic images of hard disks and logs produced by different applications or network appliances (firewall, IDS etc.). SD, on the other hand though, is additional information that originates by further analysis of CRD or other types of metadata that can be provided by other services. Examples of SD can include domain name information such as DNS reverse lookups, WHOIS domain information, IP geo-location and IP to Autonomous System Number (ASN) mappings, antivirus and antimalware engine outputs upon scanning extracted files etc. A clear distinction between CRD and SD may not be feasible taking into consideration the procedures followed by different investigation teams. For this research, the ability of data and metadata to be collected directly from the collected sources versus the usage of external services for additional metadata will be the main criterion for performing the distinction. For the purposes of the implementation of a proof-of-concept system, a number of lightweight and focused ontologies, expressed in OWL, have been defined and designed using the Protégé OWL Editor. Based on the assumption that heavyweight ontologies that can better conceptualize the area of digital investigations require large-scale cooperation and consensus among academia, industry and individuals, a set of lightweight ontologies specializing in the ontological representation of different types of data pertinent to a digital investigation, can provide a more pragmatic approach. Such an approach can demonstrate the ability of semantic web solutions to still provide considerable advantages, with the price of added complexity of course, in the areas of information integration and automated reasoning. Three main types of CRD have been chosen, namely network packet captures that provide an accurate image of the network communications a system had in the past, forensic images of hard disks that 49 provide a forensic copy of the files and directories stored on a hard disk structured in the context of a file system such as NTFS and finally firewall logs such as those created by Windows XP Firewall. For SD, we have considered sources of data such as malware detection services such as ‘VirusTotal’ (www.virustotal.com) that can provide additional information about the potential malicious nature of a file, network registration information such as those provided by RIPE (www.ripe.net/data-tools/db) that include the IP range in which an IP address belongs and the Autonomous System (AS) to which it is currently assigned and finally the results produced by other projects such as the FIRE project (http://maliciousnetworks.org/) that provide real-time data from monitoring for reported malicious networks that may contain hosts that are used to serve phishing sites or malware. In the sections below, each information source and its respective proposed ontology are discussed in more detail. 7.2.1 Network Packet Capture Ontology Network Packet capturing can record data packets that are transferred via a computer network. Our focus is of course to the most prominent type of modern computer networks which are based on the TCP/IP stack. TCP/IP Stack is structured in four layers, namely from bottom to top link layer, internet layer, transport layer and application layer. A packet capture can monitor a network, wired either wireless, and capture all data transmitted in the link layer. As such, a packet transmitted in the link layer, carries along the data from all its hierarchically above layers such as protocol headers from the internet layer (e.g. source and destination IP addresses), transport layer (e.g. TCP or UDP source or destination port) and of course the application layer data (e.g. an HTTP request or response). A Network Forensic Analysis Tool Is able to analyze the protocols of each layer that each packet uses and additionally assemble the packets into streams thus providing a higher level of abstraction such as the complete file or request that an application sent to another over the network. Network Forensics has the objective of analyzing such captures of network traffic in order to extract transmitted files (e.g. HTTP downloaded files) or application messages (emails or instant messaging conversations) as well as traces of attempted or succeeded intrusions such as web application attack vectors like SQL injections. A basic ontology that contains terms and concepts important in a packet capture analysis has been designed and the hierarchical structure of a part of the defined classes is presented in Figure 10. The basic approach followed during the design of this ontology is that a packet capture file contains a set of network packets which can be further aggregated into IP conversations between pairs of IP Addresses. An IP conversation though between two distinct IP addresses may include a set of TCP or UDP flows between different source and destination ports. Finally a TCP or UDP flow may contain a set of application layer messages such as HTTP request and response messages. An application protocol such as HTTP contains an internal structure with a set of different header fields and values with different semantics such as the type of browser a user is using, the user credentials that are used for authentication and the type of the returned resource (e.g. image or text or binary content). In this ontology we have focused on the HTTP protocol, which is one of the most used protocols on the Internet today. In order to further semantically annotate the HTTP requests and responses present in a packet capture, the RDF vocabulary of HTTP as specified by the W3C ERT working group (Koch et al. 2011) has been used. As such, the content that is exchanged via the HTTP protocol can be formally annotated in a machine understandable manner. In a similar approach to the current thesis, although only focused on HTTP based attacks, (Munir et al. 2011) further leverage the HTTP RDF vocabulary by introducing semantic rules expressed in SWRL in order to detect malicious and malformed HTTP requests or responses such as HTTP requests including HTTP response headers etc. 50 Figure 10: Ontological modeling of network packet captures A brief description of the defined classes, object properties and data properties follows: Table 4: Entities of the Network Packet Capture Ontology CLASSES PacketCaptureFile A class that contains in its class extension all the resources that represent packet capture file. In order to better manage network packet captures in large network that span large periods of time, various policies can be followed such as splitting the capture network traffic into different files by a threshold file size or an amount of captured packets. An individual that belongs to this class can represent an individual packet capture file and properly annotated with additional properties useful for documentation and chain of custody purposes. IPAddress A class that represents IP Addresses. Each different IP address is identified by a different URI. IPv4_Communication A class that represents IP communications between two IP addresses. An IP conversation has a source and a destination IP address. By source we mean the client that initializes the connection and the destination as the server accepting the connection. Port A class representing the network ports that a computer has. A Port can be either a TCP or a UDP port and is identified by its number. TCPFlow A class representing a data flow over the TCP transport layer protocol from a source TCP port to a destination TCP port. The same applies for the UDP respective classes. ApplicationLayerProtocol A TCP or UDP flow is characterized by the application layer protocol that is used such as HTTP or DNS. This class extension contains all the resources that represent a communication between two network hosts using a specific application layer protocol under a specific TCP or UDP 51 connection. OBJECT PROPERTIES hasCommunication This object property is used to link a packet capture file as the subject with a set of IP communications that it contains as the object. Subproperties of it are the ‘hasTCPCommunication’ and ‘hasUDPCommunication’. hasSourceIP , hasDestinationIP These object properties link IP communications with individuals of the IPAddress class that are the source and destination endpoints. hasSourcePort , hasDestinationPort These object properties link a TCP or UDP flow with the source and destination ports which are individuals that are members of the Port class. hasApplicationLayer This object property connects an IP communication with the application protocol that is used. hasHTTPRequest This object property links an individual that is a member of an HTTP communication with a set of individuals, members of the class HTTP Request. DATA PROPERTIES hasPortNumber An integer data value that carries the port’s number. hasStartTimeStamp , hasEndTimeStamp A data value of the type of DateTime as specified in the XML Schema Datatypes. These values indicate the temporal duration of a specific TCP or UDP stream. hasContentMD5 This data value contains a hex representation of the MD5 hash signature of the content that was carried through an HTTP response message. Further classes such as Request, Response, RequestHeader, ResponseHeader, Content are imported from the HTTP RDF vocabulary and are used to further annotate the various parts of HTTP requests and responses. 7.2.2 Forensic Disk Image Ontology Analysis of forensic images of hard disks, or storage devices in general, is unquestionably a central part of almost every modern digital investigation. There are various types of storage devices that are used such as hard drives (SATA, SCSI, SSD), USB flash disks, SD cards etc. There is also a plethora of file systems that are used to organize and manage the contents of the device in the form of files, directories and other associated data structures such as NTFS, FAT32, EXT3, HFS etc. Storage devices, especially in the enterprise environment, can be combined using various techniques such as RAID or Disk Spanning for added benefits such as fault-tolerance or dynamically increasing storage sizes. There are also different techniques for acquiring forensic images of such storage devices including specialized hardware devices or software solutions. A forensic image of a storage device can also come in different format such as raw (dd), EWF, AFF and more. 52 Most forensic suites, especially the commercial ones, are able to read most of these image formats and also interpret and reconstruct different file systems. Depending on the capabilities of the tool and the file-system used, files and directories can be extracted along with deleted or fragmented ones, various metadata can be retrieved such as timestamps, file-type, hash signatures or even files hidden using various anti-forensic techniques (e.g. using the slack space). The results of the former can be used for determining the creation or possession of specific files by a user (e.g. contraband material), historical data of the usage of a system or even analyze malware-infected systems. The DFXML approach (Simson Garfinkel 2011) attempts to insert a layer of abstraction over all this variety of formats and types by introducing an XML-based vocabulary with XML entities and attributes that capture important metadata about files and directories that are common amongst most existing file system and storage types. DFXML is accompanied by a series of scripts/tools (e.g. fiwalk) that can parse forensic images of storage devices and generate an XML document that provides a listing of contained files and directories along with their most important metadata. As mentioned before although XML provides a common data exchange format that almost all systems can generate or process it is lacking the additional semantic relationships between the different entities. Based on the DFXML proposed XML vocabulary, an ontology has been designed with the goal to semantically express commonly-used concepts and attributes in the area of digital forensics along with their interrelationships. The ontology has been created using the Protégé Ontology Editor and is graphically presented in Figure 11. The design approach followed is that conceptually the term binary content can describe equally well both an image of a large storage device, a small file or even a sequence of bytes as part of a file. As such an output of the fiwalk tool represents the contents of an image of a storage device which in its turn may consist of multiple partitions formatted with different file systems which contain the various files. A file’s content can also be described by its byte-run, i.e. the sequence of bytes and their location in the image or multiple ones in the case of fragmentation. Figure 11: Ontological modeling of a forensic disk image A brief description of the defined classes, object properties and data properties follows: Table 5: Entities of the Disk Image Ontology 53 CLASSES BinaryContent A generic concept that describes all types of binary content independently of their size or format. FiwalkReport The class represents individual instances of the fiwalk tool output. MediaDeviceImage The class represents all types of storage devices. Partition The class represents blocks of the total storage area that are logically separated and possibly differently formatted. Partitioning is used in the case of system with multiple operating systems or even separations of large storage devices into smaller blocks. FileSystem This class represents different types of file systems that are used to organize the data inside a partition as well as manage them (e.g. create or delete files). File This class represents a logical entity of arbitrary information that stores data formatted in a specific way. ByteRun This class represents a byte run which is a sequence of bytes that are part of a file and are sequentially stored in a storage device. OBJECT PROPERTIES describes, isDescribedBy These object properties is used to establish the relationship between an individual that is a member of the FiwalkReport class with the individual that is a member of the MediaDeviceImage class that the former describes. hasPartition, isPartitionOf These object properties establish the relationship between an individual that is a member of the MediaDeviceImage class and the one or multiple individuals that are members of the Partition class that it may contain. hasFileSystem, belongsToPartition These object properties establish the relationship between an individual that is a member of the Partition class and an individual that is a member of the FileSystem class representing the file system with which the partition is formatted. containsFile, isContainedInFileSystem These object properties establish the relationship between an individual that is a member of the FileSystem class and the multiple individuals that are members of the File class. hasByteRun, belongsToFile These object properties establish the relationship between an individual that is a member of the File class and the single/multiple, in the case of fragmentation, individuals that are members of the ByteRun class. DATA PROPERTIES hasPathName A string value of the path and the name of a specific file contained in a storage device image or even the file of the image itself. 54 hasType A string value that holds the type of the file after the file identification process followed by fiwalk using the libmagic library. hasFileModificationTime The lexical representation of the timestamp of the last modification of the file using the XML Schema DateTime datatype. There are also a large set of data-type properties that describe different metadata about the file which due to space limitations have not been included here. 7.2.3 Windows Firewall Log Ontology The main purpose of a firewall device is to inspect the flowing network traffic and accept or deny network packets based on some predefined security policy. Firewalls can be implemented either as network devices being positioned in the perimeter of a network thus effectively protecting a set of hosts that reside behind them or can be also operated in single hosts as well. Without going too much in detail about the different types of firewalls and the ability of some to even inspect application layer data, for the purpose of this thesis we will focus on stateful host-based firewalls such as Windows Firewall. The Windows Firewall was first integrated with the Windows operating system with the release of Windows XP Service Pack 2. The Windows Firewall has a preconfigured set of rules describing allowed or disallowed traffic based on packet characteristics such as source and destination ports and source and destination IP addresses. The firewall is able to inspect both incoming as well as outgoing traffic. The Windows Firewall enables the administrator to activate the secure logging capabilities of the product through its configuration settings. The administrator has the option to either log dropped packets that the firewall had to reject based on the specified rules either log the successful connections that are allowed to pass through the firewall or both. The security log is using the W3C Extended Log File Format. This format is based on a W3C Working Draft with the aim to provide a standardized and flexible format for keeping log files related to Web activities. This format is actually used by a variety of applications such as the Microsoft IIS Server as well as other web server software such as the Apache Web Server. Of course, different vendors of either software or specialized network firewall devices such as Cisco may use their own proprietary log formats. However, basic information such as those kept in the W3C Extended Log File format is expected to be logged by most other solutions as well. The following table provides basic information of the various fields that can be found in a Windows firewall log. Table 6: List of fields of Windows Firewall log entries Item Description Version The Windows Firewall software version Software Name of the application producing the log Time The time format used for reporting timestamps Date - Time The timestamp of the log event in the form of YYYY-MM-DD HH:MM:SS Action Describes the action that the firewall has taken upon the packet. The values available are OPEN for an allowed outgoing connection, OPEN55 INBOUND for an allowed incoming connection, CLOSE for a normal closure of a TCP connection, DROP when a packet violates a firewall rule and is subsequently rejected and INFO-EVENTS-LOST that describes a number of occurred events not recorded in the log. Protocol The network protocol in use such as TCP, UDP, ICMP or the protocol numbers in case of other than the former protocols. Src-ip – Dst-ip The source and destination IP address Src-port – Dst-port The source and destination port number Size The packet size in bytes TCPFlags The first letter of each active TCP flag present in any TCP packet. Path Indicates the direction of the communication. The value SEND is used for an outgoing packet from the host and RECEIVE for an incoming packet. A firewall log can provide a wealth of information regarding connections that the system has established with other networked systems or possible incoming connection attempts that have been either accepted or rejected. A high amount of rejected packets can be used as an indicator of a possible intrusion attempt while large sets of packets with specific characteristics such as TCP SYN packets may indicate network port scanning or even Denial of service attacks. As such a firewall log can be quite useful in the case of digital investigation that pertains to network intrusions or other network related activities. A firewall log that is properly stored, maintained and archived can be potentially used as evidence of previous network activity and further correlated with network packets captures. A lightweight ontology has been designed for semantically representing important terms and concepts used in a firewall log analysis as well as their interrelationships. The ontology has been also designed using the Protégé Ontology Editor and its structure is graphically presented in Figure 12. The main design approach followed is that a Firewall Log File can contain a set of Firewall Log Entries. Each entry shall include a source and destination host, a source and destination port as well as the associated protocol used. Besides that, the different actions that a firewall may take upon each packet have been conceptualized as a Firewall Event, with its sub-concepts of Open Inbound or Outbound Session event, Close Session event and Drop Data Event. 56 Figure 12: Ontological modeling of Windows firewall logs A brief description of the defined classes, object properties and data properties follows: Table 7: Entities of the Windows Firewall Log Ontology CLASSES Host A class representing all the network hosts Port A class representing the network ports that a computer has. A Port can be either a TCP or a UDP port and is identified by its number. Protocol A class representing the different network protocols that a connection may use. For the case of a Windows Firewall Log the options TCP, UDP and ICMP have been specified as individuals, members of this class. FirewallLogContainer A class representing an entity that acts as the container of a set of firewall logs. Windows Firewall Logs are commonly stored on text files on the local system or another remote storage. Advanced configuration options enable log rotation to be used where logs are kept in different files depending on the date or a threshold amount of already logged entries. FirewallLogEntry A class representing firewall log entries. Practically, each line in the W3C Extended Log File Format is mapped to a distinct firewall log entry. FirewallEvent A class representing the action that the firewall took upon a network packet or session as described by a log entry. The sub-classes CloseSessionEvent, DropDataEvent, OpenInboundSessionEvent and OpenOutboundSessionEvent describe the different actions that the Windows firewall may take. OBJECT PROPERTIES hasLogEntry This object property connects individuals that are members of the FirewallLogContainer class with individuals that are members of the FirewallLogEntry class. This is a useful property that allows the digital investigator to track down the actual log file that contains a specific log entry of interest. 57 represents A mapping between a firewall log entry and the firewall event that describes it. This is a basic abstraction from the raw format of a single firewall entry to a more abstracted and descriptive event. Events can thus be more meaningfully aggregated later on in events of even higher abstraction as well as reasoned and queried about. hasSourceHost , hasDestinationHost This object property connects individuals that are members of the FirewallEvent class with individuals that are members of the Host class. hasSourcePort , hasDestinationPort This object property connects individuals that are members of the FirewallEvent class with individuals that are members of the Port class. hasProtocol This object property connects individuals that are members of the FirewallEvent class with the specified individuals of the Protocol class. DATA PROPERTIES has Action This data property holds the lexical value of the action that the firewall applied upon a packet or session such as DROP, CLOSE, OPEN etc. hasAddress This data property holds the lexical value of the IP address of a network host. hasNumber This data property holds the numerical value of a network port. hasDateTime This data property holds the value of the date and timestamp wen the log entry has been logged. It is formatted based on the DateTime data type specified in the XML Schema DataTypes specification. 7.2.4 WHOIS Ontology The modern Internet is without doubt quite different from the initial small research-oriented networks from which it emerged. In order to deal with the ever increasing complexity of managing the IP address space and administering the allocations of them as well the associated domain name infrastructure a hierarchy of administrative organizations has emerged. Without going too much in detail, nowadays there are 5 Regional Internet Registries (RIR) with the responsibility of managing specific ranges of IP addresses. These RIRs are ARIN for North America, RIPE for Europe, Middle East, Russia and Central Asia, APNIC for Asia and Australia, AfriNIC for Africa and LACNIC for South America. RIRs are responsible to further split and allocate IP ranges and accept domain registrations from their customers such as ISPs and organizations. The RIR is storing information about the entity being assigned an IP range or domain name in its own records. On the other side, WHOIS is a query and response protocol for querying such databases maintained by the RIRs or other subsidiaries such as companies providing domain registration services. The WHOIS protocol is further described in RFC 3912. Information retrieved from a WHOIS server can include the domain name assigned to a network, the IP address block assigned or the Autonomous System. An Autonomous System (AS) id characterized by its Autonomous System Number (ASN) and identifies a set of routing prefixes under the control of a single entity. The Autonomous System is actually publishing a well-defined routing policy and is used for interconnecting large scale networks on the Internet using the BGP protocol. 58 In the case of digital investigations that involve network activity with remote IP addresses, it is of significant value to acquire more information regarding the entity responsible to manage the network where an IP address belongs to. Information acquired by using the WHOIS protocol and the DNS infrastructure are commonly used to monitor whole networks for possible malicious activity such as spamming or malware distribution and further notify the responsible network operators or blacklists them in order to isolate them from the rest of the Internet. There is a variety of tools for almost all modern operating system to submit WHOIS queries. These tools include both command-line and graphical ones as well as specialized web services provided by various web sites. The RIPE Network Coordination Centre provides such a web interface of its database that can be used to submit WHOIS queries over HTTP and display the results in the browser. Recently, RIPE has integrated each system with those of the other RIRs thus providing a unified query interface. The online service can be accessed at https://apps.db.ripe.net/search/query.html. The results can be returned either in XML or JSON format. This service, or other services to this, can be used in the context of a digital investigation in order to provide a mapping between IP addresses and the networks to which they belong to as well as based on the ASNs of their networks to further integrate them with other sets of information regarding malicious or black-listed networks. A quite simple ontology has been defined in the context of this research, in order to semantically represent information that can be retrieved from a WHOIS query. A graphical visualization of the ontology is presented in Figure 13. The main approach followed during the design of this ontology is that an IP address belongs to an IP Address Block, an IP Range and is managed by an Autonomous System (AS). Figure 13: Ontological modeling of WHOIS data as provided by RIPE A brief discussion of the defined classes, object and data properties follows: Table 8: Entities of the RIPE WHOIS Ontology CLASSES ASSystem A class that represents an Autonomous System. IPRange A class representing blocks of IP Addresses. IPAddress A class representing individual IP addresses. OBJECT PROPERTIES containsRange , isContainedInAS This object property connects individuals that are members of the AS System class with individuals that are members of the IPRange class. The inverse object property provides the reverse relationship. Using this 59 property an AS system is connected with the IP Address blocks that was assigned to. containsIP , isContainedInRange This object property connects individuals that are members of the IPRange class with individuals that are members of the IPAddress class. The inverse object property establishes the inverse relationship. This property is used to map an individual IP address with the IP Address block in which it belongs to. DATA PROPERTIES hasASNumber This data property provides the numerical value of an Autonomous System. hasNetName A descriptive name of the network responsible for a specific IP Address Range as provided by the RIR. hasCountry A two-letter country code that indicates the country in which the specified network operator is located. hasRange A lexical representation of an IP Address Block in the form of starting to end IP address. hasAddress The lexical value of an IP Address. 7.2.5 Malicious Networks Ontology The modern internet threat landscape is continuously becoming more and more complex. Over the last decade a shift is manifested from server-side attacks to client-side attacks. Modern attack vectors include sophisticated techniques such as phishing sites and drive-by downloads where the user is tricked or system vulnerabilities are exploited in order to infect a client machine with malware. Compromised machines are often becoming members of botnets, large sets of compromised hosts under the control of a criminal group that are further used for distributing spam e-mails or performing other types of activities such as distributed denial of service attacks (DDoS) against specified targets or acting as proxies for the main malware distributing servers for evading detection (fast-flux networks). These techniques are quite complex and a lot of research is already conducted and ongoing on these topics. For the purposes of this research though, we consider the knowledge that an IP address or network in general, which appears in the communication logs that an investigator examines, demonstrates such malicious behavior to be quite important. Such information can be used by an investigator, especially in cases of network-related incidents, to quickly identify suspicious traffic that any of the examined systems may had with such malicious Internet hosts. There are a large number of projects with the purpose of actively or passively monitoring Internet hosts in order to detect such malicious behavior. Most commonly such projects produce blacklists of IPs that are observed to perform such malicious actions as sending spam emails, hosting malicious web pages or perform intense scanning activities. In this project, we have considered the FIRE project (Stone-Gross et al. 2009) that is a part of the European FP7 Wombat project (http://www.wombatproject.eu/). The project aggregates and correlates security related information from a variety of sources such as the Anubis software which monitors the actions performed by a Windows malicious executable, Wepawet for analysis of malicious Javascript, PDF or Flash files that can be contained in a web page, lists of spam and advertising URLs such as SpamCop or phishing sites such as PhishTank. 60 The project correlates all these data sources and along with information about the ASNs in which these IP addresses belong to, is able to identify networks that contain hosts that consistently exhibit malicious behavior. The results are further communicated via the website of the project at (http://maliciousnetworks.org). In the context of this research, a lightweight ontology has been designed as graphically presented in Figure 14. The main design approach has been that an Autonomous System has a set of Hosts that may be further characterized as Malicious Hosts in the case that the FIRE Project based on its analysis determines so. The FIRE project separates malicious host in three main categories, based on their behavior, namely ‘Phishing Server’ for hosting phishing web sites, ‘Exploit Server’ for hosting malware files such as Windows executables and CCServer in the case that they act as Command & Control servers for managing botnets. Figure 14: Ontological modeling of FIRE's blacklist of malicious networks/hosts A brief description of the defined classes, object properties and data properties follows: Table 9: Entities of the MalicousNetworks ontology CLASSES AS The class represents the concept of an Autonomous System. Country The class represents the concept of a Country. Host The class represents network hosts. A direct subclass is the ‘MaliciousHost’ class which further is sub-classed into the ‘PhishingServer’, ‘ExploitServer’ and ‘CCServer’ classes. IPAddress The class represents the concept of the IP Address. OBJECT PROPERTIES containsHost , isContainedInAS This object property and its inverse connects individuals, members of the AS class with individuals, members of the Host class. locatedIn This object property connects members of the Host class with members of the Country class. The FIRE project further correlates IP Addresses with IP geolocation databases in order to annotate a host with the country which most 61 likely it is located in. DATA PROPERTIES hasASName A string value of a descriptive name of the Autonomous System. hasASNumber The integer value of the Autonomous System Number. hasCountryName The lexical value of a country’s name. hasIPAddressString The lexical value of an IP address. 7.2.6 Malware Detection Ontology A final Supportive Data source that has been used in this project has been a malware detection service. It is common practice, especially in cases where compromised systems are involved, to scan the files found in a system’s storage image against an antimalware engine in order to detect any traces of malware resident on the system. Investigators may use either an antimalware product of her choice or use similar web services. One such web service is the site ‘VirusTotal’ (https://www.virustotal.com/). This free online service provides access to a large set of commercial and free antimalware engines such as AVG, Avast, McAfee, Symantec etc. and returns a summary of the results of each one of these engines. The use of this service provides increased accuracy when analyzing a suspicious file since hardly any of these engines can claim 100% detection rates. One limitation is that the web service returns search results only in cases that a file with the queried hash value has been previously submitted and analyzed by the service. Unfortunately, there is not a common terminology or naming conventions followed by all these different vendors and as such the results of a file analysis are mostly of descriptive form following the naming scheme that the vendor follows. As such, a quite simple ontology has been defined in order to semantically represent the results of such file analysis as graphically shown in Figure 15. The main design approach has been that a File object is analyzed by the ‘VirusTotal’ service that in return provides an AntiVirus Report with the description that each one of the engines returned. Figure 15: Ontological modeling of VirusTotal's anti-malware detection service A brief description of the defined classes, object properties and data properties follows: Table 10: Entities of the VirusTotal ontology CLASSES 62 File The class represents File Object, commonly extracted from a forensic disk image or a network packet capture that is submitted for analysis for possible malware behavior. AntivirusEngine The class represents an Antivirus-Antimalware engine. Individuals of this class are the engines that are currently supported by the ‘VirusTotal’ service which amount to over 30 as of present. AntivirusReport The class represents a collection of the results of the different engines as returned by the service. OBJECT PROPERTIES hasAVReport This object property connects an individual, member of the File class, with another individual, member of the AntivirusReport class. hasResult This object property connects an individual, member of the AntivirusReport class with a blank node that is used to represent the result attained from an antimalware engine. generatedBy This object property connects the blank node mentioned before that represents an engine result with the individual that is a member of the AntivirusEngine and represents the specific engine that provides the result. DATA PROPERTIES hasAVName The name of the engine formatted as string. hasDate The date on which the report has been produced. hasMD5Hash The MD5 hash value of the submitted file object. hasPermanentLink A URL link where the ‘VirusTotal’ service stores the generated report for future reference. hasResultDescription The lexical representation of the output of the antimalware engine. 7.3 Semantic Integration and Correlation of Forensic Evidence Upon the definition of the semantic descriptions that have been presented in the previous section, an automated method can access evidence files that are pertinent to a case and structured in the supported formats described before and represent them in a semantic manner. The data contained in these formats can be transformed into semantic representations using the classes, object and data properties defined in the respective ontology. The result of this operation is a set of axioms following the underlying fundamental data model of ‘subject-predicate-object’. This set of axioms can also be graphically represented in the form of a graph by specialized visualization platforms. A conceptual representation of this approach is presented in Figure 16. Source data can come in different forms such as native formats like ‘pcap’ for network packet captures, ‘ewf’, ‘dd’, ‘aff’ for disk image formats and others. By native formats we consider the direct output of acquisition tools 63 such as disk imaging software or hardware or networking monitoring tools. Another category of source data can be the output of specialized forensic tools that can perform a preprocessing of the native evidence format and introduce a first level of abstraction. Tools that fit in this category can be e.g. file system parsers that reconstruct the file system resident in a disk image and output the structure of directories and files along with their content or metadata or network stream assemblers that group network packets into higher layer stream and connections such as TCP/UDP streams and application layer messages or transmitted files. The final category is the supportive data exchange formats where all different types of knowledge bases are considered that can provide additional information. These knowledge bases can include additional information about IP addresses and networks or files. There is a plethora of formats where this data can be contained in including online HTML pages or XMLformatted web service responses, stored data in relational databases, custom text formats and many more. Figure 16: Transformation process of raw data to their semantic representation The output of this transformation process is the set of axioms, which is graphically represented by a graph structure, where individual resources, as members of the classes defined in the respective ontology, represented by circles in the figure, are interconnected by object properties and also associated with data values by data properties. Each resource is uniquely identified by its URI under a namespace that follows a naming scheme that can be decided by the examiner. A simple naming scheme that can be followed is to use the file name of the evidence file or the name of the source of supportive data appended by a descriptive name of the resource e.g. an IP address, a filename etc. The transformation process can be either natively supported by forensic acquisition or forensic analysis tools where an export function to a set of ontological axioms is provided or through specialized software components/parsers. In the next step, the generated sets of axioms can be merged in order to provide a complete ontological representation of all the sources of data involved in the case. During the merging some apparent issues that need further treatment are those of integration and correlation in and between the different data sets which are discussed in the subsequent sections. 7.3.1 Semantic Integration In order to reduce the complexity of the asserted axioms that are generated by the aforementioned process as well as being able to establish semantic relationships between multiple identical or closely related individuals, an integration process is required. Three issues have been identified that the integration step can resolve namely integrate same individuals within the same set of axioms, integrate 64 same individuals within different set of axioms but under the same ontology and finally same or similar individuals within different set of axioms expressed under different ontologies. The first issue can be tacked through de-duplication. As an example we can consider an IP address which appears to have multiple network communications with different hosts in the same network capture file. In order to reduce the complexity of subsequent reasoning, it is important that the transformation process creates a single OWL Individual for each IP address. The RDF data model allows the reuse of this resource either as the subject or the object of other axioms which promotes integration of axioms that have shared members. An example is given in Figure 17 where two different TCP sessions are connected with the same resource representing an individual that is member of the IP address class. The two sessions are even connected with this resource via different object properties as an IP address can act simultaneously both as a client of a remote service and as a server providing other network services. The forensic tool or the respective parser can easily support such a feature with an internal Set data structure that can keep single URI resources for each distinctive value. Figure 17: De-duplication of data by semantic integration using URIs The second issue is the integration of individuals representing the same entity when present under different namespaces. Each forensic tool or parser can transform a specific type of evidence into an ontological set of axioms under a unique per source namespace. In the case of multiple evidence files of the same type there is the possibility that multiple OWL Individuals under each file’s namespace be created representing the same entity e.g. the same IP address being present in multiple network captures. OWL 2 has introduced support for keys (Parsia et al. 2008), a DL-Safe form of inverse functional properties with support for data values as well. Defining a HasKey axiom, named instances of a class that have same values on specified object and/or data properties can be considered as being the same and the object relationship owl:sameAs connecting them be inferred by a reasoning engine. A last issue is the integration of individuals that may represent the same or similar concepts between different ontologies. An example would be the concept of the IP address which is present in both network captures as well as firewall logs and many more network-related sources of data. As discussed before, an advantageous approach would be if well-accepted ontologies were specified that covered such generic terms and imported and reused in tool specific ontologies. However, this thesis follows a more realistic approach by accepting each tool or data source be associated with its respective ontology without having any shared common base. This introduces the problem that classes in different ontologies may represent the same or closely related concepts thus hampering automated integration of this data. Although research efforts exist on automated or semi-automated multiontology assignment (Jean-Mary et al. 2009), in the context of this thesis a more manual approach is applied. 65 A solution followed in this thesis is the introduction of SWRL rules which can be used to infer additional axioms that the expressivity limitations of OWL cannot support. SWRL rule evaluation by a rule engine can establish intra-connections between individuals that are members of classes belonging in different ontologies but represent related concepts. These new connections can be established in the form of new object properties with a descriptive name and appropriately defined domain and range restrictions. An example can be an object property by the name ‘FWHostToPacketCaptureIPAddress’ with individuals that are members of the class ‘Host’ of the WindowsXPFirewall ontology as the domain restriction and individuals that are members of the class ‘IPAddress’ of the PacketCapture ontology as the range restriction. The axioms defining these ‘bridging’ object properties can be either included in the respective ontologies that need to be integrated or to a new dedicated ontology. This thesis follows the second approach so as to decouple the different domain ontologies that can be developed and expanded separately from each other. A new ontology is designed to import the various domain ontologies that the provided sources of data may require and where such ‘bridging’ object properties along with other concepts of combinatorial nature can be defined. An graphical example is given in Figure 18 where the green colored arrow represents an object property that ‘bridges’ two individuals that are members of different classes of different ontologies. In this case, an individual of the ‘IPAddress’ class from the ‘PacketCapture’ ontology is connected with an individual of the ‘Host’ class from the ‘WindowsXPFirewall’ ontology. This allows the integration of different datasets thus enabling an automated manner to combine data sets from different sources and further advanced reasoning and query capabilities. Figure 18: Semantic Integration of related individuals represented in different ontologies Based on the types of data sources that this thesis focuses on and the ontologies specified in the previous section, the following similar classes between different ontologies have been identified and respective object properties have been specified for interlinking individuals that are members of the former classes: Table 11: Integration semantic mappings between ontologies Ontology A : Class A Ontology B : Class B Linking Object Property PacketCapture : IPAddress WindowsXPFirewallLog : Host PcapIPToFWLogHost PacketCapture : IPAddress WHOIS : IPAddress PcapIPToWHOISIpAddr PacketCapture : IPAddress FIRE : IPAddress PcapIPToFireIPAddr 66 WindowsXPFirewallLog : Host WHOIS : IPAddress FWLogHostToWHOISIpAddr WindowsXPFirewallLog : Host FIRE : Host FWLogHostToFireHost WHOIS : IPAddress FIRE : IPAddress WHOISIpAddrToFireIPAddr PacketCapture : Port WindowsXPFirewallLog : Port PcapPortToFWLogPort HTTP : Content DigitalMedia : File HTTPContentToMediaFile HTTP : Content VirusTotal : File HTTPContentToVTFile DigitalMedia : File VirusTotal : File MediaFileToVTFile The linking properties that have been defined have focused on two main types of information, namely IP addresses and hash values of files. IP address is a piece of information that is present in almost any type of network related logs and is quite fundamental in every network forensics related investigation. An IP address can be observed either as a source or destination endpoint of a network packet or communication stream or as the source or destination of a logged event from various network security appliances such as network or host firewalls, IDS, VPN, network authentication logs etc. The ability to link in an automated manner multiple observations of the same IP in different data sources can enable the reconstruction of the actions that an IP address has performed, history of communications, acquire additional information about the network and the operator that owns it as well as cross-checking with IP reputation and ban lists for possible malicious behavior. On the other side, a file signature such as an MD5 or SHA1 hash value can be used in order to reconstruct the trail of a file from its network transmission up to its storage on the media device as well as cross-checking with antimalware or other file analysis services that can provide more information about the content or the malicious nature of it. These two types of integrate-able information are visually presented in Figure 19. Figure 19: Integration of IP addresses/MD5 hash signatures 7.3.2 Evidence Correlation Besides integration attempts that strive to interlink individual resources that represent the same or closely related concepts, correlation can enable the investigator to connect resources of totally different nature. As such, in the context of this thesis we will use the term correlation when being able to automatically establish relations between resources of different nature, of the same or different domains/ontologies, either by the reasoner or rule evaluation. In order to correlate resources of 67 different type there has to be a common reference, some type of information to which these resources are directly or indirectly connected. Different types of correlation techniques are explicitly or implicitly used in digital investigations such as temporal, spatial, mereological, size, IP-to-user and many more. In this thesis we focus on two types of correlation, namely temporal and mereological which are further discussed below. 7.3.2.1 Temporal Correlation Time is of paramount importance in almost every type of digital investigation. A variety of different resources can carry time related metadata including transmission time of a network packet, duration of a network communication, timestamps of file activities on a file-system like creation, modification and last access time, timestamps of logged events from firewalls, IDS, operating systems and many more. A correlation of different resources based on their reference to time can enable the investigator to reconstruct a global timeline of events that incorporates events from heterogeneous sources such as disk images, network activity, logs etc. One forensic tool worthy to be mentioned is the log2timeline (Guðjónsson 2010) which supports a large number of different input formats and merge different types of events in one combined timeline. One problem with this approach though is that although the output format of the tool supports advanced visualization techniques, events are linked between them in a sequential manner based on the timestamps without support for more advanced types of queries as discussed below. One point that needs attention is the translation of the different timestamps to a common format and locale. Different timestamps may be expressed in different time-zones regarding the local time-zone settings of the data source. This can create inconsistent results when timestamps from different sources are combined together. We follow the assumption that this translation is part of the preprocessing phase and that the timestamps that are given as input to the method are expressed in a common format and time zone. The approach followed in this thesis for semantically describing time-related information as well as perform temporal correlation of resources is based on the method discussed in (M. J. O. Connor & Das 2011). The proposed SWRL Temporal Ontology is based on the valid-time temporal model upon which a fact or a proposition can be associated with time instants or time intervals during which the fact is considered to be true or valid. The SWRL Temporal Ontology provides various OWL entities in the form of classes, object and data properties that can be used in order to represent arbitrary propositions, time instants and intervals as well as the granularity of the temporal information. The ontology specifies the class ‘ExtendedProposition’ as a semantic representation of any type of fact or proposition that can carry temporal information. An instance that is a member of this class can be connected via the ‘hasValidTime’ object property with instances that are members of the ‘ValidInstant’ or ‘ValidPeriod’ classes that represent time instants or periods respectively. There are two main ways on how to introduce semantically expressed temporal information to existing ontologies. The one is to modify the existing ontologies by adding new properties to declared classes with their range pointing to instances of the ‘ValidInstant’ or ‘ValidPeriod’ classes. The problem with this approach is that modifications may be needed to a large number of ontologies and inconsistencies may arise between modified and non-modified ontologies. The other approach is to use sub-classing for specifying classes of the existing ontologies to be subclasses of the ‘ExtendedProposition’ class. One advantage with the latter approach is that the original ontology does not have to be modified as the sub-classing axioms can be specified in an external ontology that imports the original ones. The only problem with this approach is that axioms that have been asserted by the transformation process and include references to date and time values have to be re-expressed so as to introduce instances of the new temporal classes as well as be properly linked to them. This process is graphically represented 68 in Figure 20 where an original axiom about a timestamp with a domain specific data property is converted so as to use the SWRL Temporal Ontology. Figure 20: Conversion process according to the SWRL Temporal Ontology Specifically, the class in which the instance belongs to is declared to be a subclass of the ‘ExtendedProposition’ class. An individual is created for every distinct timestamp that is a member of the class ‘ValidInstance’ or the class ‘ValidPeriod’ in case a time period needs to be represented. This new individual is then linked by the data property ‘hasTime’ with the literal value of the XML Schema DateTime type. In case that the individual represents a time period two data properties are used, by the name ‘hasStartTime’ and ‘hasFinishTime’ to connect the former with the two time endpoints. After this conversion operation finishes, all time-related information has been converted in a semantic uniform representation that can be further leveraged into performing temporal correlations between individuals of different types. In this thesis, we have used the Allen’s Interval Algebra (AIA) as described in (Allen 1983) which provides a description of the different relationships that time intervals can have between them. AIA has defined 13 relations between time intervals which are graphically represented in Figure 21. 69 Figure 21: Temporal relations of Allen's Interval Algebra The rules that define the relations between the start and end timestamps between two intervals and the resulting predicate that links them have been encoded in corresponding SWRL rules. Besides the above relations that pertain to time intervals, (Hobbs & Pan 2004) have expanded and defined additional predicates for relations between time instants and time intervals. The predicate ‘inside’ can be used to describe a time instant that is between the start and end time points of an interval while the predicates ‘before’ and ‘after’ can also be used when a time instant is before the start or after the end of an interval respectively. The research of (M. J. O. Connor & Das 2011) has provided a set of SWRL built-ins implementing the Allen temporal operators on temporal entities that can be used in SWRL rules which upon evaluation can assert new axioms that utilize these temporal predicates. The predicates that have been described above have been defined in a separate ontology that imports all the domain specific ones with the aim of minimizing the need of modifications to the original ontologies. The domain and range of these object properties have been defined to be instances of the Event class, a generic class that is used to represent any type of event that can be referred to with respect to time. 7.3.2.2 Mereological Correlation The other type of correlation studied in this thesis is part to whole relations which are part of the field of ‘mereology’. These relations are commonly representing the connections between the parts of an entity and the entity itself. A common characteristic of such relations is the transitivity since if A is part of B and B is part of C then A is also part of C. The semantics of this type of relations must not be confused with similar types of relations such as containment, membership and sub-classing as the semantics of being part of something as well the transitivity are lacking in the latter cases. (Keet & Artale 2007) One type of this type of relation used in the selected types of data sources is the correlation between IP networks and Autonomous Systems (AS). Autonomous systems are commonly a collection of IP prefixes that are managed by the same network operator. By correlating instances of IP addresses contained in either network packet captures or firewall logs with the data sets provides by Internet registries such as RIPE part to whole relations can be established between the IP addresses and the Autonomous Systems they are parts of. 70 Another type of this type of relation is the connection between a disk image, a partition, a file and a sequence of bytes in the same data stream. More specifically, in the manner that a modern file system like NTFS operates and the possibility of fragmentation, a logical file allocated in the file-system can be split in a number of byte streams in different physical locations on the disk. A file though is also a part of a partition which is a separation of the disk’s size into distinct areas that can support different file systems. Finally, the partition is also a part of the disk image thus completing a chain of part to whole relations. Part-to-whole correlations between individuals of different types as those mentioned above can be established either in the same ontology or between different ontologies. In the first case, the initial transformation process from the raw input data to their ontological representation can establish these relations along with appropriately defined predicates in the respective ontology. However, this method does not rely on the domain specific ontology to provide always such expressivity and as thus allows the establishment of such relations via SWRL rules as well that can later be utilized by the reasoning engine as well. This applies also in the second case where these ‘partOf’ type of relations can be established between individuals originating from the ontological representation of different sources of data and their respective ontologies. A graphical example of the establishment of these relations is presented in Figure 22 where an individual representing an IP address is connected through a ‘partOfAS’ predicate with a resource representing an Autonomous System. This correlation can provide additional information about different IP addresses that are under the same network operator as well as the inverse relations of which IP addresses belong to an autonomous system. Figure 22: Mereological correlation between IP addresses and Autonomous Systems 7.4 Query Formulation and Evaluation After the semantic representations of the different evidence files and supportive data have been interconnected through the inference and the rule engine, a single integrated dataset is available to the investigator for further examination. SPARQL has been standardized as the recommended query language for the Semantic Web. SPARQL can operate on RDF data as those produced by the semantic representation phase. The use of the SPARQL can allow the investigator to query the set of asserted and inferred RDF triples via pattern matching. In this thesis, the SELECT type of query was the main 71 focus since it is the main method of information retrieving compared to the use patterns of the other types of queries. The structure of a SELECT query can be divided in the following parts: PREFIX: The prefixes are used to establish a mapping between a URI and a shortcut that can be used in the rest of the query for improved readability. FROM: This optional clause is used when multiple datasets exist. In this thesis all RDF triples were aggregated in a single dataset instead. WHERE: This is the core part of the query where triple patterns are specified against which the dataset is searched for possible matches. LIMIT, OFFSET, ORDER BY: These are optional query modifiers with similar use as in SQL. FILTER: This optional clause enables the query to include constraints that restrict the set of the query results upon various criteria. There is a variety of constraints that can be specified such as XPath tests of comparison between XSD typed literals, regular expression checks against literal values, XPath arithmetic operations on numeric values etc. The set of the triple patterns specified in the WHERE clause form a so called ‘basic graph pattern’. The triple patterns are similar to the RDF triples with the only difference that any or all of the three parts (subject, predicate, object) can be a variable, syntactically denoted by the prefix of ‘?’). The SPARQL processor searches the given RDF dataset attempting to find a sub-graph of it that matches the given triple patterns. In the case that such sub-graphs are found, the values of the dataset replace the corresponding variables in the ‘basic graph pattern’. The variables that have been matched to their corresponding values in the dataset are said to be bound and can all or a subset of them returned as the result of the query. SPARQL supports also OPTIONAL triple patterns in which case a graph pattern that is partially matched to the dataset leaving some variables unbounded can yet be included in the query results. An example of such graph pattern matching is shown in Figure 23 where a small set of triple patterns is queried against a larger dataset and its variables getting bound by their respective values. In this simplistic example a query that searches for possible TCP connections which have destination IP addresses that belong in networks that are located in China. The graph pattern on the left has variables in the place of RDF resources that are searched for. The SPARQL engine scans the whole graph of the given dataset in order to find existing graph patterns that match the queried one. On the right side, a portion of the dataset is presented, where the resources colored with the green color are the ones that match the query’s graph pattern. The variables of the query are bounded by the values of these resources, URIs or literals. As it can be seen, besides the use of the OPTIONAL clause, a graph pattern has to be matched completely in order to be included in the results. As an example, the individual pcap:UDP1 of the Figure 23, is not included in the result since it is a UDP flow instead of a TCP, although the remaining parts match the query pattern. 72 Figure 23: SPARQL graph pattern matching 7.5 A reference method implementation The method so far has been described on a generic level that elaborated on the different parts of it along with a discussion on how to perform key functions such as integrating and correlating individuals among the ontological representations of the various data sources. In this section, a reference implementation of the method will be presented along with a brief discussion of the tools and techniques used. Semantic web technologies are continuously reshaping and gaining advanced capabilities with the introduction of new standards and technologies thus it may be that some of the ones used in this thesis may change significantly in the future. The method is quite flexible and the presented proof of concept (PoC) system is just one of the different ways that the method can be implemented. As discussed in the last section of this report, new advancements on distributed SPARQL endpoints and queries or Rule Interchange Format can allow the integration of different technologies and even more distributed and decentralized architectures. 7.5.1 Overview of the tools used The main tools used to build the reference implementation of the method are given below along with a short description. Java 1.6.0 Update 25 and the Eclipse IDE – Most of the code that has been written for the proof-ofconcept implementation has been in Java and the Eclipse IDE has been used as a code editor along with its project management capabilities. Protégé 4.1.0 – Protégé is an Ontology Development and Editing tool. Protégé has been in order to specify the various domain ontologies that have been used. Protégé provides also an embedded reasoning engine that allows the manual creation of individuals and inference of new axioms based on the defined restrictions and class hierarchy. The tool can assist in the discovery of inconsistencies in the defined ontology as well as visualization of its parts for easier understanding. OWL API 3.2.4 – The OWL API is a Java API that provides a reference implementation for the programmatic creation, modification and serialization of OWL ontologies. The project is open source and already supports most of the features of OWL 2.0. The API provides a number of different parsers thus can support different serialization formats for ontologies such as RDF/XML, OWL/XML, Turtle, KRSS and others. The API provides also interfaces for integration with reasoning engines for performing inference on given ontologies. The API has been used extensively 73 in order to transform the different raw input data to their respective ontological representations according to the specified domain ontologies. Pellet 2.3.0 – Pellet is an OWL reasoning engine that provides an API for programmatic reasoning on OWL based ontologies. Pellet is one of the most established reasoners for OWL with support for most of the part of OWL 2. Pellet can be integrated with OWL API thus providing reasoning services to the application. Protégé-OWL API 3.4.8 – The Protégé editor has two major versions, the 3.X and the 4.X which are incompatible between them. Although Protégé 4.X uses the OWL API as the internal API for editing of the ontologies, Protégé 3.X uses the Protégé-OWL API which is another Java library for manipulation of OWL ontologies. This API has been used in this project in order to provide support for integration with a rule engine as well support for the SWRL temporal ontology and SWRL built-ins since the latter have been developed on top of it. Jena 2.6.4 – Jena is a Java framework that contains an extensive set of libraries that deal with Semantic Web technologies. The Java framework has been mostly focused on the support of RDF and RDF Schema with support for different storage mechanisms for sets of triples such as inmemory or in relational databases and provision of basic inference capabilities. Jena can be also integrated with DL reasoning engines such as Pellet for more advanced OWL 2 inference while recent versions provide support for manipulation of OWL ontologies as well. Jena also provides support for the SPARQL query language for the Semantic Web stack through the SPARQL API that allows querying and updating of RDF knowledge bases. The SPARQL API is used in this project for the querying part of the method although restricted in a command-line environment without using advanced data publishing capabilities over HTTP that the sub-project Fuseki provides. Jess 7.1p2 – Jess is a rule engine based on Java that has the ability to reason upon a given knowledge base in the form of declarative rules. Jess is based on the Rete algorithm for efficient processing of rules (Forgy 1982). The research of (M. O. Connor et al. 2005) has led to the ‘SWRLTab’ which is a development environment both integrated with Protégé as a plugin as well as an API that allows the definition and evaluation of SWRL rules. The ‘SWRLTab’ provides a bridging interface for integration of its core APIs with different rule engines. Currently only the Jess rule engine is supported which although is not supporting SWRL, the ‘SWRLTab’ handles internally the conversion between the OWL representation and the Jess required one. The API allows the definition of SWRL rules in a convenient text-based format while also automatically handles the import of the OWL model and the SWRL rules to the Jess rule engine, the evaluation of the rules and the transfer of any inferred knowledge back from the Jess rule engine to the OWL model. Kraken PCAP 1.3.0 – The Kraken PCAP is a Java API that allows the programmatic manipulation of network packet captures in the ‘pcap’ format. This API provides a number of different network protocol processors that can handle various technical aspects of network communications such as IP fragmentation, TCP stream reassembling and even application layer protocols support such as HTTP or SMTP decoding and extraction of transmitted files. This library has been slightly modified in order to be better adjusted to the needs of this project and has been used for the ontological representation of network packet captures. Apache HTTP Components, JSoup, JSON – Other supportive Java APIs that have been used include the Apache HTTP Components that provides a toolset for programmatic support of a basic HTTP client, the JSoup library that eases the parsing of received HTML content for the extraction 74 of various pieces of data and JSON that allows the parsing of received data formatted in JSON (JavaScript Object Notation). 7.5.2 Architecture of the PoC system The PoC system has been designed so as to support the selected types of data sources which namely are network packet captures, hard disk forensic images in the form of the ‘fiwalk’ tool DFXMLformatted output, Windows XP firewall logs, IP registration information from the RIPE registry database, reputation lists of maliciously behaving autonomous systems and their hosts and finally results of the online anti-malware VirusTotal service. A schematic overview of the implemented PoC system is presented in Figure 24 and its various components are further discussed below. Figure 24: Proof-of-concept system architecture The first component of the system is the Evidence Manager. The evidence manager is tasked to parse the application given arguments regarding the locations and names of the different evidence files and load their byte contents for further processing. The Evidence Manager is also responsible for identifying the type of the given source (e.g. pcap, disk image, firewall log) manually based on a user given parameter or automatically based on the contents or the input file signature as well as properly manages and categorizes all these evidence files. The evidence manager can also be responsible for verifying the integrity of the given files in the case that hash values of the original files is also provided. The evidence manager has been implemented to accept the locations of the evidence files in the form of command line arguments although this could have been implemented also using a graphical interface. The evidence manager can only load files from the local examiner’s system 75 although future enhancements could enable it to load evidence files via the network as well. The evidence manager provides simple call interfaces that allows the other parts of the system to fetch files of specific types or process iteratively all the input evidence files. The second component is the semantic parser. The semantic parser is a generic Java interface that contains the definition of a ‘parseToOntology’ method. Actually, there have been defined two overloaded method, one that accepts the path and name of an evidence file and the other that accepts a collection of strings which as we discussed below could be sets of IP addresses, MD5 hash values, timestamps etc. The method returns an object of the class ‘OWLOntology’ which is part of the OWLAPI and represents an OWL ontology along with the OWL Axioms and OWL Annotations it consists of. Concrete Java classes have been created that implement the above abstract methods for the different types of source data. As such 6 parsers have been implemented, one for each source type, which ontologically represent the input data according to the specified ontologies. As seen in the Figure, the parsers are also loading the respective domain ontologies, the ontologies for the different formats of data or tool outputs as specified in §7.2. The ‘Semantic Parser’ is iteratively loading all the files that the ‘Evidence Manager’ is handling along with the respective ontology and transforms the input data into a set of OWL Axioms that contains the OWL Individuals, their class memberships, links between them by object properties as well as their data values by data properties. ‘Semantic Parsers’ are able to load the evidence file from the examiner’s local disk or fetch data from online services such as VirusTotal and the RIPE database. One problem that our system had to encounter was the amount of axioms that a database like RIPE or VirusTotal could lead to. RIPE provides an interface to their registry database that can theoretically provide data for each one of the 4 billion IP addresses. Similarly VirusTotal maintains a database of over 100 million files. In order to reduce the complexity of the implemented system and based on the practical limitations that specific online services such as RIPE or VirusTotal may block large amounts of parallel requests, the idea of collectors has been implemented. The collectors have been data structures of the form of List that the different parsers had access to and filling them up with entries as the input data where processed. As such, collectors are responsible for maintaining a list of entities that are actually observed in the main evidence files which later are loaded in the ‘Semantic Parser’ for further processing. Four collectors have been defined which namely are, the ‘IPAddressCollector’, the ‘MD5HashCollector’, the ‘ASNumberCollector’ and the ‘TimestampCollector’. The ‘IPAddressCollector’ is filled with entries of IP addresses that the parsers of the packet captures and the firewall logs are encountering. After finishing processing all the files of these types, the final list of IP addresses is then fed to the RIPE-specific semantic parser that iteratively queries the RIPE database in order to acquire additional information about these IP addresses and represent them semantically based on the respective ontology. The ‘MD5HashCollector’ maintains entries of MD5 hash values of files that are either found to be in a disk image or transferred via a network protocol like HTTP and extracted from a network packet capture. The list of MD5 hash values is then fed to the VirusTotal-specific semantic parser that iteratively queries the VirusTotal service through its online web service in case of any file being already examined and listed as malicious. The ‘ASNumberCollector’ is maintaining a list of the numbers of Autonomous Systems based on the results retrieved from the RIPE database. The AS numbers of networks whose IP addresses have been encountered during the processing of the evidence files are then used by the FIRE-specific semantic parser in order to search for networks that are blacklisted as containing malicious hosts. Finally the ‘TimestampCollector’ collects timestamps of either time instants as those found in firewall logs or MAC file timestamps from disk images or representing time intervals such as the beginning and end 76 timestamps of a TCP session. The timestamps are used by a specialized component which upon iteration generates individuals of the ‘ValidInstant’ and ‘ValidPeriod’ classes as defined in the SWRL Temporal ontology which are later connected with other resources upon evaluation of SWRL rules. The collectors are not an essential part of the method as they can be implemented in different ways. One solution could be if online services like RIPE and VirusTotal provided SPARQL endpoints that could enable remote implementations of the method query and fetch axioms relevant to the encountered entities. In such a case, integration of individuals that were created from the domain ontologies could be performed dynamically with the online sets of OWL axioms and their URI references. Another point is that the semantic parsers have been implemented in a more involved manner since they should be able to accept object references to the collectors’ data structures and appropriately update them. A different technique, although more complex that could have been adopted could be the iteration of the set of axioms that a semantic parser returns and e.g. the extraction of all data-type properties of interest (e.g. all the data-type properties that are of type XML Schema DateTime or custom defined types for representing e.g. IP addresses. In an additional effort to reduce the complexity of the resulting set of axioms, a technique commonly used in digital forensics with respect to known files hashes has been adopted. Most vendors of forensic suites such as Encase and FTK provide lists of hashes of known files that are commonly part of the operating system or well-recognized applications and thus should be ignored during the investigation. NIST maintains such a list, known as the National Software Reference Library (NSRL). The SANS website provides a query interface to this library via an online HTML-based form (https://isc.sans.edu/tools/hashsearch.html). The user may fill in the hash value of a file and retrieve information in the case that the hash is included in the database and the name of the file it belongs to. In order to promote automation of the removal of such files from forensic tools, the database can be queried also by using the DNS protocol. A special DNS zone, ‘md5.dshield.org’ has been configured where the tool may issue a DNS request for a hostname of the form ‘hashvalue.md5.dshield.org by substituting it with the corresponding value. In the case of a successful lookup, the tool can infer that the hash value is contained in the database of known ‘good’ files and thus removed or ignored from further forensic analysis. The semantic parser that processes the output of the Fiwalk tool representing in XML the contents of a disk image has been adapted with such functionality. Thus, the resulting set of axioms can contain a considerably lower amount of files and their resulting axioms. For a relatively fresh disk image of a Windows XP SP3 OS installation, approximately 40-50% of the files were removed by such an approach. The next component is that of the ‘Inference Engine’. The OWL axioms that the semantic parsers have asserted are aggregated together under a temporary namespace of an initially blank ontology. This blank ontology can of course be named after some case identifier or any other type of identification scheme the investigator may use. The inference engine imports all the referenced domain ontologies that the semantic parsers have used and is now able to perform automated reasoning according to the OWL/OWL2 specifications. Pellet is the reasoning engine that is used in the PoC system and has a granular level of which types of OWL axioms the examiner wants to be inferred. One type is the generation of class assertion axioms such as in the case of class hierarchies where an individual that is a member of a subclass can be inferred to be a member of the parent class as well. Another useful type is the generation of inverse object properties axioms where two individuals are establishing reverse links based on the asserted object property and the defined inverse object property of the latter that can improve the performance of query execution later on. 77 Due to the fact that in a realistic environment, the expressivity of the different domain tools is not guaranteed or the investigator may desire to introduce new concepts that may be combinations of multiple concepts dispersed in different ontologies and not being able to modify the original ontologies, the method allows the creation of additional ontologies to support the definition of additional ontological assertions. In the PoC system, an additional ontology which is called ‘IntegrationOntology’ has been created and it imports all the referenced domain ontologies as well as their indirect references. The investigator is able to use an ontology editing tool such as Protégé to define new classes, additional restrictions, new object properties etc. The inference engine is then able to infer new axioms based not only on the domain ontologies but on the investigator custom-defined one. This allows for a level of flexibility since the domain ontologies may not be easily modifiable but the OWL specification allows, even more promotes, the creation of supplementary ontologies that may reuse entities defined in other ontologies. The next component is the SWRL Rule Engine. In the PoC system the Jess rule engine has been adopted which although does not support directly SWRL rules and OWL ontologies, a bridging API as part of the ProtegeOWL API provided the capability to load and evaluate SWRL rules and the set of ontological axioms to the rule engine as well as import back any newly inferred axioms. SWRL rules have been used in order to establish relations between individuals belonging in different ontologies but representing similar concepts like IP addresses as well as correlate different individuals based on shared grounds like time. As such, SWRL rules play a major role in the automated integration and correlation parts of the method. The SWRL rules can be kept in a separate text file which promotes the decoupling of the actual rules from the implementation of the method as well as enables sharing and reuse of the rules in multiple cases as well. The final component is responsible for accepting SPARQL queries from the user and evaluating them against a SPARQL query engine. The Jena framework is providing the ARQ query engine that supports the SPARQL RDF query language. The set of RDF triples that have been either asserted during the semantic parsing of the source data or inferred by the reasoning or the rule engine are loaded in-memory and SPARQL queries can be evaluated against it. The queries can once more be stored in separate files thus promoting reuse and decoupling from the implemented system. The folder that contains the text files of the SPARQL queries is given to the program in the form of a commandline argument which then iteratively loads and evaluates them. Currently the results are outputted to the console with a simplistic table-based formatting. The above PoC system provides a basic implementation of the proposed method. Each component implements a part of the method as the latter has been described in §7.1. It needs to be emphasized that the implemented system is far from optimal since all the processing is performed sequentially and the whole set of triples is stored in-memory . The system can be much more efficiently designed by taking advantage of upcoming technologies such as SPARQL Federated Query where a single SPARQL query can be evaluated over multiple and diverse data sources and utilizing dedicated persistence layers for storing the triples/graphs such as Jena’s TDB and SDB SPARQL databases. One other point of discussion is the order of the calls to the reasoning and the rule engine. In the case that the call to the reasoning engine may need to take advantage of axioms that can only be inferred by the SWRL rules, then the call order may need to be reversed. Another option can be that a call to the reasoning engine is done twice, one before the rule engine and the other after it. An interesting point to highlight though is that the implemented system is quite flexible since besides the code part of the semantic parsers, the calls to the reasoning, the rule and the query engine can be performed in a dynamic and non-hardcoded manner. The information needed to perform reasoning; rule and query 78 evaluation does not have to be part of the code but can be kept in separate files such as RDF/XML format for the ontologies and text files for the SWRL rules and the SPARQL queries. 79 Demonstration of the Method In this section a demonstration of the method is shown using a common scenario that involves a host getting infected by malware after visiting a malicious web-site. The experiment conducted has the goal of evaluating the feasibility of the method as well as the qualitative advantages that it provides to the investigator for the analysis of the case and the acquired evidence files. The infection of a machine upon visiting a malicious website and its further exploitation via a remote terminal or even desktop connection that the attacker may establish is a quite common method of intruding a private network. In the cases of an advanced type of malware that includes Antivirus evasion techniques or when the compromised machine is not protected by any local anti-malware engine or maybe a non-properly updated one, the system’s compromise may go unnoticed. A forensic examination of the system’s drive and its further scanning against an updated antimalware engine by the investigator may reveal the residence of malicious files (e.g. executables, DLL libraries) that the attacker may have downloaded on the system after the exploitation. A typical scenario of a remote compromise of a networked host involves three or four steps, namely reconnaissance in order to identify running and possibly vulnerable services on the remote system, exploitation where the actual vulnerability is exploited using some or multiple attack vectors, privilege escalation where the attacker attempts to gain administrative privileges on the compromised system and finally post-exploitation where the attacker may perform various tasks such as download additional malicious software on the compromised system, leave backdoors for further connection, extrude important files resident on the victim, modify OS settings etc. The exploitation stage can be performed mainly in either a direct manner where the attacker attempts to exploit a vulnerability of a networked service running on the remote system or either attempt to trick the victim to activate the exploit code directly on the system. The first method is commonly using buffer overflow attacks as the attack vector such as stack-based and heap-based techniques that may enable the attacker to execute arbitrary code on the compromised system. The second method is usually carried on using social engineering techniques such as tricking the user to download and execute some malicious file send as an email attachment, from a malicious website or through file sharing channels. Common attack vectors employed in the second method are vulnerabilities that document-related software may have such as Microsoft Word, Adobe Acrobat Reader, Picture Viewers or the support that such formats provide for active content like JavaScript or Flash embedded code or vulnerabilities that web browsers may have that enables specially crafted web content to bypass their security sandbox and execute code on the victim’s machine without even the need of any user interaction (drive-by download & execution). After successful vulnerability exploitation, the attacker is able to execute code of his choice on the compromised system. Commonly, this code is termed as ‘shell-code’ since it usually binds a command shell process to a network port. The attacker upon establishing a network connection with this port, can access remotely the compromised system through this command shell enabling him to perform a vast array of tasks (e.g. create users, set up remote desktop services, disable local antimalware or firewall services etc.). Even more advanced techniques exist that allow the attacker to inject the compromised system with additional code libraries that enable the attacker to perform even more advanced tasks (e.g. taking desktop screenshots, logging user keystrokes etc.). After a successful exploitation, the connection can be set up in two different manners, the forward and the reverse one. In the forward manner, the shell-code that has been executed on the system establishes a listening port 80 waiting for incoming connections by the attacker for serving the remote shell over it. Due to most network or host firewalls disabling incoming requests to most ports and modern OS asking the user in case of a process attempting to run a listening service, this way of remote connection is highly possible to be unsuccessful. The reverse connection forces the compromised system to initiate the connection to a listening service on the attacker’s end since outgoing connections are less probable to be disallowed by a firewall. 8.1 Description of the Experiments In order to better evaluate the proposed method and proof of concept implementation experiments have been performed on a controlled local area network (LAN) environment between two connected hosts, the attacking one and the victim. In order to avoid distributing malicious files over the normal university network or other commercial ones (ISPs), a choice has been made to simulate such a system infection on a LAN isolated environment. In order though, for the method to be able to integrate data from the collected evidence with other data sources such as RIPE and FIRE, a manual process of IP address modification has been followed. Since data sources like RIPE and FIRE do not obviously maintain any data about local IP addresses, the IP address of the attacking machine has been changed to one known malicious one such as those reported by FIRE. This change does not affect the integrity of the results of the method and it can be implemented in either a local properly configured routed setup or by modifying the values of the generated triples after the semantic transformation of any packet capture evidence file including the attacker’s IP address. The experiments have been performed on two systems interconnected via a network hub in a LAN that provided Internet Access as well. Both systems were the same model, namely HP Compaq 8000 Elite with an Intel Core 2 Duo E8400 Processor and 4096MB of RAM. The first system was running Microsoft Windows XP Professional with Service Pack 3. The other system was running BackTrack 5 Release 1, a Linux based distribution known for its collection of exploitation tools such as Metasploit. The exploitation part of the experiment has been centered on the disclosed MS11-006 vulnerability in Windows Shell Graphics Processing that affected multiple versions of Windows OS such as Windows XP Service Pack 3, Windows Vista SP 1 and 2 as well as Windows Server 2008. This module exploits a buffer overflow vulnerability on how Windows handled thumbnails in Office documents. A specially crafted Office document can trigger the vulnerable code when the user navigates to the folder that contains it and display the folder’s contents in ‘Thumbnails’ view. Metasploit provides a plugin that allows an attacker to generate such a malicious document combined with a payload of his choice. The payload could be either code that will establish a listening predefined network port on the victim or code that will force the victim to initiate a connection to a predefined remote IP address and port. The payload can also establish either a simple shell connection or additionally use the ‘meterpreter’ postexploitation extension provided by Metasploit providing even more capabilities to the attacker. After generating the malicious document, the attacker has to find a way to transfer the file on the victim’s system. In the current setup, the attacker has setup a web server from which the user could download it. As said before, the attacker simulates a known malicious host such as those listed by the FIRE project under the ‘Exploit Server’ category. The user could be led to just a site serving malicious files either directly through a link received in the body of an email, an IM message or in the content of another webpage or social networking site. The user could be led also in a drive-by scenario by visiting other sites initially and her browser being automatically redirected. After downloading of the file, we assume that the user may navigate to the folder containing the received file at some point of time and in the case that ‘thumbnail’ view is selected, the exploit code to be triggered. 81 In the first implemented scenario, the Windows XP system gets compromised by executing the exploit code but a locally configured firewall disallows incoming connections to unknown ports. The attacker has selected to create a listening port on the victim machine at port 4444. As such, although the port is listening, the Windows XP Firewall is not allowing incoming connections to be established thus disabling the attacker from continuing his attack. The Windows Firewall has been configured so as to log both dropped packets as well as successfully established connections. The network traffic has been captured by a third system connected to the hub and using Wireshark and saved as a file in the ‘pcap’ format. The scenario is schematically presented in Figure 25 below. Figure 25: Attack scenario of a ‘bind_tcp’ shellcode triggered by a malicious Word document downloaded from the Web. Finally a forensic image of the victim’s system has been taken using FTK Imager and stored in the raw format (dd). The fiwalk tool was used to get a DFXML-formatted XML file representing the files present on the disk and their associated metadata. The firewall log file is extracted from the disk image using common forensic techniques such as using FTK. The malicious file has not been deleted from the victim’s system and the forensic image was taken approximately 5 minutes after the execution of the malicious code. In a more realistic scenario though, there would be possibly a considerable amount of time between the compromise of the system and the initiation of the digital investigation process. However, at least in the case of proper retention of logs such as captured network traffic and that the malicious file was still allocated on the file system, the method should produce the same result although with a much larger dataset hindering it performance-wise. In order to further reduce the complexity of the analysis, the Windows XP system was used for the experiment just after the completion of OS installation thus reducing the number of user-generated files to a minimum. In the second experiment, a variation of the first scenario is simulated. In that case, the malicious document that is generated and downloaded by the victim attempts, after exploitation succeeds, to establish a TCP connection to a remote IP and port that is under the control of the attacker. It is quite common for such outgoing connections to use well-known ports such as port 80 that is used for HTTP traffic so as to minimize the risk of getting denied by firewalls. The attacker has set up properly the attacking system to listen for incoming connections from freshly compromised systems. Upon the 82 establishment of the connection and depending of the payload used, the attacker may be able to establish a shell connection to the compromised system or utilize the ‘meterpreter’ plugin provided by Metasploit for even more advanced post-exploitation capabilities. The ‘meterpreter’ plugin is injecting a DLL in the memory of the compromised system providing advanced functions to the attacker such as automated privilege escalation, download/upload files between the two systems, taking a screenshot of the victim’s desktop, operating a keylogger etc. The injected DLL is not stored on the disk thus minimizing any possible traces of the connection. In the conducted experiment, the meterpreter plugin has been used and after the establishment of the connection, some sample operations have been performed such as downloading to the compromised system an additional malicious file, uploading a random file as well as dumping the hashed passwords of the user accounts from the system’s registry. The firewall allowed by default outgoing connections towards port 80 although it was configured so as to log allowed connections as well. The implemented scenario is graphically represented in Figure 26. Figure 26: Attack scenario of a ‘reverse_tcp’ shellcode triggered by a malicious Word document downloaded from the Web. 8.2 Integration and Correlation of Digital Artifacts The first step of the experiment is the parsing of the collected evidence files (one packet capture of ‘pcap’ format, one DFXML-formatted file as the output of ‘fiwalk’ on the disk image and a text file that contained the Windows XP firewall logs. The respective semantic parsers are applied on the evidence files and the generated triples are stored in individual ontologies which are serialized to RDF/XML formatted text files. In the table below some quantitative data about the collected evidence files are summarized. Table 12: Semantic Representation of the Experiment 1 Evidence Files CompromisedSystem.xml (Fiwalk output of the system’s disk image) Original Disk Size 25GB Original Fiwalk XML output File Size 9,46MB 83 RDF/XML Serialization File Size 7,08MB Number of Allocated Files in the Disk 6610 Number of Nodes in the Graph Representation 34012 Number of Edges in the Graph Representation 83032 Network Packet Capture (filtered for the system’s IP address and TCP protocol only) Original File Size 454KB RDF/XML Serialization File Size 662KB Number of TCP sessions 40 Number of Nodes in the Graph Representation 1616 Number of Edges in the Graph Representation 5891 Windows XP Firewall Log of the compromised system Original File Size 38KB RDF/XML Serialization File Size 684KB Number of Log Entries 413 Number of Nodes in the Graph Representation 1344 Number of Edges in the Graph Representation 5866 RIPE NCC WHOIS Database RDF/XML Serialization File Size 210KB Number of Queried IP Addresses 37 Number of Nodes in the Graph Representation 137 Number of Edges in the Graph Representation 395 FIRE Malicious Networks Database RDF/XML Serialization File Size 113KB Number of Queried Autonomous Systems 5 Number of Nodes in the Graph Representation 384 Number of Edges in the Graph Representation 1083 VirusTotal Anti-Malware Web Service RDF/XML Serialization File Size 2,45MB Number of Queried and Indexed by VT Files 2304 Number of Nodes in the Graph Representation 11519 Number of Edges in the Graph Representation 18508 84 After parsing of the evidence files and their conversion to their respective semantic representations, the set of triples generated by each parser are merged to a single ontology. The resulting set of triples is the sum of all the nodes and edges present in the individual graphs which leads to a quite complex graph. Visualization software such as Gephi can be used to graph the resulting set of graphs as presented in Figure 27 where although not very clear, the sub-graphs of the evidence files are disconnected between them. Figure 27: Visualization of the semantic representation of the evidence files. Use of the inference engine can introduce new triples based on the specified ontologies. Despite the ontologies specified for the evidence types used in this thesis being quite lightweight by restricting to basic expressions such as parent-child class relationships and inverse object properties, the Pellet inference engine introduced 72130 inferred axioms which amounted to an increase of the RDF/XML serialization of the ontology by 6,1MB approximately. The next step is the establishment of interconnections between the separate sub-graphs through the use of the ‘bridging’ ontology and its specified classes and object properties. As discussed before such bridging can be performed by a variety of modeling approaches such as hierarchical relationships of properties between different ontologies, SPARQL construct queries, SWRL rules etc. The main approach followed in this thesis was through SWRL rules since it provided additional expressivity than OWL as well better programmatic support for temporal related rules. The following table contains the definition of the SWRL rules that have been specified as well as the number of axioms that the Jess rule engine generated and later imported back to the main ontology. Table 13: SWRL Rule Evaluation Results for Experiment 1 SWRL Rule Definition Rule Description 85 Number of Generate d Axioms PacketCapture:hasIPValue(?x,?y) ^ WindowsXPFirewallLog:hasAddress(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:PcapIPToFWLogHost(?x,?w) PacketCapture:hasIPValue(?x,?y) ^ WHOIS:hasAddress(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:PcapIPToWHOISIpAddr(?x,?w) PacketCapture:hasIPValue(?x,?y) ^ Fire:hasIPAddressString(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:PcapIPToFireIPAddr(?x,?w) WindowsXPFirewallLog:hasAddress(?x,?y) ^ WHOIS:hasAddress(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:FWLogHostToWHOISIpAddr(?x,?w) The rule ‘bridges’ individuals referring to the same IP address value between the ‘PacketCapture’ and the Firewall ontologies. 26 The rule ‘bridges’ individuals referring to the same IP address value between the ‘PacketCapture’ and the RIPE ontologies. 17 The rule ‘bridges’ individuals referring to the same IP address value between the ‘PacketCapture’ and the FIRE ontologies. 1 The rule ‘bridges’ individuals referring to the same IP address value between the Firewall and the RIPE ontologies. 37 The rule ‘bridges’ individuals referring to the same IP address value between the Firewall and the FIRE ontologies. WindowsXPFirewallLog:hasAddress(?x,?y) ^ Fire:hasIPAddressString(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> 2 IntegrationOntology:FWLogHostToFireHost(?x,?w) WHOIS:hasAddress(?x,?y) ^ Fire:hasIPAddressString(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:WHOISIpAddrToFireIPAddr(?x,?w) The rule ‘bridges’ individuals referring to the same IP address value between the RIPE and the FIRE ontologies. The rule ‘bridges’ individuals referring to the same network port numberbetween the ‘PacketCapture’ and the Firewall ontologies PacketCapture:TCPPort(?x) ^ PacketCapture:hasNumericalValue(?x,?y) ^ WindowsXPFirewallLog:hasNumber(?w,?z) ^ swrlb:equal(?y,?z) -> 2 34 IntegrationOntology:PcapPortToFWLogPort(?x,?w) The rule ‘bridges’ PacketCapture:hasContentMD5(?x,?y) ^ 86 18 DigitalMedia:hasMD5(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:HTTPContentToMediaFile(?x,?w) individuals referring to the same MD5 hash value between the ‘PacketCapture’ and the ‘DigitalMedia’ ontology The rule ‘bridges’ individuals referring to the same MD5 hash value between the ‘PacketCapture’ and the ‘VirusTotal’ ontologies 1 22 IntegrationOntology:MediaFileToVTFile(?x,?w) The rule ‘bridges’ individuals referring to the same MD5 hash value between the ‘DigitalMedia’ and the ‘VirusTotal’ ontologies WindowsXPFirewallLog:FirewallEvent(?x) ^ WindowsXPFirewallLog:hasDateTime(?x,?y) ^ temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) -> temporal:hasValidTime(?x,?z) The rule ‘connects’ individual firewall events to the individuals representing the temporal instants. 413 DigitalMedia:File(?x) ^ DigitalMedia:hasFileCreationTime(?x,?y) ^ temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?filecreationevent,?x) -> IntegrationOntology:FileCreationEvent(?filecreationevent) ^ IntegrationOntology:hasFileCreationEvent(?x,?filecreationeve nt) ^ temporal:hasValidTime(?filecreationevent,?z) The rule ‘connects’ file individuals to temporal instant individual based on their file creation timestamp. A new individual is also created that is a member of the ‘FileCreationEvent’ class. 9795 DigitalMedia:File(?x) ^ DigitalMedia:hasFileAccessTime(?x,?y) ^ temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?fileaccessevent,?x) -> IntegrationOntology:FileAccessEvent(?fileaccessevent) ^ IntegrationOntology:Event(?fileaccessevent) ^ temporal:hasValidTime(?fileaccessevent,?z) The rule ‘connects’ file individuals to temporal instant individual based on their file last access timestamp. A new individual is also created that is a member of the ‘FileAccessEvent’ class. 9795 DigitalMedia:File(?x) ^ DigitalMedia:hasMetadataChangeTime(?x,?y) ^ temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?filemetadatachangeevent,?x) -> IntegrationOntology:FileMetadataChangeEvent(?filemetadatac The rule ‘connects’ file individuals to temporal instant individual based on their file metadata change timestamp. A new individual is also created 9795 PacketCapture:hasContentMD5(?x,?y) ^ VirusTotal:hasMD5Hash(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:HTTPContentToVTFile(?x,?w) DigitalMedia:hasMD5(?x,?y) ^ VirusTotal:hasMD5Hash(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> 87 hangeevent) ^ IntegrationOntology:Event(?filemetadatachangeevent) ^ temporal:hasValidTime(?filemetadatachangeevent,?z) that is a member of the ‘FileMetadataChangeEvent ’ class. DigitalMedia:File(?x) ^ DigitalMedia:hasFileModificationTime(?x,?y) ^ temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?filemodificationevent,?x) -> IntegrationOntology:FileModificationEvent(?filemodificatione vent) ^ IntegrationOntology:Event(?filemodificationevent) ^ temporal:hasValidTime(?filemodificationevent,?z) The rule ‘connects’ file individuals to temporal instant individuals based on their file modification timestamp. A new individual is also created that is a member of the ‘FileModificationEvent’ class. 9795 PacketCapture:hasStartTimeStamp(?x,?y1) ^ PacketCapture:hasEndTimeStamp(?x,?y2) ^ temporal:hasStartTime(?z,?z1) ^ temporal:hasFinishTime(?z,?z2) ^ temporal:equals(?y1,?z1) ^ temporal:equals(?y2,?z2) -> temporal:hasValidTime(?x,?z) The rule ‘connects’ TCP/UDP sessions to temporal period individuals based on their start and finish timestamp. 17 In the second experiment, the same approach has been followed. Some quantitative data describing the evidence files of the second experiment are shown below. Table 14: Semantic Representation of the Experiment 2 Evidence Files CompromisedSystem.xml (Fiwalk output of the system’s disk image) Original Disk Size 25GB Original Fiwalk XML output File Size 9,34MB RDF/XML Serialization File Size 6,44MB Number of Allocated Files in the Disk 3273 Number of Nodes in the Graph Representation 16330 Number of Edges in the Graph Representation 45039 Network Packet Capture (filtered for the system’s IP address and TCP protocol only) Original File Size 2,63MB RDF/XML Serialization File Size 2MB Number of TCP sessions 57 Number of Nodes in the Graph Representation 5419 Number of Edges in the Graph Representation 21712 Windows XP Firewall Log of the compromised system Original File Size 46KB RDF/XML Serialization File Size 784KB 88 Number of Log Entries 480 Number of Nodes in the Graph Representation 1510 Number of Edges in the Graph Representation 6794 RIPE NCC WHOIS Database RDF/XML Serialization File Size 38KB Number of Queried IP Addresses 41 Number of Nodes in the Graph Representation 181 Number of Edges in the Graph Representation 326 FIRE Malicious Networks Database RDF/XML Serialization File Size 113KB Number of Queried Autonomous Systems 5 Number of Nodes in the Graph Representation 384 Number of Edges in the Graph Representation 1083 VirusTotal Anti-Malware Web Service RDF/XML Serialization File Size 54KB Number of Queried and Indexed by VT Files 2540 Number of Nodes in the Graph Representation 253 Number of Edges in the Graph Representation 386 In addition to the evaluation of the aforementioned rules that establish relationships between similar or identical individuals belonging to different ontologies, an emphasis on temporal related rules has also been given for the 2nd case. Taking advantage of the semantic parsers’ work that have generated individuals representing time instant and time periods in accordance with the SWRL Temporal Ontology as well as the SWRL temporal built-ins provided by the Protégé-OWL API that implement most of the Allen’s temporal operators, custom temporal rules can be specified and evaluated by the rule engine. Case 2’s evidence files have results in a number of 1024 individuals representing time instants (file and firewall events) and 21 individuals representing time periods (TCP sessions). Examples of such temporal rules are presented above along with their results for the 2nd case’s evidence files that enable the investigator to establish temporal relations, such as before, after, starting at the same time, between the semantic representations of time events and their associated forensic events. Table 15: SWRL Rule Evaluation Results for Experiment 2 SWRL Rule Definition Rule Description temporal:hasTime(?x,?t1) ^ temporal:hasTime(?y,?t2) ^ The rule connects the individuals of time instants temporal:before(?t1, ?t2) ^ 89 Number of Generate d Axioms 55770 temporal:add(?t1Plus,?t1,60,temporal:Seconds) ^ temporal:add(?t2Plus,?t2,0,temporal:Seconds) ^ temporal:before(?t2Plus,?t1Plus,temporal:Seconds) -> IntegrationOntology:temporalBefore(?x,?y) temporal:hasStartTime(?x,?z) ^ temporal:hasStartTime(?y,?w) ^ temporal:hasFinishTime(?x,?z2) ^ temporal:hasFinishTime(?y,?w2) ^ temporal:before(?z,?w) ^ temporal:before(?z2,?w) ^ temporal:add(?zPlus,?z,60,temporal:Seconds) ^ temporal:add(?wPlus,?w,0,temporal:Seconds) ^ temporal:before(?wPlus,?zPlus,temporal:Seconds) -> IntegrationOntology:temporalBefore(?x,?y) temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^ temporal:hasFinishTime(?y,?w2) ^ temporal:before(?z,?w1) ^ temporal:add(?zPlus,?z,60,temporal:Seconds) ^ with a custom property when one of the timestamp is positioned before the other but no more than one minute. The rule connects the individuals representing time periods when one period begins before another but not more one minute but it also ends before the other has started. 136 The rule connects individuals representing time instants and time periods when a time instant is positioned before the start of the period by no more than one minute. 1008 The rule connects individuals representing time instants and time periods when a period starts and ends before a time instant by no more than one minute 1841 temporal:hasStartTime(?x,?t1) ^ temporal:hasStartTime(?y,?t3) ^ temporal:hasFinishTime(?x,?t2) ^ temporal:hasFinishTime(?y,?t4) ^ temporal:equals(?t1,?t3) ^ temporal:before(?t2,?t4) -> IntegrationOntology:temporalStarts(?x,?y) The rule connects individuals representing time periods when the two periods start at the same time but end in different timestamps. 33 temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^ temporal:hasFinishTime(?y,?w2) ^ temporal:after(?z,?w1) ^ temporal:before(?z,?w2) -> IntegrationOntology:temporalInside(?x,?y) The rule connects individuals representing time instants and time periods when a time instant is between the beginning and the end of the time period. 64 temporal:add(?w1Plus,?w1,0,temporal:Seconds) ^ temporal:before(?w1Plus,?zPlus,temporal:Seconds) -> IntegrationOntology:temporalBefore(?x,?y) temporal:hasTime(?x,?z) ^ temporal:hasStartTime(?y,?w1) ^ temporal:hasFinishTime(?y,?w2) ^ temporal:before(?w2,?z) ^ temporal:add(?zPlus,?z,60,temporal:Seconds) ^ temporal:add(?w1Plus,?w1,0,temporal:Seconds) ^ temporal:before(?w1Plus,?zPlus,temporal:Seconds)-> IntegrationOntology:temporalBefore(?y,?x) It should be obvious that the investigator has much more flexibility to create even more meaningful temporal-related rules utilizing the full spectrum of the Allen operators. In the aforementioned 90 examples a time window of one minute has been chosen since such exploitation events (download and execution of a malicious file) can happen quite fast but also in order to reduce the complexity of the rules. Unrestricted temporal relations between timestamps, especially in cases with multiple and large files can lead to quite ‘heavy’ rules with high amounts of generated axioms. The context of the case along with the shared expertise of the forensic community can lead to various heuristic for the specification of more useful and lightweight rules. 8.3 Hypothesis formulation and evaluation After the successful completion of all the previous steps, the investigator has in his/her disposal a large set of triples which represent concepts commonly used in a digital investigation, individual entities of these concepts corresponding to the observed and logged events of the case as well as the semantic relationships that interconnect all these resources. In the last step, the investigator can utilize the powerful SPARQL language in order to address queries to the collected dataset and receive meaningful results. The main benefit of all the approach followed so far is that the dataset does not consist of different formats that need tools and manual interpretation by the investigator but a conceptual and logical representation of it. In this section, we will follow the possible stages of a digital investigator’s analysis process in order to verify the security incident that may have compromised the system as well as reconstruct all the involved parties and sequence of events. Assuming that the investigator has been provided with the aforementioned evidence files and does not have access to any additional information, she is called to understand if the system has been compromised and if so how this may have been accomplished. One of the first steps that the investigator may follow is to gain familiarity about the provided dataset and the information it carries. The investigator should of course have already an understanding of the ontologies used for the semantic representation of the evidence which we claim to be a much easier task to accomplish than gaining technical expertise to all the different tools that evidence processing demands for. One of the largest advantages of the SPARQL language and RDF in general is that the actual data along with metadata such as information about the schema can be present both in the same set of triples. In such a manner the investigator can get an overview of the provided dataset by querying directly the dataset itself. Additionally, in order to reduce the extent of the reported queries in the document, the SPARQL query part where the mappings between the prefixes used and the respective vocabulary namespaces is shown once below and assumed in all the following queries. PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#> PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#> PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#> PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#> PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#> PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#> PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#> PREFIX http: <http://www.w3.org/2011/http#> PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> 91 The investigator may formulate some hypotheses which can be expressed in the form of queries in order to evaluate quickly the potential significance or relevance of the provided evidence to the case. Some sample queries are provided below along with excerpts of their results that can demonstrate the potential ability of the proposed method for fast and efficient queries upon aggregated data from heterogeneous domains. Hypoth esis The investigator hypothesizes that the compromised system may have had network communications with external IP addresses that belong to autonomous systems that may be listed as malicious networks. Query SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire WHERE { ?tcpflow packetcapture:hasDestinationIP ?destip . ?destip packetcapture:hasIPValue ?destipvalue . ?destip integration:PcapIPToWHOISIpAddr ?whoisip . ?whoisip whois:isContainedInRange ?range . ?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?range whois:hasRange ?rangeValue . ?range whois:isContainedInAS ?as . ?as whois:hasNetName ?netname . ?as whois:hasASNumber ?asnumber . ?as whois:hasRoute ?route } Results Interpre tation tcpflow destipvalue netname asnumber <urn://bind_tcp_F Wed_tcp.pcap#tcpS ession_6> "78.46.173.193" ^^<http://www.w3.o rg/2001/XMLSchem a#string> "HETZNER-AS" ^^<http://www.w3.o rg/2001/XMLSchem a#string> "24940" ^^<http://www.w3.o rg/2001/XMLSchem a#string> <urn://bind_tcp_F Wed_tcp.pcap#tcpS ession_4> "78.46.173.193" ^^<http://www.w3.o rg/2001/XMLSchem a#string> "HETZNER-AS" ^^<http://www.w3.o rg/2001/XMLSchem a#string> "24940" ^^<http://www.w3.o rg/2001/XMLSchem a#string> The results of the query support the hypothesis that the compromised system had indeed network communications with IP addresses that belongs to autonomous systems known to demonstrate malicious behavior. The query is able to match a graph pattern in the provided dataset thus retrieving additional information regarding the specific blacklisted AS. 92 Hypoth esis A common attack vector for compromising a system is through malware that may have been downloaded and executed by the user from the Web. A hypothesis is that a file that has been downloaded from the Web as can be extracted from the packet capture can be recognized as malicious by the anti-malware engine. Query SELECT DISTINCT ?tcpflow ?http ?uri ?md5 ?link WHERE { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . ?http packetcapture:hasHTTPRequest ?httpreq . ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?vtreport . ?vtreport virustotal:hasPermanentLink ?link } Results tcpflow <urn://bind_t cp_FWed_tc p.pcap#tcpSe ssion_6> http <urn://bind _tcp_FWed _tcp.pcap# http_52> uri md5 link "/msf.doc" ^^<http://ww w.w3.org/200 1/XMLSche ma#string> "9E10A7844 BA8BA4EFE 1A514D2710 5735" ^^<http://ww w.w3.org/200 1/XMLSche ma#string> "http://www.virustotal.com/file/ 8bacecdc64d63b334bca23f46cb 0723119bbaafa148479d4b91785 2c2ee44943/analysis/" ^^<http://www.w3.org/2001/XM LSchema#string> Interpre tation The SPARQL results show that upon evaluating the provided query on the integrated data of the different types of evidence, a path could be found that connects one of the files that have been extracted from the packet capture with an identified and analyzed known malware. The result provides a support to the investigator’s hypothesis that indeed the compromised system may have downloaded and potentially executed a malicious file. It should be obvious that the SPARQL SELECT graph pattern can be expanded in order to retrieve more information such as the involved IP addresses, TCP ports, HTTP request and result headers or the individual comments by each antivirus engine that VirusTotal supports. Hypoth esis The investigator hypothesizes that in the event of a successful compromise, traces of malicious files may also be found in the hard disk’s image of the system. The query searches for any files that may be listed as malicious by the anti-malware service. 93 Query SELECT DISTINCT ?file ?pathName ?md5 WHERE { ?file rdf:type digitalmedia:File . ?file digitalmedia:hasPathName ?pathName . ?file digitalmedia:hasMD5 ?md5 . ?file integration:MediaFileToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?report . ?report virustotal:hasResult ?result . ?result virustotal:hasResultDescription ?description } Results Interpre tation file pathName md5 <urn://infectedHostNE W.xml#file_5578> "WINDOWS/system32/driver s/beep.sys" ^^<http://www.w3.org/2001/ XMLSchema#string> "da1f27d85e0d1525f6621372 e7b685e9" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://infectedHostNE W.xml#file_758> "Documents and Settings/John/My Documents/msf.doc" ^^<http://www.w3.org/2001/ XMLSchema#string> "9e10a7844ba8ba4efe1a514d2 7105735" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://infectedHostNE W.xml#file_5686> "WINDOWS/system32/driver s/vdmindvd.sys" ^^<http://www.w3.org/2001/ XMLSchema#string> "55e01061c74a8cefff58dc361 14a8d3f" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://infectedHostNE W.xml#file_6139> "WINDOWS/system32/servic es.exe" ^^<http://www.w3.org/2001/ XMLSchema#string> "0e776ed5f7cc9f94299e70461 b7b8185" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://infectedHostNE W.xml#file_2847> "WINDOWS/pchealth/helpctr/ System/Remote Assistance/Interaction/Client/ RAClient.htm" ^^<http://www.w3.org/2001/ XMLSchema#string> "cb4a33bd4fce7cc2eeccdc45d 939e8b7" ^^<http://www.w3.org/2001/ XMLSchema#string> The results support the hypothesis that the system was indeed infected with malicious files. The above results are an excerpt of the complete results which amounted to a number of 22. Some of the reported files seem to be false positives however the structured format of these results may allow further integration with additional datasets such as hash lists of malicious files or 94 further checks against online anti-malware services (e.g. online sandboxes and binary analysis). Hypoth esis The investigator hypothesizes that some of the malicious files identified on the image of the system’s drive may have been downloaded from web communications the system may had. If the downloaded malware was further stored in the disk without modifications, then hash value equality provides a way to track the path of the file as it was downloaded and then stored on the disk. To make the hypothesis even more explicit, the investigator may refine the query so as to search for such files that may come from known malicious hosts and networks. Query SELECT ?file ?uri ?destip ?host_fire ?type ?asname WHERE { ?file rdf:type digitalmedia:File . ?file digitalmedia:hasMD5 ?md5 . ?httpbody integration:HTTPContentToMediaFile ?file . ?httpresp http:body ?httpbody . ?httpreq http:requestURI ?uri . ?httpreq http:resp ?httpresp . ?http packetcapture:hasHTTPRequest ?httpreq . ?http rdf:type packetcapture:HTTP . ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?tcpflow packetcapture:hasDestinationIP ?destip . ?destip integration:PcapIPToFireIPAddr ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?host_fire rdf:type ?type . ?host_fire fire:isContainedInAS ?as . ?as fire:hasASName ?asname } Results file uri host_fire type <urn://infectedHo stNEW.xml#file_ 758> "/msf.doc" ^^<http://www.w3.org /2001/XMLSchema#st ring> <urn://firedb#host_7 8.46.173.1 93> <http://www.w3.org/2002/07/ owl#Thing> <urn://infectedHo stNEW.xml#file_ "/msf.doc" ^^<http://www.w3.org <urn://firedb#host_7 <http://people.dsv.su.se/~dossi s/ontologies/Fire.owl#Exploit 95 758> Interpre tation /2001/XMLSchema#st ring> 8.46.173.1 93> Server> <urn://infectedHo stNEW.xml#file_ 758> "/msf.doc" ^^<http://www.w3.org /2001/XMLSchema#st ring> <urn://firedb#host_7 8.46.173.1 93> <http://people.dsv.su.se/~dossi s/ontologies/Fire.owl#Host> <urn://infectedHo stNEW.xml#file_ 758> "/msf.doc" ^^<http://www.w3.org /2001/XMLSchema#st ring> <urn://firedb#host_7 8.46.173.1 93> <http://people.dsv.su.se/~dossi s/ontologies/Fire.owl#Malicio usHost> <urn://infectedHo stNEW.xml#file_ 758> "/msf.doc" ^^<http://www.w3.org /2001/XMLSchema#st ring> <urn://firedb#host_7 8.46.173.1 93> <http://www.w3.org/2002/07/ owl#NamedIndividual> The results show that indeed a malicious file, named ‘msf.doc’ has been downloaded and later stored on the disk. The host from which the file was downloaded was included in the FIRE blacklist as a malicious host and more specifically it has been categorized as an Exploit Server. Hy pot hesi s The investigator hypothesizes that after downloading and storing the malicious file on the system, some user or automated action may have led to an actual exploitation of the system and the execution of malicious arbitrary code. It is common that malicious code that manages to install itself in the system may attempt to communicate with other external systems (e.g. exfiltrate data, receive commands, participate in DDos attacks). The query searches for any firewall events that indicate any unsuccessful connection attempts to the host. Que ry PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#> PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#> PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#> PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#> PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#> PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#> PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#> PREFIX http: <http://www.w3.org/2011/http#> PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX fn: <http://www.w3.org/2005/xpath-functions#> SELECT DISTINCT ?event ?ripeAS ?type ?fireip ?host_type ?fireipB 96 WHERE { ?event xpfw:hasSourceHost ?host . ?event rdf:type xpfw:FirewallEvent . ?event rdf:type ?type . ?host integration:FWLogHostToFireHost ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?host_fire rdf:type ?host_type . ?host_fire fire:isContainedInAS ?as . ?host integration:FWLogHostToWHOISIpAddr ?ripeip . ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?tcpflow packetcapture:hasDestinationIP ?destip . ?destip integration:PcapIPToFireIPAddr ?fireipB . ?fireipB fire:IPbelongsToHost ?host_fireB . ?host_fireB fire:isContainedInAS ?as . ?ripeip whois:isContainedInRange ?riperange . ?riperange whois:isContainedInAS ?ripeAS . ?ripeAS whois:hasCountry ?country . FILTER regex(str(?type),"Event","i") FILTER regex(str(?host_type),"Server","i") } Res ults event ripe AS type fireip host_type fireipB <urn://pfi rewall.log #event_2 99> <ur n://r ipedb# as_2 494 0> <http://people.dsv.su.se/~do ssis/ontologies/WindowsXP FirewallLog.owl#DropData Event> <urn://f iredb#ipad dr_78.4 6.104.4 3> <http://people.dsv.s u.se/~dossis/ontolo gies/Fire.owl#CCS erver> <urn://fi redb#ipad dr_78.4 6.173.1 93> <urn://pfi rewall.log #event_3 15> <ur n://r ipedb# as_2 494 <http://people.dsv.su.se/~do ssis/ontologies/WindowsXP FirewallLog.owl#DropData Event> <urn://f iredb#ipad dr_78.4 6.104.4 3> <http://people.dsv.s u.se/~dossis/ontolo gies/Fire.owl#CCS erver> <urn://fi redb#ipad dr_78.4 6.173.1 93> 97 0> <urn://pfi rewall.log #event_2 73> Inte rpre tati on <ur n://r ipedb# as_2 494 0> <http://people.dsv.su.se/~do ssis/ontologies/WindowsXP FirewallLog.owl#DropData Event> <urn://f iredb#ipad dr_78.4 6.104.4 3> <http://people.dsv.s u.se/~dossis/ontolo gies/Fire.owl#CCS erver> <urn://fi redb#ipad dr_78.4 6.173.1 93> The results provide support for the hypothesis that the system’s firewall has indeed prohibited and logged incoming connections from an IP address that belongs to the same Autonomous System as some of the web communications that the system has been reported to have with malicious hosts. Mereological correlation of the part-to-whole relationship that connects IP addresses with the AS to which they belong has allowed quick identification of further communication that the compromised system may had with other malicious hosts. In this case, the firewall events were of the ‘DropDataEvent’ type that represented unsuccessful incoming connections. The originating IP address of these attempts did not only belong on the same malicious network as other web communications the system had but was also blacklisted as a Command and Control Server (CCServer). Based on the sample queries presented above, the investigator has gained considerable support on some of the hypotheses that can be significant to the case in hand. The queries’ results have indicated that the system had web communications with known malicious hosts, downloaded and stored on the system’s disk malicious files as well as there were further attempts to communicate with other malicious hosts that were probably related with the former ones. Based on the temporal relations that have been established in the 2nd case, temporal queries can also be evaluated against the aggregated and integrated data. Some examples of such temporal queries are given below. Hypo thesis After extracting and identifying a malicious file in one of the web communications that the system had, the investigator hypothesizes that a successful execution of it can lead to the infection of the machine with additional malicious files that the attacker may inject. The investigator formulates a query to identify any possible malicious files that have been created on the host’s disk shortly after the malicious web communication. Quer y SELECT DISTINCT ?uri ?file ?pathname ?descriptionDisk WHERE { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . ?http packetcapture:hasHTTPRequest ?httpreq . ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . 98 ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?report . ?report virustotal:hasResult ?result . ?tcpflow temporal:hasValidTime ?duration . ?duration integration:temporalBefore ?timestamp . ?fileevent temporal:hasValidTime ?timestamp . ?file integration:hasFileCreationEvent ?fileevent . ?file digitalmedia:hasPathName ?pathname . ?file integration:MediaFileToVTFile ?vtfileDisk . ?vtfileDisk virustotal:hasAVReport ?reportDisk . ?reportDisk virustotal:hasResult ?resultDisk . ?resultDisk virustotal:hasResultDescription ?descriptionDisk } Resul ts uri file pathname descriptionDisk <urn://reverse TCP.xml#file_ 754> "Documents and Settings/John/MyFile. exe" ^^<http://www.w3.or g/2001/XMLSchema# string> "Trojan.Win32.Generi c!BT" ^^<http://www.w3.or g/2001/XMLSchema# string> <urn://reverse TCP.xml#file_ 754> "Documents and Settings/John/MyFile. exe" ^^<http://www.w3.or g/2001/XMLSchema# string> "BackDoor.Netguy.4" ^^<http://www.w3.or g/2001/XMLSchema# string> "/~dossis/Windows7S erial.doc" ^^<http://www.w3.or g/2001/XMLSchema# string> <urn://reverse TCP.xml#file_ 754> "Documents and Settings/John/MyFile. exe" ^^<http://www.w3.or g/2001/XMLSchema# string> "Malware.Kernelbot" ^^<http://www.w3.or g/2001/XMLSchema# string> "/~dossis/Windows7S erial.doc" ^^<http://www.w3.or <urn://reverse TCP.xml#file_ 754> "Documents and Settings/John/MyFile. exe" "TR/Rootkit.Gen" ^^<http://www.w3.or g/2001/XMLSchema# "/~dossis/Windows7S erial.doc" ^^<http://www.w3.or g/2001/XMLSchema# string> "/~dossis/Windows7S erial.doc" ^^<http://www.w3.or g/2001/XMLSchema# string> 99 g/2001/XMLSchema# string> "/~dossis/Windows7S erial.doc" ^^<http://www.w3.or g/2001/XMLSchema# string> Interp retati on ^^<http://www.w3.or g/2001/XMLSchema# string> <urn://reverse TCP.xml#file_ 754> "Documents and Settings/John/MyFile. exe" ^^<http://www.w3.or g/2001/XMLSchema# string> string> "Riskware.WinNT.Bou pke!IK" ^^<http://www.w3.or g/2001/XMLSchema# string> The results verify the hypothesis that indeed in a short time period (one minute according to the aforementioned custom rules) after downloading a malicious document from a Web server, an additional malicious executable has been created and stored on the system’s disk. Hypoth esis The investigator further hypothesizes that a successful compromise of the system by a downloaded malicious file may have enabled an attacker to establish a shell connection to the system and access various files on the system’s disk. The investigator formulates a query to identify which files have been accessed in a short time period after the download from the web of a malicious file. Query SELECT DISTINCT ?uri ?file ?pathname WHERE { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . ?http packetcapture:hasHTTPRequest ?httpreq . ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?report . ?report virustotal:hasResult ?result . ?tcpflow temporal:hasValidTime ?duration . ?duration integration:temporalBefore ?timestamp . ?fileevent temporal:hasValidTime ?timestamp . ?file integration:hasFileLastAccessEvent ?fileevent . ?file digitalmedia:hasPathName ?pathname } 100 Results uri file pathname <urn://reverseTCP.x ml#file_4785> "WINDOWS/system32/config/ SAM" ^^<http://www.w3.org/2001/ XMLSchema#string> "/~dossis/Windows7Serial.doc " ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://reverseTCP.x ml#file_907> "Documents and Settings/LocalService/Local Settings/Application Data/Microsoft/Windows/Usr Class.dat" ^^<http://www.w3.org/2001/ XMLSchema#string> "/~dossis/Windows7Serial.doc " ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://reverseTCP.x ml#file_4795> "WINDOWS/system32/config/ system.LOG" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://reverseTCP.x ml#file_4788> "WINDOWS/system32/config/ SECURITY" ^^<http://www.w3.org/2001/ XMLSchema#string> <urn://reverseTCP.x ml#file_4791> "WINDOWS/system32/config/ software.LOG" ^^<http://www.w3.org/2001/ XMLSchema#string> "/~dossis/Windows7Serial.doc " ^^<http://www.w3.org/2001/ XMLSchema#string> "/~dossis/Windows7Serial.doc " ^^<http://www.w3.org/2001/ XMLSchema#string> "/~dossis/Windows7Serial.doc " ^^<http://www.w3.org/2001/ XMLSchema#string> Interpre tation The results (only an excerpt is shown above) list all the files whose last access time was in a short time window after the download of the malicious file. The results contain entries of registry files such as the SAM that contains the user accounts’ passwords and may indicate possible dumping of the hashed passwords by the attacker. Hyp othe sis The investigator hypothesizes that after a malicious file has been downloaded and possibly executed, a communication of the compromised system with the attacker may be attempted. As before, in most cases like botnets, the compromised system may attempt to communicate with the C&C server, controlled by the attacker. The formulated query searches for firewall logged events in a short time window after the malicious download towards hosts that may be already blacklisted. Que ry SELECT DISTINCT ?uri ?type ?host ?typeB WHERE { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . ?http packetcapture:hasHTTPRequest ?httpreq . 101 ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?report . ?report virustotal:hasResult ?result . ?tcpflow temporal:hasValidTime ?duration . ?timestamp integration:temporalInside ?duration . ?event temporal:hasValidTime ?timestamp . ?event rdf:type xpfw:FirewallEvent . ?event rdf:type ?type . ?event xpfw:hasDestinationHost ?host . ?host integration:FWLogHostToFireHost ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?host_fire rdf:type ?typeB . FILTER regex(str(?type),"Event","i") FILTER regex(str(?typeB),"Server","i") } Res ults uri type host typeB "/~dossis/Wind ows7Serial.doc" ^^<http://www. w3.org/2001/X MLSchema#stri ng> <http://people.dsv.su.se/~dossis/ ontologies/WindowsXPFirewallLo g.owl#OpenOutboundSessionEve nt> <urn://pfire wall.log#ip_ 78.46.104.4 3> <http://people.dsv.s u.se/~dossis/ontolog ies/Fire.owl#CCServ er> "/~dossis/Wind ows7Serial.doc" ^^<http://www. w3.org/2001/X MLSchema#stri ng> <http://people.dsv.su.se/~dossis/ ontologies/IntegrationOntology.o wl#Event> <urn://pfire wall.log#ip_ 78.46.104.4 3> <http://people.dsv.s u.se/~dossis/ontolog ies/Fire.owl#CCServ er> "/~dossis/Wind ows7Serial.doc" ^^<http://www. <http://people.dsv.su.se/~dossis/ ontologies/WindowsXPFirewallLo g.owl#FirewallEvent> <urn://pfire wall.log#ip_ 78.46.104.4 <http://people.dsv.s u.se/~dossis/ontolog ies/Fire.owl#CCServ 102 w3.org/2001/X MLSchema#stri ng> Inte rpre tati on 3> er> The results show that indeed an outgoing connection has been established, in a short time period after the malicious download, to a system that was already blacklisted as a known Command and Control server. Hypothesis Upon identifying a suspicious connection to a malicious host as logged by the firewall, the investigator wants to verify which files have been recently accessed before this event. The formulated query searches for files whose last access timestamp is before the suspicious connection, in a short time window. Query SELECT DISTINCT ?host ?pathname WHERE { ?event xpfw:hasDestinationHost ?host . ?event rdf:type xpfw:FirewallEvent . ?event rdf:type ?type . ?host integration:FWLogHostToFireHost ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?event temporal:hasValidTime ?timestampA . ?timestampB integration:temporalBefore ?timestampA . ?fileevent temporal:hasValidTime ?timestampB . ?file integration:hasFileLastAccessEvent ?fileevent . ?file digitalmedia:hasPathName ?pathname } Results host pathname <urn://pfirewall.log#ip_78.46.104. 43> "Documents and Settings/John/My Documents/Windows7Serial.doc" ^^<http://www.w3.org/2001/XMLSchema#st ring> <urn://pfirewall.log#ip_78.46.104. 43> "Documents and Settings/All Users/Documents/My Music/.." ^^<http://www.w3.org/2001/XMLSchema#st ring> <urn://pfirewall.log#ip_78.46.104. 43> "Documents and Settings/John/My Documents/My Music/.." 103 ^^<http://www.w3.org/2001/XMLSchema#st ring> Interpretatio n <urn://pfirewall.log#ip_78.46.104. 43> "Documents and Settings/John/My Documents/." ^^<http://www.w3.org/2001/XMLSchema#st ring> <urn://pfirewall.log#ip_78.46.104. 43> "Documents and Settings/John/Recent/Windows7Serial.lnk" ^^<http://www.w3.org/2001/XMLSchema#st ring> The results show that the malicious file that has been downloaded from the web has been recently accessed before the communication to the C&C server was established. This result further corroborates the hypothesis that this file was indeed executed and the one that caused the suspicious connection. Hy pot hes is The investigator hypothesizes that the user may have been drawn to downloading the malicious file through a phishing attempt by a malicious site. The formulated query searches for the web pages that were visited by the user shortly before or even the same time as when the malicious file was downloaded. Qu ery PREFIX whois: <http://people.dsv.su.se/~dossis/ontologies/WHOIS.owl#> PREFIX integration: <http://people.dsv.su.se/~dossis/ontologies/IntegrationOntology.owl#> PREFIX xpfw: <http://people.dsv.su.se/~dossis/ontologies/WindowsXPFirewallLog.owl#> PREFIX fire: <http://people.dsv.su.se/~dossis/ontologies/Fire.owl#> PREFIX packetcapture: <http://people.dsv.su.se/~dossis/ontologies/PacketCapture.owl#> PREFIX digitalmedia: <http://people.dsv.su.se/~dossis/ontologies/DigitalMedia.owl#> PREFIX virustotal: <http://people.dsv.su.se/~dossis/ontologies/VirusTotal.owl#> PREFIX http: <http://www.w3.org/2011/http#> PREFIX temporal: <http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX content: <http://www.w3.org/2011/content#> SELECT DISTINCT ?uri ?fieldValue ?title WHERE { { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . 104 ?http packetcapture:hasHTTPRequest ?httpreq . ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . ?vtfile virustotal:hasAVReport ?vtreport . ?vtreport virustotal:hasResult ?result . ?tcpflow temporal:hasValidTime ?durationA . ?durationB integration:temporalStarts ?durationA . ?tcpFlowB temporal:hasValidTime ?durationB . ?tcpFlowB packetcapture:hasApplicationLayerProtocol ?httpB . ?httpB rdf:type packetcapture:HTTP . ?httpB packetcapture:hasHTTPRequest ?httpreqB . ?httpreqB http:requestURI ?uriB . ?httpreqB http:headers ?header . ?httpreqB http:resp ?httprespB . ?httprespB http:body ?httpbodyB . ?httpbodyB content:title ?title . ?header http:fieldName ?fieldName . ?header http:fieldValue ?fieldValue . FILTER (regex(str(?fieldName),"Host","i") || regex(str(?fieldName),"Referer","i")) } UNION { ?tcpflow packetcapture:hasApplicationLayerProtocol ?http . ?http rdf:type packetcapture:HTTP . ?http packetcapture:hasHTTPRequest ?httpreq . ?httpreq http:resp ?httpresp . ?httpreq http:requestURI ?uri . ?httpresp http:body ?httpbody . ?httpbody packetcapture:hasContentMD5 ?md5 . ?httpbody integration:HTTPContentToVTFile ?vtfile . 105 ?vtfile virustotal:hasAVReport ?vtreport . ?vtreport virustotal:hasResult ?result . ?tcpflow temporal:hasValidTime ?durationA . ?durationB integration:temporalBefore ?durationA . ?tcpFlowB temporal:hasValidTime ?durationB . ?tcpFlowB packetcapture:hasApplicationLayerProtocol ?httpB . ?httpB rdf:type packetcapture:HTTP . ?httpB packetcapture:hasHTTPRequest ?httpreqB . ?httpreqB http:requestURI ?uriB . ?httpreqB http:headers ?header . ?httpreqB http:resp ?httprespB . ?httprespB http:body ?httpbodyB . ?httpbodyB content:title ?title . ?header http:fieldName ?fieldName . ?header http:fieldValue ?fieldValue . FILTER (regex(str(?fieldName),"Host","i") || regex(str(?fieldName),"Referer","i")) } } Re sul ts uri fieldnam fieldValue e title "/~dossi s/Windo ws7Seria l.doc" ^^<http: //www. w3.org/ 2001/X MLSche ma#strin g> "Host" ^^<http: //www. w3.org/ 2001/X "people.dsv.su.se" MLSche ^^<http://www.w3.org/2001/XM ma#strin LSchema#string> g> "/~dossi s/Windo ws7Seria l.doc" ^^<http: //www. "Referer " ^^<http: //www. w3.org/ 2001/X "http://www.google.se/search?hl =sv&source=hp&q=semantic+web +dsv&gbv=2&oq=semantic+web+ dsv&gs_l=hp.3...1743.4076.0.421 6.16.14.0.2.2.0.110.1162.12j2.14. 0...0.0...1c.lAuiKGn2X-s" 106 "http://people.dsv.su.se/~dossis/ " ^^<http://www.w3.org/2001/XM LSchema#string> "http://people.dsv.su.se/~dossis/ " ^^<http://www.w3.org/2001/XM LSchema#string> w3.org/ MLSche ^^<http://www.w3.org/2001/XM 2001/X ma#strin LSchema#string> MLSche g> ma#strin g> "/~dossi s/Windo ws7Seria l.doc" ^^<http: //www. w3.org/ 2001/X MLSche ma#strin g> "Host" ^^<http: //www. w3.org/ 2001/X "people.dsv.su.se" MLSche ^^<http://www.w3.org/2001/XM ma#strin LSchema#string> g> "/~dossi s/Windo ws7Seria l.doc" ^^<http: //www. w3.org/ 2001/X MLSche ma#strin g> "Referer " ^^<http: //www. w3.org/ 2001/X MLSche ma#strin g> "/~dossi s/Windo ws7Seria l.doc" ^^<http: //www. w3.org/ 2001/X MLSche ma#strin g> "Host" ^^<http: //www. w3.org/ 2001/X "http://www.google.se/" MLSche ^^<http://www.w3.org/2001/XM ma#strin LSchema#string> g> "/~dossi s/Windo ws7Seria l.doc" ^^<http: "Referer "www.google.se" " ^^<http: ^^<http://www.w3.org/2001/XM //www. LSchema#string> w3.org/ "http://people.dsv.su.se/~dossis/ " ^^<http://www.w3.org/2001/XM LSchema#string> 107 "http://people.dsv.su.se/~dossis/ coolsites.htm" ^^<http://www.w3.org/2001/XM LSchema#string> "http://people.dsv.su.se/~dossis/ coolsites.htm" ^^<http://www.w3.org/2001/XM LSchema#string> "http://www.google.se/search?hl =sv&source=hp&q=semantic+web +dsv&gbv=2&oq=semantic+web+ dsv&gs_l=hp.3...1743.4076.0.421 6.16.14.0.2.2.0.110.1162.12j2.14. 0...0.0...1c.lAuiKGn2X-s" ^^<http://www.w3.org/2001/XM LSchema#string> "http://www.google.se/search?hl =sv&source=hp&q=semantic+web +dsv&gbv=2&oq=semantic+web+ dsv&gs_l=hp.3...1743.4076.0.421 6.16.14.0.2.2.0.110.1162.12j2.14. //www. w3.org/ 2001/X MLSche ma#strin g> Int erp ret ati on 2001/X MLSche ma#strin g> 0...0.0...1c.lAuiKGn2X-s" ^^<http://www.w3.org/2001/XM LSchema#string> An excerpt of the results shows some of the websites visited by the user before downloading the malicious file. The results can show that before downloading the malicious file, the browser has visited the main page of the malicious site (people.dsv.su.se/~dossis in our example) along with a specific page under it (coolsites.htm). By utilizing also the Refered HTTP request header, the examiner is able to identify, a web search submitted on the Google search engine (“semantic web dsv”) where the malicious site was included in the results. The ability of the above queries to quickly provide integrated and correlated results upon the multitude of initial unstructured or semi-structured collected evidence should be apparent by now. Although no specific measurement has been taken on the time or expertise that an investigator may have needed to provide such answers manually, it should be reasonable to claim that such an approach is quite more flexible and scalable. It should be emphasized the fact that such queries allowed the investigator to connect events such as dropped incoming connections by the firewall with network communications and disk image by evading the complexity of the sheer amount of data. Besides that, a dropped incoming connection from an external host would not have probably raised any suspicions or alerts although there could have been logical connections with other logged events by other tools and techniques. The evaluation of the above hypotheses can greatly assist the investigation team into being able to reconstruct the sequence of events and provide a more complete and accurate narrative of the investigated event. 8.4 Evaluation of the Method As a final part, a lightweight evaluation of the method and lessons learned are discussed and the attained results are compared to the initially specified goals and criteria in section 6.3. A generic goal that has been specified was that the proposed method should be appropriate and relevant to the digital investigation in focus. Through the design and the prototype implementation of the method, it has been shown that the method can be used in a variety of different contexts involving different types of evidence files. The method should not be considered as a replacement of existing tools or techniques but rather as an additional layer on top of them with the ability of integrating and correlating meaningfully their respective results. In the proof of concept implementation six different types of data have been used (disk images, packet captures, firewall logs, site blacklists, antimalware engine, and network data) which are quite commonly used either wholly or partly in digital investigations and especially in cases involving network intrusions. There haven’t been any restrictions regarding the types of cases or data that the method can handle besides the technical restriction that a respective ontology must have been designed and an appropriate semantic parser implemented that can operate on the semi-structured or unstructured data that an existing tool outputs. One of the strongest points of this method, compared to traditional techniques, is that through the power of SPARQL it gives access to the user of an expressive query interface against the integrated data. The queries can include numerous of unbound variables and constitute quite complex graph 108 patterns against which results are attained. The queries can include every term specified in the respective ontologies and evaluation over datasets spanning hundreds of thousands of triples and representing the complete body of the collected evidence was demonstrated. The resulting semantic representations of the different evidence files as expressed in the RDF/XML format have been shown to be of equal or lesser size than that of the original files. Of course, the resulting size of the set of axioms depends on the abstraction level of the ontology used as well as the features of OWL that are used since evaluation and storage of the inferred axioms leads to even larger files. However, the proposed method provides an efficient way for the investigator to focus the initial phase of the investigation on the metadata of evidence such as timestamps, hash values, file types and names etc. and then resort to the initial evidence files and retrieve specific content upon identifying the most relevant to the case ones such as executables, digital images etc. In addition, although no explicit time measurements have been made, the query resolution has been proven almost instantaneous on commodity hardware for a case that resulted in approximately 300 thousand triples. Inference and storage of inverse object properties has shown to accelerate most of the queries considerably. An additional benefit of this method is its flexibility and potential to fully decouple the implementation code from the used ontologies/rules/queries. Indeed, as described before, all these entities can be stored as separate and external files or even downloaded dynamically from URLs. The prototype code’s main role is to organize and streamline the whole process from calling the appropriate parsers to later invoke the inference and rule engine. The various semantic parsers are the only parts that are highly dependent on the various formats of their source data and the ones that are responsible with extracting and asserting the semantic information according to the specified ontology. Some few hard-coded references to the locations of the ontology files or the way in which the parsers are invoked are shortcomings of this prototype implementation and can be easily lifted off with a more pluggable architecture and a graphical interactive layer. The prototype implementation was capable to deal with the complexity and size of the evidence files used in the experiments in a timely manner. Most parsing, inference and rule tasks did not take more than 5-10 minutes to complete. The only tasks that needed a considerable amount of time were the live matching of all files against the online antivirus engine as well as the temporal-related rules’ evaluation. More specifically, rules that needed to cross-match around 1000 timestamps for e.g. temporal ‘before’ relations, were leading the Jess rule engine to memory exhaustion problems. Upon trial and error, an optimum size of 500 timestamps has been found to avoid such problems and need approximately 7-8 hours to complete. The task of splitting the set of time instant individuals to subgroups and later merging the inferred axioms has been performed manually in this thesis. However, with more computing resources available and possible automated ways or heuristics to ‘divide-and-conquer’ such problems, the ability of the method to handle even larger amount of data will increase. Regarding the forensic-related requirements that have been initially specified, the method has been proven to be quite accurate and precise. The method has been applied to the same evidence files twice providing approximately the same results. Small differences were caused from minor network disrupts or few dropped API calls against the used online services without affecting significantly the bulk of the case’s body. However, reliance on online sources surely can affect the results of this method and disrupt the automated processing by the semantic parsers. The ability though of employing the parsers asynchronously and separately of each other and later merging of their generated statements can alleviate such problems with the need of the implementation of some form of temporary storage for the collector modules. The results of the method have been checked successfully for possible ontological inconsistencies using the Protégé’s reasoning engines. The most probable cause of 109 inconsistencies can be when either the specified ontologies are erroneous or when multiple ontologies cross-reference each other inconsistently. Such problems though, must be checked and corrected before the employment of the method. The method is capable of working on forensic copies of the various sources of evidence since the various semantic parsers can be implemented to do so without any major hindrances. Minor logging capabilities have been added to the prototype implementation concerning the start and finish of the various steps of the method. The resulting file is an ontology file which is fully compliant with the standards and able to be processed by ontology editors such as Protégé for further inclusion of annotation statements. Finally, concerning the semantic-related criteria that have been set, the prototype implementation has been proven to be quite flexible and able to operate successfully on at least two different testing laptops. The implementation is based on Java and thus platform and OS-independent. Additionally, the various steps of the method can be easily separated and run under different systems. As an example, different parsers can work in parallel on different evidence files of the case and then the results can be easily merged into a single dataset. Such techniques have been applied during the thesis in order to deal with increased complexity in a manual manner, however future improvements may take advantage of contemporary distributed processing techniques such as WebPie (Urbani et al. 2010) which is a parallel inference engine based on the Hadoop framework. The proof of concept system has been implemented using contemporary tools and libraries based on the current Semantic Web standards. To the best of the author’s knowledge, there is no standard yet regarding the semantic representation of temporal information however the followed approach seems promising. The former has led to the adoption of the Jess rule engine as being able to handle the temporal related rules even though not a Semantic-web based tool. Finally, the method can accept arbitrary user-specified concepts such as ontology classes and properties which can be asserted as either a custom extension of the various specific ontologies or either as parts of an additional individual ontology as done in this thesis. Overall, the described method along with its prototype implementation was able to fulfill quite successfully almost all of the specified criteria. During the evaluation the author was not able to gather accurate timing information mostly due to the fact that the testing systems were most of the times in use in parallel with the execution of the developed application as well the variance of its performance based on factors such as the CPU speed and RAM size of the system or the network speed. Furthermore, the size of the evidence files used and the amount and complexity of generated triples can vary greatly thus constituting such measurements more difficult to be accurate. However, approximate results regarding the time and complexity related performance metrics of the method have been presented showing that the method is both feasible and relatively fast compared to a traditional manual analysis while exploiting most of the advantages of the semantic web technologies in integrating and correlating heterogeneous data sources as well maintaining the potential to ascribe to the strict requirements of a forensic process. 110 Conclusions and Future Work 9.1 Conclusions The aim of this thesis was to identify and describe the potential benefit that Semantic Web based technologies and ideas can bring to the area of digital investigations as well as tackle some of its most prominent problems. Such problems pertain to the ever-increasing amount and complexity of data, the heterogeneity and incompatibility of various disparate tools and techniques as well as the lack of automation and advanced forms of analytical capabilities. The thesis started with an extended background research on both fields, the state of the art of digital investigations regarding both their conceptual foundations as well as concrete problems and limitations that they currently face, and the Semantic Web with its stack of cross-complementary technologies and standards along with its distinctive capabilities in automated reasoning, rule evaluation and expressive querying. The thesis continued with a study and evaluation of recent approaches on the merging of these two fields along with their promising results and possible shortcomings. Based on this background knowledge, the thesis proposed a generic and adaptable method based on the semantic representation, integration and correlation of digital evidence describing both a generic conceptual framework bridging the two fields as well as describing a proof-of-concept implementation of it. The thesis continued with a demonstration of the method utilizing the implemented prototype system upon two experiments that closely resemble a quite common contemporary method of compromising a system with malicious payloads over the Internet. The demonstration showed various examples of how sources of data of different origin and nature (disk images, network captures, firewall logs) can be automatically semantically represented according to respective ontologies and similar or identical entities be integrated and correlated upon various factors such as hash values, IP addresses, network blocks and time. The ability of such integrated and correlated data to provide a fast and meaningful insight to the investigator has been showcased through a number of relevant queries of combinatorial nature providing results in a much effortless and analytically-rich approach. The thesis concluded with an evaluation of the proposed method according to previously specified criteria and highlighted some of its strong points such as increased automation, improved analytical capabilities, decoupled implementation, and ability to accept user-defined concepts, rules and queries. The evaluation pinpointed also some of the shortcomings of the method with respect mostly to its performance and scalability capabilities that can be further improved. Overall, the outcome of the thesis in accordance with the research questions posed in the beginning, is that the Semantic Web technologies can offer a lot in the field of digital investigations and information security overall due to their distinctive abilities on automation and data integration. An ontological representation of the various collected data can alleviate the problem of heterogeneous and nonintegrate-able forensic and security tools and databases while enable a more abstract but expressive conceptualization of existing forensic knowledge thus minimizing the barrier to people with less technical expertise. The thesis showed also that the (semi-)automated capabilities of an inference and a rule engine can improve current forensic techniques and analytical skills by allowing easy and fast ways to integrate and correlate forensic evidence while encapsulating most of the complexity by the provision of a querying interface to the investigator that matches better the thinking process behind an investigation procedure. 111 9.2 Future Work Both fields, the Semantic Web on the one hand and digital forensics on the other hand, are continuously evolving and reshaping. The main contribution of the thesis was to propose the adoption of the Semantic Web framework as an enabling technology that can alleviate some of the most challenging problems that the field of digital forensics faces. The main objective was to take advantage of key features of Semantic Web in the area of heterogeneous data integration, support for automation and improved analytical capabilities for the investigator in the form of an expressive and flexible querying layer. The goal of such a proposal is as the PoC system showed, to be able to reduce the time and expertise needed from the investigator to deal with an ever increasing arsenal of specialized tools and data formats as well as the respective corpus of forensic knowledge and techniques. As such, further research should be focused on the study of how real investigations are conducted and if or how such an approach can produce tangible benefits by being introduced in the workflow. Better evaluation of the method in terms of both its time efficiency and ability to deal with complex data can be performed through real-life usage of it by investigation teams or individuals and application of relevant research methods such as interviews, surveys etc. On the technical side, the proposed system was merely a proof of concept and still quite far from a production level implementation. There are literally thousands of improvements that can be made in such a system by taking advantage of recent advancements in both fields. The first and most important step would be the engineering of ontologies that could cover even more fields such as OS artifacts, mobile forensics, live memory etc. It would be beneficial if such ontologies could reach a consensus amongst the digital forensic community since the usage of common ontologies can avoid various difficulties that arise in multi-ontology environments where there are overlaps between them. Even the ontologies described in this thesis, do not cover exhaustively their respective knowledge areas and can be further extended. One basic approach that was adopted during this thesis was that since common domain ontologies have not been yet developed and standardized, such method can still be implemented even in such a multi-ontological environment. One of the main issues was the establishment of relations between individuals representing the same concept, e.g. the same IP address. The technique followed throughout this thesis was the use of SWRL rules for performing such a ‘bridging’. This part can be further improved and even automated by taking advantage of recent developments in the area of automated multi-ontological concepts and individuals linking. Furthermore, developments in the area of distributed inference engines, RDF triples storage systems and federated querying can significantly improve the robustness, efficiency and computational feasibility of such a method and constituting it capable of dealing with massive amounts of data, while reducing the time needed. Eventually, such technologies can even enable real-time integration and correlation of logs and captured events from various information security appliances such as firewalls, IDS, antivirus, server and database systems etc. Finally, a graphical interactive and user-friendly layer positioned on top of such a method, can greatly increase its usability while in parallel reduce the training time needed for adoption. Such a method could also be considered as an educational method allowing persons with less technical background to develop analytical skills needed during a digital investigation. 112 List of References ACPO, 2007. Good Practice Guide for Computer-Based Evidence. Available at: http://www.7safe.com/electronic_evidence/ACPO_guidelines_computer_evidence_v4_web.pdf. Abbott, J. et al., 2006. Automated recognition of event scenarios for digital forensics. Proceedings of the 2006 ACM symposium on Applied computing SAC 06, p.293. Available at: http://portal.acm.org/citation.cfm?doid=1141277.1141346. Al-Feel, H., Koutb, M.A. & Suoror, H., 2009. Toward An Agreement on Semantic Web Architecture. Europe, 49(384,633,765), pp.806-810. Available at: http://www.akademik.unsri.ac.id/download/journal/files/waset/v49-142.pdf. Alink, W. et al., 2006. XIRAF – XML-based indexing and querying for digital forensics. Digital Investigation, 3(Supplement-1), pp.50-58. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000776. Allen, J.F., 1983. Maintaining knowledge about temporal intervals R. J. Brachman & H. J. Levesque, eds. Communications of the ACM, 26(11), pp.832-843. Available at: http://portal.acm.org/citation.cfm?doid=182.358434. Antoniou, G. & Van Harmelen, F., 2004. A Semantic Web Primer M. P. Papazoglou & J. W. Schmidt, eds., The MIT Press. Available at: http://doi.wiley.com/10.1002/asi.20368. Ayers, D., 2009. A second generation computer forensic analysis system. Digital Investigation, 6(Supplement 1), p.S34-S42. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000371. Basili, V.R., Caldiera, G. & Rombach, H.D., 1994. The goal question metric approach. In J. J. Marciniak, ed. Encyclopedia of Software Engineering. John Wiley & Sons, pp. 528-532. Available at: http://www.citeulike.org/user/whazzup221/article/4186285. Berners-Lee, T, Fielding, R. & Masinter, L., 2005. RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. Technical report httptoolsietforghtmlrfc3986, pp.1-62. Available at: http://tools.ietf.org/html/rfc3986. Berners-Lee, Tim, 1998. Semantic web road map. Design Issues for the World Wide Web, 2008(September 1998), pp.1-5. Available at: http://www.w3.org/DesignIssues/Semantic.html. Berners-Lee, Tim et al., 1992. World-Wide Web: The Information Universe. Internet Research, 2(1), pp.52-58. Available at: http://www.emeraldinsight.com/10.1108/eb047254. Berners-Lee, Tim, Hendler, J. & Lassila, O., 2001. The Semantic Web A. Gómez-Pérez, Y. Yu, & Y. Ding, eds. Scientific American, 284(5), pp.34-43. Available at: http://www.nature.com/doifinder/10.1038/scientificamerican0501-34. Brezinski, D. & Killalea, T., 2002. RFC3227: Guidelines for Evidence Collection and Archiving. RFC Editor United States, 2010. Available at: http://portal.acm.org/citation.cfm?id=RFC3227. Brinson, A., Robinson, A. & Rogers, M., 2006. A cyber forensics ontology: Creating a new approach to studying cyber forensics. Digital Investigation, 3(2), pp.37-43. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000703. Carrier, B & Spafford, E., 2006. Categories of digital investigation analysis techniques based on the computer history model. Digital Investigation, 3(Supplement 1), pp.121-130. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000739. Carrier, Brian, 2006. CERIAS Tech Report 2006-06 A HYPOTHESIS-BASED APPROACH TO DIGITAL FORENSIC INVESTIGATIONS by Brian D . Carrier Center for Education and Research in Information Assurance and Security , Purdue University , West Lafayette , IN 47907-2086. History. Available at: https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2006-06.pdf. Carrier, B.D. & Spafford, E.H., 2004. An Event-Based Digital Forensic Investigation Framework. Proceedings of the 4th Digital Forensic Research Workshop DFRWS, pp.1-12. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.3355&rep=rep1&type=pdf. Carrier, Brian, 2003. Defining Digital Forensic Examination and Analysis Tools Using Abstraction Layers. International Journal, 1(4), pp.1-12. Available at: http://www.utica.edu/academic/institutes/ecii/publications/articles/A04C3F91-AFBB-FC134A2E0F13203BA980.pdf. 113 Carroll, J.J. et al., 2005. Named graphs, provenance and trust. Proceedings of the 14th international conference on World Wide Web WWW 05, 14, p.613. Available at: http://portal.acm.org/citation.cfm?doid=1060745.1060835. Case, A. et al., 2008. FACE: Automated digital evidence discovery and correlation. Digital Investigation, 5(Suppl. 1), p.S65-S75. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287608000340. Casey, E, 2004. Digital evidence and computer crime: forensic science, computers and the Internet, Academic Pr. Available at: http://books.google.com/books?hl=en&lr=&id=Xo8GMt_AbQsC&oi=fnd&pg=PP1& dq=Digital+Evidence+and+Computer+Crime%E2%80%94Forensic+Science,+Computers+and+the+Internet,+ Second+Edition&ots=-YN1HU71ME&sig=N3K1I-XiNfljRB5gReQiK9B3_8s. Casey, E, 2002. Handbook of computer crime investigation: forensic tools and technology. No ISBN 0121631036, p.462. Available at: http://www.ncjrs.gov/App/abstractdb/AbstractDBDetails.aspx?id=195111. Cohen, M., Garfinkel, Simson & Schatz, B., 2009. Extending the advanced forensic format to accommodate multiple data sources, logical evidence, arbitrary information and forensic workflow. Digital Investigation, 6(9), p.S57-S68. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000401. Cohen, M. & Schatz, B., 2010. Hash based disk imaging using AFF4. Digital Investigation, 7(Suppl. 1), p.S121S128. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287610000423. Connor, M.J.O. & Das, A.K., 2011. A Method for Representing and Querying Temporal Information in OWL A. Fred, J. Filipe, & H. Gamboa, eds. Language, 127, pp.97-110. Available at: http://books.google.com/books?hl=en&lr=&id=AdLf6o25F00C&oi=fnd&pg=PA97& dq=A+Method+for+Representing+and+Querying+Temporal+Information+in+OWL&ots=IcNulJfrhR&a mp;sig=rkQ1FTtx0z4AhMAmeBdzTB8WgAg. Connor, M.O. et al., 2005. Supporting Rule System Interoperability on the Semantic Web with SWRL Y. Gil et al., eds. The Semantic Web–ISWC 2005, 3729(2), pp.974–986. Available at: http://www.springerlink.com/index/f16373n77h8p2181.pdf. Du, L. et al., 2008. A Latent Semantic Indexing and WordNet based Information Retrieval Model for Digital Forensics. In IEEE International Conference on Intelligence and Security Informatics. pp. 70-75. Forgy, C., 1982. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19(1), pp.17-37. Available at: http://linkinghub.elsevier.com/retrieve/pii/0004370282900200. Garfinkel, S, 2006. Forensic feature extraction and cross-drive analysis. Digital Investigation, 3, pp.71-81. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287606000697. Garfinkel, S.L., 2006. AFF : A New Format for Storing Hard Drive Imagens. Communications of the ACM, 49(2), pp.85-87. Garfinkel, S.L., 2009. Automating Disk Forensic Processing with SleuthKit, XML and Python. 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, pp.73-84. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5341559. Garfinkel, S.L., 2010. Digital forensics research: The next 10 years. Digital Investigation, 7(Suppl. 1), p.S64S73. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287610000368. Garfinkel, Simson, 2011. Digital forensics XML and the DFXML toolset. Digital Investigation, pp.1-14. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287611000910. Giova, G., 2011. Improving Chain of Custody in Forensic Investigation of Electronic Digital Systems. Journal of Computer Science, 11(1). Gladyshev, P. & Patel, A., 2004. Finite state machine approach to digital event reconstruction. Digital Investigation, 1(2), pp.130-149. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287604000271. Gruber, T.R., 1993. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), pp.199-220. Available at: http://linkinghub.elsevier.com/retrieve/doi/10.1006/knac.1993.1008. Guo, Y., Slay, J. & Beckett, J., 2009. Validation and verification of computer forensic software tools—Searching Function. Digital Investigation, 6(SUPPL.), p.S12-S22. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000358. Guðjónsson, K., 2010. Mastering the Super Timeline With log2timeline. SANS Institute. Haslhofer, B. & Neuhold, E.J., 2011. A Retrospective on Semantics and Interoperability Research. In D. Fensel, ed. Foundations for the Web of Information and Services. Springer Berlin Heidelberg, pp. 3-27. Available at: http://cs.univie.ac.at/research/research-groups/multimedia-information-systems/publikation/infpub/2921/. 114 Hevner, A.R. et al., 2004. Design Science in Information Systems Research. MIS Quarterly, 28(1), pp.75-105. Available at: http://www.jstor.org/stable/25148625. Hildebrandt, M., Kiltz, S. & Dittmann, J., 2011. A Common Scheme for Evaluation of Forensic Software. In IT Security Incident Management and IT Forensics (IMF), 2011 Sixth International Conference on. pp. 92-106. Hitzler, P. et al., 2009. OWL 2 Web Ontology Language Primer, W3C. Available at: http://www.w3.org/TR/2009/REC-owl2-primer-20091027/. Hobbs, J.R. & Pan, F., 2004. An ontology of time for the semantic web. Acm Transactions On Asian Language Information Processing, 3(1), pp.66-85. Available at: http://portal.acm.org/citation.cfm?doid=1017068.1017073. IOCE, 2002. G8 Proposed Principles for Forensic Evidence. Available at: http://www.ioce.org/fileadmin/user_upload/2002/G8 Proposed principles for forensic evidence.pdf [Accessed December 20, 2011]. Jean-Mary, Y.R., Shironoshita, E.P. & Kabuka, M.R., 2009. Ontology Matching with Semantic Verification. Web semantics Online, 7(3), pp.235-251. Available at: http://www.ncbi.nlm.nih.gov/pubmed/20186256. Kahvedzic, D. & Kechadi, T., 2009. DIALOG: A framework for modeling, analysis and reuse of digital forensic knowledge. Digital Investigation, 6(Supplement 1), p.S23 - S33. Available at: http://www.sciencedirect.com/science/article/pii/S174228760900036X. Kahvedžić, D. & Kechadi, T., 2011. Semantic Modelling of Digital Forensic Evidence. In O. Akan et al., eds. Digital Forensics and Cyber Crime. Springer Berlin Heidelberg, pp. 149-156. Available at: http://dx.doi.org/10.1007/978-3-642-19513-6_13. Keet, C.M. & Artale, A., 2007. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 3(1), pp.91-110. Available at: http://portal.acm.org/citation.cfm?id=1412417.1412418. Kent, K. et al., 2006. Guide to Integrating Forensic Techniques into Incident Response. Nist Special Publication, August(SP 800-86), p.121. Available at: http://cybersd.com/sec2/800-86Summary.pdf. Kitchenham, B., Linkman, S. & Law, D., 1997. DESMET: a methodology for evaluating software engineering methods and tools H. Rombach, V. Basili, & R. Selby, eds. Computing Control Engineering Journal, 8(3), p.120. Available at: http://link.aip.org/link/CCEJEL/v8/i3/p120/s1&Agg=doi. Koch, J., Velasco, C.A. & Abou-Zahra, S., 2011. HTTP Vocabulary in RDF 1.0. W3C Working Draft. Available at: http://www.w3.org/TR/2011/WD-HTTP-in-RDF10-20110510/. Kruegel, C., Valeur, F. & Vigna, G., 2005. Intrusion Detection and Correlation, Springer. Available at: http://www.netlibrary.com/Details.aspx. Kruse, W. & Heiser, J., 2002. Computer Forensics: Incident Response Essentials, Addison-Wesley. Available at: http://www.best-seller-books.com/computer-forensics-incident-response-essentials.pdf. Küster, U., König-ries, B. & Klusch, M., 2010. Criteria , Approaches and Challenges Evaluating Semantic Web Service Technologies M. L. A. Sheth, ed. Progressive Concepts for Semantic Web Evolution Application and Developments, pp.1-24. Lamis, T., 2010. A Forensic Approach to Incident Response. Human Factors, pp.177-185. Lee, S. et al., 2010. A proposal for automating investigations in live forensics. Computer Standards Interfaces, 32(5-6), pp.246-255. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0920548909000762. Levine, B.N. & Liberatore, M., 2009. DEX: Digital evidence provenance supporting reproducibility and comparison. Digital Investigation, 6(9), p.S48-S56. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287609000395. Munir, R.F. et al., 2011. Detect HTTP Specification Attacks Using Ontology, IEEE. National Institute of Justice, U., 2011. U.S. National Institute of Justice. Crimes Scene Guides. Available at: http://www.nij.gov/topics/law-enforcement/investigations/crime-scene/guides/glossary.htm. Noy, N.F. & Klein, M., 2004. Ontology Evolution: Not the Same as Schema Evolution. Knowledge and Information Systems, 6(4), pp.428-440. Available at: http://springerlink.metapress.com/openurl.asp?genre=article&id=doi:10.1007/s10115-003-0137-2. Palmer, G., 2001. A Road Map for Digital Forensic Research. New York, 1, pp.27–30. Available at: http://www.dfrws.org/2001/dfrws-rm-final.pdf. Parsia, B., Sattler, U. & Schneider, T., 2008. Easy Keys for OWL. OWLed. Available at: http://www.webont.org/owled/2008/papers/owled2008eu_submission_3.pdf. Patzakis, B.J., 2003. Maintaining The Digital Chain of Custody. IFOSEC. 115 Rekhis, S. & Boudriga, N., 2011. Logic-based approach for digital forensic investigation in communication Networks. Computers Security, In Press,. Available at: http://www.sciencedirect.com/science/article/B6V8G52BPJWH-1/2/1f69b962893b83cb7ceffcc14c4ee2e3. Reynolds, D. et al., 2005. An assessment of RDF / OWLmodelling. October, 12, pp.2005–189. Available at: http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf. Saad, S. & Traore, I., 2010. Method ontology for intelligent network forensics analysis. In Privacy Security and Trust {(PST)}, 2010 Eighth Annual International Conference on. pp. 7-14. Scarfone, K. & Masone, K., 2004. Computer Security Incident Handling Guide Recommendations of the National Institute of Standards and Technology. Nist Special Publication, 2(Revision 1), pp.800–61. Available at: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Computer+Security+Incident+Handling+Gu ide+Recommendations+of+the+National+Institute+of+Standards+and+Technology#1. Schatz, B., Mohay, G. & Clark, A., 2004a. Generalising Event Forensics Across Multiple Domains. Most. Schatz, B., Mohay, G. & Clark, A., 2004b. Rich Event Representation for Computer Forensics. Asia Pacific Industrial Engineering and Management Systems APIEMS 2004, pp.1-16. Available at: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:RICH+EVENT+REPRESENTATION+FO R+COMPUTER+FORENSICS#0. Schatz, B.L., 2007. Digital evidence: representation and assurance. Queensland University of Technology. Available at: http://eprints.qut.edu.au/16507/. Schuster, A., 2010. Recent Advances in Memory Forensics. In ZISC. Scientific Working Group On Digital Evidence, 2011. Scientific Working Group on Digital Evidence ( SWGDE ) SWGDE Model Quality Assurance Manual for Digital Scientific Working Group on Digital Evidence ( SWGDE ). Quality Assurance, 2011(Version 1), pp.1-117. Shah, U., Finin, T. & Joshi, A., 2002. Information retrieval on the semantic web. In C. Nicholas et al., eds. Proceedings of the eleventh international conference on Information and knowledge management. ACM New York, NY, USA, pp. 461–468. Available at: http://portal.acm.org/citation.cfm?id=584868. Stallard, T. & Levitt, K., 2003. Automated analysis for digital forensic science: semantic integrity checking L. Karl, ed. 19th Annual Computer Security Applications Conference 2003 Proceedings, 0(Acsac), pp.160-167. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1254321. Stevens, R.M. & Casey, Eoghan, 2010. Extracting Windows command line details from physical memory. Digital Investigation, 7(Suppl. 1), p.S57-S63. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287610000356. Stone-Gross, B. et al., 2009. FIRE: FInding Rogue nEtworks. 2009 Annual Computer Security Applications Conference, pp.231-240. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5380682. Suciu, D., 1998. An overview of semistructured data. SIGACT News, 29(4), pp.28-38. Available at: http://portal.acm.org/citation.cfm?id=306198.306204. Tan, T., Ruighaver, T. & Ahmad, A., 2003. Incident Handling : Where the need for planning is often not recognised. Network, (November), pp.1-10. Turner, P., 2005. Unification of digital evidence from disparate sources (Digital Evidence Bags). Digital Investigation, 2(3), pp.223-228. Available at: http://linkinghub.elsevier.com/retrieve/pii/S1742287605000575. Urbani, J. et al., 2010. OWL reasoning with WebPIE: calculating the closure of 100 billion triples L. Aroyo et al., eds. The Semantic Web Research and Applications, 6088, pp.213-227. Available at: http://www.springerlink.com/index/2581664J64961667.pdf. Vermaas, O., Simons, J. & Meijer, R., 2010. Open Computer Forensic Architecture a Way to Process Terabytes of Forensic Disk Images Huebner E And Zanero S, ed. Architecture, pp.45-67. Wikipedia, 2011. Daubert standard. Available at: http://en.wikipedia.org/wiki/Daubert_standard [Accessed December 15, 2011]. Willassen, S.Y., Chapter 1 HYPOTHESIS BASED INVESTIGATION OF DIGITAL TIMESTAMPS. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.173.5585 [Accessed January 10, 2012]. Willassen, S.Y., 2008. Timestamp evidence correlation by model based clock hypothesis testing. , p.15. Available at: http://dl.acm.org/citation.cfm?id=1363217.1363237 [Accessed January 10, 2012]. 116 Zhang, S. et al., 2011. An Ontology-Based Context-aware Approach for Behaviour Analysis. In L. Chen et al., eds. Activity Recognition in Pervasive Intelligent Environments. Atlantis Press, pp. 127-148. Available at: http://dx.doi.org/10.2991/978-94-91216-05-3_6. Zhao, Y. & Sandahl, K., 2002. Potential advantages of semantic web for internet commerce. Computer, (3). Available at: www.ida.liu.se/~yuxzh/doc/iceis-030120.pdf. 117 Departmen of Computer and Systems Sciences Stockholm University Forum 100 SE-164 40 Kista Phone: 08 – 16 20 00 www.su.se 118