* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparison of Data Base Technologies for APT Detection
Entity–attribute–value model wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Relational model wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
Royal Military Academy Brussels, BELGIUM www.rma.ac.be Comparison of Data Base Technologies for APT Detection Abidi Marwa 2014 i Abstract Nowadays, cyber security is becoming a primary concern for organizations as well as for individuals. As cyber space is considered vital for multiple key functions such as air traffic control systems and nuclear reactor safety systems, cyber security’s issues are emerging as conspicuous elements of several studies. Data theft, network disruption and other circumstances can perturb companies and persons in ways that range from embarrassing to lethal. Actually, communication, research, development and all aspects of personal life and business rely on networks. Governments, military, corporations, social networks, etc are collecting, performing and storing humungous amounts of restricted data every day which presents cyber security high concern. Looking to the noticeable growth of complex, sophisticated and advanced cyber attacks, convenient security measures are required to protect confidential information and to prevent disastrous consequences. Among the most daunting challenges for cyber security solutions are ”Advanced Persistent Threats” (APT). APT are relentless and targeted cyber attacks considered as the latest cyber security scourge. These attacks are most of the time achieved by nation states or hacktivists possessing high technical capabilities and the appropriate material infrastructure giving them the opportunities to achieve their goals without being noticed or detected by the existing security tools. In order to detect APT, large amounts of data should be processed during the investigation mission looking the the silent aspect of these intrusions, an operation which is extremely time consuming. One possibility to perform these analysis consists in using databases which are far from being obsolete toward this issue. This project consists in a study on how data can be interrogated through sets of queries in addition to a comparison of three databases performances: a classical SQL database, a NoSQL database and a database specialized in text search. It is performed in favor of the MASFAD project development: Multi-Agents System For APT Detection, funded by the Belgian Royal Military Academy. Within this thesis, we will demonstrate the adaptation of MongoDB to the monitoring of large data sets over the other databases as well as an interesting behavior showed by PostgreSQL toward the performance of complex queries. ii Foreword When I began working on this Master thesis, I had no idea how it will influence my life not only during its elaboration period, but also for the remainder of my future career. This work gave me the opportunity to learn in many aspects and to improve my knowledge in network security, database management and java programming. I wish to sincerely thank all the people who have believed in me and who have always been there for me during the best and worst moments. I thank my family, my fiance and my friends for their patience. I thank my promoters for the great help and support they gave me and for leaving me freely choose the orientation of this project. I thank the employees of the Tunisian Military Academy and Tunisian Ministry of Defense for their support. I’m grateful for this opportunity that made experience such wonderful journey. I thank the Belgian Royal Military Academy staff for their hospitality and orientation. I hope I made you all proud. Marwa Contents List of algorithms vi List of figures viii List of tables 1 Introduction 1.1 Preamble . . . . . . . . . 1.2 Motivation . . . . . . . . 1.3 Objectives of the project 1.4 State of the Art . . . . . . 1.4.1 APT definition . . 1.4.2 Terminology . . . 1.4.3 History . . . . . . 1.4.4 Life cycle . . . . . 1.4.5 APT detection . . 1.5 Originality and output . 1.6 Project outline . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Comparative study of databases 2.1 Introduction . . . . . . . . . . . . . . . . . . 2.2 Preliminary comparison . . . . . . . . . . . 2.2.1 Classical SQL databases . . . . . . . 2.2.2 NoSQL databases . . . . . . . . . . . 2.2.3 Databases specialized in text search 2.3 Presentation of the used databases . . . . . 2.3.1 PostgreSQL . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 4 4 4 5 5 7 9 11 12 . . . . . . . 13 13 13 13 14 15 17 17 CONTENTS 2.3.2 2.3.3 MongoDB . . . . . . . Elasticsearch . . . . . . 2.3.3.1 Presentation 2.4 Comparative tables . . . . . . 2.5 Summary . . . . . . . . . . . . iv . . . . . . . . . . . . . . . 3 Implementations 3.1 Introduction . . . . . . . . . . . . . 3.2 Data storage . . . . . . . . . . . . . 3.2.1 Data format . . . . . . . . . 3.2.2 Database design . . . . . . . 3.2.3 Codes . . . . . . . . . . . . . 3.2.4 Elasticsearch . . . . . . . . . 3.3 Agent Uploader . . . . . . . . . . . 3.3.1 Algorithm . . . . . . . . . . 3.3.2 Codes . . . . . . . . . . . . . 3.3.2.1 PostgreSQL . . . . 3.3.2.2 MongoDB . . . . . 3.3.2.3 Elasticsearch . . . 3.4 Agent Uploader with aggregations . 3.4.1 Algorithm . . . . . . . . . . 3.4.2 Codes . . . . . . . . . . . . . 3.4.2.1 PostgreSQL . . . . 3.4.2.2 MongoDB [Net 22] 3.4.2.3 Elasticsearch . . . 3.5 Summary . . . . . . . . . . . . . . . 4 Performance tests 4.1 Introduction . . . . . . . . . . . 4.2 Methodology . . . . . . . . . . . 4.3 First test: Data import . . . . . . 4.3.1 Presentation . . . . . . . 4.3.2 Results . . . . . . . . . . 4.3.2.1 Before indexing 4.3.2.2 After Indexing 4.3.3 Analysis . . . . . . . . . 4.4 Second test: Agent Uploader . 4.4.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 20 22 25 . . . . . . . . . . . . . . . . . . . 26 26 26 27 28 30 35 37 38 39 39 41 42 44 44 45 45 46 47 48 . . . . . . . . . . 49 49 49 50 50 51 51 51 52 53 53 CONTENTS 4.4.2 Results . . . . . . . . . . . . . . . . . . 4.4.2.1 Before indexing . . . . . . . . 4.4.2.2 After indexing . . . . . . . . . 4.4.3 Analysis . . . . . . . . . . . . . . . . . 4.5 Third test: Agent Uploader with aggregations 4.5.1 Presentation . . . . . . . . . . . . . . . 4.5.2 Results . . . . . . . . . . . . . . . . . . 4.5.2.1 Before indexing . . . . . . . . 4.5.2.2 With indexing . . . . . . . . . 4.5.3 Analysis . . . . . . . . . . . . . . . . . 4.6 Comparison of the two agents . . . . . . . . . 4.6.1 Presentation . . . . . . . . . . . . . . . 4.6.2 PostgreSQL . . . . . . . . . . . . . . . . 4.6.3 MongoDB . . . . . . . . . . . . . . . . 4.6.4 Analysis . . . . . . . . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 57 59 59 59 59 60 62 63 63 64 65 65 66 5 Conclusion 5.1 Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . 5.3 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 68 69 Bibliography 70 Netography 71 Appendecies 74 PostgreSQL Codes 74 MongoDB Codes 83 Elasticsearch Codes 91 List of Algorithms 3.1 Agent Uploader . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Agent Uploader with aggregations . . . . . . . . . . . . . . . vi 39 45 List of Figures 1.1 APT attack life cycle . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Generalized Full-Text Architecture . . . . . . . . . . . . . . 2.2 MongoDB Subscriptions . . . . . . . . . . . . . . . . . . . . 17 19 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 31 32 33 33 34 35 35 36 37 37 40 41 41 41 42 42 43 43 44 46 46 46 46 DBloading: GetConnection query . . . . . . . . . . . DBloading: extracting the transaction’s fields . . . . DBloading: Insert query . . . . . . . . . . . . . . . . PostgreSQL: data table . . . . . . . . . . . . . . . . . MongoLoading: GetConnection query . . . . . . . . MongoLoading: Insert query . . . . . . . . . . . . . . MongoDB: data collection . . . . . . . . . . . . . . . ElasticLoading: getClient query . . . . . . . . . . . . ElasticLoding: Insert query . . . . . . . . . . . . . . . Elasticsearch: data type . . . . . . . . . . . . . . . . . PostgreSQL: GetClients query . . . . . . . . . . . . . PostgreSQL: GetServers query . . . . . . . . . . . . . PostgreSQL main: Select query . . . . . . . . . . . . MongoDB: GetClients query . . . . . . . . . . . . . . MongoDB: GetServers query . . . . . . . . . . . . . . MongoDB main: Select query . . . . . . . . . . . . . Elasticsearch: GetClients query . . . . . . . . . . . . Elasticsearch: GetServers query . . . . . . . . . . . . Elasticsearch main: Select query . . . . . . . . . . . PostgreSQL main: Select with group by query . . . . MongoDB main (with group by): Operation $match MongoDB main (with group by): Operation $project MongoDB main (with group by): Operation $group vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES viii 3.24 MongoDB main (with group by): aggregation execution . . 3.25 Elasticsearch main (with group by): aggregation query . . . 47 48 4.1 Comparative diagram: data import . . . . . . . . . . . . . . 4.2 Comparative curve: impact of the data size rise on the data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparative diagram: Agent Uploader without indexing . 4.4 Comparative diagram: Agent Uploader using single-field indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Comparative diagram: Agent Uploader using multiple-fields index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Comparative curve: impact on the rise of data size on Agent Uploader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Comparative diagram: Agent Uploader with aggregations without indexing . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Comparative diagram: Agent Uploader with aggregations using single-field index . . . . . . . . . . . . . . . . . . . . . 4.9 Comparative diagram: Agent Uploader with aggregations using multiple-fields index . . . . . . . . . . . . . . . . . . . 4.10 Comparative curve: impact on the rise of data size on Agent Uploader with aggregations . . . . . . . . . . . . . . . . . . 4.11 Comparative Diagram: comparison of the two agents on PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Comparative Diagram: comparison of the two agents on MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 53 55 56 57 58 60 61 62 63 64 65 List of Tables 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Text-search databases taxonomy . . . . . . . . . . . . Databases comparison: organization and community Databases comparison: compatibility . . . . . . . . . Databases comparison: administration . . . . . . . . Databases comparison: general information . . . . . Databases comparison: costs and software license . . Databases comparison: Capabilities/Limitations . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 . . . . . . . . . . . . . . . . . . . . . Data import : results . . . . . . . . . . . . . . . . . . . . . . Data import results on indexed tables . . . . . . . . . . . . Agent Uploader : results without indexing . . . . . . . . . . Agent Uploader : results using single-field indexes . . . . . Agent Uploader : results using multiple-fields index . . . . Agent Uploader with aggregations : results without indexing Agent Uploader with aggregations : results using singlefield indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Agent Uploader with aggregations : results using multiplefields index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 PostgreSQL results: Comparison of the two agents . . . . . 4.10 MongoDB results: Comparison of the two agents . . . . . . ix 16 22 22 23 23 24 25 51 51 54 55 56 59 60 61 64 65 Chapter 1 Introduction 1.1 Preamble As an introduction for this project, we provide some definitions which are useful for a better assimilation of this work. Cyber space is the virtual terrain in which computers communicate through the exchange of data blocks between networks. A cyber attack is an attempt to make a computer network malfunction or to steal confidential data by taking advantage of existing vulnerabilities. An APT attack is an advanced and complex cyber attack performed by high skilled hackers and even by governments in order to intercept and exfiltrate critical data. A database is an information container offering a flexible infrastructure for data storage and querying. 1.2 Motivation Among the large number of cyber attacks occurring daily, APT are considered as the most dangerous and severe ones. These intrusions are so1 CHAPTER 1. INTRODUCTION 2 phisticated and silent as they can keep unperceived for a long period of time during which critical and high value information is exfiltrated from high-profile targets. Looking to these aspects, serious measures should be taken into consideration in order to block potential attacks. According to the Mandiant’s annual threat report [1]: • 66% of organizations are unaware that they’re under attack, and are only notified by an external entity. • Most attacks can keep unnoticed for nearly eight months. • Hackers are counting on outsourced service providers and partners in order to keep unobserved. • A reconnaissance step is employed in which attackers inspect the ordinary behavior of the network entities so they can resemble it and not awake suspicions. In order to achieve an APT attack, hackers attempt to use new unknown vulnerabilities and to constantly change their hacking strategies in order to bypass traditional security measures. Although, these attacks show some distinguishable signatures which can be used for detection purposes. Many open source software tools monitoring these patterns are nowadays available for public use. These tools can ensure an early APT detection by analyzing outbound-traffic and uncovering suspicious interzone. They implement different detection techniques and can be used together to achieve a whole detection strategy. Here are some examples: • Splunk: is a search, analysis and reporting tool indexing and querying data coming from various sources and trigger alerts in case an abnormal behavior has been noticed. • Suricata: is a Network Security Monitoring engine, an Intrusion Detection System and an Intrusion Prevention System at once. It allows the writing of rules to protocols rather than ports bringing a vicious communication with Command an Control server down. • Kismet: is a sniffer, Intrusion Detection System and a wireless network detector operating on 802.11, layer 2. It sniffs wireless traffic, analyzes the data and can detect the presence of hidden or default networks as well as probe requests. CHAPTER 1. INTRODUCTION 3 • Snort: is a Network Intrusion Detection System based on signatures working with a combination of rules and pre-processors to analyze, in real time, a network traffic then search within its database for the attack and finally manipulate the results. • Squert: is a visual engine which analysis and query events stored on an internal database in order to retrieve result sets to which it adds further information looking foreword to discover new data interrogations. APT detection is a process that needs monitoring huge amounts of data such as proxy or firewall log files, a challenging task to the traditional IT systems and security tools. To achieve an investigation on lots of data, analysts have to perform complex queries. Certainly, databases are not obsolete and can be used to store data and allow analysts to interrogate it through sets of queries. Big data is difficult to handle and shows the challenge of processing it within a tolerable execution time but recent software tools with high performance indexes and distributed architecture possibilities seem to be of a great advantage. Many researches in APT detection are nowadays faced with the recurring problem of Big Data. Big Data technology can make important contributions as its analysis improvements offer efficient opportunities to ameliorate APT intrusions exposure in critical security situations. HP and Dell have spent more than $15 billion on software firms only specializing in data management and analytics [3]. In 2010, the Economist annual report recorded that this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole. In order to develop a multi-agents system for APT detection, the project MASFAD is being funded by the European Defense Agency (EDA), the Belgian Royal Academy (RMA), the Netherlands Organization for Applied Scientific Research (TEO), and the German Fraunhofer Institute for Communication, Information Processing and Ergonomics (FKIE), who are joining forces to offer to the military network analyst a powerful tool assuring a high level of defense against APT. CHAPTER 1. INTRODUCTION 1.3 4 Objectives of the project From the preceding discussion and definitions it clearly appears that, in order to improve Big Data processing time in APT detection procedure we need to work with recent big data tools which offer the opportunity to unlock significant value by making information accessible in a nearly real time. The main goal of this project is to study how to translate analyst’s complex questions into sets of queries and to realize a comparison between PostgreSQL, MongoDB and Elasticsearch database management systems in order to determine which database presents the best performances in terms of processing and execution time. To validate the chosen approach, two different versions of a java agent that investigates an abnormal upload behavior will be implemented and tested with the three different database types. The data sets manipulated will consist on log files from a proxy server of a medium-sized network. The primary concerns for us will lie on: • How many inserts per second can the database support? • How fast will the database response to the queries be? • How each database will manage the growth of event data? In the next section, we present the State of the Art in Advanced Persistent Threats . 1.4 1.4.1 State of the Art APT definition Advanced Persistent Threats are prolonged and mute cyber attacks qualified of sophisticated and complex hacking actions as they are coordinated by large groups of proficient hacktivists generally supported by governments to conduct multipurpose espionage. They aim governments, military and companies from business, financial, social, health care sectors,etc where valuable data can be found and be of great benefit. CHAPTER 1. INTRODUCTION 5 To be really successful, an APT operation requires a prolonged period of time as well as a high level of stealthiness. The persistent aspect is usually fulfilled with the use of an external Command and Control server which retrieves targeted strategic information via a pre-build botnet army within the victims network [Net 2]. 1.4.2 Terminology Advanced Persistent Threats seems hard to define due to the large number of existing definitions. The term was first used in 2006 by the United States Air Force (USAF) in order to dissimulate the intrusions investigations and prevent civilian interference[2]. In the next paragraph we will try to clear up the APT definition by explaining its terminology [4], [5]: Advanced: APT actors acquire multiple existing malware technologies as well as a high knowledge and technological skills making them capable of creating new unknown crimewares by combining the existing ones making it more and more powerful and hard to detect. As these groups are most of the time sponsored by very authoritarian entities and even Nation-State, they do have access to advanced information-gathering devices such as telephone spying equipments. Persistent: an APT attack usually takes a long period of time to be achieved (it can be performed over several years) during which small amounts of important data are retrieved each time in a strategy qualified of ”low and slow” that allows the intrusion to keep silent and not to awake suspicions by showing a deviant behavior. Threats: APT are aggressions conducted by humans with the intention of stealing valuable information in order to make subsequent usage of it. It can’t be automated or executed by readily available programs and modules. 1.4.3 History The examination of APT history is a crucial part of APT understanding and assimilation as it permit a derivation of useful commentaries and annotations helping in future detection procedures. CHAPTER 1. INTRODUCTION 6 Targeted, socially-engineered emails dropping Trojans1 to exfiltrate confidential information were noticed by both the U.K. National Infrastructure Security Co-ordination Center (UK-NISCC) and the U.S. Computer Emergency Response Team (US-CERT) in June, 2005. This attack was qualified of an Advanced Persistent Threat as it lasted for an important period of time during which it was able to bypass all the existing security tools [6]. The designation APT is attributed to an attack showing signs of advance and persistence. Here are some examples of the most famous APT intrusions during the past few years [Net 1], [7]: • The Gozi virus: firstly detected in 2007. It’s a banking virus designed to steal personal banking information. One million computers were infected by the virus in many countries such as the U.S., UK, Germany, Poland, France, Finland, Italy, Turkey, etc and even systems at NASA2 . It was basically spread to the victims through a benign PDF document as well as many other methods. The virus was sold to many hacktivists groups around the world by its developers, many variants of it were detected during the past years and banks continue to experience intrusions from Gozi. • Operation Aurora: it consists in series of cyber attacks against companies in multiple industries which started in 2009. It’s reported that these attacks have originated in China. The attack began with a large phishing operation in order to gain access to victim’s systems where a Trojan horse designed to retrieve data was installed. Companies that were attacked preferred to remain anonymous due to the fear of loosing their clients and shareholders by lack of confidence in their security systems the reason that allowed the attackers to keep performing the intrusions more widely. The first public announcements was made by Google in January 2010. • RSA Breach: it is a successful APT attack against the RSA3 network which was detected in March 2011. Similar to Operation Aurora, it began by a successful phishing campaign exploiting an Adobe 1 A Trojan horse is a malware program containing malicious code which the execution causes data theft or system damage. 2 NASA stands for the National Aeronautics and Space Administration. 3 RSA is an American computer and network security company. CHAPTER 1. INTRODUCTION 7 flash vulnerability and obtaining access and control over victims computers. The malware used consisted in a remote access Trojan (RAT)4 named Poison Ivy. The attack was far from being complex, but it was very effective as data relating to RSA’s best-selling SecurID authentication technology was stolen. • Flame: consisted in multiple cyber espionage attacks on governmental ministries, educational institutions and individuals in Middle Eastern countries discovered by Iran’s National Computer Emergency Response Team in 2012. Unlike the other attacks presented above, a sophisticated malware was spread via USB keys into internal networks. 1.4.4 Life cycle Basically, an Advanced Persistent Threat life cycle is divided into 4 major stages as its shown in the following figure: 4 RAT is a program including a malware code which installs a back door on the victims computer giving administrative control to the attacker over the infected machine. CHAPTER 1. INTRODUCTION 8 Figure 1.1: APT attack life cycle source:blog.trendmicro.com/trendlabs-security-intelligence/ in-depth-look-apt-attack-tools-of-the-trade/ 1. Preparation: the first step the hackers need is to determine their target. According to the victim’s information system’s security level, they seek help from other hacktivist groups, companies or even Nation States. The next step consist in elaborating the intrusion strategy and gathering the devices, malwares and techniques needed for the intrusion mission. They might also decide to develop their own crimeware by combining existing techniques in order to make the attack more complex and hard to notice. 2. Infection: the second stage consists in infiltrating the victim’s local network by gaining access and administrative control over local machines. This step can be achieved with the use of different tech- CHAPTER 1. INTRODUCTION 9 niques such as phishing5 and social engineering6 , etc and by taking advantage of existing network vulnerabilities and installing covert backdoors. 3. Deployment: when the hackers make sure of possessing enough control over the target’s network, it comes the part where the main malware causing the attack is spread. This crimeware is designed to gather valuable data from the internal network. Communications between the hackers and the victims are managed through a Command and Control server ( C&C server )7 which grants the anonymousness of the attackers. To rise the number of infected machines in the targeted network, attackers effectuate a lateral move which consists on infecting other machines and taking control over them. 4. Persistence: an APT intrusion is always a long-time mission. To maintain presence inside the victim’s system, APT performers make sure to keep silent and transparent. They tend to masquerade the normal behavior of the machines inside the network in order to bypass all security measures and devices. They also ensure continued control over access channels and equipments acquired previously to achieve their goals, and keep exfiltrating stolen confidential data from the targeted network. 1.4.5 APT detection As previous detected APT attacks have shown noticeable attributes , the current and future detection and investigation procedures will be based on similar signatures which the analysis can lead to an early discovery of such intrusions. Furthermore, APT relies on C&C8 servers which the 5 Phishing is the act of guising the identity of a trustworthy person in order to gain one’s confidence in an electronic communication which permit the transmission of malwares to the victim. 6 Social engineering is a technique used to influence employees psychologically to reveal important information about their job’s in a process of gathering valuable data on the victim before performing an attack. 7 A C&C server is a machine capable of sending commands for equipments being part of a botnet army and retrieve outputs from them. 8 C&C refers to Command and Control. CHAPTER 1. INTRODUCTION 10 activity can be identified, contained and disrupted through the analysis of outbound network traffic [4]. In order to effectively detect APT, a security strategy should be put into place using multiple software tools and implementing several security measures such as [7]: • Rule Sets • Statistical and Correlation Methods • Manual Approaches • Automatic Blocking of Data Exfiltration Nowadays, multiple commercial software tools are available on the market implementing different defense technologies such as Data Loss Prevention (DLP)9 , advanced reporting, forensic analysis, real-time threat analysis, etc. The following list presents some examples of APT detection dedicated tools as well as some of their distinctive properties: Verint CYBERVISION Advanced Detection System [Net 3]: This detection system offers high performance and scalable network security by providing superior malware detection capabilities to national level organizations and cyber security operation centers (CSOCs). It implements a unique centralized approach and is able to provide a holistic overview of all the relevant critical networks at a national level. Verint CYBERVISION Advanced Detection System’s most distinctive characteristic is his ability to enable fast and coordinated response in case of a security breach. FireEye Threat Prevention Platform [Net 4]: It is a system that supplements traditional security defenses such as firewalls, Intrusion Detection Systems, etc. By enabling the aggregation and correlation of events, it can identify blended attacks. It is also considered to be of a good use for deeper, hands-on analysis and investigations of APT by building a 360degree, stage-by-stage analysis of the advanced attack. FireEye Threat 9 Data Loss Prevention is a technology designed to detect and prevent potential data breach or ex-filtration transmissions. CHAPTER 1. INTRODUCTION 11 Prevention Platform uses a cloud platform which efficiently shares autogenerated threat intelligence, such as covert callback channels, as well as new threat findings from FireEye Labs. ISC8 Cyber adAPT system [Net 5]: As an APT detection tool, it learns the network topology and leverages knowledge of advanced malware’s actions after weeks and months of network activities analysis and correlations. It also offers signature-less, network based advanced malware detection at speeds of 10Gb and higher thanks to the implementation of a sensor-based, near real-time forensics technology. This advanced technology identifies next-generation APT ahead of perimeter solutions before devastating damage or critical data theft can occur. These detection systems present great capabilities in the APT detection field. Therefor, they can not be fully trustworthy as APT guise the local network typical behavior so they can manage to bypass them. For this reason, a detection system that is adequate for the internal network dependencies and needs will be of a higher benefit. 1.5 Originality and output A deep analysis of the APT detection requirements as well as some of the most famous attacks during the past few years allowed us to identify APT attributes which can be used in the investigation process. Existing commercial and open source solutions don’t seem to be advantageous for the protection of a local military network which totally justifies the need of the development of a customized APT detection system. In order to study the efficiency of databases use within the MASFAD project, we have to perform several tests that will allow us to choose the more suitable database management system for our detection needs. We will effectuate our study through the investigation of anomalous uploading attitudes which can indicate an APT intrusion. A high rate of data uploading can lead to the identification of a Command and Control channel and an outputs exfiltration from machines being part of a botnet within the internal network. The main output of this work is: • A comparison between three different databases performances rel- CHAPTER 1. INTRODUCTION 12 atively to APT detection needs in terms of monitoring enormous amounts of data. 1.6 Project outline This dissertation is organized in four chapters and a concluding one. The present chapter has given you the motivations and the goals of the project as well as a review of the State of the Art in the relevant domain. An overview of the other chapters is given below. Chapter 2 begins with a presentation of the different databases used during this work. It is followed by a section devoted to a comparison study between these databases to justify our choices by showing that they fulfill the requirements of an APT detection system. Chapter 3 will presents the implementation of our agent as well as the different data import codes. We will be explaining the main programming features that we have used and we will try to simplify our design in order to make the understanding of this dissertation easier for its readers. Chapter 4 is devoted to the results and analysis of detailed performance tests. The last chapter is devoted to present a subjective point of view about the different databases studied during this project. Finally, future study directions are suggested and global conclusions are drawn. Chapter 2 Comparative study of databases 2.1 Introduction Databases are becoming more and more ubiquitous and essential to aspects of modern societies based on computer and network communications. In this chapter, we will be studying the different types of databases. This study will be divided into 2 major stages. The first one will consist in a preliminary analysis of the the 3 databases categories used during this project with the purpose to define the general concepts that vary from a category to another . The second one will be devoted to a comparison of the three database management systems on which this study is based. 2.2 2.2.1 Preliminary comparison Classical SQL databases Databases were used since the 1960s. The need to create such systems for data organization and management was due to the remarkable speed of data production as well as the rise of data demand and accessibility through the strata of the modern society. The first prototype SQL database model was developed by IBM in 1974 [8]. As it is shown by its denomination, this database category is based on SQL1 language which is a normalized language conceived to query rela1 SQL stands for Structured Query Language. 13 CHAPTER 2. COMPARATIVE STUDY OF DATABASES 14 tional databases. This standard is divided into three classes depending on the type of data querying: • DDL: Data Definition Language for data description, structuring and codification. • DML: Data Manipulation Language for data restitution: insertion, modification and interrogation. • DCL: Data Control Language for data administration, security and integrity insurance. SQL databases are obligatory relational ones. In fact, a relational database is based on a group of concepts that define the database structure. This structure is called a schema and it describes the data , the relations between them as well as a set of integrity rules which should be respected. In a relational database, data are grouped in tables composed of rows and columns. The tables are vertically scalable meaning that the database is scaled by increasing the horse-power of the hardware. This scaling process can be managed by increasing load through the rise of the CPU’s, RAM’s, SSD’s capacities, etc. Using SQL language makes these databases very powerful when it comes to performing complex queries and suitable for high transactional based application. However, they are not considered as a best fit for hierarchical data storage [Net 6]. 2.2.2 NoSQL databases This new type of databases came out as a result of a repulsing attitude against the traditional relational data model caused by the enormous explosion in data volumes [8]. Actually, the term NoSQL refers to not only SQL as its main characteristic is being schemaless. NoSQL offers a new way of data storage and querying different from the tabular relational model and shows an easier integration into applications as it has less restrictions. For this reason, NoSQL databases are known to be more flexible than the relational ones. Schema changes led by applications evolution requirements will not drag long complicated modifications which enables a faster data iteration and a better integrity management [9]. We don’t talk about tables anymore. It’s all about collections which can store CHAPTER 2. COMPARATIVE STUDY OF DATABASES 15 heterogeneous data rows called documents. NoSQL databases introduce a new way of data design which takes into consideration future progressions. They are horizontally scalable and fit perfectly for the addition of clusters within the staple infrastructure for high transactional attitude handling purposes [Net 7]. Data is stored using key-value pair (KVP) model which makes the databases not typed and the storage of most of the data with the string type. Otherwise, this data repertory model isn’t ACID compliant unlike typical relational model. For this reason, database designers and implementers have to think about data placement, replication, and fault tolerance as they are not expressly controlled by the technology itself [10]. ACID actually refers to [Net 8]: • Atomicity: transactions have to follow a ”all or nothing” rule which means that an interrupted modification is entirely rolled back. • Consistency: the database is submitted to a set of integrity and consistency rules which should be obligatory fulfilled by any data insertion attempt request. • Isolation: transactions occurring at the same time must not impact or perturb each other’s execution. • Durability: any transaction which have been committed can’t lost and the modification that it had brought to the database can’t be wasted randomly. Databases which do not respect the ACID model being part of the most crucial concepts in databases theory are not considered reliable. The term not only SQL refers to the fact that the SQL language can also be used within this type of databases. They are not fully stable until the time being and their support is not considered good enough for users who will have no choice other than to rely on communities as experts in the field are not easy to find looking to the young age of this database model. 2.2.3 Databases specialized in text search The explosion in data volumes gathered and manipulated by companies, social networks, etc has come with a new problem which consists in un- CHAPTER 2. COMPARATIVE STUDY OF DATABASES 16 structured textual data presented in electronic format. As this type of information can not be stored and queried using typical database management systems, new engines capable of performing text search into unstructured data were required [Net 9]. Full-text search databases are not completely independent from the older technologies as they are using the traditional tabular model in one way or another. Here is the taxonomy used by text-search databases: Traditional Terminology Database Table Row Column Schema Indexes Text-search databases Terminology Index Type Document Field Mapping Automatic indexing Table 2.1: Text-search databases taxonomy The architecture of a text search engine is different from other databases ones. The procedure is the following: external data is inserted into rows within the main document index. The search engine will automatically create an ordered word list to load into the main word index. In response to a data query, the search engine will initially perform the search into the index in order to identify which documents match the required request and returns pointers to matching results in the main document index. This process offers great performances when monitoring textual data thanks to its index’s high granularity, allowing rapid indexed access to specific words. CHAPTER 2. COMPARATIVE STUDY OF DATABASES 17 A text search engine architecture is described by the following figure : Figure 2.1: Generalized Full-Text Architecture source: http://www.ideaeng.com/database-full-text-search-0201 2.3 2.3.1 Presentation of the used databases PostgreSQL PostgresSQL is a database management system based on the relational model. It was developed by PostgreSQL Global Development Group, written in C language and initially released on May 1, 1995 under the PostgreSQL License. In addition to the traditional functions offered by a relational database server such as storing and retrieving data mechanisms required by any type of application, PostgresSQL proposes expansions which are not available in the previous database management systems CHAPTER 2. COMPARATIVE STUDY OF DATABASES 18 such as: • Several indexing methods for complex SQL queries handling, Updateable and materialized views, Rules, Constraints, Triggers, Foreign keys ,Transactions integrity, etc [Net 10]. PostgresSQL is ACID-compliant and uses multiversion concurrency control (MVCC)2 to avoid locking issues [11]. It also gives access to additional concepts like classes, functions, inheritance, etc extending the system and joining it to the object-relational category. Positives [Net 10]: Portability and execution on multiple platforms. Fast and easy installation especially thanks to the Postgres installer which was provided since the version 8.0. Sturdiness. Rich of powerful functions. Active support offered by a large community of professionals. Conformity with the majority of the SQL2011 standard. BSD3 license. Configurability and extensibility. Stability. Limitations [Net 10]: Low performances when manipulating small data volumes. Data distribution problems. 2.3.2 MongoDB MongoDB is a free and open-source NoSQL, document-oriented database management system developed by MongoDB Inc, written in C++ programming language and initially released in 2009 under a combination between GNU AGPL v3.04 and the Apache License. It is schemaless and 2 MVCC is a system avoiding concurrency problems by making the transactions invisible until the time they are committed in order to not disturb other changes on the database. 3 BSD refers to Berkeley Software Distribution 4 AGPL refers to Affero General Public License CHAPTER 2. COMPARATIVE STUDY OF DATABASES 19 has a JSON-like5 documents structure making the data integration easier, faster and more flexible. Schema in MongoDB is qualified of dynamic as we can store in the same collection, documents that do not have the same structure, and common fields in a collection can hold different types of data. According to db-engines.com, in April 2014, MongoDB is at 5th place as the most popular type of database management system, and first place for NoSQL database management systems [Net 12]. MongoDB has a large number of great features giving it advantages over other document-oriented databases such as subscriptions, etc. MongoDB subscriptions are features addressing the most demanding requirements of an application and enabling the users to be more rapid and effective. Figure 2.2: MongoDB Subscriptions Source:http://www.mongodb.com/mongodb-overview 5 JSON refers to JavaScript Object Notation: it is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs CHAPTER 2. COMPARATIVE STUDY OF DATABASES 20 The main purposes for MongoDB are: high availability, performance and especially scalability as it is designed to work on large server deployments and multi-site architectures. It supports standalone operations and handles traffic enlargement perfectly as more servers can easily be added. MongoDB offers a large number of drivers for various programming languages such as Java, C++, PHP, JavaScript,etc. Positives [Net 12]: Short execution time even with the increase of data volumes. High scalability performances thanks to the existing replication mechanisms. Easily distributable. Flexibility in failure cases with fail-over and recovery mechanisms. Partial or total availability. Advanced features. Powerful query system when it is combined with the best database design and the appropriate indexing strategy. Great modeling for many sensitive data types such as graphs, location based data, fog data of any format,etc. Limitations [Net 12]: Risks to dematerialize some limitations of the model query: a minimum level of coherence is required. No transactions supported. No equivalent for the join query in SQL. Administration issues led by the absence of a data model. Absence of a proper debugger and a graphical user interface. 2.3.3 Elasticsearch 2.3.3.1 Presentation Elasticsearch is an open source, RESTful6 search engine developed by Shay Banon, written in Java programming language and released under the Apache License. It’s mainly arranged for inspecting through text, 6 RESTful refers to REpresentational State Transfer which is a software architecture designed for distributed hypermedia systems. CHAPTER 2. COMPARATIVE STUDY OF DATABASES 21 returning textual results to a given query and statistical analysis of text body. Data is stored in a special format optimized for language based searches and its main protocol is implemented with HTTP/JSON [Net 14]. Elasticsearch indexes JSON documents automatically using unique type-level identifier for each indexing operation. It fits for the storage of large amounts of unstructured data in a distributed architecture of several servers on which it can perform powerful research functions. If the purpose is simply data storage, it is not wise to use Elasticsearch as its main use is to perform text search and in this case it is better if we use other databases which are suitable for this need [Net 13]. Despite the fact that Elasticserach is based on Apache Lucene7 , it provides a simpler API for public use which is better than the Lucene’s old one. It also offers a rich infrastructure making the scaling process within machines and data centers a lot easier, inter-operation with non-Java languages and an operational ease of use. Positives [Net 14]: High search performances thanks to the data sharding concepts. Easy to use as there is no needed configuration. Efficiency: starting a node require only its integration into the ecosystem in which it benefits from the automatic replication and dimensioning. Parallelism between node’s treatments. Large number of features proposed: facets, percolation, plug-ins,etc. Easy integration into other information systems since it is under the Apache2 license. Limitations [Net 14]: Absence of a graphical user interface. Not completely stable. No support for transactions. Immaturity as the software is considered relatively new. Near real-time data availability. 7 Apache Lucene is a free/open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. CHAPTER 2. COMPARATIVE STUDY OF DATABASES 22 No access-control or authentication features. Unsatisfying documentation which looks more like a tutorial. 2.4 Comparative tables The following information was mainly retrieved from the official sites of the three studied databases [Net 12] [Net 13] [Net 14]. Table 2.2: Databases comparison: organization and community Table 2.3: Databases comparison: compatibility CHAPTER 2. COMPARATIVE STUDY OF DATABASES Table 2.4: Databases comparison: administration Table 2.5: Databases comparison: general information 23 CHAPTER 2. COMPARATIVE STUDY OF DATABASES Table 2.6: Databases comparison: costs and software license 24 CHAPTER 2. COMPARATIVE STUDY OF DATABASES 25 Table 2.7: Databases comparison: Capabilities/Limitations 2.5 Summary After this theoretical study, we can see that the three database management systems: PostgreSQL, MongoDB and Elasticsearch are potential candidates for our research as they show great abilities in big data processing field. A practical study is definitely needed in order to determine which one of the databases mentioned before are more suitable for APT detection procedures. To achieve this goal, a prototype of an agent specialized in abnormal uploading behavior investigation will be implemented in Java programming language and will be tested on the three databases with the use of log files of different sizes in order to be able to make comparisons, ascertainment and final conclusions. Chapter 3 Implementations 3.1 Introduction This chapter will be devoted to the explanation of the codes used to test the databases studied in the previous chapter. To implement our agent, we though about 2 methods: the first one is to only retrieve the wanted data fields from the databases and perform the rest of the operations on the request results, while the second one consists in using the aggregation methods on the databases fields and get the results ready for analysis without the need of subsequent operations. As each database management system possesses it’s own Java API1 and it’s own query system, we dispose in total of nine codes: one for data storage and two versions of the Agent Uploader for each database, which will be fully explained in the current chapter. 3.2 Data storage The first thing to think about when storing data into a database is the schema design. Even with schemaless databases, it is crucial to extract the information in specific individual fields within the documents in order to facilitate data extraction because otherwise we will need to use regular expressions and to perform a full scan each time we desire to retrieve specific information. 1 API refers to Application Programming Interface. 26 CHAPTER 3. IMPLEMENTATIONS 27 As it was mentioned in the first chapter, the real challenge with APT detection systems is Big Data. During this project we were given a Squid proxy log files of a medium sized network to work with. The following section will be devoted to the analysis of the data format of a Squid proxy log file in order to arrange it properly into the databases. 3.2.1 Data format Two default formats are built-into Squid proxy and are defined by the option logformat existing in the file squid.conf: • The native format • The common format In our case, the log file format we will be using during this work is the native one [Net 15] : The format is: %9d.%03d time %6d elapsed %s URL %s remotehost %s rfc931 %s/%03d code/status %s%s%s peerstatus/peerhost %d bytes %s method %s type For example a line in the log file named access.log comes like the following: time 1394950584.861 elapsed 438 remotehost 10.0.149.23 URL http://dzayfqe.trwvkpc.au/lsetyuxs.html code/status TCP_MISS/200 rfc931 - bytes 934 method GET peerstatus/peerhost DIRECT/69.114.1.230 Each line in this file contains ten fields which hold significant information that can be used to investigate the state of the network. The succeeding paragraph will explain the signification of each field separately. type text/html CHAPTER 3. IMPLEMENTATIONS 28 Squid native access.log format in detail [Net 18] • time: A decimal(11,3) referring to the time between the end of the transaction execution and its logging by the proxy. • elapsed = duration: Expressed in milliseconds and presents the time during which the transaction has occupied the cache memory to be executed. • remotehost = client address: The ip address of the client machine where the transaction was originated. • code/status = result codes: It is composed of two entries separated by a slash: The first one refers to the kind of the request The second one presents the code describing how the transaction has succeeded or failed. • bytes: The amount of data that was delivered to the client. • method: The request method to demand the procuration of certain data. • URL: The URL to which the client is trying to get access. • rfc931: It contains the ident lookups for the current client but it is always replaced by a ”-” for performance reasons. • peerstatus/peerhost = hierarchy code: It’s divided into two entities: The first one refers to a code which demonstrates how the demand was handled. The second one presents the ip address of the server where the request was forwarded. • type: The content type of the object as it is written in the HTTP reply header. 3.2.2 Database design In order to build a coherent database with enough expressive and concise information and on which several APT detection agents can be run CHAPTER 3. IMPLEMENTATIONS 29 effectively, we have decided to decompose some fields further more and to get rid of some others that doesn’t hold any useful information. We divided the URL into three separated fields: 1. Protocol: we will be able to query the database and retrieve the field protocol independently as botnet protocols through which APT attacks are performed incorporate common used protocols such as http, ftp, etc in order to go unnoticed. 2. Domain: as APT attackers rely mostly on C&C2 servers, we can query the domains visited by local machines in order to look for known domain names, although they tend to change constantly. We can also fetch for domains which are frequently consulted to reclaim confidential data which can allow an early APT discovery. 3. Document We have eliminated the ”rfc931” field as its doesn’t hold any important information that might be needed for a network analyst in his mission to detect APT intrusions. The peerstatus/peerhost field was decomposed into two independent fields: • rfc • host_adr 2 C&C refers to Command an Control servers. CHAPTER 3. IMPLEMENTATIONS 30 The database schema is the following: Field name time duration client-adr req_kind rslt_code bytes req_method protocol domain document rfc host_adr type 3.2.3 Type double int String String int int String String String String String String String Codes PostgreSQL We will be using PostgreSQL 9.2 version. We have at first, created a database named ”proxydata” and a table named ”data” in order to hold the log information that will be queried by the Agent Uploader. The procedure will consist in getting connection with the database, reading the log file line by line, splitting each line into separated fields and inserting them on the table ”data”. In order to communicate with a PostgreSQL database, we had to import the postgresql-9.2-1003.jdbc4.jar file from http://mvnrepository. com/artifact/org.postgresql/postgresql/9.2-1003-jdbc4 and add it to the Libraries package of our project named DBloading Getting connection to the database [Net 16] We have written a class named ”GetConnection” which will return an object of type Connection. CHAPTER 3. IMPLEMENTATIONS 31 The connection is ensured by the DriverManager using a connection URL: jdbc:postgresql://host:port/database. The default port is 5432 and as we are working on a local machine, we will be using the localhost. The DriverManager.getConnection() method is called like the following: Connection object = DriverManager.getConnection(url, username, password) [Net 21] Figure 3.1: DBloading: GetConnection query Getting the transaction’s fields In order to insert the transactions into the database, we had to extract the log information into independent fields corresponding to the design previously presented and which will later be inserted in the table data. To achieve this goal, we have used String[] split(String delimiter) method of Java String class which takes as parameters a group of delimiters. These delimiters consist in the different characters that separate two fields. They are presented in the list below along with an example of where they can be found: • Multiple spaces to separate the time and duration. • ’/’ to separate the req-kind and rslt-code. • ’://’ to separate the protocol and domain. • ’-’ to separate the document and the rfc For an appropriate splitting procedure, we have to specify to the method that these delimiters can be used several times within the same line. This task is achieved with the use of a ’+’ operator that follows the delimiters list. The split method returns an array of the fields contained in the original line of the log file. CHAPTER 3. IMPLEMENTATIONS 32 Figure 3.2: DBloading: extracting the transaction’s fields Insert query Inserting the fields that we have extracted from the log file will need the performance of an insert SQL query. Each time we try to issue a SQL query to the database, we require a Statement or PreparedStatement instance. The PreparedStatement object is a slightly more powerful version of a Statement as it may be parametrized. This means that it should be used when user-input parameters are required within the query. PreparedStatement is also known for its capability of providing the necessary mechanisms to avoid sql-injections, otherwise both methods do the same. In our case, we don’t need to pass any parameters to the query so we will use a Statement object instance to hold our insert query. We will be inserting the fields that we retrieved previously from each line in the log file. CHAPTER 3. IMPLEMENTATIONS Figure 3.3: DBloading: Insert query Verification of the database insertion Figure 3.4: PostgreSQL: data table MongoDB We will be using MongoDB 2.4.9 version. 33 CHAPTER 3. IMPLEMENTATIONS 34 Unlike PostgreSQL, we don’t have to create the database or the collection as a preliminary step because they will automatically be created when the java program is run and this is due to the fact that MongoDB is a schemaless document-oriented database that doesn’t need a preconfiguration or design for the database. Each line of the log file will be represented by a document in the collection ”data” and as we already said in the database storage section, we have to divide each document into fields in order to facilitate the data picking. Each transaction will be read from the log file, splited into fields as we have done in the DBloading code and finally inserted into a document of a data collection within a proxydata database. In order to communicate with a MongoDB database, we had to import the mongo-java-driver-2.4 jar file from http://mvnrepository.com/ artifact/org.mongodb/mongo-java-driver and add it to the Libraries package of our project named MongoLoading. Getting connection to the database To get connection to the MongoDB data collection, we have to connect to the database proxydata on the default port 27017. The GetConnection method will return a DBCollection that consists on the collection we will be working on. Figure 3.5: MongoLoading: GetConnection query Insert query Each document corresponding to a log transaction will be built of the fields extracted from this transaction using the split method as we did in the project DBloading. We have to get a document instance with the type BasicDBObject, add the necessary fields to it then insert it in the collection. CHAPTER 3. IMPLEMENTATIONS 35 Figure 3.6: MongoLoading: Insert query Verification of the database insertion Figure 3.7: MongoDB: data collection 3.2.4 Elasticsearch The Elasticsearch version we will be using is the 1.1.1. We had to import the elasticsearch-1.1.1 jar file to the Libraries package in the new project we created named ElasticLoading. As Elasticsearch is based on Lucene, we had to look for the compatible lucene-core jar file with our database version. So, we have downloaded CHAPTER 3. IMPLEMENTATIONS 36 the lucene-core-4.7.2.jar file from http://lucene.apache.org/core/downloads. html and add it to our Libraries package too. Getting connection to the database [Net 19] The connection to an Elasticsearch database needs a client instance of the class TransportClient. It connects to an existing Elasticsearch node through a transport socket on the port 9300 and returns an object of the type TransportClient. Figure 3.8: ElasticLoading: getClient query Insert query First, we have to get an insert request instance to our index. This task is ensured by the method prepareIndex which is related to the client we have got earlier and which takes into parameters the index name along with the type name. It returns an IndexRequestBuilder object instance that will holds the document that will be inserted. We also have to start a XContentBuilder instance corresponding to a container of a json object to which we will add the fields that we extracted from each line of the log file. This instance should be added to the IndexRequestBuilder corresponding to the insert query and which will later be executed. CHAPTER 3. IMPLEMENTATIONS 37 Figure 3.9: ElasticLoding: Insert query Verification of the database insertion Figure 3.10: Elasticsearch: data type 3.3 Agent Uploader The Agent Uploader performs an investigation mission within sets of proxy log transactions in order to detect an abnormal uploading rate. For each server visited by each client, the agent sums the bytes corresponding to each request method: GET, POST or CONNECT, then calculates an upload rate which will be compared to a chosen threshold. This threshold CHAPTER 3. IMPLEMENTATIONS 38 is chosen by the network analyst in function of the typical upload behavior inside the local network. If the agent discovers a rate that exceeds the pre-fixed threshold, it should declare an alert. 3.3.1 Algorithm The first version of our Agent Uploader will consists in: 1. Selecting the list of clients addresses 2. Selecting the list of servers addresses visited by each client 3. Selecting req_method and bytes for each couple (client address, server address) 4. Performing the calculations 5. Performing the analysis CHAPTER 3. IMPLEMENTATIONS 39 Algorithm 3.1 Agent Uploader Connect to the database for each client for each server visited by this client select req_method, bytes for each result in the result set perform the calculations: if req_method== GET BytesByGet= BytesByGet + bytes else if req_method== POST BytesByPost= BytesByPost + bytes else if req_method== CONNECT BytesByConnect= BytesByConnect + bytes perform the analysis: if Get_method exists if Post_method exists ratio= BytesByPost / BytesByGet else ratio= 0 if Connect_method exists ratio= BytesByConnect / resultSet_size *100000 if ratio > threshold print(client,server,ratio) 3.3.2 Codes 3.3.2.1 PostgreSQL Each time we need to query a PostgreSQL database, we need to get an instance of the object Statement to hold the query which the execution will return a ResultSet object containing all the results retrieved from the consulted data table. A ResultSet object supports a cursor to point to the current result records. When we create a ResultSet, we have to set three attribute [Net 20]: CHAPTER 3. IMPLEMENTATIONS Property Type Value TYPE_SCROLL_INSENSITIVE Concurrency CONCUR_READ_ONLY Holdability HOLD_CURSORS_OVER_COMMIT 40 Explanation The result records can be navigated both forward and backwards The result records can only be read The ResultSet is kept open during the whole connection period Select clients The method GetClients selects distinctly the clients addresses existing in the log file. It takes as parameter a Connection instance and returns an array of the clients addresses. The select query is performed like the following. Figure 3.11: PostgreSQL: GetClients query Select servers visited by each client The method GetServers selects distinctly the server’s addresses visited by a giving client’s address. It takes for parameters a Connection instance, a client’s address and returns an array of servers addresses. The select query is done like the following. CHAPTER 3. IMPLEMENTATIONS 41 Figure 3.12: PostgreSQL: GetServers query Select req_method, bytes for each server visited by a specific client The select query of the project’s main section is done within a Statement instance and its execution execution returns a ResultSet object which contains the information needed to carry on the agent’s investigation. Figure 3.13: PostgreSQL main: Select query 3.3.2.2 MongoDB Select clients The method GetClients selects distinctly the clients addresses existing in the log file. It takes for parameter a DBCollection instance and returns a list of clients addresses. The select query is performed like above. Figure 3.14: MongoDB: GetClients query Select servers visited by each client The method GetServers selects distinctly the servers addresses visited by a giving client’s address. It takes for parameters a DBCollection instance, a client’s address and returns a list of servers addresses. An instance of a CHAPTER 3. IMPLEMENTATIONS 42 BasicDBObject object is used in order to specify the clause where for the select query which is executed like above. Figure 3.15: MongoDB: GetServers query Select req_method, bytes for each server visited by a specific client To select the request method and bytes corresponding to a couple(client_adr, host_adr), we need to build a BasicDBObject object named query which will hold the clause where of the select query and a second one named fields in order to limit the fields retrieved from the database to the two we will be needing for our subsequent calculations and analysis. Figure 3.16: MongoDB main: Select query 3.3.2.3 Elasticsearch Select clients The method GetClients selects distinctly the clients addresses existing in the log file. It takes for parameter a TransportClient instance and returns a list of clients addresses. An Elasticsearch query is contained within an instance of a SearchRequestBuilder object to which we will add the wanted properties of the selection.The search can be executed across one or more types so we had to specify the name of the type that we will be searching through. To perform the distinct selection, we had to use a termsFacet which will allow us to specify field facets3 that return the most frequent terms. It 3 Facets are a group of available filters which can be applied to a set of search results CHAPTER 3. IMPLEMENTATIONS 43 is necessary to specify the facet size, meaning the number of most frequent distinct clients addresses, but this has a major drawback: if we put a size which is inferior to the actual clients number the result will be incomplete and if we put a very large number in order to avoid the first disadvantage this can lead to performance issues. This select query is mentioned in the next figure. Figure 3.17: Elasticsearch: GetClients query Select servers visited by each client The method GetServers selects distinctly the servers addresses visited by a giving client’s address using a termsfacet. It takes for parameters a TransportClient instance, a client’s address and returns a list of servers addresses. Figure 3.18: Elasticsearch: GetServers query Select req_method, bytes for each server visited by a specific client To select the request method and bytes corresponding to a couple(client_adr,host_adr), CHAPTER 3. IMPLEMENTATIONS 44 we need to use a bool query to hold the where clauses. A bool query is a query that matches documents matching boolean combinations of other queries. We will be retrieving only the two fields we need with the use of the addField method of the SearchRequestBuilder instance. Figure 3.19: Elasticsearch main: Select query 3.4 Agent Uploader with aggregations The methods GetClients and GetServers are the same as in the first version of our agent, that’s why we will not go through them again in the following section as they were already explained in the previous one. 3.4.1 Algorithm The second version of our Agent Uploader will consist in: 1. Selecting the list of clients addresses 2. Selecting the list of servers addresses visited by each client 3. Selecting req_method and Sum(bytes) for each couple(client_adr,host_adr) group by req_method 4. Performing the analysis CHAPTER 3. IMPLEMENTATIONS 45 Algorithm 3.2 Agent Uploader with aggregations Connect to the database for each client for each server visited by this client select req_method, Sum(bytes) group by req_method for each result in the result set perform the analysis: if Get_method exists if Post_method exists ratio= BytesByPost / BytesByGet else ratio= 0 if Connect_method exists ratio= BytesByConnect / resultSet_size *100000 if ratio > threshold print(client,server,ratio) 3.4.2 Codes 3.4.2.1 PostgreSQL Select req_method, Sum( bytes) for each server visited by a specific client group by req_method To retrieve Sum(bytes) for each couple (client_adr,host_adr), we need to use the aggregate function Sum which performs a sum calculation over the field bytes. These calculations will by grouped by req_method in order to get the results classified by each request method type like the following example: req_method GET POST CONNECT Sum(bytes) 100 200 0 This select query is invocated in the following figure. CHAPTER 3. IMPLEMENTATIONS 46 Figure 3.20: PostgreSQL main: Select with group by query 3.4.2.2 MongoDB [Net 22] In order to use aggregations on MongoDB with Java driver, we need to build an aggregation pipeline. This pipeline consists of three major operations: • Operation $match: describes the ”where” clause of the query. Figure 3.21: MongoDB main (with group by): Operation $match • Operation $project: specify the fields which need to pass through the pipeline. Figure 3.22: MongoDB main (with group by): Operation $project • Operation $group: presents the ”group by” clause by specifying the fields by which the aggregation will be grouped. Figure 3.23: MongoDB main (with group by): Operation $group CHAPTER 3. IMPLEMENTATIONS 47 Other operations can be defined and added to the pipeline if needed such us $sort for sorting the results, etc. To execute the aggregation query, we have used a second method: aggregate(List<DBObject>) like the following. Figure 3.24: MongoDB main (with group by): aggregation execution 3.4.2.3 Elasticsearch During the implementation of this code, we could not retrieve the aggregation results as we haven’t found any documentation on the relevant subject. On the Elasticsearch blog ,we found that many other users are asking for a java api documentation on aggregations but it wasn’t provided until the time of the writing of this dissertation. In order to build an aggregation query on Elasticsearch with the java driver, we have to use two types of aggregations: • A bucket aggregation: to group documents for each req-method. It will return the unique terms indexed for a given req_method. • A metrics aggregation: to return the Sum(bytes) of the groups of documents returned by the previous query using the sum metric. Each one of the two aggregations is built within an AggregationBuilders instance. The bucket aggregation encapsulate the metrics one as a subAggregation. It means that the sum aggregation will have as input, the output of the buckets: the sum metric will be applied to the content of each group: each req_method in our case. CHAPTER 3. IMPLEMENTATIONS 48 Figure 3.25: Elasticsearch main (with group by): aggregation query 3.5 Summary During this chapter, we have presented the implementations of our agents as well as an explanation of the methods, libraries and objects used to communicate and query the different database management systems. The next chapter will be devoted for testing our agents and analyzing the different databases performances. Results will be provided in order to make conclusions about each database exploits. Chapter 4 Performance tests 4.1 Introduction The last section of our comparative study consists in running multiple tests in order to observe the three databases performances. The retrieved results will allow us to make conclusions about the most suitable database management system for our APT detection needs. The tests that follow have been executed locally on a Microsoft Windows 7 professional 32-bit operating system machine with the following properties: • Hard disk 193 Go. • RAM memory 4 Go. • Processor Intel® Core™ i5 CPU M480 @2.67GHz-2.66GHz. 4.2 Methodology The following tests will be classified into two different parts: The first part attempts to measure the execution time needed to fill in each database with log data. The second is similar to the first except that it targets the different agents execution time on each database type. In order to be able to analyze the behavior of each database management system parallely to the growth of data size, the tests will be done with three different log files with different sizes: 836 KB, 33 MB, 153 MB. 49 CHAPTER 4. PERFORMANCE TESTS 50 In addition, we will be using two different indexing strategies as indexes are said to be able to speed up data retrieval and to depress the server task by providing quick jump references on where to find the searched information. However, indexes can present some disadvantages which we have to take into consideration such as the space they will be taking and the time needed to update the indexes after each modification on the database and which can slow it down [Net 17]. As indexes are used to limit the number of results we are trying to find, the fields that should be indexed are those being part of the where clauses of the select queries used in our codes. These fields are: client_adr and host_adr. In a first step, we will be indexing each field apart which means that we will have two single-field indexes after all. In a second step, we will be gathering the two fields inside one multiple_fields index as it is said to beat the single-field one in terms of quickening the data recuperation. The indexing strategies will only be performed on PostgreSQL and MongoDB as in Elasticsearch database, data is indexed automatically. To avoid slowing down the data insertion in each database, we will disable the indexes at each insertion attempt, but we will also present a few test results showing how can indexes deteriorate the insertion’s performances. 4.3 4.3.1 First test: Data import Presentation This test consists in measuring the time needed by each database management system to upload a set of transactions from a log text file and store the data in its internal system: table, type or collection. As we said it earlier, we will be using three different log files in order to be able to observe the database behavior in parallel with the rise of the data size. Monitoring big data volumes is a crucial criterion as APT detection process is based on big data manipulation. CHAPTER 4. PERFORMANCE TESTS 4.3.2 Results 4.3.2.1 Before indexing Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB Elasticsearch 38 27 100 730 173 1354 5208 795 6727 Table 4.1: Data import : results Figure 4.1: Comparative diagram: data import 4.3.2.2 After Indexing Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) 51 Execution time in seconds PostgreSQL MongoDB 68 57 1908 617 9154 2088 Table 4.2: Data import results on indexed tables CHAPTER 4. PERFORMANCE TESTS 4.3.3 52 Analysis In the previous tests, we have demonstrated that inserting data on indexed databases lower the insertion performances remarkably even with small amounts of data. For this reason, it is wise to delete the indexes before any insert query in order to get the best execution time. Looking to the results of the table 4.2, PostgreSQL along with MongoDB show similar results while importing the first log file, but it is not the case with Elasticsearch which shows a higher data import time. The more the log file size rises, the more differences between the insertion times appear. It clearly come out that MongoDB presents the best execution time in data insertion procedure beating both PostgerSQL and Elasticsearch. Another observation would be that PostgreSQL does not fall far behind MongoDB when inserting small to medium sized data sets. Elasticsearch takes longer time to insert data as it indexes the documents automatically using unique type-level identifier for each indexing operation which consumes much time then a normal document insertion. In order to get deeper analysis, we have examined the influence of data size rise on inserting data on the three databases. The results are presented on the following curve. CHAPTER 4. PERFORMANCE TESTS 53 Figure 4.2: Comparative curve: impact of the data size rise on the data import The curve demonstrates that data growth affects MongoDB less than the other databases which emphasizes on the previous findings. For inserting big volumes of data, we conclude that MongoDB is the most efficient database as it exhibits an admissible execution time which would fulfill a future APT detection system requirements. 4.4 4.4.1 Second test: Agent Uploader Presentation In order to test the Agent Uploader we have implemented, we will run it on the different databases so we can observe its adaptation with the increase of manipulated data. As a first step, tests will be run without creating indexes. In a second one, two types of indexes will be created and used within the tests: single-field indexes and multiple-fields ones. Creating single-field indexes To create the single-field indexes we used the following SQL queries: CHAPTER 4. PERFORMANCE TESTS 54 On PostgreSQL • CREATE INDEX index_client ON data(client_adr); • CREATE INDEX index_server ON data(host_adr); On MongoDB • db.data.ensureIndex({ client_adr:1 }); • db.data.ensureIndex({ host_adr:1 }); Creating multiple-fields indexes To create the multiple-fields indexes we used the following SQL queries: On PostgreSQL • CREATE INDEX index_client ON data( client_adr, host_adr ); On MongoDB • db.data.ensureIndex({ client_adr:1 , host_adr:1 }); 4.4.2 Results 4.4.2.1 Before indexing Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB 26 30 1391 4200 7246 10855 Table 4.3: Agent Uploader : results without indexing CHAPTER 4. PERFORMANCE TESTS 55 Figure 4.3: Comparative diagram: Agent Uploader without indexing 4.4.2.2 After indexing Using single-field indexes Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB Elasticsearch 8 9 10 101 70 81 1560 961 295 Table 4.4: Agent Uploader : results using single-field indexes CHAPTER 4. PERFORMANCE TESTS 56 Figure 4.4: Comparative diagram: Agent Uploader using single-field indexes Using multiple-fields index Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB Elasticsearch 8 5 10 57 24 81 1271 75 295 Table 4.5: Agent Uploader : results using multiple-fields index CHAPTER 4. PERFORMANCE TESTS 57 Figure 4.5: Comparative diagram: Agent Uploader using multiple-fields index 4.4.3 Analysis Without indexing, PostgreSQL and MongoDB show very long execution time that exceeds multiple hours. Querying data which is not indexed is far from being useful. Unindexed data is the default data state but it is definitely not a sagacious choice to work with it as it is. It is obvious that indexes reduce the agent’s execution time in a large scale that ranges from multiple hours to few minutes. Using single-field indexes clearly reduces the execution time for both PostgreSQL and MongoDB but multiple-fields index reduces it further more. The two indexing strategies seem to have different effects on each database performances. It is easy to notice that using multiple-fields index is a lot more beneficial as it divides the execution time using single-field indexes by more than 10 times in MongoDB case. PostgreSQL does not show great results especially when monitoring big data sizes and there is not a noticeable difference between the two indexing methods. MongoDB, however, presents great results once again as querying more than a million documents takes only a minute and a few seconds. Elasticsearch does also work remarkably good with big data amounts and even beats up MongoDB execution time using single-fields indexes. CHAPTER 4. PERFORMANCE TESTS 58 For a more profound analysis, we will take a close look on the impact of data rise on the agent performances. The results are given below. Figure 4.6: Comparative curve: impact on the rise of data size on Agent Uploader The curve exhibits important results as it shows how both PostgreSQL and Elasticsearch are affected by the growth of the monitored data size: the more data we have to query, the less performances we get especially for PostgreSQL. However, MongoDB presents a slight performance variation in function of data rise. It actually displays the best adaptation behavior to the enlargement of the data through which the agent executes its investigation. Eventually, MongoDB wins the competition once again providing the best execution time when monitoring large sets of log transactions. CHAPTER 4. PERFORMANCE TESTS 4.5 4.5.1 59 Third test: Agent Uploader with aggregations Presentation The same tests that we have achieved with the first version of the Agent Uploader will be run with the second version in order to make comparison between the two of them, and be able to conclude about which one presents the best performances and the less execution time in a subsequent section. As the Agent Uploadr with aggregations was not totally implemented on Elasticsearch, these test will be run only on PostgreSQL and MongoDB. 4.5.2 Results 4.5.2.1 Before indexing Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB 27 50 2102 5520 8507 12762 Table 4.6: Agent Uploader with aggregations : results without indexing CHAPTER 4. PERFORMANCE TESTS 60 Figure 4.7: Comparative diagram: Agent Uploader with aggregations without indexing 4.5.2.2 With indexing Using single-field indexes Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB 11 15 99 151 1073 1908 Table 4.7: Agent Uploader with aggregations : results using single-field indexes CHAPTER 4. PERFORMANCE TESTS 61 Figure 4.8: Comparative diagram: Agent Uploader with aggregations using single-field index Using multiple-fields index Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds PostgreSQL MongoDB 7 12 57 41 544 156 Table 4.8: Agent Uploader with aggregations : results using multiplefields index CHAPTER 4. PERFORMANCE TESTS 62 Figure 4.9: Comparative diagram: Agent Uploader with aggregations using multiple-fields index 4.5.3 Analysis As it was said in the previous analysis, it is not wise at all to query data which is not indexed on any of the databases we used. It really takes hours to get results which does not fit for a future APT detection preoccupations. During the first two tests, PostgreSQL showed less execution time than MongoDB, but the fact of using multiple-fields has turned the results up side down as it reduced MongoDB performances remarkably to the edge of beating PostgreSQL. These observations asserts the study we have made in the second chapter during which we said that PostgreSQL along with other SQL, relational databases are better fit for performing complex questions than NoSQL ones. It also comes out that using the appropriate indexing method on MongoDB simplifies the execution of complex queries and improves the agent execution time. A deeper analysis of the databases response to the enlargement of monitored data sets is provided by the following curve. CHAPTER 4. PERFORMANCE TESTS 63 Figure 4.10: Comparative curve: impact on the rise of data size on Agent Uploader with aggregations PostgreSQL shows a more like exponential behavior relatively to the data size rise while MongoDB exhibits more like a linear one. These observations demonstrate that MongoDB presents the best adaptation to the data growth especially with very large data sets as for small to medium sized data, its results are similar to the PostgreSQL ones. For the third time, MongoDB shows the best performances in terms of execution time but with the condition of using appropriate indexing method that will speed up data querying and offers good results which can be of a great use in a future utilization especially when performing complex queries that might have slowing impacts on the database. 4.6 4.6.1 Comparison of the two agents Presentation So far, we have compared the performances of each version of the Agent Uploader on each one of the three databases on its own. This section will be consecrated to the comparison between the two versions results CHAPTER 4. PERFORMANCE TESTS 64 for PostgreSQL and MongoDB. It will not be possible for us to apply this comparison to Elasticsearch as we were not able to carry on the second version’s implementation due to the absence of required documentation. In this section, we will be seeking to conclude about the best way to query each database in order to get the best performances which are needed in a future APT detection system. 4.6.2 PostgreSQL Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds First version Second version 8 7 57 57 1271 544 Table 4.9: PostgreSQL results: Comparison of the two agents Figure 4.11: Comparative Diagram: comparison of the two agents on PostgreSQL CHAPTER 4. PERFORMANCE TESTS 4.6.3 65 MongoDB Number of transactions treated 6682 (836 KB) 273141 (33 MB) 1263317 (153 MB) Execution time in seconds First version Second version 5 12 24 41 75 156 Table 4.10: MongoDB results: Comparison of the two agents Figure 4.12: Comparative Diagram: comparison of the two agents on MongoDB 4.6.4 Analysis For PostgreSQL, the second version of the Agent Uploader takes less time to be executed than the first one. This actually, can be explained through the fact that SQL databases including PostgreSQL are good fit for the complex query intensive environments. The gain in execution time gets higher along with the data volume growth. CHAPTER 4. PERFORMANCE TESTS 66 In MongoDB’s case, the two agent versions seem to have similar results but the first one has slightly a lower execution time. Once again, we can justify this observations according to the fact that NoSQL databases are not best fit for handling complex queries. In general, NoSQL including MongoDB don’t have standard interfaces to monitor complex queries, and NoSQL queries are not as powerful as SQL ones. Unfortunately we couldn’t get through the entire implementation of the second version of our agent for the Elasticsearch database owing to the lack of documentation, so we couldn’t compare its two versions performances. 4.7 Summary During this chapter, we have presented the different tests that we have performed on the three studied databases. Looking at the results we have got and the analysis we have made, it comes out that MongoDB shows a tolerable execution time especially when monitoring big data volumes that beats both PostgreSQL and Elasticsearch. By providing result sets in a limited time interval, MongoDB allows to a future APT detection system to perform network investigation on an admissible time especially when running multiple agents in order to perform the APT uncovering mission. From these conclusions, it stems out that we answered our research question even if we still have some unsolved issues with the implementation of the second version of our agent on Elasticsearch due to the absence of the required documentation. We have proved that MongoDB presents the best results when monitoring big data in an APT detection process over PostgreSQL and Elasticsearch. We have also demonstrated that MongoDB as a NoSQL database does not fit perfectly for performing complex queries, but PostgreSQL shows interesting behavior toward these queries that beats up the performance of simple ones. The results are not terrible but if we seek a minimum execution time, it’s more wise to limit the communication with the database to the data retrieval and perform any further calculations or analysis on the server side which will speed up the agents in a very large scale. Chapter 5 Conclusion The detection of Advanced Persistent Threats is a challenging task looking to the sophistication and the complexity of these intrusions. Many open source software tools, free and commercial are available for public use implementing different approaches. In order to be able to protect a network from being intruded, an appropriate APT detection strategy should be put into place with the combination of several APT detection mechanisms. The real challenge behind APT detection procedures is ”Big Data”. Big Data refers to extremely large amounts of varying, fast-changing information through which searching is not convenient for traditional technologies. Due to its heavy impacts on applications in a wide variety of fields, databases specialized in Big Data monitoring are launched on the market called and ”Big Data Tools”. In this thesis, we addressed the problem of running multiple database queries in order to help detect APT. Actually, APT detection joins log data collection and investigation field to which Big Data Tools seem to be of a great benefit. We have achieved a comparative study on three different database types in order to make conclusions about the most suitable database management system for a future APT detection software. We have implemented two versions of an agent which investigates an abnormal upload rating in order to analyze the best way to query a database and to be able to conclude about which one of the two querying methods will demonstrate the best performances. To test the different databases, we have used different log file sizes with the purpose of closely examining the impact of data growth on each database perfor67 CHAPTER 5. CONCLUSION 68 mances. Below we provide a summary of the ascertainment we have collected during this study and a set of recommendations for a best practice in terms of the results we retrieved. 5.1 Ascertainment PostgreSQL It shows good results when monitoring small to medium sized data sets but low performances when manipulating large amounts of data. PostgreSQL is a good fit for performing complex queries where it presents interesting results compared to the performance of simple ones. MongoDB It shows the best performances in data import and Agent Uploader on indexed data with multiple-fields indexes. However, it does not fit for complex queries performance but fits greatly for simple querying. For an APT detection system development, we advise the use of MongoDB as it is capable of providing the needed requirements. Elasticsearch It takes long time to import data but shows good results when running Agent Uploader. Elasticsearch is still far from being stable and lacks documentation on performing aggregations with the java API along with lots of other issues. For this reason, we don’t recommend its integration within the MASFAD project. 5.2 Recommendations Looking to the integrity of the study we have achieved, we conclude that MongoDB is the most suitable database among the studied ones for APT CHAPTER 5. CONCLUSION 69 detections needs in terms of response time. MongoDB shows great performances when monitoring big data and offers several concepts which can optimize big data processing further more. Here we presents few recommendations and subsequent solutions which stems from our study in order to enhance the results we have collected in a future work. MongoDB offers a sharding method as a solution to remedy the problem when data amounts are challenging a single server capacities: either CPU or storage capacity. It consists on distributing the data over multiple machines called ”shards” which, all together make up an entire an single database. In order to lower big data impact on the database performances, splitting the data into several tables can have a remarkable effect on the execution time. This concept within MongoDB is called ”data partitioning”. It joins the sharding concept but occurs at the collection level. The purpose of this procedure is to speed up the execution time of the most frequent queries. To reach this goal, the splitting should be made on the objects used by these queries to filter the results. In our case, we could create a database per client. Once again, to avoid performance issues caused by big data volumes, MongoDB gives a specific collection type named ”capped collection” where the maximum size can be specified in advance. When this maximum is reached, old documents are deleted so we would not have to monitor insignificant data. Otherwise, methods for maintaining balanced data distribution are assured by MongoDB, facilitating the database administration and ensuring an equitable data partitioning between the different shards. 5.3 Closing remarks During this work, we have only studied three database management systems and we do not pretend that we presented the best existing solutions or have throughly studied all related aspects. Nevertheless, the results that we collected are promising and can present a strong starting point for further researches or even an APT detection prototype implementation. Bibliography [1] “M-Trends® 2013: Attack the Security Gap™“, March 13, 2013 [2] Bejtlich, Richard. Understanding the advanced persistent threat. 2010 [3] The Economist annual report, 2010 [4] "What’s an APT? A Brief Definition". Damballa. January 20, 2010 [5] "The changing threat environment ...". Command Five Pty. March, 2011 [6] M. Hutchins, J. Clopperty,M. Amin, Ph.D. "Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains Lockheed Martin Corporation Abstract. March, 2013 [7] Beth E. Binde, Russ McRee, Terrence J. O’Connor. Assessing Outbound Traffic to Uncover Advanced Persistent Threat. May 5th, 2011 [8] Stephen Fortune, A Brief History of Databases, February 27th 2014 [9] Guy Harrison, 10 things you should know about NoSQL databases, August 26th, 2010 [10] Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman, Big Data For Dummies, April 2013 [11] "Appendix D. SQL Conformance". PostgreSQL 9 Documentation. PostgreSQL Global Development Group. 2009 70 Netography [Net 1] http://www.itbusinessedge.com/slideshows/ the-most-famous-advanced-persistent-threats-in-history.html [Net 2] https://www.academia.edu/6309905/Advanced_Persistent_Threat [Net 3] http://www.verint.com/solutions/communications-cyber-intelligence/ solutions/cyber-security/advanced-persistent-threats-apt-detection/ index [Net 4] http://www.fireeye.com/products-and-solutions/ [Net 5] http://www.isc8.com/products/cyber-adapt.html [Net 6] http://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/ [Net 7] http://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/ [Net 8] http://databases.about.com/od/specificproducts/a/acid.html [Net 9] http://www.ideaeng.com/database-full-text-search-0201 [Net 10] http://www.postgresql.org/ [Net 11] http://www.db-engines.com [Net 12] http://www.mongodb.org/ [Net 13] http://www.elasticsearch.org/ [Net 14] http://exploringelasticsearch.com/overview.html [Net 15] http://wiki.squid-cache.org/Features/LogFormat#Feature: _Customizable_Log_Formats [Net 16] http://www.tutorialspoint.com/postgresql/postgresql_java. html [Net 17] http://www.interspire.com/content/2006/02/15/ introduction-to-database-indexes/ [Net 18] http://wiki.squid-cache.org/Features/LogFormat [Net 19] http://www.elasticsearch.org/guide/en/elasticsearch/client/ java-api/current/client.html [Net 20] http://tutorials.jenkov.com/jdbc/resultset.html [Net 21] http://jdbc.postgresql.org/documentation/80/connect.html 71 CHAPTER 5. CONCLUSION [Net 22] http://docs.mongodb.org/ecosystem/tutorial/ use-aggregation-framework-with-java-driver/ 72 Appendecies 73 PostgreSQL Codes Data import PostgreSQL Codes 75 PostgreSQL Codes Agent Uploader 76 PostgreSQL Codes 77 PostgreSQL Codes 78 PostgreSQL Codes Agent Uploader with aggregations 79 PostgreSQL Codes 80 PostgreSQL Codes 81 MongoDB Codes MongoDB Codes Data import 83 MongoDB Codes 84 MongoDB Codes Agent Uploader 85 MongoDB Codes 86 MongoDB Codes Agent Uploader with aggregations 87 MongoDB Codes 88 MongoDB Codes 89 Elasticsearch Codes Elasticsearch Codes Data import 91 Elasticsearch Codes 92 Elasticsearch Codes Agent Uploader 93 Elasticsearch Codes 94 Elasticsearch Codes 95