Download Comparison of Data Base Technologies for APT Detection

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Clusterpoint wikipedia , lookup

PostgreSQL wikipedia , lookup

Database model wikipedia , lookup

Transcript
Royal Military Academy
Brussels, BELGIUM
www.rma.ac.be
Comparison of Data Base Technologies for APT Detection
Abidi Marwa
2014
i
Abstract
Nowadays, cyber security is becoming a primary concern for organizations as well as for individuals. As cyber space is considered vital for
multiple key functions such as air traffic control systems and nuclear reactor safety systems, cyber security’s issues are emerging as conspicuous
elements of several studies. Data theft, network disruption and other
circumstances can perturb companies and persons in ways that range
from embarrassing to lethal. Actually, communication, research, development and all aspects of personal life and business rely on networks.
Governments, military, corporations, social networks, etc are collecting,
performing and storing humungous amounts of restricted data every day
which presents cyber security high concern. Looking to the noticeable
growth of complex, sophisticated and advanced cyber attacks, convenient security measures are required to protect confidential information
and to prevent disastrous consequences.
Among the most daunting challenges for cyber security solutions are
”Advanced Persistent Threats” (APT). APT are relentless and targeted cyber attacks considered as the latest cyber security scourge. These attacks
are most of the time achieved by nation states or hacktivists possessing
high technical capabilities and the appropriate material infrastructure
giving them the opportunities to achieve their goals without being noticed or detected by the existing security tools.
In order to detect APT, large amounts of data should be processed
during the investigation mission looking the the silent aspect of these
intrusions, an operation which is extremely time consuming. One possibility to perform these analysis consists in using databases which are far
from being obsolete toward this issue.
This project consists in a study on how data can be interrogated through
sets of queries in addition to a comparison of three databases performances: a classical SQL database, a NoSQL database and a database specialized in text search. It is performed in favor of the MASFAD project
development: Multi-Agents System For APT Detection, funded by the
Belgian Royal Military Academy. Within this thesis, we will demonstrate
the adaptation of MongoDB to the monitoring of large data sets over the
other databases as well as an interesting behavior showed by PostgreSQL
toward the performance of complex queries.
ii
Foreword
When I began working on this Master thesis, I had no idea how it will
influence my life not only during its elaboration period, but also for the
remainder of my future career. This work gave me the opportunity to
learn in many aspects and to improve my knowledge in network security,
database management and java programming.
I wish to sincerely thank all the people who have believed in me and
who have always been there for me during the best and worst moments.
I thank my family, my fiance and my friends for their patience.
I thank my promoters for the great help and support they gave me and
for leaving me freely choose the orientation of this project.
I thank the employees of the Tunisian Military Academy and Tunisian
Ministry of Defense for their support. I’m grateful for this opportunity
that made experience such wonderful journey.
I thank the Belgian Royal Military Academy staff for their hospitality
and orientation.
I hope I made you all proud.
Marwa
Contents
List of algorithms
vi
List of figures
viii
List of tables
1 Introduction
1.1 Preamble . . . . . . . . .
1.2 Motivation . . . . . . . .
1.3 Objectives of the project
1.4 State of the Art . . . . . .
1.4.1 APT definition . .
1.4.2 Terminology . . .
1.4.3 History . . . . . .
1.4.4 Life cycle . . . . .
1.4.5 APT detection . .
1.5 Originality and output .
1.6 Project outline . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Comparative study of databases
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.2 Preliminary comparison . . . . . . . . . . .
2.2.1 Classical SQL databases . . . . . . .
2.2.2 NoSQL databases . . . . . . . . . . .
2.2.3 Databases specialized in text search
2.3 Presentation of the used databases . . . . .
2.3.1 PostgreSQL . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
4
4
4
5
5
7
9
11
12
.
.
.
.
.
.
.
13
13
13
13
14
15
17
17
CONTENTS
2.3.2
2.3.3
MongoDB . . . . . . .
Elasticsearch . . . . . .
2.3.3.1 Presentation
2.4 Comparative tables . . . . . .
2.5 Summary . . . . . . . . . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Implementations
3.1 Introduction . . . . . . . . . . . . .
3.2 Data storage . . . . . . . . . . . . .
3.2.1 Data format . . . . . . . . .
3.2.2 Database design . . . . . . .
3.2.3 Codes . . . . . . . . . . . . .
3.2.4 Elasticsearch . . . . . . . . .
3.3 Agent Uploader . . . . . . . . . . .
3.3.1 Algorithm . . . . . . . . . .
3.3.2 Codes . . . . . . . . . . . . .
3.3.2.1 PostgreSQL . . . .
3.3.2.2 MongoDB . . . . .
3.3.2.3 Elasticsearch . . .
3.4 Agent Uploader with aggregations .
3.4.1 Algorithm . . . . . . . . . .
3.4.2 Codes . . . . . . . . . . . . .
3.4.2.1 PostgreSQL . . . .
3.4.2.2 MongoDB [Net 22]
3.4.2.3 Elasticsearch . . .
3.5 Summary . . . . . . . . . . . . . . .
4 Performance tests
4.1 Introduction . . . . . . . . . . .
4.2 Methodology . . . . . . . . . . .
4.3 First test: Data import . . . . . .
4.3.1 Presentation . . . . . . .
4.3.2 Results . . . . . . . . . .
4.3.2.1 Before indexing
4.3.2.2 After Indexing
4.3.3 Analysis . . . . . . . . .
4.4 Second test: Agent Uploader .
4.4.1 Presentation . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
20
20
22
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
26
27
28
30
35
37
38
39
39
41
42
44
44
45
45
46
47
48
.
.
.
.
.
.
.
.
.
.
49
49
49
50
50
51
51
51
52
53
53
CONTENTS
4.4.2
Results . . . . . . . . . . . . . . . . . .
4.4.2.1 Before indexing . . . . . . . .
4.4.2.2 After indexing . . . . . . . . .
4.4.3 Analysis . . . . . . . . . . . . . . . . .
4.5 Third test: Agent Uploader with aggregations
4.5.1 Presentation . . . . . . . . . . . . . . .
4.5.2 Results . . . . . . . . . . . . . . . . . .
4.5.2.1 Before indexing . . . . . . . .
4.5.2.2 With indexing . . . . . . . . .
4.5.3 Analysis . . . . . . . . . . . . . . . . .
4.6 Comparison of the two agents . . . . . . . . .
4.6.1 Presentation . . . . . . . . . . . . . . .
4.6.2 PostgreSQL . . . . . . . . . . . . . . . .
4.6.3 MongoDB . . . . . . . . . . . . . . . .
4.6.4 Analysis . . . . . . . . . . . . . . . . .
4.7 Summary . . . . . . . . . . . . . . . . . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
55
57
59
59
59
59
60
62
63
63
64
65
65
66
5 Conclusion
5.1 Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . .
5.3 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . .
67
68
68
69
Bibliography
70
Netography
71
Appendecies
74
PostgreSQL Codes
74
MongoDB Codes
83
Elasticsearch Codes
91
List of Algorithms
3.1 Agent Uploader . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Agent Uploader with aggregations . . . . . . . . . . . . . . .
vi
39
45
List of Figures
1.1 APT attack life cycle . . . . . . . . . . . . . . . . . . . . . . .
8
2.1 Generalized Full-Text Architecture . . . . . . . . . . . . . .
2.2 MongoDB Subscriptions . . . . . . . . . . . . . . . . . . . .
17
19
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
31
32
33
33
34
35
35
36
37
37
40
41
41
41
42
42
43
43
44
46
46
46
46
DBloading: GetConnection query . . . . . . . . . . .
DBloading: extracting the transaction’s fields . . . .
DBloading: Insert query . . . . . . . . . . . . . . . .
PostgreSQL: data table . . . . . . . . . . . . . . . . .
MongoLoading: GetConnection query . . . . . . . .
MongoLoading: Insert query . . . . . . . . . . . . . .
MongoDB: data collection . . . . . . . . . . . . . . .
ElasticLoading: getClient query . . . . . . . . . . . .
ElasticLoding: Insert query . . . . . . . . . . . . . . .
Elasticsearch: data type . . . . . . . . . . . . . . . . .
PostgreSQL: GetClients query . . . . . . . . . . . . .
PostgreSQL: GetServers query . . . . . . . . . . . . .
PostgreSQL main: Select query . . . . . . . . . . . .
MongoDB: GetClients query . . . . . . . . . . . . . .
MongoDB: GetServers query . . . . . . . . . . . . . .
MongoDB main: Select query . . . . . . . . . . . . .
Elasticsearch: GetClients query . . . . . . . . . . . .
Elasticsearch: GetServers query . . . . . . . . . . . .
Elasticsearch main: Select query . . . . . . . . . . .
PostgreSQL main: Select with group by query . . . .
MongoDB main (with group by): Operation $match
MongoDB main (with group by): Operation $project
MongoDB main (with group by): Operation $group
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF FIGURES
viii
3.24 MongoDB main (with group by): aggregation execution . .
3.25 Elasticsearch main (with group by): aggregation query . . .
47
48
4.1 Comparative diagram: data import . . . . . . . . . . . . . .
4.2 Comparative curve: impact of the data size rise on the data
import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Comparative diagram: Agent Uploader without indexing .
4.4 Comparative diagram: Agent Uploader using single-field
indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Comparative diagram: Agent Uploader using multiple-fields
index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Comparative curve: impact on the rise of data size on Agent
Uploader . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Comparative diagram: Agent Uploader with aggregations
without indexing . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Comparative diagram: Agent Uploader with aggregations
using single-field index . . . . . . . . . . . . . . . . . . . . .
4.9 Comparative diagram: Agent Uploader with aggregations
using multiple-fields index . . . . . . . . . . . . . . . . . . .
4.10 Comparative curve: impact on the rise of data size on Agent
Uploader with aggregations . . . . . . . . . . . . . . . . . .
4.11 Comparative Diagram: comparison of the two agents on
PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Comparative Diagram: comparison of the two agents on
MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
53
55
56
57
58
60
61
62
63
64
65
List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Text-search databases taxonomy . . . . . . . . . . . .
Databases comparison: organization and community
Databases comparison: compatibility . . . . . . . . .
Databases comparison: administration . . . . . . . .
Databases comparison: general information . . . . .
Databases comparison: costs and software license . .
Databases comparison: Capabilities/Limitations . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Data import : results . . . . . . . . . . . . . . . . . . . . . .
Data import results on indexed tables . . . . . . . . . . . .
Agent Uploader : results without indexing . . . . . . . . . .
Agent Uploader : results using single-field indexes . . . . .
Agent Uploader : results using multiple-fields index . . . .
Agent Uploader with aggregations : results without indexing
Agent Uploader with aggregations : results using singlefield indexes . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Agent Uploader with aggregations : results using multiplefields index . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 PostgreSQL results: Comparison of the two agents . . . . .
4.10 MongoDB results: Comparison of the two agents . . . . . .
ix
16
22
22
23
23
24
25
51
51
54
55
56
59
60
61
64
65
Chapter 1
Introduction
1.1
Preamble
As an introduction for this project, we provide some definitions which
are useful for a better assimilation of this work.
Cyber space is the virtual terrain in which computers communicate
through the exchange of data blocks between networks.
A cyber attack is an attempt to make a computer network malfunction
or to steal confidential data by taking advantage of existing vulnerabilities.
An APT attack is an advanced and complex cyber attack performed by
high skilled hackers and even by governments in order to intercept and
exfiltrate critical data.
A database is an information container offering a flexible infrastructure for data storage and querying.
1.2
Motivation
Among the large number of cyber attacks occurring daily, APT are considered as the most dangerous and severe ones. These intrusions are so1
CHAPTER 1. INTRODUCTION
2
phisticated and silent as they can keep unperceived for a long period of
time during which critical and high value information is exfiltrated from
high-profile targets. Looking to these aspects, serious measures should
be taken into consideration in order to block potential attacks. According
to the Mandiant’s annual threat report [1]:
• 66% of organizations are unaware that they’re under attack, and are
only notified by an external entity.
• Most attacks can keep unnoticed for nearly eight months.
• Hackers are counting on outsourced service providers and partners
in order to keep unobserved.
• A reconnaissance step is employed in which attackers inspect the
ordinary behavior of the network entities so they can resemble it
and not awake suspicions.
In order to achieve an APT attack, hackers attempt to use new unknown
vulnerabilities and to constantly change their hacking strategies in order
to bypass traditional security measures. Although, these attacks show
some distinguishable signatures which can be used for detection purposes. Many open source software tools monitoring these patterns are
nowadays available for public use. These tools can ensure an early APT
detection by analyzing outbound-traffic and uncovering suspicious interzone. They implement different detection techniques and can be used
together to achieve a whole detection strategy. Here are some examples:
• Splunk: is a search, analysis and reporting tool indexing and querying data coming from various sources and trigger alerts in case an
abnormal behavior has been noticed.
• Suricata: is a Network Security Monitoring engine, an Intrusion Detection System and an Intrusion Prevention System at once. It allows the writing of rules to protocols rather than ports bringing a
vicious communication with Command an Control server down.
• Kismet: is a sniffer, Intrusion Detection System and a wireless network detector operating on 802.11, layer 2. It sniffs wireless traffic,
analyzes the data and can detect the presence of hidden or default
networks as well as probe requests.
CHAPTER 1. INTRODUCTION
3
• Snort: is a Network Intrusion Detection System based on signatures
working with a combination of rules and pre-processors to analyze,
in real time, a network traffic then search within its database for the
attack and finally manipulate the results.
• Squert: is a visual engine which analysis and query events stored on
an internal database in order to retrieve result sets to which it adds
further information looking foreword to discover new data interrogations.
APT detection is a process that needs monitoring huge amounts of data
such as proxy or firewall log files, a challenging task to the traditional IT
systems and security tools. To achieve an investigation on lots of data,
analysts have to perform complex queries. Certainly, databases are not
obsolete and can be used to store data and allow analysts to interrogate
it through sets of queries. Big data is difficult to handle and shows the
challenge of processing it within a tolerable execution time but recent
software tools with high performance indexes and distributed architecture possibilities seem to be of a great advantage.
Many researches in APT detection are nowadays faced with the recurring problem of Big Data. Big Data technology can make important
contributions as its analysis improvements offer efficient opportunities
to ameliorate APT intrusions exposure in critical security situations. HP
and Dell have spent more than $15 billion on software firms only specializing in data management and analytics [3]. In 2010, the Economist
annual report recorded that this industry on its own was worth more than
$100 billion and was growing at almost 10 percent a year: about twice as
fast as the software business as a whole.
In order to develop a multi-agents system for APT detection, the project
MASFAD is being funded by the European Defense Agency (EDA), the
Belgian Royal Academy (RMA), the Netherlands Organization for Applied
Scientific Research (TEO), and the German Fraunhofer Institute for Communication, Information Processing and Ergonomics (FKIE), who are joining forces to offer to the military network analyst a powerful tool assuring
a high level of defense against APT.
CHAPTER 1. INTRODUCTION
1.3
4
Objectives of the project
From the preceding discussion and definitions it clearly appears that, in
order to improve Big Data processing time in APT detection procedure
we need to work with recent big data tools which offer the opportunity
to unlock significant value by making information accessible in a nearly
real time.
The main goal of this project is to study how to translate analyst’s
complex questions into sets of queries and to realize a comparison between PostgreSQL, MongoDB and Elasticsearch database management
systems in order to determine which database presents the best performances in terms of processing and execution time.
To validate the chosen approach, two different versions of a java agent
that investigates an abnormal upload behavior will be implemented and
tested with the three different database types. The data sets manipulated
will consist on log files from a proxy server of a medium-sized network.
The primary concerns for us will lie on:
• How many inserts per second can the database support?
• How fast will the database response to the queries be?
• How each database will manage the growth of event data?
In the next section, we present the State of the Art in Advanced Persistent
Threats .
1.4
1.4.1
State of the Art
APT definition
Advanced Persistent Threats are prolonged and mute cyber attacks qualified of sophisticated and complex hacking actions as they are coordinated by large groups of proficient hacktivists generally supported by
governments to conduct multipurpose espionage. They aim governments,
military and companies from business, financial, social, health care sectors,etc where valuable data can be found and be of great benefit.
CHAPTER 1. INTRODUCTION
5
To be really successful, an APT operation requires a prolonged period of time as well as a high level of stealthiness. The persistent aspect is usually fulfilled with the use of an external Command and Control
server which retrieves targeted strategic information via a pre-build botnet army within the victims network [Net 2].
1.4.2
Terminology
Advanced Persistent Threats seems hard to define due to the large number of existing definitions. The term was first used in 2006 by the United
States Air Force (USAF) in order to dissimulate the intrusions investigations and prevent civilian interference[2].
In the next paragraph we will try to clear up the APT definition by
explaining its terminology [4], [5]:
Advanced: APT actors acquire multiple existing malware technologies as well as a high knowledge and technological skills making them
capable of creating new unknown crimewares by combining the existing ones making it more and more powerful and hard to detect. As these
groups are most of the time sponsored by very authoritarian entities and
even Nation-State, they do have access to advanced information-gathering
devices such as telephone spying equipments.
Persistent: an APT attack usually takes a long period of time to be
achieved (it can be performed over several years) during which small
amounts of important data are retrieved each time in a strategy qualified of ”low and slow” that allows the intrusion to keep silent and not to
awake suspicions by showing a deviant behavior.
Threats: APT are aggressions conducted by humans with the intention of stealing valuable information in order to make subsequent usage
of it. It can’t be automated or executed by readily available programs and
modules.
1.4.3
History
The examination of APT history is a crucial part of APT understanding
and assimilation as it permit a derivation of useful commentaries and
annotations helping in future detection procedures.
CHAPTER 1. INTRODUCTION
6
Targeted, socially-engineered emails dropping Trojans1 to exfiltrate
confidential information were noticed by both the U.K. National Infrastructure Security Co-ordination Center (UK-NISCC) and the U.S. Computer Emergency Response Team (US-CERT) in June, 2005. This attack
was qualified of an Advanced Persistent Threat as it lasted for an important period of time during which it was able to bypass all the existing
security tools [6].
The designation APT is attributed to an attack showing signs of advance and persistence. Here are some examples of the most famous APT
intrusions during the past few years [Net 1], [7]:
• The Gozi virus: firstly detected in 2007. It’s a banking virus designed to steal personal banking information. One million computers were infected by the virus in many countries such as the U.S.,
UK, Germany, Poland, France, Finland, Italy, Turkey, etc and even
systems at NASA2 . It was basically spread to the victims through a
benign PDF document as well as many other methods. The virus
was sold to many hacktivists groups around the world by its developers, many variants of it were detected during the past years and
banks continue to experience intrusions from Gozi.
• Operation Aurora: it consists in series of cyber attacks against companies in multiple industries which started in 2009. It’s reported
that these attacks have originated in China. The attack began with
a large phishing operation in order to gain access to victim’s systems where a Trojan horse designed to retrieve data was installed.
Companies that were attacked preferred to remain anonymous due
to the fear of loosing their clients and shareholders by lack of confidence in their security systems the reason that allowed the attackers to keep performing the intrusions more widely. The first public
announcements was made by Google in January 2010.
• RSA Breach: it is a successful APT attack against the RSA3 network
which was detected in March 2011. Similar to Operation Aurora,
it began by a successful phishing campaign exploiting an Adobe
1
A Trojan horse is a malware program containing malicious code which the execution causes data theft or system damage.
2
NASA stands for the National Aeronautics and Space Administration.
3
RSA is an American computer and network security company.
CHAPTER 1. INTRODUCTION
7
flash vulnerability and obtaining access and control over victims
computers. The malware used consisted in a remote access Trojan (RAT)4 named Poison Ivy. The attack was far from being complex, but it was very effective as data relating to RSA’s best-selling
SecurID authentication technology was stolen.
• Flame: consisted in multiple cyber espionage attacks on governmental ministries, educational institutions and individuals in Middle Eastern countries discovered by Iran’s National Computer Emergency Response Team in 2012. Unlike the other attacks presented
above, a sophisticated malware was spread via USB keys into internal networks.
1.4.4
Life cycle
Basically, an Advanced Persistent Threat life cycle is divided into 4 major
stages as its shown in the following figure:
4
RAT is a program including a malware code which installs a back door on the victims computer giving administrative control to the attacker over the infected machine.
CHAPTER 1. INTRODUCTION
8
Figure 1.1: APT attack life cycle
source:blog.trendmicro.com/trendlabs-security-intelligence/
in-depth-look-apt-attack-tools-of-the-trade/
1. Preparation: the first step the hackers need is to determine their
target. According to the victim’s information system’s security level,
they seek help from other hacktivist groups, companies or even Nation States. The next step consist in elaborating the intrusion strategy and gathering the devices, malwares and techniques needed for
the intrusion mission. They might also decide to develop their own
crimeware by combining existing techniques in order to make the
attack more complex and hard to notice.
2. Infection: the second stage consists in infiltrating the victim’s local network by gaining access and administrative control over local
machines. This step can be achieved with the use of different tech-
CHAPTER 1. INTRODUCTION
9
niques such as phishing5 and social engineering6 , etc and by taking
advantage of existing network vulnerabilities and installing covert
backdoors.
3. Deployment: when the hackers make sure of possessing enough
control over the target’s network, it comes the part where the main
malware causing the attack is spread. This crimeware is designed to
gather valuable data from the internal network. Communications
between the hackers and the victims are managed through a Command and Control server ( C&C server )7 which grants the anonymousness of the attackers. To rise the number of infected machines
in the targeted network, attackers effectuate a lateral move which
consists on infecting other machines and taking control over them.
4. Persistence: an APT intrusion is always a long-time mission. To
maintain presence inside the victim’s system, APT performers make
sure to keep silent and transparent. They tend to masquerade the
normal behavior of the machines inside the network in order to bypass all security measures and devices. They also ensure continued
control over access channels and equipments acquired previously
to achieve their goals, and keep exfiltrating stolen confidential data
from the targeted network.
1.4.5
APT detection
As previous detected APT attacks have shown noticeable attributes , the
current and future detection and investigation procedures will be based
on similar signatures which the analysis can lead to an early discovery
of such intrusions. Furthermore, APT relies on C&C8 servers which the
5
Phishing is the act of guising the identity of a trustworthy person in order to gain
one’s confidence in an electronic communication which permit the transmission of
malwares to the victim.
6
Social engineering is a technique used to influence employees psychologically to
reveal important information about their job’s in a process of gathering valuable data
on the victim before performing an attack.
7
A C&C server is a machine capable of sending commands for equipments being
part of a botnet army and retrieve outputs from them.
8
C&C refers to Command and Control.
CHAPTER 1. INTRODUCTION
10
activity can be identified, contained and disrupted through the analysis
of outbound network traffic [4].
In order to effectively detect APT, a security strategy should be put
into place using multiple software tools and implementing several security measures such as [7]:
• Rule Sets
• Statistical and Correlation Methods
• Manual Approaches
• Automatic Blocking of Data Exfiltration
Nowadays, multiple commercial software tools are available on the market implementing different defense technologies such as Data Loss Prevention (DLP)9 , advanced reporting, forensic analysis, real-time threat
analysis, etc. The following list presents some examples of APT detection
dedicated tools as well as some of their distinctive properties:
Verint CYBERVISION Advanced Detection System [Net 3]: This detection system offers high performance and scalable network security by
providing superior malware detection capabilities to national level organizations and cyber security operation centers (CSOCs). It implements a
unique centralized approach and is able to provide a holistic overview of
all the relevant critical networks at a national level.
Verint CYBERVISION Advanced Detection System’s most distinctive
characteristic is his ability to enable fast and coordinated response in
case of a security breach.
FireEye Threat Prevention Platform [Net 4]: It is a system that supplements traditional security defenses such as firewalls, Intrusion Detection
Systems, etc. By enabling the aggregation and correlation of events, it
can identify blended attacks. It is also considered to be of a good use for
deeper, hands-on analysis and investigations of APT by building a 360degree, stage-by-stage analysis of the advanced attack. FireEye Threat
9
Data Loss Prevention is a technology designed to detect and prevent potential data
breach or ex-filtration transmissions.
CHAPTER 1. INTRODUCTION
11
Prevention Platform uses a cloud platform which efficiently shares autogenerated threat intelligence, such as covert callback channels, as well as
new threat findings from FireEye Labs.
ISC8 Cyber adAPT system [Net 5]: As an APT detection tool, it learns
the network topology and leverages knowledge of advanced malware’s
actions after weeks and months of network activities analysis and correlations. It also offers signature-less, network based advanced malware
detection at speeds of 10Gb and higher thanks to the implementation
of a sensor-based, near real-time forensics technology. This advanced
technology identifies next-generation APT ahead of perimeter solutions
before devastating damage or critical data theft can occur.
These detection systems present great capabilities in the APT detection field. Therefor, they can not be fully trustworthy as APT guise the
local network typical behavior so they can manage to bypass them. For
this reason, a detection system that is adequate for the internal network
dependencies and needs will be of a higher benefit.
1.5
Originality and output
A deep analysis of the APT detection requirements as well as some of
the most famous attacks during the past few years allowed us to identify
APT attributes which can be used in the investigation process. Existing
commercial and open source solutions don’t seem to be advantageous
for the protection of a local military network which totally justifies the
need of the development of a customized APT detection system. In order to study the efficiency of databases use within the MASFAD project,
we have to perform several tests that will allow us to choose the more
suitable database management system for our detection needs. We will
effectuate our study through the investigation of anomalous uploading
attitudes which can indicate an APT intrusion. A high rate of data uploading can lead to the identification of a Command and Control channel
and an outputs exfiltration from machines being part of a botnet within
the internal network.
The main output of this work is:
• A comparison between three different databases performances rel-
CHAPTER 1. INTRODUCTION
12
atively to APT detection needs in terms of monitoring enormous
amounts of data.
1.6
Project outline
This dissertation is organized in four chapters and a concluding one.
The present chapter has given you the motivations and the goals of the
project as well as a review of the State of the Art in the relevant domain.
An overview of the other chapters is given below.
Chapter 2 begins with a presentation of the different databases used
during this work. It is followed by a section devoted to a comparison
study between these databases to justify our choices by showing that they
fulfill the requirements of an APT detection system.
Chapter 3 will presents the implementation of our agent as well as the
different data import codes. We will be explaining the main programming features that we have used and we will try to simplify our design in
order to make the understanding of this dissertation easier for its readers.
Chapter 4 is devoted to the results and analysis of detailed performance tests.
The last chapter is devoted to present a subjective point of view about
the different databases studied during this project. Finally, future study
directions are suggested and global conclusions are drawn.
Chapter 2
Comparative study of databases
2.1
Introduction
Databases are becoming more and more ubiquitous and essential to aspects of modern societies based on computer and network communications. In this chapter, we will be studying the different types of databases.
This study will be divided into 2 major stages. The first one will consist in
a preliminary analysis of the the 3 databases categories used during this
project with the purpose to define the general concepts that vary from a
category to another . The second one will be devoted to a comparison of
the three database management systems on which this study is based.
2.2
2.2.1
Preliminary comparison
Classical SQL databases
Databases were used since the 1960s. The need to create such systems
for data organization and management was due to the remarkable speed
of data production as well as the rise of data demand and accessibility through the strata of the modern society. The first prototype SQL
database model was developed by IBM in 1974 [8].
As it is shown by its denomination, this database category is based on
SQL1 language which is a normalized language conceived to query rela1
SQL stands for Structured Query Language.
13
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
14
tional databases. This standard is divided into three classes depending
on the type of data querying:
• DDL: Data Definition Language for data description, structuring
and codification.
• DML: Data Manipulation Language for data restitution: insertion,
modification and interrogation.
• DCL: Data Control Language for data administration, security and
integrity insurance.
SQL databases are obligatory relational ones. In fact, a relational database
is based on a group of concepts that define the database structure. This
structure is called a schema and it describes the data , the relations between them as well as a set of integrity rules which should be respected.
In a relational database, data are grouped in tables composed of rows and
columns. The tables are vertically scalable meaning that the database is
scaled by increasing the horse-power of the hardware. This scaling process can be managed by increasing load through the rise of the CPU’s,
RAM’s, SSD’s capacities, etc.
Using SQL language makes these databases very powerful when it
comes to performing complex queries and suitable for high transactional
based application. However, they are not considered as a best fit for hierarchical data storage [Net 6].
2.2.2
NoSQL databases
This new type of databases came out as a result of a repulsing attitude
against the traditional relational data model caused by the enormous explosion in data volumes [8]. Actually, the term NoSQL refers to not only
SQL as its main characteristic is being schemaless. NoSQL offers a new
way of data storage and querying different from the tabular relational
model and shows an easier integration into applications as it has less restrictions. For this reason, NoSQL databases are known to be more flexible than the relational ones. Schema changes led by applications evolution requirements will not drag long complicated modifications which
enables a faster data iteration and a better integrity management [9]. We
don’t talk about tables anymore. It’s all about collections which can store
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
15
heterogeneous data rows called documents. NoSQL databases introduce
a new way of data design which takes into consideration future progressions. They are horizontally scalable and fit perfectly for the addition of
clusters within the staple infrastructure for high transactional attitude
handling purposes [Net 7]. Data is stored using key-value pair (KVP)
model which makes the databases not typed and the storage of most of
the data with the string type.
Otherwise, this data repertory model isn’t ACID compliant unlike typical relational model. For this reason, database designers and implementers have to think about data placement, replication, and fault tolerance as they are not expressly controlled by the technology itself [10].
ACID actually refers to [Net 8]:
• Atomicity: transactions have to follow a ”all or nothing” rule which
means that an interrupted modification is entirely rolled back.
• Consistency: the database is submitted to a set of integrity and consistency rules which should be obligatory fulfilled by any data insertion attempt request.
• Isolation: transactions occurring at the same time must not impact
or perturb each other’s execution.
• Durability: any transaction which have been committed can’t lost
and the modification that it had brought to the database can’t be
wasted randomly.
Databases which do not respect the ACID model being part of the most
crucial concepts in databases theory are not considered reliable.
The term not only SQL refers to the fact that the SQL language can
also be used within this type of databases. They are not fully stable until
the time being and their support is not considered good enough for users
who will have no choice other than to rely on communities as experts in
the field are not easy to find looking to the young age of this database
model.
2.2.3
Databases specialized in text search
The explosion in data volumes gathered and manipulated by companies,
social networks, etc has come with a new problem which consists in un-
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
16
structured textual data presented in electronic format. As this type of
information can not be stored and queried using typical database management systems, new engines capable of performing text search into
unstructured data were required [Net 9].
Full-text search databases are not completely independent from the
older technologies as they are using the traditional tabular model in one
way or another. Here is the taxonomy used by text-search databases:
Traditional Terminology
Database
Table
Row
Column
Schema
Indexes
Text-search databases Terminology
Index
Type
Document
Field
Mapping
Automatic indexing
Table 2.1: Text-search databases taxonomy
The architecture of a text search engine is different from other databases
ones. The procedure is the following: external data is inserted into rows
within the main document index. The search engine will automatically
create an ordered word list to load into the main word index. In response
to a data query, the search engine will initially perform the search into
the index in order to identify which documents match the required request and returns pointers to matching results in the main document
index. This process offers great performances when monitoring textual
data thanks to its index’s high granularity, allowing rapid indexed access
to specific words.
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
17
A text search engine architecture is described by the following figure :
Figure 2.1: Generalized Full-Text Architecture
source: http://www.ideaeng.com/database-full-text-search-0201
2.3
2.3.1
Presentation of the used databases
PostgreSQL
PostgresSQL is a database management system based on the relational
model. It was developed by PostgreSQL Global Development Group, written in C language and initially released on May 1, 1995 under the PostgreSQL License. In addition to the traditional functions offered by a relational database server such as storing and retrieving data mechanisms
required by any type of application, PostgresSQL proposes expansions
which are not available in the previous database management systems
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
18
such as:
• Several indexing methods for complex SQL queries handling, Updateable and materialized views, Rules, Constraints, Triggers, Foreign keys ,Transactions integrity, etc [Net 10].
PostgresSQL is ACID-compliant and uses multiversion concurrency control (MVCC)2 to avoid locking issues [11]. It also gives access to additional
concepts like classes, functions, inheritance, etc extending the system
and joining it to the object-relational category.
Positives [Net 10]:
Portability and execution on multiple platforms.
Fast and easy installation especially thanks to the Postgres installer
which was provided since the version 8.0.
Sturdiness.
Rich of powerful functions.
Active support offered by a large community of professionals.
Conformity with the majority of the SQL2011 standard.
BSD3 license.
Configurability and extensibility.
Stability.
Limitations [Net 10]:
Low performances when manipulating small data volumes.
Data distribution problems.
2.3.2
MongoDB
MongoDB is a free and open-source NoSQL, document-oriented database
management system developed by MongoDB Inc, written in C++ programming language and initially released in 2009 under a combination
between GNU AGPL v3.04 and the Apache License. It is schemaless and
2
MVCC is a system avoiding concurrency problems by making the transactions invisible until the time they are committed in order to not disturb other changes on the
database.
3
BSD refers to Berkeley Software Distribution
4
AGPL refers to Affero General Public License
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
19
has a JSON-like5 documents structure making the data integration easier, faster and more flexible. Schema in MongoDB is qualified of dynamic
as we can store in the same collection, documents that do not have the
same structure, and common fields in a collection can hold different
types of data. According to db-engines.com, in April 2014, MongoDB is
at 5th place as the most popular type of database management system,
and first place for NoSQL database management systems [Net 12].
MongoDB has a large number of great features giving it advantages
over other document-oriented databases such as subscriptions, etc. MongoDB subscriptions are features addressing the most demanding requirements of an application and enabling the users to be more rapid and effective.
Figure 2.2: MongoDB Subscriptions
Source:http://www.mongodb.com/mongodb-overview
5
JSON refers to JavaScript Object Notation: it is an open standard format that uses
human-readable text to transmit data objects consisting of attribute–value pairs
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
20
The main purposes for MongoDB are: high availability, performance
and especially scalability as it is designed to work on large server deployments and multi-site architectures. It supports standalone operations
and handles traffic enlargement perfectly as more servers can easily be
added. MongoDB offers a large number of drivers for various programming languages such as Java, C++, PHP, JavaScript,etc.
Positives [Net 12]:
Short execution time even with the increase of data volumes.
High scalability performances thanks to the existing replication mechanisms.
Easily distributable.
Flexibility in failure cases with fail-over and recovery mechanisms.
Partial or total availability.
Advanced features.
Powerful query system when it is combined with the best database
design and the appropriate indexing strategy.
Great modeling for many sensitive data types such as graphs, location
based data, fog data of any format,etc.
Limitations [Net 12]:
Risks to dematerialize some limitations of the model query: a minimum
level of coherence is required.
No transactions supported.
No equivalent for the join query in SQL.
Administration issues led by the absence of a data model.
Absence of a proper debugger and a graphical user interface.
2.3.3
Elasticsearch
2.3.3.1
Presentation
Elasticsearch is an open source, RESTful6 search engine developed by
Shay Banon, written in Java programming language and released under
the Apache License. It’s mainly arranged for inspecting through text,
6
RESTful refers to REpresentational State Transfer which is a software architecture
designed for distributed hypermedia systems.
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
21
returning textual results to a given query and statistical analysis of text
body. Data is stored in a special format optimized for language based
searches and its main protocol is implemented with HTTP/JSON [Net
14].
Elasticsearch indexes JSON documents automatically using unique
type-level identifier for each indexing operation. It fits for the storage of
large amounts of unstructured data in a distributed architecture of several servers on which it can perform powerful research functions. If the
purpose is simply data storage, it is not wise to use Elasticsearch as its
main use is to perform text search and in this case it is better if we use
other databases which are suitable for this need [Net 13].
Despite the fact that Elasticserach is based on Apache Lucene7 , it provides a simpler API for public use which is better than the Lucene’s old
one. It also offers a rich infrastructure making the scaling process within
machines and data centers a lot easier, inter-operation with non-Java
languages and an operational ease of use.
Positives [Net 14]:
High search performances thanks to the data sharding concepts.
Easy to use as there is no needed configuration.
Efficiency: starting a node require only its integration into the ecosystem in which it benefits from the automatic replication and dimensioning.
Parallelism between node’s treatments.
Large number of features proposed: facets, percolation, plug-ins,etc.
Easy integration into other information systems since it is under the
Apache2 license.
Limitations [Net 14]:
Absence of a graphical user interface.
Not completely stable.
No support for transactions.
Immaturity as the software is considered relatively new.
Near real-time data availability.
7
Apache Lucene is a free/open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
22
No access-control or authentication features.
Unsatisfying documentation which looks more like a tutorial.
2.4
Comparative tables
The following information was mainly retrieved from the official sites of
the three studied databases [Net 12] [Net 13] [Net 14].
Table 2.2: Databases comparison: organization and community
Table 2.3: Databases comparison: compatibility
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
Table 2.4: Databases comparison: administration
Table 2.5: Databases comparison: general information
23
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
Table 2.6: Databases comparison: costs and software license
24
CHAPTER 2. COMPARATIVE STUDY OF DATABASES
25
Table 2.7: Databases comparison: Capabilities/Limitations
2.5
Summary
After this theoretical study, we can see that the three database management systems: PostgreSQL, MongoDB and Elasticsearch are potential
candidates for our research as they show great abilities in big data processing field. A practical study is definitely needed in order to determine which one of the databases mentioned before are more suitable for
APT detection procedures. To achieve this goal, a prototype of an agent
specialized in abnormal uploading behavior investigation will be implemented in Java programming language and will be tested on the three
databases with the use of log files of different sizes in order to be able to
make comparisons, ascertainment and final conclusions.
Chapter 3
Implementations
3.1
Introduction
This chapter will be devoted to the explanation of the codes used to test
the databases studied in the previous chapter. To implement our agent,
we though about 2 methods: the first one is to only retrieve the wanted
data fields from the databases and perform the rest of the operations on
the request results, while the second one consists in using the aggregation methods on the databases fields and get the results ready for analysis
without the need of subsequent operations.
As each database management system possesses it’s own Java API1
and it’s own query system, we dispose in total of nine codes: one for data
storage and two versions of the Agent Uploader for each database, which
will be fully explained in the current chapter.
3.2
Data storage
The first thing to think about when storing data into a database is the
schema design. Even with schemaless databases, it is crucial to extract
the information in specific individual fields within the documents in order to facilitate data extraction because otherwise we will need to use
regular expressions and to perform a full scan each time we desire to retrieve specific information.
1
API refers to Application Programming Interface.
26
CHAPTER 3. IMPLEMENTATIONS
27
As it was mentioned in the first chapter, the real challenge with APT
detection systems is Big Data. During this project we were given a Squid
proxy log files of a medium sized network to work with. The following
section will be devoted to the analysis of the data format of a Squid proxy
log file in order to arrange it properly into the databases.
3.2.1
Data format
Two default formats are built-into Squid proxy and are defined by the
option logformat existing in the file squid.conf:
• The native format
• The common format
In our case, the log file format we will be using during this work is the
native one [Net 15] :
The format is:
%9d.%03d
time
%6d
elapsed
%s
URL
%s
remotehost
%s
rfc931
%s/%03d
code/status
%s%s%s
peerstatus/peerhost
%d
bytes
%s
method
%s
type
For example a line in the log file named access.log comes like the following:
time
1394950584.861
elapsed
438
remotehost
10.0.149.23
URL
http://dzayfqe.trwvkpc.au/lsetyuxs.html
code/status
TCP_MISS/200
rfc931
-
bytes
934
method
GET
peerstatus/peerhost
DIRECT/69.114.1.230
Each line in this file contains ten fields which hold significant information that can be used to investigate the state of the network. The succeeding paragraph will explain the signification of each field separately.
type
text/html
CHAPTER 3. IMPLEMENTATIONS
28
Squid native access.log format in detail [Net 18]
• time: A decimal(11,3) referring to the time between the end of the
transaction execution and its logging by the proxy.
• elapsed = duration: Expressed in milliseconds and presents the time
during which the transaction has occupied the cache memory to be
executed.
• remotehost = client address: The ip address of the client machine
where the transaction was originated.
• code/status = result codes: It is composed of two entries separated
by a slash: The first one refers to the kind of the request The second one presents the code describing how the transaction has succeeded or failed.
• bytes: The amount of data that was delivered to the client.
• method: The request method to demand the procuration of certain
data.
• URL: The URL to which the client is trying to get access.
• rfc931: It contains the ident lookups for the current client but it is
always replaced by a ”-” for performance reasons.
• peerstatus/peerhost = hierarchy code: It’s divided into two entities:
The first one refers to a code which demonstrates how the demand
was handled. The second one presents the ip address of the server
where the request was forwarded.
• type: The content type of the object as it is written in the HTTP reply
header.
3.2.2
Database design
In order to build a coherent database with enough expressive and concise information and on which several APT detection agents can be run
CHAPTER 3. IMPLEMENTATIONS
29
effectively, we have decided to decompose some fields further more and
to get rid of some others that doesn’t hold any useful information.
We divided the URL into three separated fields:
1. Protocol: we will be able to query the database and retrieve the field
protocol independently as botnet protocols through which APT attacks are performed incorporate common used protocols such as
http, ftp, etc in order to go unnoticed.
2. Domain: as APT attackers rely mostly on C&C2 servers, we can query
the domains visited by local machines in order to look for known
domain names, although they tend to change constantly. We can
also fetch for domains which are frequently consulted to reclaim
confidential data which can allow an early APT discovery.
3. Document
We have eliminated the ”rfc931” field as its doesn’t hold any important
information that might be needed for a network analyst in his mission to
detect APT intrusions.
The peerstatus/peerhost field was decomposed into two independent
fields:
• rfc
• host_adr
2
C&C refers to Command an Control servers.
CHAPTER 3. IMPLEMENTATIONS
30
The database schema is the following:
Field name
time
duration
client-adr
req_kind
rslt_code
bytes
req_method
protocol
domain
document
rfc
host_adr
type
3.2.3
Type
double
int
String
String
int
int
String
String
String
String
String
String
String
Codes
PostgreSQL
We will be using PostgreSQL 9.2 version. We have at first, created a database
named ”proxydata” and a table named ”data” in order to hold the log information that will be queried by the Agent Uploader.
The procedure will consist in getting connection with the database,
reading the log file line by line, splitting each line into separated fields
and inserting them on the table ”data”.
In order to communicate with a PostgreSQL database, we had to import the postgresql-9.2-1003.jdbc4.jar file from http://mvnrepository.
com/artifact/org.postgresql/postgresql/9.2-1003-jdbc4 and add it
to the Libraries package of our project named DBloading
Getting connection to the database [Net 16]
We have written a class named ”GetConnection” which will return an object of type Connection.
CHAPTER 3. IMPLEMENTATIONS
31
The connection is ensured by the DriverManager using a connection
URL: jdbc:postgresql://host:port/database. The default port is 5432 and
as we are working on a local machine, we will be using the localhost. The
DriverManager.getConnection() method is called like the following:
Connection object = DriverManager.getConnection(url, username, password) [Net 21]
Figure 3.1: DBloading: GetConnection query
Getting the transaction’s fields
In order to insert the transactions into the database, we had to extract
the log information into independent fields corresponding to the design
previously presented and which will later be inserted in the table data. To
achieve this goal, we have used String[] split(String delimiter) method of
Java String class which takes as parameters a group of delimiters. These
delimiters consist in the different characters that separate two fields. They
are presented in the list below along with an example of where they can
be found:
• Multiple spaces to separate the time and duration.
• ’/’ to separate the req-kind and rslt-code.
• ’://’ to separate the protocol and domain.
• ’-’ to separate the document and the rfc
For an appropriate splitting procedure, we have to specify to the method
that these delimiters can be used several times within the same line. This
task is achieved with the use of a ’+’ operator that follows the delimiters
list.
The split method returns an array of the fields contained in the original line of the log file.
CHAPTER 3. IMPLEMENTATIONS
32
Figure 3.2: DBloading: extracting the transaction’s fields
Insert query
Inserting the fields that we have extracted from the log file will need the
performance of an insert SQL query. Each time we try to issue a SQL
query to the database, we require a Statement or PreparedStatement instance. The PreparedStatement object is a slightly more powerful version of a Statement as it may be parametrized. This means that it should
be used when user-input parameters are required within the query. PreparedStatement is also known for its capability of providing the necessary mechanisms to avoid sql-injections, otherwise both methods do the
same. In our case, we don’t need to pass any parameters to the query so
we will use a Statement object instance to hold our insert query.
We will be inserting the fields that we retrieved previously from each
line in the log file.
CHAPTER 3. IMPLEMENTATIONS
Figure 3.3: DBloading: Insert query
Verification of the database insertion
Figure 3.4: PostgreSQL: data table
MongoDB
We will be using MongoDB 2.4.9 version.
33
CHAPTER 3. IMPLEMENTATIONS
34
Unlike PostgreSQL, we don’t have to create the database or the collection as a preliminary step because they will automatically be created
when the java program is run and this is due to the fact that MongoDB
is a schemaless document-oriented database that doesn’t need a preconfiguration or design for the database.
Each line of the log file will be represented by a document in the collection ”data” and as we already said in the database storage section, we
have to divide each document into fields in order to facilitate the data
picking. Each transaction will be read from the log file, splited into fields
as we have done in the DBloading code and finally inserted into a document of a data collection within a proxydata database.
In order to communicate with a MongoDB database, we had to import the mongo-java-driver-2.4 jar file from http://mvnrepository.com/
artifact/org.mongodb/mongo-java-driver and add it to the Libraries package of our project named MongoLoading.
Getting connection to the database
To get connection to the MongoDB data collection, we have to connect
to the database proxydata on the default port 27017.
The GetConnection method will return a DBCollection that consists
on the collection we will be working on.
Figure 3.5: MongoLoading: GetConnection query
Insert query
Each document corresponding to a log transaction will be built of the
fields extracted from this transaction using the split method as we did
in the project DBloading. We have to get a document instance with the
type BasicDBObject, add the necessary fields to it then insert it in the
collection.
CHAPTER 3. IMPLEMENTATIONS
35
Figure 3.6: MongoLoading: Insert query
Verification of the database insertion
Figure 3.7: MongoDB: data collection
3.2.4
Elasticsearch
The Elasticsearch version we will be using is the 1.1.1.
We had to import the elasticsearch-1.1.1 jar file to the Libraries package in the new project we created named ElasticLoading.
As Elasticsearch is based on Lucene, we had to look for the compatible
lucene-core jar file with our database version. So, we have downloaded
CHAPTER 3. IMPLEMENTATIONS
36
the lucene-core-4.7.2.jar file from http://lucene.apache.org/core/downloads.
html and add it to our Libraries package too.
Getting connection to the database [Net 19]
The connection to an Elasticsearch database needs a client instance of
the class TransportClient. It connects to an existing Elasticsearch node
through a transport socket on the port 9300 and returns an object of the
type TransportClient.
Figure 3.8: ElasticLoading: getClient query
Insert query
First, we have to get an insert request instance to our index. This task
is ensured by the method prepareIndex which is related to the client we
have got earlier and which takes into parameters the index name along
with the type name. It returns an IndexRequestBuilder object instance
that will holds the document that will be inserted.
We also have to start a XContentBuilder instance corresponding to
a container of a json object to which we will add the fields that we extracted from each line of the log file. This instance should be added to
the IndexRequestBuilder corresponding to the insert query and which
will later be executed.
CHAPTER 3. IMPLEMENTATIONS
37
Figure 3.9: ElasticLoding: Insert query
Verification of the database insertion
Figure 3.10: Elasticsearch: data type
3.3
Agent Uploader
The Agent Uploader performs an investigation mission within sets of proxy
log transactions in order to detect an abnormal uploading rate. For each
server visited by each client, the agent sums the bytes corresponding to
each request method: GET, POST or CONNECT, then calculates an upload rate which will be compared to a chosen threshold. This threshold
CHAPTER 3. IMPLEMENTATIONS
38
is chosen by the network analyst in function of the typical upload behavior inside the local network. If the agent discovers a rate that exceeds the
pre-fixed threshold, it should declare an alert.
3.3.1
Algorithm
The first version of our Agent Uploader will consists in:
1. Selecting the list of clients addresses
2. Selecting the list of servers addresses visited by each client
3. Selecting req_method and bytes for each couple (client address, server
address)
4. Performing the calculations
5. Performing the analysis
CHAPTER 3. IMPLEMENTATIONS
39
Algorithm 3.1 Agent Uploader
Connect to the database
for each client
for each server visited by this client
select req_method, bytes
for each result in the result set
perform the calculations:
if req_method== GET
BytesByGet= BytesByGet + bytes
else if req_method== POST
BytesByPost= BytesByPost + bytes
else if req_method== CONNECT
BytesByConnect= BytesByConnect + bytes
perform the analysis:
if Get_method exists
if Post_method exists
ratio= BytesByPost / BytesByGet
else
ratio= 0
if Connect_method exists
ratio= BytesByConnect / resultSet_size *100000
if ratio > threshold
print(client,server,ratio)
3.3.2
Codes
3.3.2.1
PostgreSQL
Each time we need to query a PostgreSQL database, we need to get an
instance of the object Statement to hold the query which the execution
will return a ResultSet object containing all the results retrieved from the
consulted data table. A ResultSet object supports a cursor to point to the
current result records. When we create a ResultSet, we have to set three
attribute [Net 20]:
CHAPTER 3. IMPLEMENTATIONS
Property
Type
Value
TYPE_SCROLL_INSENSITIVE
Concurrency
CONCUR_READ_ONLY
Holdability
HOLD_CURSORS_OVER_COMMIT
40
Explanation
The result records
can be navigated
both forward and
backwards
The result records
can only be read
The ResultSet is kept
open during the
whole
connection
period
Select clients
The method GetClients selects distinctly the clients addresses existing
in the log file. It takes as parameter a Connection instance and returns
an array of the clients addresses. The select query is performed like the
following.
Figure 3.11: PostgreSQL: GetClients query
Select servers visited by each client
The method GetServers selects distinctly the server’s addresses visited by
a giving client’s address. It takes for parameters a Connection instance,
a client’s address and returns an array of servers addresses. The select
query is done like the following.
CHAPTER 3. IMPLEMENTATIONS
41
Figure 3.12: PostgreSQL: GetServers query
Select req_method, bytes for each server visited by a specific client
The select query of the project’s main section is done within a Statement
instance and its execution execution returns a ResultSet object which
contains the information needed to carry on the agent’s investigation.
Figure 3.13: PostgreSQL main: Select query
3.3.2.2
MongoDB
Select clients
The method GetClients selects distinctly the clients addresses existing in
the log file. It takes for parameter a DBCollection instance and returns a
list of clients addresses. The select query is performed like above.
Figure 3.14: MongoDB: GetClients query
Select servers visited by each client
The method GetServers selects distinctly the servers addresses visited by
a giving client’s address. It takes for parameters a DBCollection instance,
a client’s address and returns a list of servers addresses. An instance of a
CHAPTER 3. IMPLEMENTATIONS
42
BasicDBObject object is used in order to specify the clause where for the
select query which is executed like above.
Figure 3.15: MongoDB: GetServers query
Select req_method, bytes for each server visited by a specific client
To select the request method and bytes corresponding to a couple(client_adr,
host_adr), we need to build a BasicDBObject object named query which
will hold the clause where of the select query and a second one named
fields in order to limit the fields retrieved from the database to the two
we will be needing for our subsequent calculations and analysis.
Figure 3.16: MongoDB main: Select query
3.3.2.3
Elasticsearch
Select clients
The method GetClients selects distinctly the clients addresses existing in
the log file. It takes for parameter a TransportClient instance and returns
a list of clients addresses.
An Elasticsearch query is contained within an instance of a SearchRequestBuilder object to which we will add the wanted properties of the
selection.The search can be executed across one or more types so we
had to specify the name of the type that we will be searching through.
To perform the distinct selection, we had to use a termsFacet which will
allow us to specify field facets3 that return the most frequent terms. It
3
Facets are a group of available filters which can be applied to a set of search results
CHAPTER 3. IMPLEMENTATIONS
43
is necessary to specify the facet size, meaning the number of most frequent distinct clients addresses, but this has a major drawback: if we put
a size which is inferior to the actual clients number the result will be incomplete and if we put a very large number in order to avoid the first
disadvantage this can lead to performance issues.
This select query is mentioned in the next figure.
Figure 3.17: Elasticsearch: GetClients query
Select servers visited by each client
The method GetServers selects distinctly the servers addresses visited
by a giving client’s address using a termsfacet. It takes for parameters
a TransportClient instance, a client’s address and returns a list of servers
addresses.
Figure 3.18: Elasticsearch: GetServers query
Select req_method, bytes for each server visited by a specific client
To select the request method and bytes corresponding to a couple(client_adr,host_adr),
CHAPTER 3. IMPLEMENTATIONS
44
we need to use a bool query to hold the where clauses. A bool query is a
query that matches documents matching boolean combinations of other
queries. We will be retrieving only the two fields we need with the use of
the addField method of the SearchRequestBuilder instance.
Figure 3.19: Elasticsearch main: Select query
3.4
Agent Uploader with aggregations
The methods GetClients and GetServers are the same as in the first version of our agent, that’s why we will not go through them again in the
following section as they were already explained in the previous one.
3.4.1
Algorithm
The second version of our Agent Uploader will consist in:
1. Selecting the list of clients addresses
2. Selecting the list of servers addresses visited by each client
3. Selecting req_method and Sum(bytes) for each couple(client_adr,host_adr)
group by req_method
4. Performing the analysis
CHAPTER 3. IMPLEMENTATIONS
45
Algorithm 3.2 Agent Uploader with aggregations
Connect to the database
for each client
for each server visited by this client
select req_method, Sum(bytes) group by req_method
for each result in the result set
perform the analysis:
if Get_method exists
if Post_method exists
ratio= BytesByPost / BytesByGet
else
ratio= 0
if Connect_method exists
ratio= BytesByConnect / resultSet_size *100000
if ratio > threshold
print(client,server,ratio)
3.4.2
Codes
3.4.2.1
PostgreSQL
Select req_method, Sum( bytes) for each server visited by a specific client
group by req_method
To retrieve Sum(bytes) for each couple (client_adr,host_adr), we need to
use the aggregate function Sum which performs a sum calculation over
the field bytes. These calculations will by grouped by req_method in order to get the results classified by each request method type like the following example:
req_method
GET
POST
CONNECT
Sum(bytes)
100
200
0
This select query is invocated in the following figure.
CHAPTER 3. IMPLEMENTATIONS
46
Figure 3.20: PostgreSQL main: Select with group by query
3.4.2.2
MongoDB [Net 22]
In order to use aggregations on MongoDB with Java driver, we need to
build an aggregation pipeline. This pipeline consists of three major operations:
• Operation $match: describes the ”where” clause of the query.
Figure 3.21: MongoDB main (with group by): Operation $match
• Operation $project: specify the fields which need to pass through
the pipeline.
Figure 3.22: MongoDB main (with group by): Operation $project
• Operation $group: presents the ”group by” clause by specifying the
fields by which the aggregation will be grouped.
Figure 3.23: MongoDB main (with group by): Operation $group
CHAPTER 3. IMPLEMENTATIONS
47
Other operations can be defined and added to the pipeline if needed
such us $sort for sorting the results, etc.
To execute the aggregation query, we have used a second method: aggregate(List<DBObject>) like the following.
Figure 3.24: MongoDB main (with group by): aggregation execution
3.4.2.3
Elasticsearch
During the implementation of this code, we could not retrieve the aggregation results as we haven’t found any documentation on the relevant
subject. On the Elasticsearch blog ,we found that many other users are
asking for a java api documentation on aggregations but it wasn’t provided until the time of the writing of this dissertation.
In order to build an aggregation query on Elasticsearch with the java
driver, we have to use two types of aggregations:
• A bucket aggregation: to group documents for each req-method. It
will return the unique terms indexed for a given req_method.
• A metrics aggregation: to return the Sum(bytes) of the groups of
documents returned by the previous query using the sum metric.
Each one of the two aggregations is built within an AggregationBuilders
instance. The bucket aggregation encapsulate the metrics one as a subAggregation. It means that the sum aggregation will have as input, the
output of the buckets: the sum metric will be applied to the content of
each group: each req_method in our case.
CHAPTER 3. IMPLEMENTATIONS
48
Figure 3.25: Elasticsearch main (with group by): aggregation query
3.5
Summary
During this chapter, we have presented the implementations of our agents
as well as an explanation of the methods, libraries and objects used to
communicate and query the different database management systems.
The next chapter will be devoted for testing our agents and analyzing
the different databases performances. Results will be provided in order
to make conclusions about each database exploits.
Chapter 4
Performance tests
4.1
Introduction
The last section of our comparative study consists in running multiple
tests in order to observe the three databases performances. The retrieved
results will allow us to make conclusions about the most suitable database
management system for our APT detection needs. The tests that follow
have been executed locally on a Microsoft Windows 7 professional 32-bit
operating system machine with the following properties:
• Hard disk 193 Go.
• RAM memory 4 Go.
• Processor Intel® Core™ i5 CPU M480 @2.67GHz-2.66GHz.
4.2
Methodology
The following tests will be classified into two different parts:
The first part attempts to measure the execution time needed to fill in
each database with log data.
The second is similar to the first except that it targets the different
agents execution time on each database type.
In order to be able to analyze the behavior of each database management system parallely to the growth of data size, the tests will be done
with three different log files with different sizes: 836 KB, 33 MB, 153 MB.
49
CHAPTER 4. PERFORMANCE TESTS
50
In addition, we will be using two different indexing strategies as indexes
are said to be able to speed up data retrieval and to depress the server
task by providing quick jump references on where to find the searched
information. However, indexes can present some disadvantages which
we have to take into consideration such as the space they will be taking
and the time needed to update the indexes after each modification on
the database and which can slow it down [Net 17].
As indexes are used to limit the number of results we are trying to find,
the fields that should be indexed are those being part of the where clauses
of the select queries used in our codes. These fields are: client_adr and
host_adr. In a first step, we will be indexing each field apart which means
that we will have two single-field indexes after all. In a second step, we
will be gathering the two fields inside one multiple_fields index as it is
said to beat the single-field one in terms of quickening the data recuperation.
The indexing strategies will only be performed on PostgreSQL and
MongoDB as in Elasticsearch database, data is indexed automatically.
To avoid slowing down the data insertion in each database, we will
disable the indexes at each insertion attempt, but we will also present a
few test results showing how can indexes deteriorate the insertion’s performances.
4.3
4.3.1
First test: Data import
Presentation
This test consists in measuring the time needed by each database management system to upload a set of transactions from a log text file and
store the data in its internal system: table, type or collection.
As we said it earlier, we will be using three different log files in order
to be able to observe the database behavior in parallel with the rise of
the data size. Monitoring big data volumes is a crucial criterion as APT
detection process is based on big data manipulation.
CHAPTER 4. PERFORMANCE TESTS
4.3.2
Results
4.3.2.1
Before indexing
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL MongoDB Elasticsearch
38
27
100
730
173
1354
5208
795
6727
Table 4.1: Data import : results
Figure 4.1: Comparative diagram: data import
4.3.2.2
After Indexing
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
51
Execution time in seconds
PostgreSQL
MongoDB
68
57
1908
617
9154
2088
Table 4.2: Data import results on indexed tables
CHAPTER 4. PERFORMANCE TESTS
4.3.3
52
Analysis
In the previous tests, we have demonstrated that inserting data on indexed databases lower the insertion performances remarkably even with
small amounts of data. For this reason, it is wise to delete the indexes
before any insert query in order to get the best execution time.
Looking to the results of the table 4.2, PostgreSQL along with MongoDB show similar results while importing the first log file, but it is not
the case with Elasticsearch which shows a higher data import time.
The more the log file size rises, the more differences between the insertion times appear. It clearly come out that MongoDB presents the best
execution time in data insertion procedure beating both PostgerSQL and
Elasticsearch. Another observation would be that PostgreSQL does not
fall far behind MongoDB when inserting small to medium sized data sets.
Elasticsearch takes longer time to insert data as it indexes the documents automatically using unique type-level identifier for each indexing
operation which consumes much time then a normal document insertion.
In order to get deeper analysis, we have examined the influence of
data size rise on inserting data on the three databases. The results are
presented on the following curve.
CHAPTER 4. PERFORMANCE TESTS
53
Figure 4.2: Comparative curve: impact of the data size rise on the data
import
The curve demonstrates that data growth affects MongoDB less than
the other databases which emphasizes on the previous findings.
For inserting big volumes of data, we conclude that MongoDB is the
most efficient database as it exhibits an admissible execution time which
would fulfill a future APT detection system requirements.
4.4
4.4.1
Second test: Agent Uploader
Presentation
In order to test the Agent Uploader we have implemented, we will run
it on the different databases so we can observe its adaptation with the
increase of manipulated data. As a first step, tests will be run without
creating indexes. In a second one, two types of indexes will be created
and used within the tests: single-field indexes and multiple-fields ones.
Creating single-field indexes
To create the single-field indexes we used the following SQL queries:
CHAPTER 4. PERFORMANCE TESTS
54
On PostgreSQL
• CREATE INDEX index_client ON data(client_adr);
• CREATE INDEX index_server ON data(host_adr);
On MongoDB
• db.data.ensureIndex({ client_adr:1 });
• db.data.ensureIndex({ host_adr:1 });
Creating multiple-fields indexes To create the multiple-fields indexes
we used the following SQL queries:
On PostgreSQL
• CREATE INDEX index_client ON data( client_adr, host_adr );
On MongoDB
• db.data.ensureIndex({ client_adr:1 , host_adr:1 });
4.4.2
Results
4.4.2.1
Before indexing
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL
MongoDB
26
30
1391
4200
7246
10855
Table 4.3: Agent Uploader : results without indexing
CHAPTER 4. PERFORMANCE TESTS
55
Figure 4.3: Comparative diagram: Agent Uploader without indexing
4.4.2.2
After indexing
Using single-field indexes
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL MongoDB Elasticsearch
8
9
10
101
70
81
1560
961
295
Table 4.4: Agent Uploader : results using single-field indexes
CHAPTER 4. PERFORMANCE TESTS
56
Figure 4.4: Comparative diagram: Agent Uploader using single-field indexes
Using multiple-fields index
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL MongoDB Elasticsearch
8
5
10
57
24
81
1271
75
295
Table 4.5: Agent Uploader : results using multiple-fields index
CHAPTER 4. PERFORMANCE TESTS
57
Figure 4.5: Comparative diagram: Agent Uploader using multiple-fields
index
4.4.3
Analysis
Without indexing, PostgreSQL and MongoDB show very long execution
time that exceeds multiple hours. Querying data which is not indexed is
far from being useful. Unindexed data is the default data state but it is
definitely not a sagacious choice to work with it as it is.
It is obvious that indexes reduce the agent’s execution time in a large
scale that ranges from multiple hours to few minutes. Using single-field
indexes clearly reduces the execution time for both PostgreSQL and MongoDB but multiple-fields index reduces it further more. The two indexing
strategies seem to have different effects on each database performances.
It is easy to notice that using multiple-fields index is a lot more beneficial as it divides the execution time using single-field indexes by more
than 10 times in MongoDB case. PostgreSQL does not show great results especially when monitoring big data sizes and there is not a noticeable difference between the two indexing methods. MongoDB, however,
presents great results once again as querying more than a million documents takes only a minute and a few seconds. Elasticsearch does also
work remarkably good with big data amounts and even beats up MongoDB execution time using single-fields indexes.
CHAPTER 4. PERFORMANCE TESTS
58
For a more profound analysis, we will take a close look on the impact
of data rise on the agent performances. The results are given below.
Figure 4.6: Comparative curve: impact on the rise of data size on Agent
Uploader
The curve exhibits important results as it shows how both PostgreSQL
and Elasticsearch are affected by the growth of the monitored data size:
the more data we have to query, the less performances we get especially
for PostgreSQL. However, MongoDB presents a slight performance variation in function of data rise. It actually displays the best adaptation behavior to the enlargement of the data through which the agent executes
its investigation.
Eventually, MongoDB wins the competition once again providing the
best execution time when monitoring large sets of log transactions.
CHAPTER 4. PERFORMANCE TESTS
4.5
4.5.1
59
Third test: Agent Uploader with aggregations
Presentation
The same tests that we have achieved with the first version of the Agent
Uploader will be run with the second version in order to make comparison between the two of them, and be able to conclude about which one
presents the best performances and the less execution time in a subsequent section. As the Agent Uploadr with aggregations was not totally
implemented on Elasticsearch, these test will be run only on PostgreSQL
and MongoDB.
4.5.2
Results
4.5.2.1
Before indexing
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL
MongoDB
27
50
2102
5520
8507
12762
Table 4.6: Agent Uploader with aggregations : results without indexing
CHAPTER 4. PERFORMANCE TESTS
60
Figure 4.7: Comparative diagram: Agent Uploader with aggregations
without indexing
4.5.2.2
With indexing
Using single-field indexes
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL
MongoDB
11
15
99
151
1073
1908
Table 4.7: Agent Uploader with aggregations : results using single-field
indexes
CHAPTER 4. PERFORMANCE TESTS
61
Figure 4.8: Comparative diagram: Agent Uploader with aggregations using single-field index
Using multiple-fields index
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
PostgreSQL
MongoDB
7
12
57
41
544
156
Table 4.8: Agent Uploader with aggregations : results using multiplefields index
CHAPTER 4. PERFORMANCE TESTS
62
Figure 4.9: Comparative diagram: Agent Uploader with aggregations using multiple-fields index
4.5.3
Analysis
As it was said in the previous analysis, it is not wise at all to query data
which is not indexed on any of the databases we used. It really takes
hours to get results which does not fit for a future APT detection preoccupations.
During the first two tests, PostgreSQL showed less execution time than
MongoDB, but the fact of using multiple-fields has turned the results up
side down as it reduced MongoDB performances remarkably to the edge
of beating PostgreSQL. These observations asserts the study we have made
in the second chapter during which we said that PostgreSQL along with
other SQL, relational databases are better fit for performing complex questions than NoSQL ones. It also comes out that using the appropriate indexing method on MongoDB simplifies the execution of complex queries
and improves the agent execution time.
A deeper analysis of the databases response to the enlargement of
monitored data sets is provided by the following curve.
CHAPTER 4. PERFORMANCE TESTS
63
Figure 4.10: Comparative curve: impact on the rise of data size on Agent
Uploader with aggregations
PostgreSQL shows a more like exponential behavior relatively to the
data size rise while MongoDB exhibits more like a linear one. These
observations demonstrate that MongoDB presents the best adaptation
to the data growth especially with very large data sets as for small to
medium sized data, its results are similar to the PostgreSQL ones.
For the third time, MongoDB shows the best performances in terms
of execution time but with the condition of using appropriate indexing
method that will speed up data querying and offers good results which
can be of a great use in a future utilization especially when performing
complex queries that might have slowing impacts on the database.
4.6
4.6.1
Comparison of the two agents
Presentation
So far, we have compared the performances of each version of the Agent
Uploader on each one of the three databases on its own. This section
will be consecrated to the comparison between the two versions results
CHAPTER 4. PERFORMANCE TESTS
64
for PostgreSQL and MongoDB. It will not be possible for us to apply this
comparison to Elasticsearch as we were not able to carry on the second
version’s implementation due to the absence of required documentation.
In this section, we will be seeking to conclude about the best way
to query each database in order to get the best performances which are
needed in a future APT detection system.
4.6.2
PostgreSQL
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
First version Second version
8
7
57
57
1271
544
Table 4.9: PostgreSQL results: Comparison of the two agents
Figure 4.11: Comparative Diagram: comparison of the two agents on
PostgreSQL
CHAPTER 4. PERFORMANCE TESTS
4.6.3
65
MongoDB
Number of transactions treated
6682 (836 KB)
273141 (33 MB)
1263317 (153 MB)
Execution time in seconds
First version Second version
5
12
24
41
75
156
Table 4.10: MongoDB results: Comparison of the two agents
Figure 4.12: Comparative Diagram: comparison of the two agents on
MongoDB
4.6.4
Analysis
For PostgreSQL, the second version of the Agent Uploader takes less time
to be executed than the first one. This actually, can be explained through
the fact that SQL databases including PostgreSQL are good fit for the
complex query intensive environments. The gain in execution time gets
higher along with the data volume growth.
CHAPTER 4. PERFORMANCE TESTS
66
In MongoDB’s case, the two agent versions seem to have similar results but the first one has slightly a lower execution time. Once again, we
can justify this observations according to the fact that NoSQL databases
are not best fit for handling complex queries. In general, NoSQL including MongoDB don’t have standard interfaces to monitor complex queries,
and NoSQL queries are not as powerful as SQL ones.
Unfortunately we couldn’t get through the entire implementation of
the second version of our agent for the Elasticsearch database owing to
the lack of documentation, so we couldn’t compare its two versions performances.
4.7
Summary
During this chapter, we have presented the different tests that we have
performed on the three studied databases. Looking at the results we have
got and the analysis we have made, it comes out that MongoDB shows a
tolerable execution time especially when monitoring big data volumes
that beats both PostgreSQL and Elasticsearch. By providing result sets in
a limited time interval, MongoDB allows to a future APT detection system
to perform network investigation on an admissible time especially when
running multiple agents in order to perform the APT uncovering mission.
From these conclusions, it stems out that we answered our research
question even if we still have some unsolved issues with the implementation of the second version of our agent on Elasticsearch due to the absence of the required documentation. We have proved that MongoDB
presents the best results when monitoring big data in an APT detection
process over PostgreSQL and Elasticsearch. We have also demonstrated
that MongoDB as a NoSQL database does not fit perfectly for performing complex queries, but PostgreSQL shows interesting behavior toward
these queries that beats up the performance of simple ones. The results
are not terrible but if we seek a minimum execution time, it’s more wise
to limit the communication with the database to the data retrieval and
perform any further calculations or analysis on the server side which will
speed up the agents in a very large scale.
Chapter 5
Conclusion
The detection of Advanced Persistent Threats is a challenging task looking to the sophistication and the complexity of these intrusions. Many
open source software tools, free and commercial are available for public
use implementing different approaches. In order to be able to protect
a network from being intruded, an appropriate APT detection strategy
should be put into place with the combination of several APT detection
mechanisms.
The real challenge behind APT detection procedures is ”Big Data”. Big
Data refers to extremely large amounts of varying, fast-changing information through which searching is not convenient for traditional technologies. Due to its heavy impacts on applications in a wide variety of
fields, databases specialized in Big Data monitoring are launched on the
market called and ”Big Data Tools”.
In this thesis, we addressed the problem of running multiple database
queries in order to help detect APT. Actually, APT detection joins log data
collection and investigation field to which Big Data Tools seem to be of
a great benefit. We have achieved a comparative study on three different database types in order to make conclusions about the most suitable
database management system for a future APT detection software.
We have implemented two versions of an agent which investigates
an abnormal upload rating in order to analyze the best way to query a
database and to be able to conclude about which one of the two querying methods will demonstrate the best performances. To test the different databases, we have used different log file sizes with the purpose of
closely examining the impact of data growth on each database perfor67
CHAPTER 5. CONCLUSION
68
mances.
Below we provide a summary of the ascertainment we have collected
during this study and a set of recommendations for a best practice in
terms of the results we retrieved.
5.1
Ascertainment
PostgreSQL
It shows good results when monitoring small to medium sized data sets
but low performances when manipulating large amounts of data. PostgreSQL is a good fit for performing complex queries where it presents
interesting results compared to the performance of simple ones.
MongoDB
It shows the best performances in data import and Agent Uploader on
indexed data with multiple-fields indexes. However, it does not fit for
complex queries performance but fits greatly for simple querying.
For an APT detection system development, we advise the use of MongoDB as it is capable of providing the needed requirements.
Elasticsearch
It takes long time to import data but shows good results when running
Agent Uploader.
Elasticsearch is still far from being stable and lacks documentation on
performing aggregations with the java API along with lots of other issues.
For this reason, we don’t recommend its integration within the MASFAD
project.
5.2
Recommendations
Looking to the integrity of the study we have achieved, we conclude that
MongoDB is the most suitable database among the studied ones for APT
CHAPTER 5. CONCLUSION
69
detections needs in terms of response time. MongoDB shows great performances when monitoring big data and offers several concepts which
can optimize big data processing further more. Here we presents few recommendations and subsequent solutions which stems from our study in
order to enhance the results we have collected in a future work.
MongoDB offers a sharding method as a solution to remedy the problem when data amounts are challenging a single server capacities: either
CPU or storage capacity. It consists on distributing the data over multiple
machines called ”shards” which, all together make up an entire an single
database.
In order to lower big data impact on the database performances, splitting the data into several tables can have a remarkable effect on the execution time. This concept within MongoDB is called ”data partitioning”.
It joins the sharding concept but occurs at the collection level. The purpose of this procedure is to speed up the execution time of the most frequent queries. To reach this goal, the splitting should be made on the
objects used by these queries to filter the results. In our case, we could
create a database per client.
Once again, to avoid performance issues caused by big data volumes,
MongoDB gives a specific collection type named ”capped collection” where
the maximum size can be specified in advance. When this maximum is
reached, old documents are deleted so we would not have to monitor
insignificant data.
Otherwise, methods for maintaining balanced data distribution are
assured by MongoDB, facilitating the database administration and ensuring an equitable data partitioning between the different shards.
5.3
Closing remarks
During this work, we have only studied three database management systems and we do not pretend that we presented the best existing solutions
or have throughly studied all related aspects. Nevertheless, the results
that we collected are promising and can present a strong starting point
for further researches or even an APT detection prototype implementation.
Bibliography
[1] “M-Trends® 2013: Attack the Security Gap™“, March 13, 2013
[2] Bejtlich, Richard. Understanding the advanced persistent threat.
2010
[3] The Economist annual report, 2010
[4] "What’s an APT? A Brief Definition". Damballa. January 20, 2010
[5] "The changing threat environment ...". Command Five Pty. March,
2011
[6] M. Hutchins, J. Clopperty,M. Amin, Ph.D. "Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns
and Intrusion Kill Chains Lockheed Martin Corporation Abstract. March,
2013
[7] Beth E. Binde, Russ McRee, Terrence J. O’Connor. Assessing Outbound Traffic to Uncover Advanced Persistent Threat. May 5th, 2011
[8] Stephen Fortune, A Brief History of Databases, February 27th 2014
[9] Guy Harrison, 10 things you should know about NoSQL databases,
August 26th, 2010
[10] Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman, Big
Data For Dummies, April 2013
[11] "Appendix D. SQL Conformance". PostgreSQL 9 Documentation.
PostgreSQL Global Development Group. 2009
70
Netography
[Net 1] http://www.itbusinessedge.com/slideshows/
the-most-famous-advanced-persistent-threats-in-history.html
[Net 2] https://www.academia.edu/6309905/Advanced_Persistent_Threat
[Net 3] http://www.verint.com/solutions/communications-cyber-intelligence/
solutions/cyber-security/advanced-persistent-threats-apt-detection/
index
[Net 4] http://www.fireeye.com/products-and-solutions/
[Net 5] http://www.isc8.com/products/cyber-adapt.html
[Net 6] http://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/
[Net 7] http://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/
[Net 8] http://databases.about.com/od/specificproducts/a/acid.html
[Net 9] http://www.ideaeng.com/database-full-text-search-0201
[Net 10] http://www.postgresql.org/
[Net 11] http://www.db-engines.com
[Net 12] http://www.mongodb.org/
[Net 13] http://www.elasticsearch.org/
[Net 14] http://exploringelasticsearch.com/overview.html
[Net 15] http://wiki.squid-cache.org/Features/LogFormat#Feature:
_Customizable_Log_Formats
[Net 16] http://www.tutorialspoint.com/postgresql/postgresql_java.
html
[Net 17] http://www.interspire.com/content/2006/02/15/
introduction-to-database-indexes/
[Net 18] http://wiki.squid-cache.org/Features/LogFormat
[Net 19] http://www.elasticsearch.org/guide/en/elasticsearch/client/
java-api/current/client.html
[Net 20] http://tutorials.jenkov.com/jdbc/resultset.html
[Net 21] http://jdbc.postgresql.org/documentation/80/connect.html
71
CHAPTER 5. CONCLUSION
[Net 22] http://docs.mongodb.org/ecosystem/tutorial/
use-aggregation-framework-with-java-driver/
72
Appendecies
73
PostgreSQL Codes
Data import
PostgreSQL Codes
75
PostgreSQL Codes
Agent Uploader
76
PostgreSQL Codes
77
PostgreSQL Codes
78
PostgreSQL Codes
Agent Uploader with aggregations
79
PostgreSQL Codes
80
PostgreSQL Codes
81
MongoDB Codes
MongoDB Codes
Data import
83
MongoDB Codes
84
MongoDB Codes
Agent Uploader
85
MongoDB Codes
86
MongoDB Codes
Agent Uploader with aggregations
87
MongoDB Codes
88
MongoDB Codes
89
Elasticsearch Codes
Elasticsearch Codes
Data import
91
Elasticsearch Codes
92
Elasticsearch Codes
Agent Uploader
93
Elasticsearch Codes
94
Elasticsearch Codes
95