Download Analytics in security technical white paper

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Analytics in security
Technical white paper
Technical white paper
Contents
Executive summary................................................................................................................................................................................................................................................................................................................................4
Introduction to analytics ....................................................................................................................................................................................................................................................................................................................4
Evolution in analytics....................................................................................................................................................................................................................................................................................................................5
Rule-based methods .............................................................................................................................................................................................................................................................................................................5
Statistical methods .................................................................................................................................................................................................................................................................................................................6
Detection- and prediction-based methods ......................................................................................................................................................................................................................................................6
Relations among the three types of methods ...............................................................................................................................................................................................................................................6
From detection to prediction: Data-mining vs. machine-learning methods .......................................................................................................................................................................7
Trends of future analytics methods ........................................................................................................................................................................................................................................................................7
Typical analytics workflow .......................................................................................................................................................................................................................................................................................................8
Business understanding.....................................................................................................................................................................................................................................................................................................8
Data understanding ...............................................................................................................................................................................................................................................................................................................8
Data preparation.......................................................................................................................................................................................................................................................................................................................8
Modeling ..........................................................................................................................................................................................................................................................................................................................................9
Evaluation .......................................................................................................................................................................................................................................................................................................................................9
Deployment ...................................................................................................................................................................................................................................................................................................................................9
Big Data analytics ................................................................................................................................................................................................................................................................................................................................10
What is Big Data? .........................................................................................................................................................................................................................................................................................................................10
Big Data analytics architecture........................................................................................................................................................................................................................................................................................ 11
Data layer ....................................................................................................................................................................................................................................................................................................................................11
Processing layer ....................................................................................................................................................................................................................................................................................................................12
Visualization and presentation layer ..................................................................................................................................................................................................................................................................13
Applying analytics to security................................................................................................................................................................................................................................................................................................... 14
The goal of analytics in security ..................................................................................................................................................................................................................................................................................... 14
Speed and accuracy...........................................................................................................................................................................................................................................................................................................14
Formalizing the unknown .............................................................................................................................................................................................................................................................................................14
Security data.....................................................................................................................................................................................................................................................................................................................................14
Vulnerability data .................................................................................................................................................................................................................................................................................................................14
Device data ................................................................................................................................................................................................................................................................................................................................14
Traffic data .................................................................................................................................................................................................................................................................................................................................14
Asset enrichment data ....................................................................................................................................................................................................................................................................................................14
Organization enrichment data .................................................................................................................................................................................................................................................................................15
The possibilities of analytics in security .................................................................................................................................................................................................................................................................. 15
Detecting new and unknown threats.................................................................................................................................................................................................................................................................15
True positive matching...................................................................................................................................................................................................................................................................................................15
False positive evaluation ...............................................................................................................................................................................................................................................................................................15
Root-cause analysis ...........................................................................................................................................................................................................................................................................................................15
Technical white paper
Designing attack vectors...............................................................................................................................................................................................................................................................................................16
Predicting a future security concern ..................................................................................................................................................................................................................................................................16
Resource planning ..............................................................................................................................................................................................................................................................................................................16
Case study: Detecting new threats ...................................................................................................................................................................................................................................................................................... 16
Business understanding......................................................................................................................................................................................................................................................................................................... 16
Data understanding ...................................................................................................................................................................................................................................................................................................................16
Data preparation...........................................................................................................................................................................................................................................................................................................................17
Modeling ..............................................................................................................................................................................................................................................................................................................................................17
Evaluation ...........................................................................................................................................................................................................................................................................................................................................18
Handling misalignments between the goal and result ......................................................................................................................................................................................................................19
Deployment .......................................................................................................................................................................................................................................................................................................................................19
The challenges with analytics ................................................................................................................................................................................................................................................................................................... 19
Security risks ....................................................................................................................................................................................................................................................................................................................................19
Data centralization ..............................................................................................................................................................................................................................................................................................................19
Data custodians .....................................................................................................................................................................................................................................................................................................................19
Operational risks ...........................................................................................................................................................................................................................................................................................................................19
Data processing.....................................................................................................................................................................................................................................................................................................................20
Time to prototype ...............................................................................................................................................................................................................................................................................................................20
System complexities ..........................................................................................................................................................................................................................................................................................................20
Conclusion ..................................................................................................................................................................................................................................................................................................................................................21
Authors ..........................................................................................................................................................................................................................................................................................................................................................21
Technical white paper
Page 4
Executive summary
In recent times, analytics has been a hot topic in a variety of industries. The security industry is no exception to this phenomenon. With data
being the new form of currency in enterprises, there is a wealth of information that can be gleaned from this data. All it needs is a little bit of time,
the right set of skills, and a robust path to follow. A combination of these traits could create the perfect analytics program that may assist an
existing security team in their day-to-day activities.
In general, analytics isn’t expected to be a one-stop shop that solves all problems in the world. Rather, a meticulous approach that intelligently
sorts data, groups them into logical aggregations, and highlights the most important items to consider. Everything else—before and after this
general workflow—is similar to today’s processes.
This paper dives into some of the basic concepts of analytics and explains the various processes and architecture involved in an analytics
program. A follow-up discussion elaborates on some of the various possible narratives of applying analytics in the context of an organization’s
security program. A detailed case study is also provided, describing a real experience of implementing such a process.
Finally, the other side of the coin is revealed by a short discussion on some anticipated risks within a security analytics program. While this
paper may benefit people across different roles, it is important to mention that all perspectives captured here are based on a security
analytics researcher’s point of view. This approach showcases the real benefits and risks of such a program based on the fundamental
research that drives the entire program.
Introduction to analytics
Analytics is a process of performing necessary actions on recorded data to discover, interpret, or represent meaningful patterns or knowledge of
the data. From a broader perspective, any method or process of analyzing raw data to extract useful information is analytics. Historically, analytics
started a long time ago. For example, the Swedish government began collecting population data as early as 1749 to record and understand the
geographical distribution of Sweden’s population. This exercise was carried out to sustain an appropriate military force. 1
Over the last few centuries, analytics has been applied to different fields of the society, forming multidimensional analytics subdomains, such as
architectural analytics, behavioral analytics, business analytics or business intelligence, customer analytics, news analytics, web analytics, speech
analytics, and more. Each of these subdomains has tremendous challenging problems that need be answered, and analytics is an effective way to
answer these questions, extract insights, or draw conclusions from data.
Take business analytics, for example, data-driven companies can leverage their data assets, including marketing data, sales data, customer data,
and so on, to make informative decisions, such as forecasting marketing changes, exploring new sales patterns, or identifying new customer
groups. These insights cannot exist without the involved analytics. Another interesting application is web analytics, which collects, measures,
analyzes, and reports web traffic data to understand and optimize web usage. This technique has been broadly used by online businesses to
maximize their web traffic and improve their business profits. Additionally, even in the finance area, sophisticated analytics algorithms are
frequently used to perform automatic high-frequency trading in the stock market to maximize profits.
No doubt, analytics has been generally used in various scenarios across industries to reach different businesses or research goals, especially
wherever recorded data is available.
1
Statistics of Sweden’s history, which was retrieved on 17 November 2016 from Statistics Sweden
Technical white paper
Page 5
Evolution in analytics
Methodologies and tools used in analytics have quickly evolved in the last few centuries, especially the last few decades. With streams such as
science, mathematics, and technology being widely used in every area of the society, scientific computation and statistics analysis are
predominantly used in analytics for a deep understanding of the data.
Evolution in analytics has also been pushed by the exponential growth of data. With the fast evolution in the IT industry, data collection and
storage has become extremely easy and cheap. Therefore, the amount of data to be analyzed in many fields has grown exponentially, forcing the
tools and methodologies in analytics to change greatly as well. Traditionally, records were made on paper and that led to manual analytics on
paper. With the emergence of electronics, storing and processing data electronically has become a reality. Then in last few decades, with the
evolution of very large-scale integration, saving and calculating data at a very large scale has become possible, solving more unanswerable
questions than ever before. Analytics, by utilizing computing power and the convenience of software development started a new era—the
Big Data analytics era.
Figure 1. Trend of exponential growth of data and decrease in storage costs 2
The methods used in analytics can be mainly classified into three categories, in terms of the chronological evolution. With data being generated
exponentially and more scientific methods getting involved in analytics, organizations are seeking answers to more sophisticated and challenging
questions using the data. This trend is pushing analytics from the original experience-based or rule-based analysis to a more mathematical
way—statistical-based or behavior-based methods—and finally, to more advanced and predicative strategies, such as machine learning or data
mining-based methods. Meanwhile, implementing and deploying these methods have also become more time- and computation-intensive.
Rule-based methods
In the beginning, the data volume was small, the number of data sources was limited, and normally this data was saved in table-based forms. Only
simple questions could be answered via analytics, such as total, average, changes over time, and others. Frequently, domain knowledge and
experience took the main control of data analysis, which led to a simple mathematical formula or rule-based filtering algorithms being used as the
main method in analytics.
2
forbes.com/sites/gilpress/2016/08/05/iot-mid-year-update-from-idc-and-other-research-firms/ and zdnet.com/article/enterprise-storage-trends-and-predictions/
Technical white paper
Page 6
Statistical methods
Next, more mathematical and statistical methods were introduced into analytics when more data became available. Basic statistics of data
(for example, average, median, standard deviation, and others), analysis of variance (ANOVA 3), factor analysis, regression analysis, and famous
statistical tests are some of the popular statistical methods for analytics.
Basic statistical metrics is the most straightforward way to see any data problem, such as skewness or fluctuations. ANOVA are statistical models
used to analyze the differences between group means and their variations. Factory analysis is used to identify less number of variables (factors)
that can describe observed and correlated variables—such an analysis is very useful when data sets with large numbers of variables are
dependent on few underlying hidden rooting factors.
Regression analysis is normally used to estimate the relationship between one dependent variable and one or more independent variables
(predictors)—the output regression function is particularly useful to depict the trend of independent data variables. Finally, statistical tests are
mainly testing statistical hypothesis by observing a process that is modeled via a set of random variables. This is very useful for drawing
conclusions directly from the data.
Detection- and prediction-based methods
The next wave of analytics methods was leveraging the prosperous research and application of the machine learning and data mining fields.
Traditional statistical methods normally have to assume particular statistical models behind the data, which in many cases do not fit well with
large volume and variety of data, leading to a weak ability to find out hidden patterns from the data.
On the other hand, machine learning and data mining methods are more advanced to simulate or represent deeper and complicated meanings
behind the data. What’s more, these methods are capable of making predictions from the new data—understanding deeply hidden knowledge
from the data to expect the outcomes of the new data. Therefore, data mining and machine learning-based methods have become the current
leading methodology to drive the evolution of analytics, especially Big Data analytics.
Relations among the three types of methods
The three categories of methods used in analytics have different focuses. Rule-based methods rely significantly on known knowledge and
experiences, focusing on explaining the facts from data to match what are already known. Statistical methods tend to use mathematical models
to identify the variance or relation among data entries to draw a conclusion or observe a trend. Lastly, detection- and prediction-based methods
use data-mining and machine-learning methods to extract new insight from the data, which is more challenging but more useful.
The three types of methods also have differences in analysis performance and deployment time. Rule-based methods normally need less computing,
hence providing fastest response and deployment time among the three types. Statistical methods need a certain amount of computation during
statistical modeling or statistical tests, so they require more analysis and deployment time. Finally, detection- and prediction-based methods normally
need several iterations of modeling, along with training process to find the best algorithms and parameters. Therefore, the cost of developing and
deploying time for this type is the highest in real applications.
However, the actual performance of the implementation of analytical methods also depends on the hardware and software architecture. If more
computation resources are available and the execution architecture can utilize the full power of such resources, the performance of the
sophisticated analytics can be greatly enhanced. For example, distributed computing frameworks often leverage computer clusters built from
commodity hardware to perform machine-learning tasks to output near real-time results.
Despite the differences in focus and performance, the three types of methods are not exclusive to each other, but quite the opposite, as they are
normally mixed together to solve more complicated problems from the data. For example, in the cybersecurity industry, where the goal is to
detect an ongoing data breach event, a mature analytics solution may use rule-based methods to filter unrelated data based on domain
knowledge, and use statistical methods (e.g., correlation analysis) to create more aggregated data vectors. What’s more, the solution can also
apply machine-learning methods to identify anomaly in activities on one or many hosts. Different categories complement each other to provide
effective ways of analyzing the data.
3
statisticssolutions.com/manova-analysis-anova/
Technical white paper
Page 7
New methods are not always better
Even as new analytics methods keep evolving to solve more complicated data and problems, it does not mean that new methods are always
providing better answers. Each type of method has its own advantages and weakness, and every analytics problem is different as well as specific.
That is why there are so many new engineers and scientists working hard to find out the best process and methods when solving a particular
data problem at hand. Sometimes, they have to mix different existing methods together to find the best combination to extract the most useful
information from the data. In short, there is no single best approach to solving all problems in data analytics.
From detection to prediction: Data-mining vs. machine-learning methods
Both data mining and machine learning are used as advanced analytics methods to understand data deeper, to discover the unknown or hidden
knowledge. However, there are some differences between them. Data mining mainly aims at finding patterns hidden in the data to explain some
phenomenon by using statistics and other programming methods. Machine learning uses different learning algorithms, including statistical
methods and/or data mining methods, to build models on existing data so that it can predict future outcomes. In short, data mining explains the
data by detecting hidden patterns, while machine learning focuses on predicting new information with models. This trend of shifting from
detection-based to prediction-based methods in analytics reflects the changes from passive remediation insights to proactive prevention
decisions needed in business data.
List of popular data mining and machine learning methods
Here is a list of most common and popular data mining and machine learning methods. Readers can refer textbooks or online resources to those
algorithms; be advised that the list is constantly changing as the industry is actively adopting them into different applications. With so many
choices, it is hard to know which one is best to suite the problem. In many cases, a trial-and-error process is needed to explore which of these is
better and balance among accuracy, performance, deployment requirements, and complexity.
• Naïve Bayes Classifier Algorithm
• K-Means Clustering Algorithm
• Support Vector Machine Algorithm
• Apriori Algorithm
• Linear Regression
• Logistic Regression
• Artificial Neural Networks
• Random Forests
• Decision Trees
• Nearest Neighbors
Trends of future analytics methods
The current trends in advancing analytics are toward two areas: Big Data analytics and deep learning. The reason for this is twofold—data is
growing exponentially but answers are required in near real-time, and questions are getting more challenging, requiring several layers of analysis,
along with a broader and deeper understanding. Big Data utilization is driven by leveraging distributed software systems to store and process a
large volume of data very quickly. Meanwhile, deep learning builds several layers of learning ability to extract information from raw data for a
higher and more abstract understanding.
Besides, cloud technologies have grown very fast in last few years, enabling more flexible and on-demand data storage and processing on
remote cloud infrastructure controlled from a simple client-side software. Therefore, analytics in cloud environment has naturally attracted more
attention and building analytics services-as-a-subscription model is becoming a phenomenon. It is easy to understand this trend: analytics goes
where data goes.
With more domains leveraging data analytics to serve their business goals, the trend of applying analytics in broader industries has continued.
Healthcare analytics and social media analytics are two examples that show this trend.
Technical white paper
Page 8
Typical analytics workflow
Here is a standard workflow when using analytics in real business scenarios.
Figure 2. Workflow of analytics 4
As shown in figure 2, there are six steps in a typical analytics task and some of them have feedback loops, which means several iterations of trial
and error are needed to solve a sophisticated problem for a good outcome. The more detailed tasks of each of the six steps are as follows.
Business understanding
While starting any project, it is essential to begin with the end in mind. This step sets up business objectives, assesses project feasibility,
determines data mining goals, and produces project plans. This is critical as it determines where the final goals are and whether or not these
goals are achievable. Setting up goals and success criteria for the data mining tasks are very important to remind us to focus on the targets and
understand when we can stop experimenting with different methods and call it a success. Assessing feasibility and creating plans involve
understanding, what tools, data, and techniques are accessible for the project and what is critical in helping us understand the limitations.
Some questions to ask oneself at this point would be:
• What problem are we trying to solve?
• Why do we need to implement this project or program?
• What kind of data do we have access to?
Data understanding
This step involves raw data collection, basic data exploration, and data quality verification. In short, this collecting and reviewing step helps us get
the initial hands-on impression with data. The main aim of this step is to help us get an idea of the possibilities and limitations of things we can
do based on the data. In a real analytics task, data may come from multiple types of sources, show various characteristics, and represent
outcomes of complicated relations.
For example, analytics on weather conditions may have image data from satellites, radar scan data from major ground monitor hubs, and sensor
data from many ground monitor stations, geographic terrain data, and historical data of each of these. Understanding every type of data set and
knowing the basic characteristics are very important for further data analysis steps.
Some questions to ask oneself:
• Do we have access to all the necessary data feeds?
• Do the data feeds provide the data we need to solve my problem?
• Is the data accurate?
Data preparation
Data preparation is a serious step that properly selects and sanitizes useful data sets or data feeds, understands the relationship among them to
integrate or consolidate them, and finally formats them for next step.
4
exde.files.wordpress.com/2009/03/crisp_visualguide.pdf
Technical white paper
Page 9
For example, in the case of weather analytics, this step can correlate to a different type of weather data (satellite image, radar scans, and ground
sensor readings) by a timestamp and geolocation to provide an overview of weather conditions for a particular area at a given time. Meanwhile,
this step can also cleanse duplicated or incorrectly generated and non-related data to make sure the final data set is able to provide the most
accurate information for the targeted problem.
Some questions to ask oneself when working in this phase:
• Is our data clean and normalized?
• Do we have the right logical aggregations to work?
Modeling
Modeling is the core step to manipulate data and draw conclusions—extract unknown patterns, detect new information, or predict new
knowledge. This step contains four sub steps—select and implement modeling techniques, create appropriate test plans, experiment with
different model settings, and validate results of models.
Sometimes, choosing appropriate models from various data mining and machine learning methods can be very challenging, as different methods
tend to fit different types of data and problems. The best solution here is to try different methods to find the best one. That is why creating an
appropriate testing procedure too is very important here. Besides, quite frequently, people have to choose the best solution by balancing
performance on false-positives and false-negatives rates. It is also not uncommon to discover incomplete or missing data while modeling an
approach. In this case, the project is iterated back to the data preparation phase, where additional data is aggregated from existing or new data
feeds until the requirement is satisfied.
Some questions to think about at this point:
• How can we analyze the given data set?
• How can we make observations that correspond to the business goal of the project?
• Can we make any interesting inferences based on the observed conclusions?
Evaluation
The evaluation step is when the results or conclusions are generated from modeling step. This is done by assessing whether the business criteria
have been met, the goals have been reached, and whether or not the implemented modeling methods are feasible for practical use. It is possible
that in some cases, the analytics results do not match the original business goals due to various reasons, such as a small misalignment in each of
the steps or the ineffectiveness of the data or modeling methods used to provide targeted insights or conclusions. In such cases, two feedback
approaches can be used:
1. Fall back to the modeling step to use other modeling algorithms or continue to optimize the model parameters.
2. Return to the first step of business understanding. While the results may be acceptable, it may be off-target and so a new set of business
goals may be created to align with the produced outcome.
In either case, there may be multiple iterations to make sure the results are what we wanted and the process is reproducible.
Meanwhile, the evaluation also determines the options for the next steps. The analytics results might be in different forms, leading to
different actions afterward. For example, if the result is an effective prediction technique on existing data, then it should be implemented for
practical usage; if the result and goal are summarizing similar clusters in the data, then the next step may be to provide an open report for
educational purposes.
Deployment
In many cases, there will be a deployment step to apply the conclusions or results to the business process. After all, the whole analytics task
is to help the business in some way. Some concrete tasks here are to implement the whole process in the production environment, create a
monitoring and maintenance plan, create a final summary report on the examined data, or provide a retrospective discussion on how analytics
was generated or failed. A detailed review report might also be useful here to provide insights for other similar projects for the future when data
analytics is involved.
Technical white paper
Page 10
Big Data analytics
As briefly discussed in the trends of future analytics, Big Data analytics is the new trend that utilizes the advancement of commodity hardware
and distributed software architecture. The three fundamental types of analytic methods might be still the same in Big Data analytics, but the
process, data workflow, and system architecture are greatly enhanced to leverage the storage and computing power in the Big Data framework.
But what exactly is Big Data? What does a big analytics system look like? How do we manipulate data under such a system? Let’s briefly describe
these in this section.
What is Big Data?
Big Data describes large data sets that are inadequate or unable to be processed by traditional systems. A broader explanation is that Big Data
refers to storing and processing of large data sets, including tools, systems, and infrastructure that are used to collect, integrate, analyze, store,
share, transfer, search, visualize, and protect such data. The main characteristics of Big Data were frequently described as the “three Vs” in the
earlier days:
• Volume—the quantity of generated or stored data
• Variety—the diversity of type and nature of data
• Velocity—the speed at which data is generated or processed
Two new additional Vs are also frequently used in describing Big Data:
• Veracity—the uncertainty or quality of data sets affecting the accuracy of analysis
• Variability—the inconsistency and dynamic nature of data
These special characteristics have pushed a new type of analytics—Big Data analytics—that is different from traditional analytics and advanced
analytics. Traditional analytics such as rule-based or statistical-based methods are slow and cannot provide enough accuracy or solve the
complexity of Big Data. While advanced analytics such as pure machine learning-based methods can only work well on small data sets—such as
the magnitude of dozens of gigabytes of data.
Figure 3. Characteristics of Big Data and Big Data analytics vs. traditional analytics 5
Fortunately, Big Data analytics can take advantage from both the worlds—the large storage space and the fast processing power of distributed
systems, along with advanced analytics methods that solve sophisticated problems and relations within data. The result is an active Big Data
ecosystem that contains many enhanced analytics methods adapted to the distributed storing and processing power of several Big Data
frameworks.
5
Big Data: The Next Big Thing, NASSCOM and CRISIL GR&A, 2012
Technical white paper
Page 11
Big Data analytics architecture
A typical architecture of an analytics system has at least three big components: data layer (for data collection and storage), processing layer
(for data analysis and processing management), and visualization or presentation layer (for result visualization or conclusion representation).
With so many mature Big Data tools and frameworks being implemented and proposed, Big Data analytics has significantly leveraged them to
solve problems in real business applications. This section not only discusses generic analytics architecture but also emphasizes on tools used in
Big Data analytics.
Figure 4. Typical architecture for Big Data analytics
Figure 4 is a high-level abstracted architecture for typical Big Data analytics and explanations are as follows. However, in some cases, such as in
analytics for streaming data, the boundary between processing and data layer is a blur. Streaming data may not store its raw form persistently
and immediately pass it to the processing stage. Therefore, it seems like the data collection and processing steps are integrated together. But
normally intermediate or final results are stored persistently to provide as a data source for the visualization or presentation layer.
Importantly, the core of Big Data analytics is still around data and manipulation, whether data is in transit or at rest in those analytics layers.
The whole flow of data manipulation in analytics typically includes data generation or feeds, data storage, data processing, data transfer, and
data visualization. Each of these steps has challenging problems to solve, and fortunately, the industry and academics have together created
many tools and frameworks to tackle the problems individually and together. We will discuss these data manipulation tasks in the appropriate
architecture layer.
Data layer
A typical analytics system needs methods and tools for data collection and storage. This can be as simple as a population survey that sends
many surveyors to record basic birth information from hospitals or government offices on a simple tabular form either on paper or on AN
Excel-like digital system. With more electronic devices used everywhere, data has mostly been directly saved in digital form so that data
collection can be automatic. Since more data can be generated at a large scale, relational databases are used to provide more effective insert,
update, and search abilities than basic digital forms. Many relational database management systems (RDBMS), such as Oracle, MySQL and
Microsoft® SQL Server, have been broadly used to store data for analytics, as well as for many other purposes. However, with the advent of the
Big Data age, a traditional RDBMS cannot satisfy the needs for data volume or data variety, leading to the creation of columnar storage
databases, non-relational database systems and distributed file systems. Columnar store databases like Vertica can scale because they store data
in columns rather than rows. When analytics are requested, the database does not need to scan through huge amounts of data to answer it.
Instead it can go directly to the data it needs. There are advantages in compression and scaling, as these solutions include massively parallel
processing (MPP) to take advantage of a cluster. Non-relational database systems can handle unstructured data types, such as texts or
documents. Such database systems are also referred as Not Only SQL (NoSQL) databases because they can have more features than traditional
SQL databases, such as auto scalability, storing unstructured data, built-in search ability, and others. Examples of NoSQL databases include
MongoDB, Cassandra, HBase, CouchDB, and others.
Technical white paper
Page 12
Non-relational database systems can handle unstructured data types, such as texts or documents. Such database systems are also referred as
Not Only SQL (NoSQL) databases because they can have more features than traditional SQL databases, such as auto scalability, storing
unstructured data, built-in search ability, and others. Examples of NoSQL databases include MongoDB, Cassandra, HBase, HPE Vertica, CouchDB,
and others.
Additionally, distributed operational systems are leveraged to handle the volume of data from a size perspective. For example, Google™ File
System (GFS), Hadoop distributed file system (HDFS), and Windows® Distributed File System are three popular adoptions to store a large
volume of data in a distributed format.
Working with data in this layer is primarily comprised of two different actions: data generation and data storage.
Data generation: Data generation can be from various types of physical or virtual entities such as electronic monitor sensors, endpoint computer
devices and network devices, sales orders, digital messages, and others. The output of data can be in various forms as well, such as single or
multiple-tuple data points, audio or video data streams, texts, and more. The key for a good data feed is providing a consistent format of data
that can be further analyzed.
Data storage: As mentioned earlier, storing data electronically has two major forms—databases or file systems. Either storage form can generally
handle various types of data input, or in some cases one is better than the other. For example, a database is better for handling shorter,
consistent, and structured data in string forms, while file systems are good at saving, streaming, unstructured, or document-structured data in a
long string (text) or binary forms.
Meanwhile, depending on the actual storage implementation, data might be stored in more than one place to provide fault-tolerance ability and
localization.
Processing layer
Processing data is the key task of analytics. It is the step where computation analysis happens; besides, it also includes some management
services that schedule the execution procedure of these analysis tasks. Depending on the business goal, there are normally many ways of
performing computation on the data. For example, computation can be done at once for the whole data set or using a divide-and-conquer
method. Sometimes, it may need iterative computation to improve the performance of the results.
For traditional analytics, this step needs a scientific approach to manipulate data properly to solve the targeted problem or meet the business
goal. Typically, analysts need to understand the data comprehensively, such as getting the statistical metric from the data, before applying any
more sophisticated computation. Depending on the size of data and performance of data processing (for example, real-time or offline long-time
processing), different tools or methods are implemented to process the data effectively. For a small data size or lower processing requirements,
for example, less than tens of GB data in offline mode, a typical computer system with enough disk storage, memory, computing power, and
appropriate software system can meet this demand. For example, projecting and analyzing the population trends at a state level will probably
have a small data set that can be done in a modern single computer.
But if the data set is more than tens of GB and needs close to real-time processing results, then using Big Data architecture is the right way. Quite
a few implemented Big Data system frameworks support distributed data storage and processing. Here are a few popular examples and each has
its unique characteristics to handle a particular type of task.
MapReduce 6
MapReduce is a programming framework for processing and generating large data sets with a parallel, distributed algorithm using a large
number of computer nodes. The core idea of MapReduce is a two-step procedure—a map step, which applies a map function in parallel to every
key-value pair of input data sets to output a different pair of data domain. As well as a reduce step, which uses a reduce function in parallel to
summarize a collection of values into the same target domain. It was originally proposed as a scientific paper by Google and has been
implemented in many different programming languages to support distributed computing.
6
MapReduce—simplified data processing on large clusters, Communications of the ACM 51.1 (107-113), 2008
Technical white paper
Page 13
Apache Hadoop 7
Apache Hadoop is an open-source software framework for distributed storage and processing for large data sets, consisting two important
parts—the distributed storage part called HDFS and the processing part that is an implementation of the MapReduce model. It also contains
two other modules for utility and management services called Hadoop Common, containing necessary utility libraries, and Hadoop YARN—a
resource-management platform for managing computing resources. Hadoop also refers to the ecosystem around its framework, including
many software packages that can be used in Hadoop, such as Apache Pig, Apache HBase, Apache ZooKeeper, Apache Spark, and Apache
Storm, and others.
Apache Spark 8
Apache Spark is also an open-source framework for distributed computing that addresses the limitation of linear data flow structure of
distributed programs designed in the MapReduce computing model. Spark proposed a new data structure called the resilient distributed data
set (RDD) that is a read-only multiset of data items distributed across multiple machines and is maintained fault-tolerant. Spark provides an
API of the RDD to offer a distributed shared memory working data sets for cluster computing. The advantage of Spark is that computing can
be iterative in memory, resulting in several magnitudes of performance increase compared to the classic Hadoop framework. Therefore,
machine-learning algorithms are leveraging these characteristics to be effectively used in Spark for advanced data analytics.
Apache Storm 9
Apache Storm is an open-source framework for processing streaming data. Stream processing is particularly suitable for distributed tasks that are
small, independent, computing intensive, and read only once or twice to the data. This framework provides a computing topology with the shape
of directed acyclic graph (DAG) in which the vertices represent the actual computing, and the edges represent data streaming from one node to
another. Storm is suitable for data that needs real-time processing, as opposed to batch processing in Spark or the Hadoop framework.
The types of data manipulation in processing layer mainly consist of data processing and data transfer.
Data processing: For Big Data analytics, two basic strategies are mostly used. First, distributed computing—leveraging the distributed systems
in Big Data frameworks to perform parallel computing. Fault tolerance is a must here. Second, the advanced machine-learning or data-mining
methods are used to solve challenging tasks. When data is abundant, there is always a more complicated relation or hidden information that one
needs to understand. Traditional rule-based methods or statistical methods will not be effective enough here.
Meanwhile, data processing also needs effective scheduling and management strategy, so that each processing task is executed as desired,
for example, once and only once. Big Data frameworks, such as MapReduce or Storm, have been designed to include components to schedule
such tasks.
Data transfer: Transferring data is needed when the data is not local during computation. This can be within a single computer system or while
crossing multiple computer systems.
In a single system, data can be stored in one of the three storage levels: CPU cache, volatile memory, or hard disk drive. Each level has a different
processing speed to access data, hence transferring data among levels require effective scheduling management. The Big Data framework also
leverages such characteristic to process different types of computing tasks. For example, Hadoop uses batched processing for a large volume of
data saved in a disk, while Storm intensively uses memory to process streaming data tasks quickly.
In multiple systems, data transfer can happen within the local environments, local data centers, or cross-region data centers. The Big Data
framework also leverages data locality to use as less data transfer as possible, and use closer data as far as possible.
Visualization and presentation layer
Providing a good architecture component to deliver analytics result is also an important step. Result visualization, graphic representation of
conclusion, or showing streaming messages are all effective options. Software tools from this step include graphic libraries or web UI libraries
when displaying results via web services, and others. Interestingly, some graph tools are built on top of the database directly such as Neo4j. 10
The only interaction with data in this layer is data visualization.
Apache Hadoop: hadoop.apache.org/
Apache Spark: spark.apache.org/
9
Apache Storm: storm.apache.org/
10
Neo4j: neo4j.com/
7
8
Technical white paper
Page 14
Data visualization: Visualizing is normally a read-only operation on data—so it is harmless. The challenge here is to use the right format and
choose the right amount of information to show the most interesting results to users. Data visualization can help decision makers to claim
whether the analytics program is a success or not, and so it normally needs equal attention as other data operations.
Many of these techniques are designed to suit particular requirements or constraints, therefore, it is easy to choose the appropriate one during a real
analytics task. Several Big Data analytics frameworks have been created in a modular way to be applicable to real applications as comprehensive as
possible. However, real applications may pose unexpected challenges over time. Hence, orchestrating such data manipulation components under
Big Data architecture to reach a common business result requires experienced engineers and data scientists to work closely together.
Applying analytics to security
Recently, analytics in security seems to be a part of multiple conversations, especially in conference expo floors and vendor pitches. It’s not
uncommon to hear various buzzwords such as machine learning, deep learning, data clustering, and others, being used to describe the capability
of various products. While the previous section broke down the hype behind these concepts, this section will explore real-world methodologies
and implications of implementing an analytics program in an organization’s security group.
The goal of analytics in security
When thinking about applying analytics to a given domain (in this case, security), there are typically two primary objectives to keep in mind.
Speed and accuracy
Implementing a robust security analytics program may help with faster and reliable detection of threats. Speed is always a security analyst’s best
friend. But, when fast results are also accurate, it eases the pain that many security teams face every day.
Striking this combination of performance and accuracy of an analytical module can help in several cases such as validating the results produced
(decreasing false positives), prioritizing the issues to be addressed, and responding appropriately based on the issue at hand.
Formalizing the unknown
Traditional systems have always targeted dealing with known issues or patterns. By using various analytical methods, it is now possible to
identify issues that were previously unknown (decreasing false negatives). This is done by creating behavior-based algorithms that don’t depend
on specific signatures or patterns. By introducing machine-learning concepts, the effectiveness of such algorithms may improve further as the
assumed benign baseline may be automatically customized based on an organization’s behavior.
Security data
A security team is typically known to deal with a variety of data. While the heterogeneity of the data adds to the complication of analytics, it can
usually be classified into one of the following five categories.
Vulnerability data
Organizations may use various tools to scan, record, and track vulnerabilities in the systems, network, and applications deployed. These
vulnerabilities may include issues with improper configurations, usage of a known vulnerable library, or incorrect rendering of images from a
CCTV. Data from risk and compliance assessments such as archive records may also be included.
Device data
Raw data, configurations, and logs from various devices connected to the network can provide insight into a variety of issues. These devices
could represent endpoint hosts with user accounts, firewalls, intrusion prevention systems, routers, and others.
Traffic data
This represents all data that is in motion within the organization’s network. When data transmitted through various protocols are aggregated
together, many hidden patterns and behaviors can be revealed through proper analysis.
Asset enrichment data
Asset enrichment data may improve the quality and variety of data associated with the assets. Examples include an extensive network map of
the organization’s connectivity, details of a subnet or asset’s usage, costs associated with assets, context of an application, and others.
Technical white paper
Page 15
Organization enrichment data
Similar to enriching data about assets, organization enrichment data may provide details about users, geolocation of departments or users,
permissions, and others. Such data may be acquired from the organization’s HR system and provides context to various activities observed from
other data feeds.
The possibilities of analytics in security
Analyzing an organization’s data feeds can open a huge door of opportunities to gain security-oriented knowledge and act accordingly. While a
security division may have multiple teams to concentrate on various specialties such as application security, operations security, physical security,
and others, the fundamental goals of all teams are very similar. Along with the goals, the processes to achieve these goals are also comparable
across these teams. The following is a list of common security narratives that may be derived based on the typical data feeds available in an
organization. These narratives may be applied to any type of security using their corresponding data feeds.
Detecting new and unknown threats
This would mostly be one of the first reasons an organization decides to set up a security analytics program. The most common activities here
include attempting to identify previous false negatives based on the available data. Additionally, security analysts may attempt to solve
challenging problems such as identifying high-severity vulnerabilities that may occur by combining multiple lower severity issues.
For example, an application may have two separate instances of local file inclusion and invalidated file upload vulnerabilities. While these
vulnerabilities pose their own set of risks to the application, combining them together may allow an attacker to upload and execute remote
commands on the server. Such complex relationships are not trivial to standard scanners, so such post-processing of results may unearth a few
hidden gems and allow for better prioritization.
True positive matching
This is an interesting problem that security auditors face every day. Every time a new audit of an asset is performed, it is essential to understand
the issues that existed previously and the issues that were newly introduced. Being able to do this allows auditors to assess the real risk of the
given asset. While this can be easily done within a specific tool or vendor solution, matching issues reported by different tools, auditors, or data
feeds is a challenging task.
Analysts need to create a robust set of parameters that may be used to correlate issues across multiple feeds and create a unified view. This
would enable the security team to understand the true nature of an issue, thus accurately assessing the risk for the organization.
In another case, such capabilities may also allow analysts to identify multiple issues across the organization with a similar remediation pattern.
Identifying such scenarios helps to optimize the work of the security team and improve their speed to action. It also allows the team to evaluate
the efficiency of current remediation strategies and patch management systems used to handle the reported issues.
False positive evaluation
Similar to analyzing true positives, it is also interesting to look at the false positives that may be reported through automated or manual testing
methods. Primarily, the cause for the occurrence of the false positives may hold significant value. This can bring various issues to light, such as an
incorrect configuration for an automated scan, an erroneous methodology in a manual process, and others. The data feeds for such analysis
would include results from vulnerability assessments along with alerts generated using monitoring techniques.
Another valuable outcome of evaluating false positives would be to educate the technical resources on common mistakes that may occur across
a given group. This would directly affect the quality of future work produced by the teams.
Root-cause analysis
Once the final results are audited, it is important for a security investigator to analyze the reason behind certain risky issues. This is done by
correlating the issue at hand with various related events that triggered the alert. Behind the scenes, building such a capability requires one to
attach a context to the issue and its associated asset. The context would include metadata such as the purpose of the asset, the criticality of the
issue, connectivity maps, and others.
Such analysis is typically done using data feeds that are generated from real-time monitoring and blocking solutions. These solutions are typically
sprinkled across the organization’s network and combining the feeds into one seamless data set may provide a better understanding of the network.
Technical white paper
Page 16
Designing attack vectors
While various automated solutions may assist with assessing the state of security in an organization, a manual assessment is also necessary to
ensure proper coverage. While doing so, results from previous automated and manual assessments may be analyzed to design newer attack
vectors that may target specific weaknesses in the environment. The attack vectors may use additional information such as the context of a web
page, understanding the nature of an application, or relating traffic between multiple hosts or services. Such a utility would especially benefit a
red team 11 and provide them with a head start on the assessment.
Automated assessments may also see benefits by utilizing metadata derived from previous assessments. This may be used to optimize
subsequent scans, thus resulting in faster and more accurate results.
Predicting a future security concern
One of the best ways to reduce the load on a security team is to be able to prevent an incident from happening. While in the initial phases,
existing machine-learning algorithms may provide primitive insights into potential issues that may occur in the near future. The algorithms may
be used to generate a probabilistic model for the occurrence of a given vulnerability or type of incident. While this may not be very accurate, it
may allow security teams to prepare for certain scenarios and help quickly remediate or even avoid an incident from occurring.
Resource planning
Analyzing the plans and actions executed in the previous year may allow a team to plan their work for the subsequent year. The plans may
include improvement to processes, detection or prevention techniques, remediation strategies, and others. It is also possible to derive commonly
made mistakes and create educational initiatives based on them to prevent the occurrence of such issues at the source.
Along with future plans, it is also possible to prioritize the current work to be done from a long list of tasks. Such prioritization may be done based
on various parameters such as the severity of an issue or incident, confidence of the report, value of the asset, and others. This would immensely
help the teams that have time constraints and/or limited access to resources.
Case study: Detecting new threats
This section will discuss an example use case in greater depth by assuming a very specific scenario—detecting command and control (C2)
communication by a malware from an infected host over DNS traffic. Recently, malware instances have started using Domain Generation
Algorithms (DGAs) to identify their C2 server since they are harder to detect and block. DGAs are used by malware to generate pseudo-random
domain names based on a given seed. One of the attempted domains would resolve to be its C2 server, while the other DNS queries would result
in an NXDOMAIN response from the DNS server, implying its failure to find a valid IP address that maps to the domain name. While many
techniques have been published to detect malicious DNS activity, this case study focuses on a project to design and implement one way of doing
the same.
The previously discussed steps to implement an analytics program will be followed here to assist with creating this module. Note that this was a
real project and that the technical details are based on real experiences and experiments. Since C2 servers may have varying domain name
patterns, the approach taken for this project was to perform a time-based analysis of DNS queries.
Business understanding
As discussed earlier, the primary goal of this project was to identify hosts infected by malware that use DGA patterns over DNS to identify their
C2 server. In order to achieve this, the fundamental idea was to look for hosts that sent multiple failed DNS queries to a variety of domains. In
order to reach this goal, the minimal requirement would be, as follows:
• Access to DNS queries and replies from all hosts within the organization.
• Data should be timestamped in order to identify relevant patterns.
Data understanding
Typically, an organization may have various types of data feeds that can be used to derive security implications, such as DNS packets, HTTP
requests or responses, Active Directory logs, records from various systems such as the configuration management database (CMDB), HR system,
and others.
11
An independent team that challenges and verifies the state of security in an organization.
Technical white paper
Page 17
In this case, once the access to DNS-related data was granted, the data and its structure were studied and classified based on the requirements.
The data was explored and the following items were confirmed.
• The data set included NXDOMAIN records.
• The recorded timestamp was consistent.
While exploring these data sets, it was important to understand their relationship with other feeds and confirm the completeness of the data set
being observed. For example, the timestamp of the data set should refer either to the time the packet was captured or when it was inserted into
the database, but should not be a combination of both. This would result in skewed results. In this case, the data set included the time it was
inserted into the database and was consistent throughout.
Data preparation
This is the step when the defined data feeds are correlated to identify various entities such as the user, a machine, and others. Along with such
entities, the data feeds may also identify logical actions such as administrative account login, sensitive data being accessed by a privileged user
over VPN, local account creation on a file server, and others.
For the current project, all DNS feeds were integrated into one seamless data set. The final consolidated data set consisted of all DNS queries and
replies from within the organization’s environment. While merging the feeds, it was important to ensure that duplicates weren’t present in the
final result. Also, it was important to normalize the data structure from multiple feeds and create a single structure that could maintain all the
required data. While normalizing, all domain names were converted to lower case to enable case-insensitive comparisons during the analysis.
DNS queries usually have the full domain name that is being requested. In addition to this, it is sometimes good to have only the top-level domains
(TLD) and second-level domains since this would point to the parent server being requested. While this sounds like a trivial problem, there are some
interesting patterns to consider. For example, some domains have country-specific TLDs such as “.co.uk” or “.co.in”. In such cases, it might be good to
consider both these segments together as the TLD. This will ensure that the second-level domain for both yahoo.com and yahoo.co.uk is yahoo.
Thankfully, there are some third-party libraries such as Nomulus 12 that can be used to solve this problem.
Modeling
Now that the data was normalized, it’s time to design the actual solution. In this case, the primary interest was in DNS queries that were not
resolved. The time series of these queries from a given host was split into a range “t”, and a set of constraints were applied to all queries that fell
within the given range. These constraints were designed by observing the behavior of malware that used DGAs. Some of the constraints were:
• If the organization maintained a whitelist of domains that are known to be benign or refer to internal domains, then they can be ignored.
• Not all the domains requested within the time “t” should not be the same. This was because DGAs typically cycle through a list of generated
domains and rarely repeat the same domain consecutively.
• The domains should not be requested during a previous time span “s”. This can be chosen to be a specific range, such as a day, week, or a
month. Since DGAs usually use the current date as a seed, it is rare to see them requested across multiple days or months.
• A minimum of “q” requests should have been sent by the host within the given time “t”.
This list provides an idea of the type of constraints being applied to the data.
An interesting case would be when queries generated by internal scripts may also satisfy these constraints and result in false positives. While
whitelisting these domains may solve the problem, other solutions could also be used that may suit the organization. For example, if most scripts
within an organization connect to some common servers, then the model could ignore requests within the time “t” that have the same TLD and
second-level domains but different lower-level domains (LLDs). DGAs usually generate different parent domains instead of multiple LLDs for the
same server. While this is one way of solving this problem, organizations may implement their own version based on their requirements.
12
github.com/google/nomulus
Technical white paper
Page 18
Evaluation
Once the model is designed, experiments are devised and executed for various values of “t” and “q”. This would allow the identification of the
most optimal solution. The result set of each experiment was validated using third-party reputation services to understand the true positive and
false positive ratios. Based on the validation numbers, a reasonable solution was chosen as the winner.
It is important to understand the scope of the project at this stage. This analysis was not meant to be a catch-all for all malicious activity. Rather,
it was focused on a very specific case. Doing so allowed for a more solid design and a clear vision while modeling the solution. It is important to
understand and articulate this scope when delivering the final result. Following is a sample result that was derived from the project.
Figure 5. Suspicious DNS patterns over a 24-hour period
Figure 5 represents the suspicious DNS queries generated from 31 different hosts in a network of over 340,000 hosts during a 24-hour period.
Each horizontal block signifies a single host and the scatter plot represents bursts of requests at a given time. It can be clearly seen that most of
the hosts exhibit a pattern of bursts of requests every few minutes. It would be interesting to study and understand the underlying program
that’s responsible for doing so.
Another interesting observation is the large number of requests sent by a single host in the result. This host may have high malicious activity,
may execute multiple scripts that send requests, or may represent a proxy server that aggregates requests from hosts behind it. Further
investigation may reveal the exact case and allow for a better understanding of DNS query patterns within the organization.
Technical white paper
Page 19
Handling misalignments between the goal and result
At the start of the project, the grand vision was to identify the hosts that were definitely infected by malware. After validating the results, it was
understood that the result set represented a set of suspicious requests that may signify an infected host. Identifying this characteristic feature
was very important as it defines the actual result of the project. There were two options to pursue at this point—either update the goal of the
project to match the result or update the model to reflect the goal accurately. In this case, the goal was updated since this phase of the project
successfully detected suspicious hosts within a given environment. Also, it was decided to plan a subsequent phase of the project to confirm that
these suspicious hosts are actually being infected.
Deployment
The final phase of the project included the deployment of the algorithm in a live environment. Most importantly, the algorithm and experiments
executed were documented clearly for future reference. Along with the final report, a plan was initiated to begin phase 2 of the project as well, in
order to confirm the detected suspicious behavior as malicious. This would be treated as a separate project and all the steps of an analytics
project would be followed again to achieve the new goal.
The challenges with analytics
As seen from earlier, analytics in security can be a great boon to an organization. But on the other hand, it is important for organizations to
consider any new risks that the systems and processes may introduce. Following are some of the security and operational risks that teams should
expect ahead of time in order to create a better plan.
Security risks
An analytics platform is simply another system used within an organization. Hence, this system should also be threat modeled, similar to all the
other systems. While doing so, there are a few specific concerns to note.
Data centralization
Creating a security analytics program rarely includes new data. Rather, existing data is centralized in order to derive interesting findings by
correlating various feeds together. In such a situation, it is important to note that this central data repository may be a single point of failure and
needs to be considered in the threat model. Additionally, other risks that are typical of data centralization such as potential abuse of data,
managing privacy of compartmentalized data, managing access to users of the data, and others also need to be considered. While these issues
are typical to data centralization, they have been newly introduced to the security team and need to be handled accordingly.
Data custodians
Before centralizing data, each feed would typically have been locally maintained by a custodian. In many cases, it’s possible for the users of the
data to have been its custodians as well. But when centralizing these data feeds, it is important to establish a formal process to manage the data,
its infrastructure, and its access. This could result in the creation of new security policies and methodologies to ensure the safety of the data,
without compromising its usefulness.
Also, managing the large number of users and providing them the right type of access to the data is a challenge. While not impossible, the
organization needs to understand and address this complexity ahead of time to reduce unnecessary costs.
Operational risks
Along with security-centric issues, a number of operational challenges may also be encountered during the creation of an analytics program.
Some of these challenges are:
Technical white paper
Page 20
Data processing
Security teams use a variety of data. Hence, normalizing and consolidating these data feeds is always a challenge. Additionally, enriching the data
with additional information greatly helps with correlation but the enrichment data is usually hard to find, especially in a structured format.
Some data feeds are also harder to retrieve due to various policies that have been in place for a long time. For example, retrieving network device
configurations, universal traffic captures, and correlating information across multiple systems such as HR and CMDB are new processes that may
make system operators nervous. It is important to describe the necessity of these data feeds and create a secure process that works well for the
organization, as well as ensures the comfort of all participants in the program.
Time to prototype
When starting any analytics project, the time to create the first prototype should be taken into account. It is typical for a combination of technical
and policy-oriented hurdles to delay progress in the initial stages. Additionally, once the raw data is normalized, it is very easy to get distracted
from the initial goal as playing with data can always bring up something interesting. These distractions may be documented and revisited at a
later stage. But to create the prototype, it is important to focus on the goal and achieve it by breaking the work into smaller tangible goals.
It is also important to take the threat model into account and make sure that the work being done does not negatively affect the model in any
way. Each threat vector may require a different analytical process, so it is essential to be focused on the current goal at hand and avoid chasing
tangents while exploring the data.
System complexities
One of the major differences between traditional and analytical systems is that traditional systems produce binary results based on a given set of
rules. On the other hand, analytical systems generate a probability of occurrence for the produced result. This probability may be considered as
part of the prioritization process implemented by the security team. It is important to note that this probability should not be the only parameter
used for prioritization, but its value should play a major role in the process.
While it has been clear so far that an analytics system is complex, the general infrastructure can still be built to span across all security domains.
However, the data being processed and the results being generated vastly differ between each domain. So the implementation of the analysis is
very specific to each domain. For example, both the AppSec and OpSec teams may share a goal of detecting new vulnerabilities using the data
feeds. But, the raw data, results, and algorithms designed are specific to the corresponding domain.
Dealing with machine-learning systems may bring their set of challenges. In general, all machine-learning systems depend heavily on the
cleanliness of the learning data. Tainted learning data could lead to incorrect and undesired results. Specifically, in the case of security domains, a
system may fail to detect an active attack due to its indicators being present in the learning phase. While they may be hard to avoid, it’s essential
to be aware and account for such risks while designing a system that uses machine-learning algorithms.
Technical white paper
Conclusion
Traditionally, mitigating security threats was primarily based on perimeter protection, such as intrusion detection and prevention systems (IDSs
and IPSs), web application firewalls (WAFs), and others that monitor incoming traffic to filter and block malicious activities or policy violations.
As the types of network devices, endpoint devices, and data locations increase in an enterprise, an effective way to view the whole organization’s
state of security is needed. This has evolved into the second generation of security defense infrastructure—security information and event
management (SIEM).
Nowadays, bring your own device (BYOD) policies introduce more uncontrolled devices into an organization, the Internet of Things (IoT) keeps
inventing new ways of interfacing with devices over the network, and attackers are discovering more sophisticated ways to target enterprises.
Companies must deploy deeper network defenses and endpoint protections, and use more advanced security analysis tools and software to
collect, integrate, and correlate diverse types of security information to understand their security infrastructure.
Meanwhile, with security data growing diversely and exponentially, real-time or near real-time analysis is needed to detect and mitigate threats
quickly and prevent damages. This has led to the birth of security analytics, which leverages Big Data, analytical tools, and machine-learning
concepts to satisfy such business needs.
Over the course of this discussion, it can be seen that security analytics is a tangible process that can be implemented by any enterprise. With the
right team, infrastructure, and budget, any organization would be able to get their analytics program going. It is to be noted that the discussion
predominantly revolves around setting up a new program. Other than this, efforts for further development, maintenance, and constant innovation
should also be considered. This would ensure continuous benefits and success for the security teams in the long run.
Applying analytics to information security is a young and fast-growing area. As such, it brings its own set of risks and challenges. While some of
them have been discussed in this report, it is important to anticipate and plan for specific risks that may be relevant to each organization. Doing
so will allow for a more productive program that can produce greater benefits quickly.
Analytics may not be the ultimate solution that solves all security problems. But, it offers a solution to many issues that have plagued the
security industry. Eventually, most mid- to large-sized organizations may find the necessity and budget to invest in a full-fledged analytics
program. Until then, this area’s ripe for innovative research and creative approaches. Overall, a security analytics program is a great path to
follow, but not a quick one to conquer.
Authors
Barak Raz
Sasi Siddharth Muthurajan
Jason Ding
Learn more at
www8.hp.com/us/en/software-solutions/siem-big-data-security-analytics/
Sign up for updates
© Copyright 2017 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice.
The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying
such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall
not be liable for technical or editorial errors or omissions contained herein.
Google is a registered trademark of Google Inc. Microsoft and Windows are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries. Oracle is a registered trademark of Oracle and/or its affiliates. All other
third-party trademark(s) is/are property of their respective owner(s).
a00000103ENW, March 2017, Rev. 1