Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analytics in security Technical white paper Technical white paper Contents Executive summary................................................................................................................................................................................................................................................................................................................................4 Introduction to analytics ....................................................................................................................................................................................................................................................................................................................4 Evolution in analytics....................................................................................................................................................................................................................................................................................................................5 Rule-based methods .............................................................................................................................................................................................................................................................................................................5 Statistical methods .................................................................................................................................................................................................................................................................................................................6 Detection- and prediction-based methods ......................................................................................................................................................................................................................................................6 Relations among the three types of methods ...............................................................................................................................................................................................................................................6 From detection to prediction: Data-mining vs. machine-learning methods .......................................................................................................................................................................7 Trends of future analytics methods ........................................................................................................................................................................................................................................................................7 Typical analytics workflow .......................................................................................................................................................................................................................................................................................................8 Business understanding.....................................................................................................................................................................................................................................................................................................8 Data understanding ...............................................................................................................................................................................................................................................................................................................8 Data preparation.......................................................................................................................................................................................................................................................................................................................8 Modeling ..........................................................................................................................................................................................................................................................................................................................................9 Evaluation .......................................................................................................................................................................................................................................................................................................................................9 Deployment ...................................................................................................................................................................................................................................................................................................................................9 Big Data analytics ................................................................................................................................................................................................................................................................................................................................10 What is Big Data? .........................................................................................................................................................................................................................................................................................................................10 Big Data analytics architecture........................................................................................................................................................................................................................................................................................ 11 Data layer ....................................................................................................................................................................................................................................................................................................................................11 Processing layer ....................................................................................................................................................................................................................................................................................................................12 Visualization and presentation layer ..................................................................................................................................................................................................................................................................13 Applying analytics to security................................................................................................................................................................................................................................................................................................... 14 The goal of analytics in security ..................................................................................................................................................................................................................................................................................... 14 Speed and accuracy...........................................................................................................................................................................................................................................................................................................14 Formalizing the unknown .............................................................................................................................................................................................................................................................................................14 Security data.....................................................................................................................................................................................................................................................................................................................................14 Vulnerability data .................................................................................................................................................................................................................................................................................................................14 Device data ................................................................................................................................................................................................................................................................................................................................14 Traffic data .................................................................................................................................................................................................................................................................................................................................14 Asset enrichment data ....................................................................................................................................................................................................................................................................................................14 Organization enrichment data .................................................................................................................................................................................................................................................................................15 The possibilities of analytics in security .................................................................................................................................................................................................................................................................. 15 Detecting new and unknown threats.................................................................................................................................................................................................................................................................15 True positive matching...................................................................................................................................................................................................................................................................................................15 False positive evaluation ...............................................................................................................................................................................................................................................................................................15 Root-cause analysis ...........................................................................................................................................................................................................................................................................................................15 Technical white paper Designing attack vectors...............................................................................................................................................................................................................................................................................................16 Predicting a future security concern ..................................................................................................................................................................................................................................................................16 Resource planning ..............................................................................................................................................................................................................................................................................................................16 Case study: Detecting new threats ...................................................................................................................................................................................................................................................................................... 16 Business understanding......................................................................................................................................................................................................................................................................................................... 16 Data understanding ...................................................................................................................................................................................................................................................................................................................16 Data preparation...........................................................................................................................................................................................................................................................................................................................17 Modeling ..............................................................................................................................................................................................................................................................................................................................................17 Evaluation ...........................................................................................................................................................................................................................................................................................................................................18 Handling misalignments between the goal and result ......................................................................................................................................................................................................................19 Deployment .......................................................................................................................................................................................................................................................................................................................................19 The challenges with analytics ................................................................................................................................................................................................................................................................................................... 19 Security risks ....................................................................................................................................................................................................................................................................................................................................19 Data centralization ..............................................................................................................................................................................................................................................................................................................19 Data custodians .....................................................................................................................................................................................................................................................................................................................19 Operational risks ...........................................................................................................................................................................................................................................................................................................................19 Data processing.....................................................................................................................................................................................................................................................................................................................20 Time to prototype ...............................................................................................................................................................................................................................................................................................................20 System complexities ..........................................................................................................................................................................................................................................................................................................20 Conclusion ..................................................................................................................................................................................................................................................................................................................................................21 Authors ..........................................................................................................................................................................................................................................................................................................................................................21 Technical white paper Page 4 Executive summary In recent times, analytics has been a hot topic in a variety of industries. The security industry is no exception to this phenomenon. With data being the new form of currency in enterprises, there is a wealth of information that can be gleaned from this data. All it needs is a little bit of time, the right set of skills, and a robust path to follow. A combination of these traits could create the perfect analytics program that may assist an existing security team in their day-to-day activities. In general, analytics isn’t expected to be a one-stop shop that solves all problems in the world. Rather, a meticulous approach that intelligently sorts data, groups them into logical aggregations, and highlights the most important items to consider. Everything else—before and after this general workflow—is similar to today’s processes. This paper dives into some of the basic concepts of analytics and explains the various processes and architecture involved in an analytics program. A follow-up discussion elaborates on some of the various possible narratives of applying analytics in the context of an organization’s security program. A detailed case study is also provided, describing a real experience of implementing such a process. Finally, the other side of the coin is revealed by a short discussion on some anticipated risks within a security analytics program. While this paper may benefit people across different roles, it is important to mention that all perspectives captured here are based on a security analytics researcher’s point of view. This approach showcases the real benefits and risks of such a program based on the fundamental research that drives the entire program. Introduction to analytics Analytics is a process of performing necessary actions on recorded data to discover, interpret, or represent meaningful patterns or knowledge of the data. From a broader perspective, any method or process of analyzing raw data to extract useful information is analytics. Historically, analytics started a long time ago. For example, the Swedish government began collecting population data as early as 1749 to record and understand the geographical distribution of Sweden’s population. This exercise was carried out to sustain an appropriate military force. 1 Over the last few centuries, analytics has been applied to different fields of the society, forming multidimensional analytics subdomains, such as architectural analytics, behavioral analytics, business analytics or business intelligence, customer analytics, news analytics, web analytics, speech analytics, and more. Each of these subdomains has tremendous challenging problems that need be answered, and analytics is an effective way to answer these questions, extract insights, or draw conclusions from data. Take business analytics, for example, data-driven companies can leverage their data assets, including marketing data, sales data, customer data, and so on, to make informative decisions, such as forecasting marketing changes, exploring new sales patterns, or identifying new customer groups. These insights cannot exist without the involved analytics. Another interesting application is web analytics, which collects, measures, analyzes, and reports web traffic data to understand and optimize web usage. This technique has been broadly used by online businesses to maximize their web traffic and improve their business profits. Additionally, even in the finance area, sophisticated analytics algorithms are frequently used to perform automatic high-frequency trading in the stock market to maximize profits. No doubt, analytics has been generally used in various scenarios across industries to reach different businesses or research goals, especially wherever recorded data is available. 1 Statistics of Sweden’s history, which was retrieved on 17 November 2016 from Statistics Sweden Technical white paper Page 5 Evolution in analytics Methodologies and tools used in analytics have quickly evolved in the last few centuries, especially the last few decades. With streams such as science, mathematics, and technology being widely used in every area of the society, scientific computation and statistics analysis are predominantly used in analytics for a deep understanding of the data. Evolution in analytics has also been pushed by the exponential growth of data. With the fast evolution in the IT industry, data collection and storage has become extremely easy and cheap. Therefore, the amount of data to be analyzed in many fields has grown exponentially, forcing the tools and methodologies in analytics to change greatly as well. Traditionally, records were made on paper and that led to manual analytics on paper. With the emergence of electronics, storing and processing data electronically has become a reality. Then in last few decades, with the evolution of very large-scale integration, saving and calculating data at a very large scale has become possible, solving more unanswerable questions than ever before. Analytics, by utilizing computing power and the convenience of software development started a new era—the Big Data analytics era. Figure 1. Trend of exponential growth of data and decrease in storage costs 2 The methods used in analytics can be mainly classified into three categories, in terms of the chronological evolution. With data being generated exponentially and more scientific methods getting involved in analytics, organizations are seeking answers to more sophisticated and challenging questions using the data. This trend is pushing analytics from the original experience-based or rule-based analysis to a more mathematical way—statistical-based or behavior-based methods—and finally, to more advanced and predicative strategies, such as machine learning or data mining-based methods. Meanwhile, implementing and deploying these methods have also become more time- and computation-intensive. Rule-based methods In the beginning, the data volume was small, the number of data sources was limited, and normally this data was saved in table-based forms. Only simple questions could be answered via analytics, such as total, average, changes over time, and others. Frequently, domain knowledge and experience took the main control of data analysis, which led to a simple mathematical formula or rule-based filtering algorithms being used as the main method in analytics. 2 forbes.com/sites/gilpress/2016/08/05/iot-mid-year-update-from-idc-and-other-research-firms/ and zdnet.com/article/enterprise-storage-trends-and-predictions/ Technical white paper Page 6 Statistical methods Next, more mathematical and statistical methods were introduced into analytics when more data became available. Basic statistics of data (for example, average, median, standard deviation, and others), analysis of variance (ANOVA 3), factor analysis, regression analysis, and famous statistical tests are some of the popular statistical methods for analytics. Basic statistical metrics is the most straightforward way to see any data problem, such as skewness or fluctuations. ANOVA are statistical models used to analyze the differences between group means and their variations. Factory analysis is used to identify less number of variables (factors) that can describe observed and correlated variables—such an analysis is very useful when data sets with large numbers of variables are dependent on few underlying hidden rooting factors. Regression analysis is normally used to estimate the relationship between one dependent variable and one or more independent variables (predictors)—the output regression function is particularly useful to depict the trend of independent data variables. Finally, statistical tests are mainly testing statistical hypothesis by observing a process that is modeled via a set of random variables. This is very useful for drawing conclusions directly from the data. Detection- and prediction-based methods The next wave of analytics methods was leveraging the prosperous research and application of the machine learning and data mining fields. Traditional statistical methods normally have to assume particular statistical models behind the data, which in many cases do not fit well with large volume and variety of data, leading to a weak ability to find out hidden patterns from the data. On the other hand, machine learning and data mining methods are more advanced to simulate or represent deeper and complicated meanings behind the data. What’s more, these methods are capable of making predictions from the new data—understanding deeply hidden knowledge from the data to expect the outcomes of the new data. Therefore, data mining and machine learning-based methods have become the current leading methodology to drive the evolution of analytics, especially Big Data analytics. Relations among the three types of methods The three categories of methods used in analytics have different focuses. Rule-based methods rely significantly on known knowledge and experiences, focusing on explaining the facts from data to match what are already known. Statistical methods tend to use mathematical models to identify the variance or relation among data entries to draw a conclusion or observe a trend. Lastly, detection- and prediction-based methods use data-mining and machine-learning methods to extract new insight from the data, which is more challenging but more useful. The three types of methods also have differences in analysis performance and deployment time. Rule-based methods normally need less computing, hence providing fastest response and deployment time among the three types. Statistical methods need a certain amount of computation during statistical modeling or statistical tests, so they require more analysis and deployment time. Finally, detection- and prediction-based methods normally need several iterations of modeling, along with training process to find the best algorithms and parameters. Therefore, the cost of developing and deploying time for this type is the highest in real applications. However, the actual performance of the implementation of analytical methods also depends on the hardware and software architecture. If more computation resources are available and the execution architecture can utilize the full power of such resources, the performance of the sophisticated analytics can be greatly enhanced. For example, distributed computing frameworks often leverage computer clusters built from commodity hardware to perform machine-learning tasks to output near real-time results. Despite the differences in focus and performance, the three types of methods are not exclusive to each other, but quite the opposite, as they are normally mixed together to solve more complicated problems from the data. For example, in the cybersecurity industry, where the goal is to detect an ongoing data breach event, a mature analytics solution may use rule-based methods to filter unrelated data based on domain knowledge, and use statistical methods (e.g., correlation analysis) to create more aggregated data vectors. What’s more, the solution can also apply machine-learning methods to identify anomaly in activities on one or many hosts. Different categories complement each other to provide effective ways of analyzing the data. 3 statisticssolutions.com/manova-analysis-anova/ Technical white paper Page 7 New methods are not always better Even as new analytics methods keep evolving to solve more complicated data and problems, it does not mean that new methods are always providing better answers. Each type of method has its own advantages and weakness, and every analytics problem is different as well as specific. That is why there are so many new engineers and scientists working hard to find out the best process and methods when solving a particular data problem at hand. Sometimes, they have to mix different existing methods together to find the best combination to extract the most useful information from the data. In short, there is no single best approach to solving all problems in data analytics. From detection to prediction: Data-mining vs. machine-learning methods Both data mining and machine learning are used as advanced analytics methods to understand data deeper, to discover the unknown or hidden knowledge. However, there are some differences between them. Data mining mainly aims at finding patterns hidden in the data to explain some phenomenon by using statistics and other programming methods. Machine learning uses different learning algorithms, including statistical methods and/or data mining methods, to build models on existing data so that it can predict future outcomes. In short, data mining explains the data by detecting hidden patterns, while machine learning focuses on predicting new information with models. This trend of shifting from detection-based to prediction-based methods in analytics reflects the changes from passive remediation insights to proactive prevention decisions needed in business data. List of popular data mining and machine learning methods Here is a list of most common and popular data mining and machine learning methods. Readers can refer textbooks or online resources to those algorithms; be advised that the list is constantly changing as the industry is actively adopting them into different applications. With so many choices, it is hard to know which one is best to suite the problem. In many cases, a trial-and-error process is needed to explore which of these is better and balance among accuracy, performance, deployment requirements, and complexity. • Naïve Bayes Classifier Algorithm • K-Means Clustering Algorithm • Support Vector Machine Algorithm • Apriori Algorithm • Linear Regression • Logistic Regression • Artificial Neural Networks • Random Forests • Decision Trees • Nearest Neighbors Trends of future analytics methods The current trends in advancing analytics are toward two areas: Big Data analytics and deep learning. The reason for this is twofold—data is growing exponentially but answers are required in near real-time, and questions are getting more challenging, requiring several layers of analysis, along with a broader and deeper understanding. Big Data utilization is driven by leveraging distributed software systems to store and process a large volume of data very quickly. Meanwhile, deep learning builds several layers of learning ability to extract information from raw data for a higher and more abstract understanding. Besides, cloud technologies have grown very fast in last few years, enabling more flexible and on-demand data storage and processing on remote cloud infrastructure controlled from a simple client-side software. Therefore, analytics in cloud environment has naturally attracted more attention and building analytics services-as-a-subscription model is becoming a phenomenon. It is easy to understand this trend: analytics goes where data goes. With more domains leveraging data analytics to serve their business goals, the trend of applying analytics in broader industries has continued. Healthcare analytics and social media analytics are two examples that show this trend. Technical white paper Page 8 Typical analytics workflow Here is a standard workflow when using analytics in real business scenarios. Figure 2. Workflow of analytics 4 As shown in figure 2, there are six steps in a typical analytics task and some of them have feedback loops, which means several iterations of trial and error are needed to solve a sophisticated problem for a good outcome. The more detailed tasks of each of the six steps are as follows. Business understanding While starting any project, it is essential to begin with the end in mind. This step sets up business objectives, assesses project feasibility, determines data mining goals, and produces project plans. This is critical as it determines where the final goals are and whether or not these goals are achievable. Setting up goals and success criteria for the data mining tasks are very important to remind us to focus on the targets and understand when we can stop experimenting with different methods and call it a success. Assessing feasibility and creating plans involve understanding, what tools, data, and techniques are accessible for the project and what is critical in helping us understand the limitations. Some questions to ask oneself at this point would be: • What problem are we trying to solve? • Why do we need to implement this project or program? • What kind of data do we have access to? Data understanding This step involves raw data collection, basic data exploration, and data quality verification. In short, this collecting and reviewing step helps us get the initial hands-on impression with data. The main aim of this step is to help us get an idea of the possibilities and limitations of things we can do based on the data. In a real analytics task, data may come from multiple types of sources, show various characteristics, and represent outcomes of complicated relations. For example, analytics on weather conditions may have image data from satellites, radar scan data from major ground monitor hubs, and sensor data from many ground monitor stations, geographic terrain data, and historical data of each of these. Understanding every type of data set and knowing the basic characteristics are very important for further data analysis steps. Some questions to ask oneself: • Do we have access to all the necessary data feeds? • Do the data feeds provide the data we need to solve my problem? • Is the data accurate? Data preparation Data preparation is a serious step that properly selects and sanitizes useful data sets or data feeds, understands the relationship among them to integrate or consolidate them, and finally formats them for next step. 4 exde.files.wordpress.com/2009/03/crisp_visualguide.pdf Technical white paper Page 9 For example, in the case of weather analytics, this step can correlate to a different type of weather data (satellite image, radar scans, and ground sensor readings) by a timestamp and geolocation to provide an overview of weather conditions for a particular area at a given time. Meanwhile, this step can also cleanse duplicated or incorrectly generated and non-related data to make sure the final data set is able to provide the most accurate information for the targeted problem. Some questions to ask oneself when working in this phase: • Is our data clean and normalized? • Do we have the right logical aggregations to work? Modeling Modeling is the core step to manipulate data and draw conclusions—extract unknown patterns, detect new information, or predict new knowledge. This step contains four sub steps—select and implement modeling techniques, create appropriate test plans, experiment with different model settings, and validate results of models. Sometimes, choosing appropriate models from various data mining and machine learning methods can be very challenging, as different methods tend to fit different types of data and problems. The best solution here is to try different methods to find the best one. That is why creating an appropriate testing procedure too is very important here. Besides, quite frequently, people have to choose the best solution by balancing performance on false-positives and false-negatives rates. It is also not uncommon to discover incomplete or missing data while modeling an approach. In this case, the project is iterated back to the data preparation phase, where additional data is aggregated from existing or new data feeds until the requirement is satisfied. Some questions to think about at this point: • How can we analyze the given data set? • How can we make observations that correspond to the business goal of the project? • Can we make any interesting inferences based on the observed conclusions? Evaluation The evaluation step is when the results or conclusions are generated from modeling step. This is done by assessing whether the business criteria have been met, the goals have been reached, and whether or not the implemented modeling methods are feasible for practical use. It is possible that in some cases, the analytics results do not match the original business goals due to various reasons, such as a small misalignment in each of the steps or the ineffectiveness of the data or modeling methods used to provide targeted insights or conclusions. In such cases, two feedback approaches can be used: 1. Fall back to the modeling step to use other modeling algorithms or continue to optimize the model parameters. 2. Return to the first step of business understanding. While the results may be acceptable, it may be off-target and so a new set of business goals may be created to align with the produced outcome. In either case, there may be multiple iterations to make sure the results are what we wanted and the process is reproducible. Meanwhile, the evaluation also determines the options for the next steps. The analytics results might be in different forms, leading to different actions afterward. For example, if the result is an effective prediction technique on existing data, then it should be implemented for practical usage; if the result and goal are summarizing similar clusters in the data, then the next step may be to provide an open report for educational purposes. Deployment In many cases, there will be a deployment step to apply the conclusions or results to the business process. After all, the whole analytics task is to help the business in some way. Some concrete tasks here are to implement the whole process in the production environment, create a monitoring and maintenance plan, create a final summary report on the examined data, or provide a retrospective discussion on how analytics was generated or failed. A detailed review report might also be useful here to provide insights for other similar projects for the future when data analytics is involved. Technical white paper Page 10 Big Data analytics As briefly discussed in the trends of future analytics, Big Data analytics is the new trend that utilizes the advancement of commodity hardware and distributed software architecture. The three fundamental types of analytic methods might be still the same in Big Data analytics, but the process, data workflow, and system architecture are greatly enhanced to leverage the storage and computing power in the Big Data framework. But what exactly is Big Data? What does a big analytics system look like? How do we manipulate data under such a system? Let’s briefly describe these in this section. What is Big Data? Big Data describes large data sets that are inadequate or unable to be processed by traditional systems. A broader explanation is that Big Data refers to storing and processing of large data sets, including tools, systems, and infrastructure that are used to collect, integrate, analyze, store, share, transfer, search, visualize, and protect such data. The main characteristics of Big Data were frequently described as the “three Vs” in the earlier days: • Volume—the quantity of generated or stored data • Variety—the diversity of type and nature of data • Velocity—the speed at which data is generated or processed Two new additional Vs are also frequently used in describing Big Data: • Veracity—the uncertainty or quality of data sets affecting the accuracy of analysis • Variability—the inconsistency and dynamic nature of data These special characteristics have pushed a new type of analytics—Big Data analytics—that is different from traditional analytics and advanced analytics. Traditional analytics such as rule-based or statistical-based methods are slow and cannot provide enough accuracy or solve the complexity of Big Data. While advanced analytics such as pure machine learning-based methods can only work well on small data sets—such as the magnitude of dozens of gigabytes of data. Figure 3. Characteristics of Big Data and Big Data analytics vs. traditional analytics 5 Fortunately, Big Data analytics can take advantage from both the worlds—the large storage space and the fast processing power of distributed systems, along with advanced analytics methods that solve sophisticated problems and relations within data. The result is an active Big Data ecosystem that contains many enhanced analytics methods adapted to the distributed storing and processing power of several Big Data frameworks. 5 Big Data: The Next Big Thing, NASSCOM and CRISIL GR&A, 2012 Technical white paper Page 11 Big Data analytics architecture A typical architecture of an analytics system has at least three big components: data layer (for data collection and storage), processing layer (for data analysis and processing management), and visualization or presentation layer (for result visualization or conclusion representation). With so many mature Big Data tools and frameworks being implemented and proposed, Big Data analytics has significantly leveraged them to solve problems in real business applications. This section not only discusses generic analytics architecture but also emphasizes on tools used in Big Data analytics. Figure 4. Typical architecture for Big Data analytics Figure 4 is a high-level abstracted architecture for typical Big Data analytics and explanations are as follows. However, in some cases, such as in analytics for streaming data, the boundary between processing and data layer is a blur. Streaming data may not store its raw form persistently and immediately pass it to the processing stage. Therefore, it seems like the data collection and processing steps are integrated together. But normally intermediate or final results are stored persistently to provide as a data source for the visualization or presentation layer. Importantly, the core of Big Data analytics is still around data and manipulation, whether data is in transit or at rest in those analytics layers. The whole flow of data manipulation in analytics typically includes data generation or feeds, data storage, data processing, data transfer, and data visualization. Each of these steps has challenging problems to solve, and fortunately, the industry and academics have together created many tools and frameworks to tackle the problems individually and together. We will discuss these data manipulation tasks in the appropriate architecture layer. Data layer A typical analytics system needs methods and tools for data collection and storage. This can be as simple as a population survey that sends many surveyors to record basic birth information from hospitals or government offices on a simple tabular form either on paper or on AN Excel-like digital system. With more electronic devices used everywhere, data has mostly been directly saved in digital form so that data collection can be automatic. Since more data can be generated at a large scale, relational databases are used to provide more effective insert, update, and search abilities than basic digital forms. Many relational database management systems (RDBMS), such as Oracle, MySQL and Microsoft® SQL Server, have been broadly used to store data for analytics, as well as for many other purposes. However, with the advent of the Big Data age, a traditional RDBMS cannot satisfy the needs for data volume or data variety, leading to the creation of columnar storage databases, non-relational database systems and distributed file systems. Columnar store databases like Vertica can scale because they store data in columns rather than rows. When analytics are requested, the database does not need to scan through huge amounts of data to answer it. Instead it can go directly to the data it needs. There are advantages in compression and scaling, as these solutions include massively parallel processing (MPP) to take advantage of a cluster. Non-relational database systems can handle unstructured data types, such as texts or documents. Such database systems are also referred as Not Only SQL (NoSQL) databases because they can have more features than traditional SQL databases, such as auto scalability, storing unstructured data, built-in search ability, and others. Examples of NoSQL databases include MongoDB, Cassandra, HBase, CouchDB, and others. Technical white paper Page 12 Non-relational database systems can handle unstructured data types, such as texts or documents. Such database systems are also referred as Not Only SQL (NoSQL) databases because they can have more features than traditional SQL databases, such as auto scalability, storing unstructured data, built-in search ability, and others. Examples of NoSQL databases include MongoDB, Cassandra, HBase, HPE Vertica, CouchDB, and others. Additionally, distributed operational systems are leveraged to handle the volume of data from a size perspective. For example, Google™ File System (GFS), Hadoop distributed file system (HDFS), and Windows® Distributed File System are three popular adoptions to store a large volume of data in a distributed format. Working with data in this layer is primarily comprised of two different actions: data generation and data storage. Data generation: Data generation can be from various types of physical or virtual entities such as electronic monitor sensors, endpoint computer devices and network devices, sales orders, digital messages, and others. The output of data can be in various forms as well, such as single or multiple-tuple data points, audio or video data streams, texts, and more. The key for a good data feed is providing a consistent format of data that can be further analyzed. Data storage: As mentioned earlier, storing data electronically has two major forms—databases or file systems. Either storage form can generally handle various types of data input, or in some cases one is better than the other. For example, a database is better for handling shorter, consistent, and structured data in string forms, while file systems are good at saving, streaming, unstructured, or document-structured data in a long string (text) or binary forms. Meanwhile, depending on the actual storage implementation, data might be stored in more than one place to provide fault-tolerance ability and localization. Processing layer Processing data is the key task of analytics. It is the step where computation analysis happens; besides, it also includes some management services that schedule the execution procedure of these analysis tasks. Depending on the business goal, there are normally many ways of performing computation on the data. For example, computation can be done at once for the whole data set or using a divide-and-conquer method. Sometimes, it may need iterative computation to improve the performance of the results. For traditional analytics, this step needs a scientific approach to manipulate data properly to solve the targeted problem or meet the business goal. Typically, analysts need to understand the data comprehensively, such as getting the statistical metric from the data, before applying any more sophisticated computation. Depending on the size of data and performance of data processing (for example, real-time or offline long-time processing), different tools or methods are implemented to process the data effectively. For a small data size or lower processing requirements, for example, less than tens of GB data in offline mode, a typical computer system with enough disk storage, memory, computing power, and appropriate software system can meet this demand. For example, projecting and analyzing the population trends at a state level will probably have a small data set that can be done in a modern single computer. But if the data set is more than tens of GB and needs close to real-time processing results, then using Big Data architecture is the right way. Quite a few implemented Big Data system frameworks support distributed data storage and processing. Here are a few popular examples and each has its unique characteristics to handle a particular type of task. MapReduce 6 MapReduce is a programming framework for processing and generating large data sets with a parallel, distributed algorithm using a large number of computer nodes. The core idea of MapReduce is a two-step procedure—a map step, which applies a map function in parallel to every key-value pair of input data sets to output a different pair of data domain. As well as a reduce step, which uses a reduce function in parallel to summarize a collection of values into the same target domain. It was originally proposed as a scientific paper by Google and has been implemented in many different programming languages to support distributed computing. 6 MapReduce—simplified data processing on large clusters, Communications of the ACM 51.1 (107-113), 2008 Technical white paper Page 13 Apache Hadoop 7 Apache Hadoop is an open-source software framework for distributed storage and processing for large data sets, consisting two important parts—the distributed storage part called HDFS and the processing part that is an implementation of the MapReduce model. It also contains two other modules for utility and management services called Hadoop Common, containing necessary utility libraries, and Hadoop YARN—a resource-management platform for managing computing resources. Hadoop also refers to the ecosystem around its framework, including many software packages that can be used in Hadoop, such as Apache Pig, Apache HBase, Apache ZooKeeper, Apache Spark, and Apache Storm, and others. Apache Spark 8 Apache Spark is also an open-source framework for distributed computing that addresses the limitation of linear data flow structure of distributed programs designed in the MapReduce computing model. Spark proposed a new data structure called the resilient distributed data set (RDD) that is a read-only multiset of data items distributed across multiple machines and is maintained fault-tolerant. Spark provides an API of the RDD to offer a distributed shared memory working data sets for cluster computing. The advantage of Spark is that computing can be iterative in memory, resulting in several magnitudes of performance increase compared to the classic Hadoop framework. Therefore, machine-learning algorithms are leveraging these characteristics to be effectively used in Spark for advanced data analytics. Apache Storm 9 Apache Storm is an open-source framework for processing streaming data. Stream processing is particularly suitable for distributed tasks that are small, independent, computing intensive, and read only once or twice to the data. This framework provides a computing topology with the shape of directed acyclic graph (DAG) in which the vertices represent the actual computing, and the edges represent data streaming from one node to another. Storm is suitable for data that needs real-time processing, as opposed to batch processing in Spark or the Hadoop framework. The types of data manipulation in processing layer mainly consist of data processing and data transfer. Data processing: For Big Data analytics, two basic strategies are mostly used. First, distributed computing—leveraging the distributed systems in Big Data frameworks to perform parallel computing. Fault tolerance is a must here. Second, the advanced machine-learning or data-mining methods are used to solve challenging tasks. When data is abundant, there is always a more complicated relation or hidden information that one needs to understand. Traditional rule-based methods or statistical methods will not be effective enough here. Meanwhile, data processing also needs effective scheduling and management strategy, so that each processing task is executed as desired, for example, once and only once. Big Data frameworks, such as MapReduce or Storm, have been designed to include components to schedule such tasks. Data transfer: Transferring data is needed when the data is not local during computation. This can be within a single computer system or while crossing multiple computer systems. In a single system, data can be stored in one of the three storage levels: CPU cache, volatile memory, or hard disk drive. Each level has a different processing speed to access data, hence transferring data among levels require effective scheduling management. The Big Data framework also leverages such characteristic to process different types of computing tasks. For example, Hadoop uses batched processing for a large volume of data saved in a disk, while Storm intensively uses memory to process streaming data tasks quickly. In multiple systems, data transfer can happen within the local environments, local data centers, or cross-region data centers. The Big Data framework also leverages data locality to use as less data transfer as possible, and use closer data as far as possible. Visualization and presentation layer Providing a good architecture component to deliver analytics result is also an important step. Result visualization, graphic representation of conclusion, or showing streaming messages are all effective options. Software tools from this step include graphic libraries or web UI libraries when displaying results via web services, and others. Interestingly, some graph tools are built on top of the database directly such as Neo4j. 10 The only interaction with data in this layer is data visualization. Apache Hadoop: hadoop.apache.org/ Apache Spark: spark.apache.org/ 9 Apache Storm: storm.apache.org/ 10 Neo4j: neo4j.com/ 7 8 Technical white paper Page 14 Data visualization: Visualizing is normally a read-only operation on data—so it is harmless. The challenge here is to use the right format and choose the right amount of information to show the most interesting results to users. Data visualization can help decision makers to claim whether the analytics program is a success or not, and so it normally needs equal attention as other data operations. Many of these techniques are designed to suit particular requirements or constraints, therefore, it is easy to choose the appropriate one during a real analytics task. Several Big Data analytics frameworks have been created in a modular way to be applicable to real applications as comprehensive as possible. However, real applications may pose unexpected challenges over time. Hence, orchestrating such data manipulation components under Big Data architecture to reach a common business result requires experienced engineers and data scientists to work closely together. Applying analytics to security Recently, analytics in security seems to be a part of multiple conversations, especially in conference expo floors and vendor pitches. It’s not uncommon to hear various buzzwords such as machine learning, deep learning, data clustering, and others, being used to describe the capability of various products. While the previous section broke down the hype behind these concepts, this section will explore real-world methodologies and implications of implementing an analytics program in an organization’s security group. The goal of analytics in security When thinking about applying analytics to a given domain (in this case, security), there are typically two primary objectives to keep in mind. Speed and accuracy Implementing a robust security analytics program may help with faster and reliable detection of threats. Speed is always a security analyst’s best friend. But, when fast results are also accurate, it eases the pain that many security teams face every day. Striking this combination of performance and accuracy of an analytical module can help in several cases such as validating the results produced (decreasing false positives), prioritizing the issues to be addressed, and responding appropriately based on the issue at hand. Formalizing the unknown Traditional systems have always targeted dealing with known issues or patterns. By using various analytical methods, it is now possible to identify issues that were previously unknown (decreasing false negatives). This is done by creating behavior-based algorithms that don’t depend on specific signatures or patterns. By introducing machine-learning concepts, the effectiveness of such algorithms may improve further as the assumed benign baseline may be automatically customized based on an organization’s behavior. Security data A security team is typically known to deal with a variety of data. While the heterogeneity of the data adds to the complication of analytics, it can usually be classified into one of the following five categories. Vulnerability data Organizations may use various tools to scan, record, and track vulnerabilities in the systems, network, and applications deployed. These vulnerabilities may include issues with improper configurations, usage of a known vulnerable library, or incorrect rendering of images from a CCTV. Data from risk and compliance assessments such as archive records may also be included. Device data Raw data, configurations, and logs from various devices connected to the network can provide insight into a variety of issues. These devices could represent endpoint hosts with user accounts, firewalls, intrusion prevention systems, routers, and others. Traffic data This represents all data that is in motion within the organization’s network. When data transmitted through various protocols are aggregated together, many hidden patterns and behaviors can be revealed through proper analysis. Asset enrichment data Asset enrichment data may improve the quality and variety of data associated with the assets. Examples include an extensive network map of the organization’s connectivity, details of a subnet or asset’s usage, costs associated with assets, context of an application, and others. Technical white paper Page 15 Organization enrichment data Similar to enriching data about assets, organization enrichment data may provide details about users, geolocation of departments or users, permissions, and others. Such data may be acquired from the organization’s HR system and provides context to various activities observed from other data feeds. The possibilities of analytics in security Analyzing an organization’s data feeds can open a huge door of opportunities to gain security-oriented knowledge and act accordingly. While a security division may have multiple teams to concentrate on various specialties such as application security, operations security, physical security, and others, the fundamental goals of all teams are very similar. Along with the goals, the processes to achieve these goals are also comparable across these teams. The following is a list of common security narratives that may be derived based on the typical data feeds available in an organization. These narratives may be applied to any type of security using their corresponding data feeds. Detecting new and unknown threats This would mostly be one of the first reasons an organization decides to set up a security analytics program. The most common activities here include attempting to identify previous false negatives based on the available data. Additionally, security analysts may attempt to solve challenging problems such as identifying high-severity vulnerabilities that may occur by combining multiple lower severity issues. For example, an application may have two separate instances of local file inclusion and invalidated file upload vulnerabilities. While these vulnerabilities pose their own set of risks to the application, combining them together may allow an attacker to upload and execute remote commands on the server. Such complex relationships are not trivial to standard scanners, so such post-processing of results may unearth a few hidden gems and allow for better prioritization. True positive matching This is an interesting problem that security auditors face every day. Every time a new audit of an asset is performed, it is essential to understand the issues that existed previously and the issues that were newly introduced. Being able to do this allows auditors to assess the real risk of the given asset. While this can be easily done within a specific tool or vendor solution, matching issues reported by different tools, auditors, or data feeds is a challenging task. Analysts need to create a robust set of parameters that may be used to correlate issues across multiple feeds and create a unified view. This would enable the security team to understand the true nature of an issue, thus accurately assessing the risk for the organization. In another case, such capabilities may also allow analysts to identify multiple issues across the organization with a similar remediation pattern. Identifying such scenarios helps to optimize the work of the security team and improve their speed to action. It also allows the team to evaluate the efficiency of current remediation strategies and patch management systems used to handle the reported issues. False positive evaluation Similar to analyzing true positives, it is also interesting to look at the false positives that may be reported through automated or manual testing methods. Primarily, the cause for the occurrence of the false positives may hold significant value. This can bring various issues to light, such as an incorrect configuration for an automated scan, an erroneous methodology in a manual process, and others. The data feeds for such analysis would include results from vulnerability assessments along with alerts generated using monitoring techniques. Another valuable outcome of evaluating false positives would be to educate the technical resources on common mistakes that may occur across a given group. This would directly affect the quality of future work produced by the teams. Root-cause analysis Once the final results are audited, it is important for a security investigator to analyze the reason behind certain risky issues. This is done by correlating the issue at hand with various related events that triggered the alert. Behind the scenes, building such a capability requires one to attach a context to the issue and its associated asset. The context would include metadata such as the purpose of the asset, the criticality of the issue, connectivity maps, and others. Such analysis is typically done using data feeds that are generated from real-time monitoring and blocking solutions. These solutions are typically sprinkled across the organization’s network and combining the feeds into one seamless data set may provide a better understanding of the network. Technical white paper Page 16 Designing attack vectors While various automated solutions may assist with assessing the state of security in an organization, a manual assessment is also necessary to ensure proper coverage. While doing so, results from previous automated and manual assessments may be analyzed to design newer attack vectors that may target specific weaknesses in the environment. The attack vectors may use additional information such as the context of a web page, understanding the nature of an application, or relating traffic between multiple hosts or services. Such a utility would especially benefit a red team 11 and provide them with a head start on the assessment. Automated assessments may also see benefits by utilizing metadata derived from previous assessments. This may be used to optimize subsequent scans, thus resulting in faster and more accurate results. Predicting a future security concern One of the best ways to reduce the load on a security team is to be able to prevent an incident from happening. While in the initial phases, existing machine-learning algorithms may provide primitive insights into potential issues that may occur in the near future. The algorithms may be used to generate a probabilistic model for the occurrence of a given vulnerability or type of incident. While this may not be very accurate, it may allow security teams to prepare for certain scenarios and help quickly remediate or even avoid an incident from occurring. Resource planning Analyzing the plans and actions executed in the previous year may allow a team to plan their work for the subsequent year. The plans may include improvement to processes, detection or prevention techniques, remediation strategies, and others. It is also possible to derive commonly made mistakes and create educational initiatives based on them to prevent the occurrence of such issues at the source. Along with future plans, it is also possible to prioritize the current work to be done from a long list of tasks. Such prioritization may be done based on various parameters such as the severity of an issue or incident, confidence of the report, value of the asset, and others. This would immensely help the teams that have time constraints and/or limited access to resources. Case study: Detecting new threats This section will discuss an example use case in greater depth by assuming a very specific scenario—detecting command and control (C2) communication by a malware from an infected host over DNS traffic. Recently, malware instances have started using Domain Generation Algorithms (DGAs) to identify their C2 server since they are harder to detect and block. DGAs are used by malware to generate pseudo-random domain names based on a given seed. One of the attempted domains would resolve to be its C2 server, while the other DNS queries would result in an NXDOMAIN response from the DNS server, implying its failure to find a valid IP address that maps to the domain name. While many techniques have been published to detect malicious DNS activity, this case study focuses on a project to design and implement one way of doing the same. The previously discussed steps to implement an analytics program will be followed here to assist with creating this module. Note that this was a real project and that the technical details are based on real experiences and experiments. Since C2 servers may have varying domain name patterns, the approach taken for this project was to perform a time-based analysis of DNS queries. Business understanding As discussed earlier, the primary goal of this project was to identify hosts infected by malware that use DGA patterns over DNS to identify their C2 server. In order to achieve this, the fundamental idea was to look for hosts that sent multiple failed DNS queries to a variety of domains. In order to reach this goal, the minimal requirement would be, as follows: • Access to DNS queries and replies from all hosts within the organization. • Data should be timestamped in order to identify relevant patterns. Data understanding Typically, an organization may have various types of data feeds that can be used to derive security implications, such as DNS packets, HTTP requests or responses, Active Directory logs, records from various systems such as the configuration management database (CMDB), HR system, and others. 11 An independent team that challenges and verifies the state of security in an organization. Technical white paper Page 17 In this case, once the access to DNS-related data was granted, the data and its structure were studied and classified based on the requirements. The data was explored and the following items were confirmed. • The data set included NXDOMAIN records. • The recorded timestamp was consistent. While exploring these data sets, it was important to understand their relationship with other feeds and confirm the completeness of the data set being observed. For example, the timestamp of the data set should refer either to the time the packet was captured or when it was inserted into the database, but should not be a combination of both. This would result in skewed results. In this case, the data set included the time it was inserted into the database and was consistent throughout. Data preparation This is the step when the defined data feeds are correlated to identify various entities such as the user, a machine, and others. Along with such entities, the data feeds may also identify logical actions such as administrative account login, sensitive data being accessed by a privileged user over VPN, local account creation on a file server, and others. For the current project, all DNS feeds were integrated into one seamless data set. The final consolidated data set consisted of all DNS queries and replies from within the organization’s environment. While merging the feeds, it was important to ensure that duplicates weren’t present in the final result. Also, it was important to normalize the data structure from multiple feeds and create a single structure that could maintain all the required data. While normalizing, all domain names were converted to lower case to enable case-insensitive comparisons during the analysis. DNS queries usually have the full domain name that is being requested. In addition to this, it is sometimes good to have only the top-level domains (TLD) and second-level domains since this would point to the parent server being requested. While this sounds like a trivial problem, there are some interesting patterns to consider. For example, some domains have country-specific TLDs such as “.co.uk” or “.co.in”. In such cases, it might be good to consider both these segments together as the TLD. This will ensure that the second-level domain for both yahoo.com and yahoo.co.uk is yahoo. Thankfully, there are some third-party libraries such as Nomulus 12 that can be used to solve this problem. Modeling Now that the data was normalized, it’s time to design the actual solution. In this case, the primary interest was in DNS queries that were not resolved. The time series of these queries from a given host was split into a range “t”, and a set of constraints were applied to all queries that fell within the given range. These constraints were designed by observing the behavior of malware that used DGAs. Some of the constraints were: • If the organization maintained a whitelist of domains that are known to be benign or refer to internal domains, then they can be ignored. • Not all the domains requested within the time “t” should not be the same. This was because DGAs typically cycle through a list of generated domains and rarely repeat the same domain consecutively. • The domains should not be requested during a previous time span “s”. This can be chosen to be a specific range, such as a day, week, or a month. Since DGAs usually use the current date as a seed, it is rare to see them requested across multiple days or months. • A minimum of “q” requests should have been sent by the host within the given time “t”. This list provides an idea of the type of constraints being applied to the data. An interesting case would be when queries generated by internal scripts may also satisfy these constraints and result in false positives. While whitelisting these domains may solve the problem, other solutions could also be used that may suit the organization. For example, if most scripts within an organization connect to some common servers, then the model could ignore requests within the time “t” that have the same TLD and second-level domains but different lower-level domains (LLDs). DGAs usually generate different parent domains instead of multiple LLDs for the same server. While this is one way of solving this problem, organizations may implement their own version based on their requirements. 12 github.com/google/nomulus Technical white paper Page 18 Evaluation Once the model is designed, experiments are devised and executed for various values of “t” and “q”. This would allow the identification of the most optimal solution. The result set of each experiment was validated using third-party reputation services to understand the true positive and false positive ratios. Based on the validation numbers, a reasonable solution was chosen as the winner. It is important to understand the scope of the project at this stage. This analysis was not meant to be a catch-all for all malicious activity. Rather, it was focused on a very specific case. Doing so allowed for a more solid design and a clear vision while modeling the solution. It is important to understand and articulate this scope when delivering the final result. Following is a sample result that was derived from the project. Figure 5. Suspicious DNS patterns over a 24-hour period Figure 5 represents the suspicious DNS queries generated from 31 different hosts in a network of over 340,000 hosts during a 24-hour period. Each horizontal block signifies a single host and the scatter plot represents bursts of requests at a given time. It can be clearly seen that most of the hosts exhibit a pattern of bursts of requests every few minutes. It would be interesting to study and understand the underlying program that’s responsible for doing so. Another interesting observation is the large number of requests sent by a single host in the result. This host may have high malicious activity, may execute multiple scripts that send requests, or may represent a proxy server that aggregates requests from hosts behind it. Further investigation may reveal the exact case and allow for a better understanding of DNS query patterns within the organization. Technical white paper Page 19 Handling misalignments between the goal and result At the start of the project, the grand vision was to identify the hosts that were definitely infected by malware. After validating the results, it was understood that the result set represented a set of suspicious requests that may signify an infected host. Identifying this characteristic feature was very important as it defines the actual result of the project. There were two options to pursue at this point—either update the goal of the project to match the result or update the model to reflect the goal accurately. In this case, the goal was updated since this phase of the project successfully detected suspicious hosts within a given environment. Also, it was decided to plan a subsequent phase of the project to confirm that these suspicious hosts are actually being infected. Deployment The final phase of the project included the deployment of the algorithm in a live environment. Most importantly, the algorithm and experiments executed were documented clearly for future reference. Along with the final report, a plan was initiated to begin phase 2 of the project as well, in order to confirm the detected suspicious behavior as malicious. This would be treated as a separate project and all the steps of an analytics project would be followed again to achieve the new goal. The challenges with analytics As seen from earlier, analytics in security can be a great boon to an organization. But on the other hand, it is important for organizations to consider any new risks that the systems and processes may introduce. Following are some of the security and operational risks that teams should expect ahead of time in order to create a better plan. Security risks An analytics platform is simply another system used within an organization. Hence, this system should also be threat modeled, similar to all the other systems. While doing so, there are a few specific concerns to note. Data centralization Creating a security analytics program rarely includes new data. Rather, existing data is centralized in order to derive interesting findings by correlating various feeds together. In such a situation, it is important to note that this central data repository may be a single point of failure and needs to be considered in the threat model. Additionally, other risks that are typical of data centralization such as potential abuse of data, managing privacy of compartmentalized data, managing access to users of the data, and others also need to be considered. While these issues are typical to data centralization, they have been newly introduced to the security team and need to be handled accordingly. Data custodians Before centralizing data, each feed would typically have been locally maintained by a custodian. In many cases, it’s possible for the users of the data to have been its custodians as well. But when centralizing these data feeds, it is important to establish a formal process to manage the data, its infrastructure, and its access. This could result in the creation of new security policies and methodologies to ensure the safety of the data, without compromising its usefulness. Also, managing the large number of users and providing them the right type of access to the data is a challenge. While not impossible, the organization needs to understand and address this complexity ahead of time to reduce unnecessary costs. Operational risks Along with security-centric issues, a number of operational challenges may also be encountered during the creation of an analytics program. Some of these challenges are: Technical white paper Page 20 Data processing Security teams use a variety of data. Hence, normalizing and consolidating these data feeds is always a challenge. Additionally, enriching the data with additional information greatly helps with correlation but the enrichment data is usually hard to find, especially in a structured format. Some data feeds are also harder to retrieve due to various policies that have been in place for a long time. For example, retrieving network device configurations, universal traffic captures, and correlating information across multiple systems such as HR and CMDB are new processes that may make system operators nervous. It is important to describe the necessity of these data feeds and create a secure process that works well for the organization, as well as ensures the comfort of all participants in the program. Time to prototype When starting any analytics project, the time to create the first prototype should be taken into account. It is typical for a combination of technical and policy-oriented hurdles to delay progress in the initial stages. Additionally, once the raw data is normalized, it is very easy to get distracted from the initial goal as playing with data can always bring up something interesting. These distractions may be documented and revisited at a later stage. But to create the prototype, it is important to focus on the goal and achieve it by breaking the work into smaller tangible goals. It is also important to take the threat model into account and make sure that the work being done does not negatively affect the model in any way. Each threat vector may require a different analytical process, so it is essential to be focused on the current goal at hand and avoid chasing tangents while exploring the data. System complexities One of the major differences between traditional and analytical systems is that traditional systems produce binary results based on a given set of rules. On the other hand, analytical systems generate a probability of occurrence for the produced result. This probability may be considered as part of the prioritization process implemented by the security team. It is important to note that this probability should not be the only parameter used for prioritization, but its value should play a major role in the process. While it has been clear so far that an analytics system is complex, the general infrastructure can still be built to span across all security domains. However, the data being processed and the results being generated vastly differ between each domain. So the implementation of the analysis is very specific to each domain. For example, both the AppSec and OpSec teams may share a goal of detecting new vulnerabilities using the data feeds. But, the raw data, results, and algorithms designed are specific to the corresponding domain. Dealing with machine-learning systems may bring their set of challenges. In general, all machine-learning systems depend heavily on the cleanliness of the learning data. Tainted learning data could lead to incorrect and undesired results. Specifically, in the case of security domains, a system may fail to detect an active attack due to its indicators being present in the learning phase. While they may be hard to avoid, it’s essential to be aware and account for such risks while designing a system that uses machine-learning algorithms. Technical white paper Conclusion Traditionally, mitigating security threats was primarily based on perimeter protection, such as intrusion detection and prevention systems (IDSs and IPSs), web application firewalls (WAFs), and others that monitor incoming traffic to filter and block malicious activities or policy violations. As the types of network devices, endpoint devices, and data locations increase in an enterprise, an effective way to view the whole organization’s state of security is needed. This has evolved into the second generation of security defense infrastructure—security information and event management (SIEM). Nowadays, bring your own device (BYOD) policies introduce more uncontrolled devices into an organization, the Internet of Things (IoT) keeps inventing new ways of interfacing with devices over the network, and attackers are discovering more sophisticated ways to target enterprises. Companies must deploy deeper network defenses and endpoint protections, and use more advanced security analysis tools and software to collect, integrate, and correlate diverse types of security information to understand their security infrastructure. Meanwhile, with security data growing diversely and exponentially, real-time or near real-time analysis is needed to detect and mitigate threats quickly and prevent damages. This has led to the birth of security analytics, which leverages Big Data, analytical tools, and machine-learning concepts to satisfy such business needs. Over the course of this discussion, it can be seen that security analytics is a tangible process that can be implemented by any enterprise. With the right team, infrastructure, and budget, any organization would be able to get their analytics program going. It is to be noted that the discussion predominantly revolves around setting up a new program. Other than this, efforts for further development, maintenance, and constant innovation should also be considered. This would ensure continuous benefits and success for the security teams in the long run. Applying analytics to information security is a young and fast-growing area. As such, it brings its own set of risks and challenges. While some of them have been discussed in this report, it is important to anticipate and plan for specific risks that may be relevant to each organization. Doing so will allow for a more productive program that can produce greater benefits quickly. Analytics may not be the ultimate solution that solves all security problems. But, it offers a solution to many issues that have plagued the security industry. Eventually, most mid- to large-sized organizations may find the necessity and budget to invest in a full-fledged analytics program. Until then, this area’s ripe for innovative research and creative approaches. Overall, a security analytics program is a great path to follow, but not a quick one to conquer. Authors Barak Raz Sasi Siddharth Muthurajan Jason Ding Learn more at www8.hp.com/us/en/software-solutions/siem-big-data-security-analytics/ Sign up for updates © Copyright 2017 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein. Google is a registered trademark of Google Inc. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Oracle is a registered trademark of Oracle and/or its affiliates. All other third-party trademark(s) is/are property of their respective owner(s). a00000103ENW, March 2017, Rev. 1