Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Conference Session A11 Paper Number 185 Disclaimer—This paper partially fulfills a writing requirement for first year (freshman) engineering students at the University of Pittsburgh Swanson School of Engineering. This paper is a student, not a professional, paper. This paper is based on publicly available information and may not provide complete analyses of all relevant data. If this paper is used for any purpose other than these authors’ partial fulfillment of a writing requirement for first year (freshman) engineering students at the University Of Pittsburgh Swanson School Of Engineering, the user does so at his or her own risk. USING CLUSTERING AND MACHINE LEARNING FOR ANOMALOUS BREACH DETECTION Jarod Vickers, [email protected], Mahboobin 4:00, Andrew Tran, [email protected], Mahboobin 4:00 Abstract — In the current digital world where almost everyone has constant access to the internet in some way, users are constantly in danger of hackers who can remotely access and control devices as well as view personal information all while remaining undetected. Traditional cyber security systems use basic statistical analysis to notify human analysts of suspicious activity that may be signs of remote access, and a simple firewall. However, the problem with this method is that these systems often overwhelm security analysts with red flags that are mostly false-positives, and the firewalls are fairly simple for advanced hackers to breach. A solution to this problem is to incorporate clustering, an analysis model that incorporates both data mining and machine learning algorithms, into the threat detection process. This method is known as user entity behavioral analytics (UEBA). There are already security systems that implement UEBA, but they are not widely used due to many factors such as high cost and ethical concerns. UEBA can take full advantage of clustering paired with machine learning, as it groups data into two overall categories: normal behavior and anomalous behavior. This grouping happens quickly and automatically making UEBA the most efficient form of breach detection available as well as the most sustainable option. The implementation of UEBA will have a profound effect on the future of cyber security by significantly decreasing the amount of successful cyber-attacks, especially when its flaws are overcome. Key Words—Sustainability, Computers, Cyber security, Breach detection, Machine learning, Data mining, Clustering data is stored on a personal computer (PC), a user must login to access their information and if it is stored on the cloud, such as an email, a password must be used. Without the correct information, such as a password, the data stays encrypted making it unreadable, but there are ways to access this information without knowing a user’s password. Every day, people send personal information to make accounts, apply for loans, make online purchases, etc. Encryption ensures that this personal information cannot be seen by anyone without the user’s permission. However, encryption is not perfect. Similar to how doors can be unlocked without keys by lock picking, encrypted data can be unlocked without passwords by hacking. If they are able to break an encryption, hackers are able to access personal information remotely, but sometimes cannot be stopped using current methods of cyber security. Cyber security is a major issue as computers and big data become more and more ubiquitous because people’s personal information is being given out to companies and organizations who are trusted to keep this information private. The current cyber security methods are not enough to stop all cyberattacks, however. Current cyber security methods have one main flaw: they are computer programs. Intelligent and clever hackers discover new ways of bypassing security that the current methods are not programmed to deal with or even detect. When a hacker can figure out how that program works, they can find a way to get around it. For this reason, these methods, like many other computer programs, are inherently unsustainable. A sustainable computer program can best be described as a program that can be maintained and updated to continuously perform at a high level. An example of a sustainable computer program is Microsoft Office. Microsoft Office was released in 1990, and still continues to be updated and sold worldwide [1]. A solution to this sustainability problem is to allow the program to change, adapt, and essentially learn. This technology actually already exists and it called user entity behavioral analytics (UEBA). CYBERSECURITY: DETECTION OVER PROTECTION The rapid advancements of computing systems has led to more convenient lives, but has also left personal information vulnerable to anyone smart enough access it. Information in recent years has all been stored in computer hard drives and can even be sent and stored in large data storage centers, commonly referred to as the cloud, if there is access to the internet. This information can include passwords to a personal email account, credit card information, and financial records. Not just anyone can access this information of course. If the Evolution of Cyber Security and Cyber Attacks Ever since the dawn of the World Wide Web in 1990, computer viruses have been a serious issue in the technological world. Viruses started off simple, as basic worms, with the 1 University of Pittsburgh, Swanson School of Engineering 3.31.2017 Jarod Vickers Andrew Tran first being created by Robert Morris in 1989 [2]. Worms initially only served as Denial of Service (DoS) attacks which simply flooded users’ networks with superfluous requests and disrupted internet connection. Due to the infancy of the internet, DoS attacks had no profound effect. These basic worms eventually evolved into modern day viruses, which have more impact, as they are able to access and retrieve information from PCs. These viruses led to the creation of antivirus programs. Antivirus programs were initially developed in the early 1990s, following the widespread attack of the Melissa and ILOVEYOU viruses. Both of these viruses were viruses sent through email, and infected personal computers, with the ability to extract private information [2]. Antivirus programs were designed to recognize the signature of computer viruses and prevent them from executing. This task was accomplished by what is known as a firewall. Firewalls, at the most basic level, function as a routing device; they act as a gateway and only allow certain addresses to have access to whatever information the firewall is protecting [3]. In the early days of the internet, a firewall was sufficient, as it was able to stop most basic attacks. As time and technology progressed, hackers became smarter, and began to develop new ways to breach security systems. Now, rather than creating viruses that infect computer systems, computer hackers manually breach firewalls, and do the dirty work themselves. Modern hackers are smart and advanced enough to breach practically any firewall. In this day in age, firewalls simply exist to slow these hackers down. As evident by the recent breach of Target, a massive corporation, hackers have the ability to breach almost anything. As a result, focus for cyber security has shifted from prevention to containment. However, most cyber breaches go over 100 days unnoticed, and thus millions of bits of data can be compromised. In the Target breach alone, over 40 million credit and debit card numbers were compromised [2]. UEBA is the future in containing these threats, as it has the ability to detect cyber breaches and notify professionals, in a fraction of the time. anomalies and deviations from the normal behavior. This method can make a unique model for every system and can even detect “day-zero” attacks. However, a major problem with this method is the high false alarm rates. Any behavior that deviates from the normal model will be flagged, even if it is completely legitimate [4]. One of the flaws of the current methods of cyber security is the heavy dependence on developers. Programs are only capable of doing what they are programmed to do. If a new type of virus is created, the program will be useless until the virus is discovered by a human and the program is updated to protect against it. To solve this problem, these cyber security programs must have the ability to learn and adapt automatically. ALGORITHMS USED IN BREACH DETECTION The ability to learn is made possible through two key algorithms: machine learning and data mining. An algorithm is essentially a sequence of instructions to convert an input to an output. For example, algorithms can take a set of numbers and output a sorted list from least to greatest or take a list of names and output the list in alphabetical order. Simple algorithms for tasks such as sorting have simple known instructions and rules such as, one is less than two, and A comes before B in the alphabet. This is the basis behind the signature-based method of cyber security. If attacks, such as viruses, are detected, they can be recorded and made into a new “rule.” This method works, but not against novelty “day-zero” attacks. There are more abstract rules for certain types of data that algorithms cannot be made for. For example, the rules for what is considered a spam email is different from person to person. The solution to this is to allow computers to learn and automatically create algorithms specifically for its user. This solution will not only increase the effectiveness of cyber security, but also increase sustainability. Security systems will be able to change and adapt even when day-zero attacks occur, possibly preventing any future attacks that may occur. One of the algorithms that allows a system to adapt is machine learning. Current Methods of Cyber Security Security systems, such as anti-virus software, that are currently being implemented use methods known as misusebased analytics, also known as signature based analytics, and anomaly-based analytics. Misuse-based analytics can detect viruses by comparing them to previously known viruses. As a benefit, this method does not raise many false alarms but requires frequent updates as new types of viruses are discovered. The only way this method is effective is if new viruses are similar to previously detected viruses which makes it useless during the very first time a new type of virus is used, known as a “day-zero” attack. The other method is anomaly-based analytics which is step towards more intelligent analysis. Anomaly-based techniques model a system’s normal behavior and detects and Machine Learning Machine learning is a type of artificial intelligence that allows a system to learn without explicitly being programmed. Learning as it pertains to machines is the ability to change and update its algorithms automatically. There are two types of machine learning: supervised and unsupervised. Supervised machine learning allows a computer to learn based on inputs and corresponding outputs, based on previously collected data, provided by a supervisor. 2 Jarod Vickers Andrew Tran techniques are able to extract data, and create a set of patterns and rules to explain the data. This can be accomplished through a wide variety of different methods, with clustering being one of these methods. Clustering is a set of techniques for finding patterns in high-dimensional unlabeled data. It is a data mining method in which similar data is grouped together by focusing on dividing separate instances into natural groups, rather than into predicted groups [6]. Natural groups are groups created from data patterns (unsupervised learning), whereas predicted groups are groups created by a user (supervised learning). There are two primary types of clustering techniques, distance-based and density-based. In cases involving anomaly detection and cybersecurity, the distance-based clustering style is preferred, as it produces more results overall than density-based clustering [7]. The basic technique of distance-based clustering is derived from k-means clustering, as the method relies on distance from means, or averages. The first step in this process is establishing a cluster set, C = {C1, …, Cj}, where j is the number of clusters. Following this step, each data point x in a data set x = {x, …, xk} is assigned to the closest cluster, according to a Euclidean distance (see Figure 2) between the data point and the cluster’s center. The Euclidian distance is a calculated difference between a data point’s value and the mean of a cluster. If the distance between any data point and the center of the cluster is considered to be too great, a new cluster is created to accommodate for this data point and future data points [7]. Once all data points are assigned to a certain cluster, the average of each cluster, also known as a centroid, is calculated. These centroids are then used as the new centers for the clusters, and the entire process stated above is repeated. Once the centroids have stabilized and are no longer fluctuating, the data is considered completely clustered [8]. FIGURE 1 [5] A hypothetical engine power versus price plot of various cars. These data points are then categorized manually into positive and negative examples, which is known as the training process. For example, in figure 1, various cars are plotted based on engine power and cost. The positive examples, represented as “+” marks, are considered by a supervisor to be family cars while negative examples, represented by “-“marks, are not family cars. Based on the data, a computer can apply bounds to the positive example based on how specific the supervisor wants to be. The specificity is called the hypothesis class, represented by the shaded rectangles in the figure 1. In real-life scenarios, there are often more than just two variables to consider and hypothesis classes do not have simple rectangular shapes. There are more factors that go into categorizing a car as a family car other just engine power and price, such as passenger limit, efficiency, etc. With machine learning a computer is able to adjust its own hypothesis class as new data is provided. New data can later be categorized by comparing its characteristics with the characteristics of data previously plotted. Unsupervised learning works similarly to supervised learning, but requires no supervisor during the training process which only includes inputs [5]. Without predetermined outputs to determine positive and negative examples, unsupervised machine learning analyzes the data given during the training period to automatically determine hypothesis classes. This is done by comparing the characteristics of each given data point and grouping them with other similar data points. When new data points are analyzed, they will be assigned to existing groups or a new group will be made if a data point deviates significantly from the others. Unsupervised learning is completely autonomous which is beneficial when new and unknown data is received as it does not need a human to tell the machine what to do with it. This method is called clustering when it is used in tandem with another algorithm: data mining. FIGURE 2 [6] The Euclidean Distance formula. This formula calculates the distance between any 2 given data points. When applying clustering to network behavior, it can be difficult to tell what is considered ‘normal’ behavior and what is ‘anomalous’ behavior. In most cases, it can be safely assumed that ‘normal’ data will consist of a much larger percentage of the data set than ‘anomalous’ data, and thus will be labeled as such [7]. Unfortunately, with this basic method, data can sometimes fall under more than one category, and it is not completely clear which group the data belongs to, which presents a major issue. However, advanced algorithms of Data Mining: Clustering Data mining is the process of extracting knowledge from a large amount of data. Through machine learning, data mining 3 Jarod Vickers Andrew Tran clustering deal with this issue, namely probability-based clustering. clustering are still in their infancy stages. As time passes, the algorithms will be optimized, and become more effective and efficient. Therefore, the incorporation of these flexible methods in cybersecurity programs, such as UEBA, vastly increases the ability to maintain and update a program, leading to an overall sustainable program. Probabilistic Clustering: An Advanced Method Probability-based clustering works very similarly to kmeans clustering. The base algorithm is exactly the same; data is organized into clusters based on their distance from the centroids of the clusters. In the case when a data point falls in between more than one cluster, a statistical approach is taken [6]. Probability-based clustering defines clusters in terms of a mean and standard deviation. Using the values of the means and standard deviations, each data point can be described as a function of x (the data point), the mean of each cluster, and the standard deviation of each cluster. As a result of this function, each data point is statistically assigned to a cluster. This minimizes the aforementioned issue of data points that are located somewhere in between 2 clusters. NETWORK ANOMALY DETECTION In the new modern age of technology, the number of network attacks has increased dramatically in both number and quality. In an effort to combat the always developing hackers of the world, significant research has been conducted on network intrusion detection. Network anomaly detection can be described as “finding exceptional patterns in network traffic that do not conform to expected normal behavior” [9]. These non-normal patterns are often referred to as anomalies, and can signify serious breaches in a network. Anomaly detection can be applied to a plethora of different scenarios, which range from fraud detection, intrusion detection, and military surveillance [9]. With the recent developments of machine learning, and its application to data mining, anomaly detection has improved greatly in recent years. Machine learning and clustering in specific have made the possibility of network anomaly detection a reality. The overarching idea of clustering is to sort collected data into different clusters. When applied to network anomaly detection, this primarily becomes involved with the analysis of user and network behavior. The data being analyzed is any action made by a user or the network. The cluster that each action is assigned to depends on the characteristics of that action such as when the action took place, where it originated from, and what files or programs that action affects. Behavior, although a categorical value, can be transformed into a numeric value for purposes of clustering. This numeric data is then run through a probability based clustering algorithm. All the data is statistically analyzed as a function of x, μ, and σ, and is then sorted into 3 general clusters: intrusion attack data, denial of service data, and normal data. Intrusion attack behavior is generally considered the more dangerous of the two types of attacks [9]. Intrusion attacks are defined as actions “aimed to compromise the security of computer and network components in terms of confidentiality, integrity, and availability” [9]. Intrusions can be performed by internal sources (individuals who have permission to access the network) or external sources (those who do not have permission) [9]. One example of these attacks is the Target hack of 2013. In 2013, Target networks were hacked, and upwards of 40 million credit cards were compromised [2]. Clustering is the superior method of data mining when it comes to detecting these intrusion attacks. In a study performed by Blower and Williams, a clustering method was used to group normal versus anomalous network data [4]. In the study, the clustering method had a reported performance of ninety-eight percent in determining whether there was an attack or not. This value FIGURE 3 [6] 2 distributions of clusters, each with their own mean and standard deviation. FIGURE 4 [6] The statistical formula f(x; μ, σ), where x is the value, μ is the mean, and σ is the standard deviation. Clustering is a powerful tool that harnesses machine learning algorithms to analyze and sort data. This ability to sort data with fair ease into certain categories can be very useful, especially for network anomaly detection. Clustering in this sense is used to analyze user behavior and categorize it as either ‘normal’ or ‘anomalous’ behavior. The ability to correctly categorize different behaviors comes from machine learning which allows the security system to “learn and make judgements without being programmed explicitly for every scenario” [10]. These algorithms behind machine learning and 4 Jarod Vickers Andrew Tran was significantly larger than other forms of data mining methods, such as Bayesian networks and fuzzy rules, in revealing anomalous behavior [4]. There are several different programs on the market that perform network anomaly detection using clustering algorithms. These programs include host-based intrusion detection systems and network-based intrusion detection systems. One of the more modern programs is known as User and Entity Behavior Analytics (UEBA), which combines these 2 forms of intrusion detection systems. for the foreseeable future, and is a sustainable program. Niara’s product and other forms of UEBA are currently the most effective versions of anomaly breach detection, as they discover threats in a fraction of the time that older methods do. UEBA Compared to Other Methods In the infancy stages of the internet, firewalls and basic antivirus systems were sufficient in dealing with cyber security threats. As time has passed, and hackers have become much more intelligent, these basic systems are simply not powerful enough detect and eliminate cybersecurity threats. Therefore, anomalous breach detection systems were created, to actively detect these threats by analyzing network activity and lead to their elimination. UEBA is currently the most effective of these anomalous breach detection systems, as it analyzes both interior users and exterior entities, unlike most cybersecurity systems. Old versions, known simply as UBA, only monitored and analyzed devices and users within the network. This however, does not provide extensive enough coverage, as evident in the Target hack case, in which hackers accessed the network via a third party company. UEBA addresses this problem, as it analyzes any and all devices that are somehow involved with a network [11]. Although UEBA provides an overall superior security blanket in terms of detecting security threats, the product is not perfect, and thus there are still some issues with it. USER ENTITY BEHAVIORAL ANALYTICS UEBA is a relatively new intrusion detection system that builds upon the foundation of past systems, known as User Behavior Analytics (UBA). As stated previously, UEBA combines both host-based intrusion detection and networkbased intrusion detection [11]. Host-based intrusion detection systems (HIDS) focus on the software on an actual device. HIDS monitor internal activity, such as what programs are running, and the processes they are carrying out [9]. Network-based intrusion detection systems monitor network systems to detect any potential compromises [9]. These primarily focus on the transfer of data throughout the network. By implementing both of these methods, UEBA is able to easily detect cyber-criminals attempting to hack into systems. Cyber-criminals typically intrude network systems through a method called spearphishing. Spearphishing is the process of posing as an employee or trusted individual to not raise any alarms within a company [10]. For example, in the Target hack, the perpetrators accessed the network through a ventilation and heating supplier to Target [2]. For this reason, many cyber-attacks go unnoticed for long periods of time; prior to UEBA, the average intrusion goes unnoticed for 265 days, and takes 69 days to contain. UEBA cuts the time required to detect a security breach significantly, by over 50 percent, according to Niara, a company that produces and distributes UEBA technology [10]. Niara’s version of UEBA uses all of the described processes above. Niara incorporates machine learning and clustering algorithms into their product to analyze user behavior and external device behaviors. The distinguishing factor of Niara’s UEBA from other UEBA programs is its contextual risk scoring. The program performs probabilistic clustering on every piece of data, and based on the probability that any given piece of data is categorized as an intrusion attack, a risk score is assigned to the data [10]. This creates a user friendly interface that allows professionals to easily assess whether a specific individual needs to be investigated or not. Therefore, Niara’s UEBA’s use of statistical analysis ensures that hackers must almost perfectly impersonate an entity’s behavior to remain undetected. This results in a program that is fairly unlikely to become obsolete, unless hacker’s find a method to perfectly replicate someone’s behaviors. Niara’s UEBA thus will perform at an optimal level Limitations of UEBA The primary issues with UEBA are the limitations of the program to actually deal with threats. UEBA is simply a program that analyzes behavior, and discovers any anomalies that may potentially be threats. UEBA does absolutely nothing to prevent these threats or deal with any successful intrusions. Therefore, UEBA must be used in tandem with other programs to deal with these threats. Preventing threats before they occur is primarily the job of a firewall. Network threats are slowed down greatly by firewalls, and if they do breach the firewall, UEBA algorithms can detect the breach, creating a very effective cybersecurity system [9]. Also, UEBA programs do nothing to contain a breach once it has occurred. Unfortunately, in our current day in age, the only thing that can deal with intrusions once they occur is a professional. Therefore, a cybersecurity professional must always be readily available to contain an intrusion, and terminate it [9]. Although it is an issue that UEBA cannot deal with specific threats, there are potential advancements that could help fix the issue. These, however, involve technology that is beyond our current reach. Every attack is different, and requires different methods to respond to and contain. Computers in their current stage are unable to make difficult decisions and respond properly to all situations. Further advancements in machine learning and 5 Jarod Vickers Andrew Tran artificial intelligence could eventually allow for cybersecurity programs to eliminate threats, but not in the foreseeable future. UEBA could also be incorporated more into a network’s infrastructure, and have the capabilities to change network traffic flow temporarily [11]. Although potential advancements are possible, it is still plausible that hackers cold create a method to spearphish perfectly without leaving traces. This unfortunately would make UEBA and other behavior based network breach detection systems obsolete, despite the fact that UEBA has the potential to be improved upon. UEBA, when partnered with other sources of cybersecurity, is still one of the most powerful tools in dealing with cyber threats, however. UEBA might not be perfect and may detect false positives, but it is still reliable enough to have a ninety-eight percent success rate. CURRENT STATE OF UEBA UEBA is the next step in perfecting cyber security. It solves many of the problems faced by older methods such as over-reliance on the developer and it also has the ability to detect threats from external sources. All of these problems were able to be solved through the implementation of two key algorithms: machine learning and clustering. These two algorithms allow security systems to categorize any behavior and action that affects a network as anomalous or safe before alerting human analysts. UEBA has proven to be the most effective cyber security system to date and is the most advanced. However, there are still many flaws and ethical issues that need to be resolved before UEBA becomes more widespread. Currently, there is no autonomous way to actually deal with an anomalous threat. UEBA can only detect threats and warn experts. This limitation cannot be resolved until advancements in machine learning and artificial intelligence are made. Even though there are some issues, UEBA is still being used currently, and until more technological advancements come along, it will be the best breach detection option. It will also dramatically increase the sustainability of cyber security software. New methods of breach detection may not need to be developed for a long time as UEBA will be able to change and adapt using machine learning. Unpredictability Another issue with UEBA is its unpredictability which is a result of using artificial intelligence (AI). The predictability of any technology is important for many reasons such as safety and accountability. Every piece of technology is created to complete a certain task, and causes an expected result. However, the technology can sometimes cause unwanted results. Having a predictable product allows engineers to develop fail safes ahead of time in case of a malfunction [12]. AI does not have the same level of predictability meaning that that it becomes difficult to predict when an unexpected result will occur. In the case of cyber security and UEBA, an unexpected result would be a false positive, or when the system categorizes a user’s normal behavior as anomalous. Unfortunately, there is no way to know or predict what conditions would result in a false positive, or even how often it will happen. The reason for the lack of predictability of AI can be an issue can explained by considering a classic example of AI, those that are made to play chess against humans such as Deep Blue, the first machine to beat the world champion of chess, Garry Kasparov. If Deep Blue was predictable, that would mean that its developers would have been able to predict every move the machine was going to make before it was made. The developers would also need to be able to realize every time an unintended result occurs, in this case the machine making a bad move. In other words, the developers would have needed to be better at chess than the machine and could have beaten Kasparov without it [12]. Similar to the chess example, the unpredictability of UEBA means that developers must know when a false positive occurs, but with the multitude of variables that are considered by UEBA, a developer would need to perform a complete investigation of the alert just to make sure it is not a false positive, which trivializes the function of the security system itself. A false positive is not necessarily harmful, however, because a human expert makes the final decision on how to act against the behavior, but the having a technology that is unpredictable could cause problems in the future. Some may find it hard to trust an alert from an unpredictable technology. Ethical Concerns of Machine Learning While UEBA may be beneficial to cyber security, there are important ethical concerns that must be considered. The main concern with UEBA comes from the method of data collection. Systems that implement UEBA must collect data from users such as where they send and receive data, as well as when files and programs are accessed. Essentially, all of their activities while using the computer are monitored by the security system and recorded to determine their normal behavior. This can be considered a violation to the user’s right to privacy. Privacy can be defined as the state of being free from being observed. However, as mentioned before, this system is essentially observing every action that is made and file that is opened. There are many questions about the extent of this observation that need to be answered. For example, how far can this monitoring go? Do the developers of the system have access to the information? Is it possible for the program to actually access these data files? If so, how can this be prevented so no one exploits these permissions? With the sensitivity of the personal information, it is important that these questions are answered before UEBA becomes a widely used method. 6 Jarod Vickers Andrew Tran [12] N. Bostrom, E. Yudkowsky. “The Ethics of Artificial Intelligence.” Machine Intelligence Research Institute. 6.12.2014. Accessed 2.10.2017. https://intelligence.org/files/EthicsofAI.pdf SOURCES [1] “A Brief History of Microsoft Office.” Microsoft. 9.10.2015. Accessed 3.25.2017. https://enterprise.microsoft.com/en-gb/articles/roles/itleader/brief-history-microsoft-office/ [2] T. Julian. “Defining Moments in the History of Cyber security and the Rise of Incident Response.” Infosecurity. 2015. Accessed 2.20.2017. https://www.infosecuritymagazine.com/opinions/the-history-of-cybersecurity/ [3] K. Scarfone, P. Hoffman. “Guidelines on Firewalls and Firewall Policy.” National Institute of Standards and Technology. 2009. Accessed 2/1/2017. http://csrc.nist.gov/publications/nistpubs/800-41-Rev1/sp80041-rev1.pdf [4] A. L. Buczak, E Guven. “A Survey of Data Mining Learning Methods for Cyber Security Intrusion Detection.” IEEE. 10.26.2015. Accessed 2.20.2017. http://ieeexplore.ieee.org.pitt.idm.oclc.org/document/730709 8/ [5] E. Alpaydin. “Introduction to Machine Learning.” Massachusetts Institute of Technology. 2014. Accessed 2.15.2017. [6] I. H. Witten, E. Frank, M. A. Hall. “Data Mining Practical Machine Learning Tools and Techniques.” Elsevier. 2011. Accessed 2.24.2017. [7] S. Dua, X Du. “Machine Learning in Cybersecurity.” CRC. 2011. Accessed 2.24.2017. [8] Y. Zhao. “R and Data Mining: Examples and Case Studies.” Elsevier. 10.20.2015. Accessed 2.21.2017. https://78462f86-a-e2d7344e-ssites.googlegroups.com/a/rdatamining.com/www/docs/RData Mining-book.pdf?attachauth=ANoY7cp85M9jpNUCtACrnD6Bd0bqQlSUBLIOv026Fj4DHSpb2PskMx7krHMu f5qGBGs5YtlCTpK_BmsEureCQnAp_i6Xlk_o77f1I3O4Kea_BqeKKMgl8rDuvEs7UAEjiSxcafLm MjAChNhDLcOEJffQqVaq63FFha8tIn1U9idild47U4Ho7q4j _AESkoUq4NHMUbAtqBBda27vBp6_05ezIsIJttsA%3D%3 D&attredirects=0. [9] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita. “Network Anomaly Detection: Methods, Systems, and Tools.” IEEE. 2014. Accessed 2.23.2017. http://www.nr2.ufpr.br/~jefferson/pdf/Network_Anomaly_De tection-Methods,_Systems_and_Tools.pdf [10] “How UEBA and Machine Learning Detect Attacks.” niara. 2016. Accessed 2.09.2017. http://info.niara.com/hubfs/PDFs/Guides/Security_Analysts_ Guide_How_UEBA_and_Machine_Learning_Detect_Attacks .pdf [11] D. Shackleford. “Active Breach Detection: The NextGeneration Security Technology?” SANS. 2.1.2016. Accessed 2.09.2017. https://www.sans.org/readingroom/whitepapers/analyst/active-breach-detection-nextgeneration-security-technology-36812 ACKNOWLEDGEMENTS We’d like to thank our friends and the wonderful card game of bridge, for keeping us sane throughout this entire process 7 Jarod Vickers Andrew Tran 8