Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ensuring Scientific Integrity of High Performance Computing Applications using Provenance Data Spencer Callicott, Mentors: Dr. Sean Peisert, Dr. Aydin Buluc, Dr. Nitin Sukhija Introduction Historically, scientific workflows have been verified through methods such as fingerprinting by examining MPI calls of known good function calls. However, the amount of cyber attacks and exploits are growing at a faster rate than ever before. Recognizing this risk demands the ability to ensure work integrity through Provenance data. Objectives Hypothesis Anomaly Detection using Machine Learning on netflow data from a DTN is effective due to the normalized and low-volume traffic. Methods Our dataset contains the connections and the system logs from a set of four data transfer nodes. The data was parsed using Apache Spark and Python scripts. Observations Further Application Connections by IP Data analysis: • Traffic on the Data Transfer Node is highly normalized and very low. Most of the traffic comes from one or two IPs • In this dataset, large transfers of bytes from one or many hosts residing on the same network are potentially anomalous Network K-core Protocol 9-16 8-5 4-3 2-1 tcp udp icmp other Considerations when implementing machine learning based anomaly detection1: • Few hosts throughout the period of one day connected to all four DTNs • What is the system is doing? Service • What is regular traffic structured like? • What kind of environment does the system target? Who are its users? • What specifically are the attacks to be detected? Glossary Provenance Data – Describes the origin and history of a set of data Data Transfer Node (DTN) – Designated nodes that handle the highspeed transfer of large amounts of data to and from other institutions or private computers. Anomaly Detection – Discovering an element of a dataset which does not fit a general pattern Evaluations for future systems: • What can the system detect efficiently? Why? • What attacks go undetected? Why? • How reliable is the system on detecting and reporting anomalies? • Where does the system break? http ssl ftp,ssl,gridftp ssh ftp-data dns gridftp-data other • What skills/resources would hackers have available to them? • How to focus on reducing false detections? • Port scans and Internet census scans can easily be identified by the number of attempted connections to ports. • Analyze connections attempts without proper TCP flag history. • Filter rejected connections as a hardware anomaly and tested for faults • Improve system robustness with reduced with the introduction of more detection methods. • Apply clustering algorithms as a form of anomaly detection Then, the connections were visualized using Gephi, a network graphing software. The above figure shows the total network with the rejected packets overlayed in blue. After the data has been visualized, the network graph is then analyzed with an anomaly ranking algorithm, such as AMEN2 or Oddball3 • The BroIDS detected traffic as mostly HTTP, with few connections using GridFTP and other services • DDoS can be detected by large amounts of connections with the rejected state “REJ” Works Cited 1 Sommer, Robin, and Vern Paxson. "Outside the Closed World: On Using Machine Learning for Network Intrusion Detection." 2010 IEEE Symposium on Security and Privacy (2010): Web 2 Perozzi, Bryan, and Leman Akoglu. "Scalable Anomaly Ranking of Attributed Neighborhoods." (n.d.): Web. 3 Akoglu, Leman, Mary Mcglohon, and Christos Faloutsos. "Oddball: Spotting Anomalies in Weighted Graphs." Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science (2010): 410-21. Web.