Download PPTX - TuxHPC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Ensuring Scientific Integrity of High Performance Computing
Applications using Provenance Data
Spencer Callicott, Mentors: Dr. Sean Peisert, Dr. Aydin Buluc, Dr. Nitin Sukhija
Introduction
Historically, scientific workflows have been
verified through methods such as fingerprinting
by examining MPI calls of known good function
calls.
However, the amount of cyber attacks and
exploits are growing at a faster rate than ever
before.
Recognizing this risk demands the ability to
ensure work integrity through Provenance
data.
Objectives
Hypothesis
Anomaly Detection using Machine
Learning on netflow data from a DTN
is effective due to the normalized and
low-volume traffic.
Methods
Our dataset contains the connections and the
system logs from a set of four data transfer
nodes.
The data was parsed using Apache Spark and
Python scripts.
Observations
Further Application
Connections by IP
Data analysis:
• Traffic on the Data Transfer
Node is highly normalized
and very low. Most of the traffic
comes from one or two IPs
• In this dataset, large transfers of bytes from
one or many hosts residing on the same
network are potentially anomalous
Network K-core
Protocol
9-16
8-5
4-3
2-1
tcp
udp
icmp
other
Considerations when implementing machine
learning based anomaly detection1:
• Few hosts throughout the period of one day
connected to all four DTNs
• What is the system is doing?
Service
• What is regular traffic structured like?
• What kind of environment does the system
target? Who are its users?
• What specifically are the attacks to be
detected?
Glossary
Provenance Data – Describes the origin and history of a set of data
Data Transfer Node (DTN) – Designated nodes that handle the highspeed transfer of large amounts of data to and from other institutions
or private computers.
Anomaly Detection – Discovering an element of a dataset which
does not fit a general pattern
Evaluations for future systems:
• What can the system detect efficiently?
Why?
• What attacks go undetected? Why?
• How reliable is the system on detecting and
reporting anomalies?
• Where does the system break?
http
ssl
ftp,ssl,gridftp
ssh
ftp-data
dns
gridftp-data
other
• What skills/resources would hackers have
available to them?
• How to focus on reducing false detections?
• Port scans and Internet census scans can
easily be identified by the number of
attempted connections to ports.
• Analyze connections attempts without proper
TCP flag history.
• Filter rejected connections as a hardware
anomaly and tested for faults
• Improve system robustness with reduced
with the introduction of more detection
methods.
• Apply clustering algorithms as a form of
anomaly detection
Then, the connections were visualized using
Gephi, a network graphing software. The
above figure shows the total network with the
rejected packets overlayed in blue.
After the data has been visualized, the network
graph is then analyzed with an anomaly
ranking algorithm, such as AMEN2 or Oddball3
• The BroIDS detected traffic as mostly HTTP,
with few connections using GridFTP and
other services
• DDoS can be detected by large amounts of
connections with the rejected state “REJ”
Works Cited
1
Sommer, Robin, and Vern Paxson. "Outside the Closed World: On Using Machine
Learning for Network Intrusion Detection." 2010 IEEE Symposium on Security and
Privacy (2010): Web
2 Perozzi, Bryan, and Leman Akoglu. "Scalable Anomaly Ranking of Attributed
Neighborhoods." (n.d.): Web.
3 Akoglu, Leman, Mary Mcglohon, and Christos Faloutsos. "Oddball: Spotting
Anomalies in Weighted Graphs." Advances in Knowledge Discovery and Data Mining
Lecture Notes in Computer Science (2010): 410-21. Web.