Download Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
5th Symposium on Information Systems Assurance
Data Mining of E-Mails to
Support Periodic &
Continuous Assurance
Glen L. Gray
California State University at Northridge
Roger Debreceny
University of Hawai`i at Mānoa
Toronto: October 2007
In this Presentation
 Continuous monitoring of emails – why?
 Technologies


Social Network Analysis
Text analysis
 Challenges
 Opportunities
Continuous Monitoring of Emails –
Why?
 Increased focus on forensic approaches to
auditing
 Increased interest in continuous assurance and
monitoring of business processes
 Emails = Organization’s DNA
 Evidential matter on:




Employee & management fraud (overrides)
Compliance (e.g., HIPAA)
Loss of intellectual property
Corporate policies
Enron Email Archive
 Released by Federal Energy Regulatory
Commission
 500K emails
 151 Enron employees
 Cleaned version at Carnegie Mellon
www.cs.cmu.edu/~enron/
 Relational DB version at USC
www.isi.edu/~adibi/Enron/Enron_Dataset_R
eport.pdf
Email Mining Targets
Email
Data Mining
Content
Analysis
Key Word
Queries
Log
Analysis
Deception
Clues
Volume &
Velocity
Social Network
Analysis
Content Analysis
Key Word Queries
 Yes, people do say self-incriminating things in
their emails


Fraud
Corporate dysfunction
 Overwhelming false positives
 Need “smart” compound queries
 Good continuous auditing (CA) candidate
 Already scanning for spam, porn, etc.
Sender Deception -- Content
 Deceptive emails include:




Fewer first-person pronouns to dissociate
themselves from their own words
Fewer exclusive words, such as but and
except, to indicate a less complex story
More negative emotion words because of the
sender’s underlying feeling of guilt
More action verbs to, again, indicate a less
complex story
Sender Deception -- Identification
 Writeprint features

Lexical -- characters & words





Function words
Root words
Syntactic -- sentences
Structural -- paragraphs
Content-specific
Sender Deception -- Identification
 Number of potential features unlimited

Optimum number can vary by
context and language
 Developing user profiles and comparing new
emails to profiles would be challenging for
real-time CA
Temporal/Log Analysis
Volume & Velocity
 Volume = number of emails a person sends and/or
receives over a period of time.
 Velocity = how quickly the volume changes.
 Many external factors (e.g., vacations, seasonal
activities, etc.) impact these numbers
 Need “rolling histogram”
Volume & Velocity
 Key issue -- determining the optimum time intervals
to sample the data
 Continuous monitoring cannot be continuous in terms
of sampling in real time
 Comparing hourly, daily, and even weekly volumes
and velocities will result in many false positives
 Optimum time internal could vary by job title
Social Network
Analysis
Social Network Analysis
 Social relationships as an undirected graph
 Importance of understanding relationships
within the flow of email exchanges
Social Network Analysis in Emails
 Emails semi-structured data
 sender
 primary recipient(s)
 copied recipient(s)
 date
 subject line
 Social groups and cliques
 CA = who doesn’t belong?
Thread Analysis – This?
Time
C
C
S
R
C
R
C
S
C
C
S
R
C
R
C
S
Thread Analysis – Or this?
Time
C
S
S
R
R
R
C
C
C
R
R
S
S
R
Integrating Content Analysis and
Social Network Analysis
Key Word
Queries
Deception
Clues
Volume &
Velocity
Content
Analysis
Social Network
Analysis
Log
Analysis
Email
Data Mining
Challenges of Email Mining
 Textual




Inconsistent use of abbreviations
Misspelled words
Smileys etc. etc.
Replies, replies, and more replies…
 Inability to identify:

Identities of email participants


[email protected]
Roles and responsibilities
What Enron Emails Show?
 People do say the darnest things
 What did he know and when did he know it?
 Verified numerous bodies of email data
mining research


Content analysis
Social network analysis
Tools
 Content monitoring





eSoft Corporation’s ThreatWall
Symantec’s Mail Security 8x00 Series
Vericept Corporation’s Vericept Content 360º
Reconnex Corporation’s iGuard Appliance
InBoxer, Inc. Anti-Risk Appliance
 Social networks


Microsoft SNARF
Heer Vizter
Research Opportunities
Research Questions
 Role of email monitoring in overall CA






environment?
Join SNA with examination of textual patterns.
Link SNA with control environment
Frauds/control overrides footprint?
What email cleaning is required for CA purposes?
Privacy and policy issues?
Lessons from existing commercial products?
Your Questions
Thank You
[email protected]
[email protected]