Download Text mining SEC Filings for Fraud Detection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Text Mining SEC Filings for Fraud
Detection
Fletcher Glancy
ISQS 7342
Research Issues
1. Can fraud be detected from SEC filings?
2. Can text mining provide a methodology for
detection of potential fraud?
3. If text mining can provide an indication of
potential fraud, which algorithm gives the
best performance?
12/2/2008
Fletcher Glancy
Brief Background
• Corporate governance fraud has been a major
concern, i.e., Enron, WorldCom, HealthSouth.
• Detection has been after many years of abuse.
• Most techniques involve ratio analysis.
• Churyk et al. used Context Analysis to detect
fraud in MDA of 10K filings.
12/2/2008
Fletcher Glancy
Potential Strengths of Text Mining
• TM can be automated.
• The results can be used for further data
mining.
• TM eliminates researcher bias that is
potentially present in Context Analysis.
12/2/2008
Fletcher Glancy
Potential Problems/Weakness
• There is no context in text mining, only
statistics.
• It is difficult to understand the relationships
with a document-term matrix.
• Unable to handle negatives or punctuation.
12/2/2008
Fletcher Glancy
Narrow the Focus - Negatives
• Antonyms – Word Opposites.
• Negatives – not good = bad.
• Interference by articles.
Not a good day.
• Interference by modifiers.
Not highly motivated.
12/2/2008
Fletcher Glancy
Possible Data Preparation Options
• Preprocessing to remove articles.
• Convert punctuation to text.
Replace ‘;’ with semicolon.
• Combine following noun with “not”.
Not highly motivated becomes
highly not_motivated.
• Create not_noun and replace with antonym.
not_dead is replaced with alive.
12/2/2008
Fletcher Glancy
Testing Data Preparation Options
• Select/Create text database.
– 10K Notes and MDA.
– Firms that have received AAER.
• Preprocess with each alternative individually
and cumulative.
• Create document text matrix and SVD.
12/2/2008
Fletcher Glancy
Testing Data Preparation Options
• Calculate variance of document set using SVD.
• Create logistic regression using set SVD and
calculate variance.
• Test for predictability using validation set.
12/2/2008
Fletcher Glancy
Questions?
Welcome to my potential
dissertation topic!
12/2/2008
Fletcher Glancy