Download Master of Science - Southern Methodist University

Document related concepts
no text concepts found
Transcript
DATA MINING
APPLICATIONS
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
[email protected]
This material is based in part upon work supported by the National Science Foundation under Grant No.
9820841
Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;
[email protected]
7/10/07
7/10/07--SEDE'07
SEDE'07
1
The 2000 ozone hole over the antarctic seen by EPTOMS
7/10/07
7/10/07--SEDE'07
SEDE'07
http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
2
OBJECTIVE
Explore some of the
applications of data
mining techniques.
7/10/07
7/10/07--SEDE'07
SEDE'07
3
Data Mining Applications Outline
Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications
 Fraud Detection & Illegal Activities
 Facial Recognition
 Cheating & Plagiarism
 Bioinformatics
 Conclusions
7/10/07
7/10/07--SEDE'07
SEDE'07
4
Data Mining Overview
 Finding hidden information in a database
 Fit data to a model
 You must know what you are looking for
 You must know how to look for you
7/10/07
7/10/07--SEDE'07
SEDE'07
5
“If it looks like a terrorist,
duck,
walks like a terrorist,
duck, andand
quacks like a terrorist,
duck, then
then
it’s a terrorist.”
duck.”
Description
Behavior
Classification Clustering
(Profiling)
(Similarity)
7/10/07
7/10/07--SEDE'07
SEDE'07
Associations
Link Analysis
6
Classification Applications
 Teachers classify students’ grades as A, B, C,
D, or F.
 Letter Recognition
 andwriting Recognition
 Phishing:
http://computerworld.com/action/article.do?command=
viewArticleBasic&taxonomyName=cybercrime_hackin
g&articleId=9002996&taxonomyId=82
 Pluto:
http://www.npr.org/templates/story/story.php?storyId=5
705254
7/10/07
7/10/07--SEDE'07
SEDE'07
7
Classification Example
Given a collection of annotated
data. (in this case 5 instances of
Katydids and five of
Grasshoppers), decide what
type of insect the unlabeled
example is.
Katydids
Grasshoppers
(c) Eamonn Keogh, [email protected]
7/10/07
7/10/07--SEDE'07
SEDE'07
8
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
7/10/07
7/10/07--SEDE'07
SEDE'07
Grasshoppers
9
(c) Eamonn Keogh, [email protected]
Katydids
Clustering Applications
 Targeted Marketing
 Determining Gene Functionality
 Identifying Species
 Clustering vs. Classification
 No prior knowledge
 Number of clusters
 Meaning of clusters
 Unsupervised learning
7/10/07
7/10/07--SEDE'07
SEDE'07
10
7/10/07
7/10/07--SEDE'07
SEDE'07
http://149.170.199.144/multivar/ca.htm
11
What is Similarity?
7/10/07
7/10/07--SEDE'07
SEDE'07
(c) Eamonn Keogh, [email protected]
12
Association Rules Applications
 People who buy diapers also buy beer
 If gene A is highly expressed in this disease
then gene B is also expressed
 Relationships between people
 www.amazon.com
 Book Stores
 Department Stores
 Advertising
 Product Placement
7/10/07
7/10/07--SEDE'07
SEDE'07
13
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
7/10/07
7/10/07--SEDE'07
SEDE'07
14
Data Mining Applications Outline
 Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications

Fraud Detection & Illegal Activities
Facial Recognition
 Cheating & Plagiarism
 Bioinformatics
 Conclusions

7/10/07
7/10/07--SEDE'07
SEDE'07
15
7/10/07
7/10/07--SEDE'07
SEDE'07
16
Fraud Detection
 Identify fraudulent behavior
 Used Extensively in financial, law enforcement,
health care, etc. sectors
 http://www.aaai.org/AITopics/html/fraud.html
 SPSS:
http://www.spss.com/predictiveclaims/fraud_det
ection.htm
 Neural Technologies:
http://www.neuralt.com/fraud_management.html
7/10/07
7/10/07--SEDE'07
SEDE'07
17
Law Enforcement
 Identify suspect behavior and relationships
 I2 Inc.
 Investigative analytic/visualization software
 http://www.i2inc.com
 Social Network Analysis – Analyze patterns of
relationships
 Relationships: personal, religious, operational,
etc.
7/10/07
7/10/07--SEDE'07
SEDE'07
18
Jialun Qin, Jennifer J. Xu, Daning Hu,
Marc Sageman and Hsinchun Chen,
“Analyzing Terrorist Networks: A Case Study
of the Global Salafi Jihad Network” Lecture
7/10/07
--SEDE'07
Notes
in Computer
Science,
7/10/07
SEDE'07
Publisher: Springer-Verlag GmbH, Volume
3495 / 2005 , p. 287.
19
Data Mining Applications Outline
 Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications
 Fraud Detection & Illegal Activities

Facial Recognition
Cheating & Plagiarism
 Bioinformatics
 Conclusions

7/10/07
7/10/07--SEDE'07
SEDE'07
20
How Stuff Works,
“Facial Recognition,”
http://computer.howstuf
fworks.com/facialrecognition1.htm
7/10/07
7/10/07--SEDE'07
SEDE'07
21
Facial Recognition
 Based upon features in face
 Convert face to a feature vector
 Less invasive than other biometric
techniques
 http://www.face-rec.org
 http://computer.howstuffworks.com/facialrecognition.htm
 SIMS:
http://www.casinoincidentreporting.com/Prod
ucts.aspx
7/10/07
7/10/07--SEDE'07
SEDE'07
22
7/10/07
7/10/07--SEDE'07
SEDE'07
(c) Eamonn Keogh, [email protected]
23
Data Mining Applications Outline
 Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications
 Fraud Detection & Illegal Activities
 Facial Recognition

Cheating & Plagiarism
Bioinformatics
 Conclusions

7/10/07
7/10/07--SEDE'07
SEDE'07
24
Cheating on Multiple Choice Tests
 Similarity between tests based on number of
common wrong answers.
 (George O. Wesolowsky, “Detecting Excessive Similarity in Answers
on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no
7,200, pp909-923.)
 The number of common correct answers is often
ignored.
 H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996,
“Crime in the Classroom – Part II, and update,” Journal of Chemical
Education, vol 73, no 4, pp 349-351):
H-H = (Number of exact answers in common)
(Number of different answers)
7/10/07
7/10/07--SEDE'07
SEDE'07
25
Joshua Benton and Holly
K. Hacker, “At Charters,
Cheating’s off the Charts:,
Dallas Morning News,
June 4, 2007.
7/10/07
7/10/07--SEDE'07
SEDE'07
26
No/Little Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s
off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07
7/10/07--SEDE'07
SEDE'07
27
Rampant Cheating
Joshua Benton
and Holly K.
Hacker, “At
Charters,
Cheating’s off
the Charts:,
Dallas Morning
News, June 4,
2007.
7/10/07
7/10/07--SEDE'07
SEDE'07
28
Data Mining Applications Outline
 Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications
 Fraud Detection & Illegal Activities
 Facial Recognition

Cheating & Plagiarism

Bioinformatics
 Conclusions
7/10/07
7/10/07--SEDE'07
SEDE'07
29
DNA
 Basic building blocks of
organisms
 Located in nucleus of cells
 Composed of 4
nucleotides
 Two strands bound
together
7/10/07
7/10/07--SEDE'07
SEDE'07
http://www.visionlearning.com/library/module_viewer.php?mi
d=63
30
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
7/10/07
7/10/07--SEDE'07
SEDE'07www.bioalgorithms.info; chapter 6; Gene Prediction
31
miRNA
 Short (20-25nt) sequence of noncoding RNA
 Known since 1993 but significance not widely
appreciated until 2001
 Impact / Prevent translation of mRNA
 Generally reduce protein levels without impacting mRNA
levels (animal cells)
 Functions
 Causes some cancers
 Guide embryo development
 Regulate cell Differentiation
 Associated with HIV
 …
7/10/07
7/10/07--SEDE'07
SEDE'07
32
Questions
 If each cell in an organism contains the
same DNA –
 How does each cell behave differently?
 Why do cells behave differently during
childhood/?
 What causes some cells to act differently
– such as during disease?
 DNA contains many genes, but only a few
are being transcribed – why?
 One answer - miRNA
7/10/07
7/10/07--SEDE'07
SEDE'07
33
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
7/10/07
7/10/07--SEDE'07
SEDE'07
34
Human Genome
 Scientists originally thought there would be about
100,000 genes
 Appear to be about 20,000
 WHY?
 Almost identical to that of Chimps. What makes
the difference?
 Visualization from UCR
dnaQT.mov
 Answers appear to lie in the noncoding regions of
the DNA (formerly thought to be junk)
7/10/07
7/10/07--SEDE'07
SEDE'07
35
RNAi – Nobel Prize in Medicine 2006
siRNA may be artificially added to cell!
Double stranded RNA
Short Interfering RNA (~20-25 nt)
RNA-Induced Silencing Complex
Binds to mRNA
Cuts RNA
7/10/07
7/10/07--SEDE'07
SEDE'07
Image source:
http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html,
36
Advanced Information, Image 3
Computer Science & Bioinformatics





Algorithms
Data Structures
Improving efficiency
Data Mining
Biologists don’t usually understand or even
appreciate what Computer Science can do
 Issues:
 Scalability
 Fuzzy
 We will look at:
 Microarray Clustering
 TCGR
7/10/07
7/10/07--SEDE'07
SEDE'07
37
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
7/10/07
7/10/07--SEDE'07
SEDE'07
38
Microarray Data Analysis






Each probe location associated with gene
Measure the amount of mRNA
Color indicates degree of gene expression
Compare different samples (normal/disease)
Track same sample over time
Questions
 Which genes are related to this disease?
 Which genes behave in a similar
manner?
 What is the function of a gene?
 Clustering
 Hierarchical
 K-means
7/10/07
7/10/07--SEDE'07
SEDE'07
39
Microarray Data - Clustering
"Gene expression
profiling identifies
clinically relevant
subtypes of prostate
cancer"
Proc. Natl. Acad.
Sci. USA, Vol. 101,
Issue 3, 811-816,
January 20, 2004
7/10/07
7/10/07--SEDE'07
SEDE'07
40
miRNA Research Issues
 Predict / Find miRNA in genomic sequence
 Predict miRNA targets
 Identify miRNA functions
7/10/07
7/10/07--SEDE'07
SEDE'07
41
Temporal CGR (TCGR)
 2D Array
 Each Row represents counts for a particular window in
sequence
• First row – first window
• Last row – last window
• We start successive windows at the next character
location
 Each Column represents the counts for the associated
pattern in that window
• Initially we have assumed order of patterns is
alphabetic
 Size of TCGR depends on sequence length and
subpattern length
7/10/07
7/10/07--SEDE'07
SEDE'07
42
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
7/10/07
7/10/07--SEDE'07
SEDE'07
43
TCGR – Mature miRNA
(Window=5; Pattern=3)
C Elegans
Homo Sapiens
Mus Musculus
All Mature
7/10/07
7/10/07--SEDE'07
SEDE'07
ACG
CGC
GCG
UCG
44
TCGRs for Xue Training Data
C. Xue, F. Li, T. He,
G. Liu, Y. Li, nad X.
Zhang, “Classification
of Real and Pseudo
MicroRNA Precursors
using Local StructureSequence Features
and Support Vector
Machine,” BMC
Bioinformatics, vol 6,
no 310.
7/10/07
7/10/07--SEDE'07
SEDE'07
P
O
S
I
T
I
VE
NE
G
AT
I
VE
45
TCGRs for Xue Test Data
PO
S
I
T
I
VE
NE
GA
T
I
VE
7/10/07
7/10/07--SEDE'07
SEDE'07
46
Data Mining Applications Outline
 Introduction – Data Mining Overview
 Classification (Prediction,Forecasting)
 Clustering
 Association Rules (Link Analysis)
 Applications
 Fraud Detection & Illegal Activities
 Facial Recognition
 Cheating & Plagiarism
 Bioinformatics
Conclusions
7/10/07
7/10/07--SEDE'07
SEDE'07
47
Conclusions
 Not magic
 Doesn’t work for all applications
 Stock Market Prediction
 Issues
 Privacy
 Data
 Here are some infamous examples of
failed data mining applications
7/10/07
7/10/07--SEDE'07
SEDE'07
48
7/10/07
7/10/07--SEDE'07
SEDE'07
49
Dallas Morning News
October 7, 2005
7/10/07
7/10/07--SEDE'07
SEDE'07
50
7/10/07
7/10/07--SEDE'07
SEDE'07
51
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
BIG BROTHER ?
 Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm

http://www.govtech.net/magazine/story.php?id=45918

http://en.wikipedia.org/wiki/Information_Awareness_Office
 Terror Watch List

http://www.businessweek.com/technology/content/may2005/tc20050
511_8047_tc_210.htm

http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/


http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html
http://www.thedenverchannel.com/news/9559707/detail.html
 CAPPS

http://www.theregister.co.uk/2004/04/26/airport_security_failures/

http://www.heritage.org/Research/HomelandDefense/BG1683.cfm

http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/

http://en.wikipedia.org/wiki/CAPPS
7/10/07
7/10/07--SEDE'07
SEDE'07
52
7/10/07
7/10/07--SEDE'07
SEDE'07
53
7/10/07
7/10/07--SEDE'07
SEDE'07
54
7/10/07
7/10/07--SEDE'07
SEDE'07
55
Related documents