Download Adventures in Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ADVENTURES IN DATA
MINING
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
[email protected]
This material is based in part upon work supported by the National Science Foundation under Grant No.
9820841
Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;
[email protected]
2/25/09 - GCSU
1
The 2000 ozone hole over the antarctic seen by EPTOMS
http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
2/25/09 - GCSU
2
Data Mining Outline
Introduction
 Techniques
 Classification
 Clustering
 Association Rules
 Examples
Explore some interesting data mining
applications
2/25/09 - GCSU
3
Introduction
 Data is growing at a phenomenal rate
 Users expect more sophisticated information
 How?
UNCOVER HIDDEN INFORMATION
DATA MINING
2/25/09 - GCSU
4
But it isn’t Magic
 You must know what you are looking for
 You must know how to look for you
Suppose you knew that a specific cave had
gold:
• What would you look for?
• How would you look for it?
• Might need an expert miner
2/25/09 - GCSU
5
“If it looks like a terrorist,
“If it looks like a duck,
walks like a terrorist, and
walks like a duck, and
quacks like a terrorist, then
quacks like a duck, then
it’s a terrorist.”
it’s a duck.”
Description
Behavior
Associations
Classification Clustering
Link Analysis
(Profiling)
(Similarity)
2/25/09 - GCSU
6
2/25/09 - GCSU
7
CLASSIFICATION
Assign data into predefined groups or
classes.
2/25/09 - GCSU
8
Classification Ex: Grading
x
<90
>=90
x
<80
>=80
x
<70
x
<50
F
2/25/09 - GCSU
A
B
>=70
C
>=60
D
9
Katydids
Given a collection of annotated data.
(in this case 5 instances of Katydids
and five of Grasshoppers), decide
what type of insect the unlabeled
example is.
Grasshoppers
2/25/09 - GCSU
(c) Eamonn Keogh, [email protected]
10
Insect ID
Abdomen
Length
Antennae
Length
Insect Class
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydid
previously
unseen instance = 11
2/25/09 - GCSU
5.1
The classification
problem can now be
expressed as:
• Given a training database
predict the class label of a
previously unseen instance
(c) Eamonn Keogh, [email protected]
7.0
???????
11
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
2/25/09 - GCSU
Grasshoppers
12
(c) Eamonn Keogh, [email protected]
Katydids
Facial Recognition
2/25/09 - GCSU
(c) Eamonn Keogh, [email protected]
13
Handwriting
Recognition
1
0.5
0
0
50
100
150
200
250
300
350
400
450
(c) Eamonn Keogh, [email protected]
2/25/09 - GCSU
George Washington Manuscript
14
Rare Event Detection
2/25/09 - GCSU
15
2/25/09 - GCSU
16
Dallas Morning News
October 7, 2005
2/25/09 - GCSU
17
CLUSTERING
Partition data into previously undefined
groups.
2/25/09 - GCSU
18
http://149.170.199.144/multivar/ca.htm
2/25/09 - GCSU
19
What is Similarity?
2/25/09 - GCSU
20
(c) Eamonn Keogh, [email protected]
Two Types of Clustering
Hierarchical
Partitional
(c) Eamonn Keogh, [email protected]
2/25/09 - GCSU
21
Hierarchical Clustering Example
Iris Data Set
Versicolor
Setosa
Virginica
•The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple
Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.
2/25/09 - GCSU •Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction
Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .
22
ASSOCIATION RULES/
LINK ANALYSIS
Find relationships between data
2/25/09 - GCSU
23
ASSOCIATION RULES EXAMPLES
 People who buy diapers also buy beer
 If gene A is highly expressed in this disease then
gene A is also expressed
 Relationships between people
 Book Stores
 Department Stores
 Advertising
 Product Placement
 http://www.amazon.com/Data-Mining-Introductory-AdvancedTopics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485
&sr=1-1
2/25/09 - GCSU
24
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
2/25/09 - GCSU
25
Data Mining Outline
 Introduction
 Techniques
Examples



2/25/09 - GCSU
Vision Mining
Law Enforcement (Cheating,
Plagiarism, Fraud, Criminal
Behavior,…)
Bioinformatics
26
Vision Mining
 License Plate Recognition
 Red Light Cameras
 Toll Booths

http://www.licenseplaterecognition.com/
 Computer Vision
 http://www.eecs.berkeley.edu/Research
/Projects/CS/vision/shape/vid/
2/25/09 - GCSU
27
How Stuff Works,
“Facial Recognition,”
http://computer.howstuf
fworks.com/facialrecognition1.htm
2/25/09 - GCSU
28
Joshua Benton and Holly
K. Hacker, “At Charters,
Cheating’s off the Charts:,
Dallas Morning News,
June 4, 2007.
2/25/09 - GCSU
29
No/Little Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s
off the Charts:, Dallas Morning News, June 4, 2007.
2/25/09 - GCSU
30
Rampant Cheating
Joshua Benton
and Holly K.
Hacker, “At
Charters,
Cheating’s off
the Charts:,
Dallas Morning
News, June 4,
2007.
2/25/09 - GCSU
31
2/25/09 - GCSU
Jialun Qin, Jennifer J. Xu, Daning Hu,
Marc Sageman and Hsinchun Chen,
“Analyzing Terrorist Networks: A Case
Study of the Global Salafi Jihad
Network” Lecture Notes in Computer
32
Science, Publisher: Springer-Verlag
GmbH, Volume 3495 / 2005 , p. 287.
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
2/25/09 - GCSU
33
DNA




Basic building blocks of organisms
Located in nucleus of cells
Composed of 4 nucleotides
Two strands bound together
2/25/09 - GCSU
http://www.visionlearning.com/library/module_viewer.php?mi
d=63
34
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
Amino Acid
2/25/09 - GCSU www.bioalgorithms.info; chapter 6; Gene Prediction
35
Human Genome
 Scientists originally thought there would be
about 100,000 genes
 Appear to be about 20,000
 WHY?
 Almost identical to that of Chimps. What
makes the difference?
 Answers appear to lie in the noncoding
regions of the DNA (formerly thought to be
junk)
2/25/09 - GCSU
36
RNAi – Nobel Prize in Medicine 2006
siRNA may be artificially added to cell!
Double stranded RNA
Short Interfering RNA (~20-25 nt)
RNA-Induced Silencing Complex
Binds to mRNA
Cuts RNA
2/25/09 - GCSU
Image source:
http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html,
Advanced Information, Image 3
37
miRNA
 Short (20-25nt) sequence of noncoding RNA
 Known since 1993 but significance not widely
appreciated until 2001
 Impact / Prevent translation of mRNA
 Generally reduce protein levels without
impacting mRNA levels (animal cells)
 Functions
 Causes some cancers
 Guide embryo development
 Regulate cell Differentiation
 Associated with HIV
 …
2/25/09 - GCSU
38
TCGR – Mature miRNA
(Window=5; Pattern=3)
C Elegans
Homo Sapiens
Mus Musculus
All Mature
2/25/09 - GCSU
ACG
CGC
GCG
UCG
39
TCGRs for Xue Training Data
C. Xue, F. Li, T. He,
G. Liu, Y. Li, nad X.
Zhang, “Classification
of Real and Pseudo
MicroRNA Precursors
using Local StructureSequence Features
and Support Vector
Machine,” BMC
Bioinformatics, vol 6,
no 310.
2/25/09 - GCSU
P
O
S
I
T
I
VE
NE
G
AT
I
VE
40
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
2/25/09 - GCSU
41
Microarray Data Analysis






Each probe location associated with gene
Measure the amount of mRNA
Color indicates degree of gene expression
Compare different samples (normal/disease)
Track same sample over time
Questions
 Which genes are related to this disease?
 Which genes behave in a similar manner?
 What is the function of a gene?
 Clustering
 Hierarchical
 K-means
2/25/09 - GCSU
42
Microarray Data - Clustering
"Gene expression
profiling identifies
clinically relevant
subtypes of prostate
cancer"
Proc. Natl. Acad.
Sci. USA, Vol. 101,
Issue 3, 811-816,
January 20, 2004
2/25/09 - GCSU
43
BIG BROTHER ?
 Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm

http://www.govtech.net/magazine/story.php?id=45918

http://en.wikipedia.org/wiki/Information_Awareness_Office
 Terror Watch List

http://www.businessweek.com/technology/content/may2005/tc20050511_
8047_tc_210.htm

http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/

http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html
 CAPPS

http://www.theregister.co.uk/2004/04/26/airport_security_failures/

http://www.heritage.org/Research/HomelandDefense/BG1683.cfm

http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/

http://en.wikipedia.org/wiki/CAPPS
2/25/09 - GCSU
44
2/25/09 - GCSU
45
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
2/25/09 - GCSU
46
2/25/09 - GCSU
47
Related documents