Download Master of Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ACM Distinguished
Speakers Program
ADVENTURES IN DATA
MINING
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
[email protected]
This material is based in part upon work supported by the National Science Foundation under Grant No.
9820841 and NIH Grant No.1R21HG005912-01A1
Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;
[email protected]
•2/25/13 - Union University
1
The 2000 ozone hole over the antarctic seen by EPTOMS
http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
•2/25/13 - Union University
2
Data Mining Outline
 Introduction
 Techniques
 Classification
 Clustering
 Association Rules
 Examples
Explore some interesting data mining
applications
•2/25/13 - Union University
3
Introduction
 Data is growing at a phenomenal rate
 Users expect more sophisticated information
 How?
UNCOVER HIDDEN INFORMATION
DATA MINING
•2/25/13 - Union University
4
But it isn’t Magic
 You must know what you are looking for
 You must know how to look for you
Suppose you knew that a specific cave had
gold:
• What would you look for?
• How would you look for it?
• Might need an expert miner
•2/25/13 - Union University
5
CLASSIFICATION

Assign data into predefined groups or
classes.
•2/25/13 - Union University
6
“If it looks like a duck,
walks like a duck, and
“If it looks like a terrorist,
quacks like a duck, then
walks
like a terrorist, and
it’s a duck.”
quacks like a terrorist, then
it’s a terrorist.”
Description
Behavior
Associations
Classification Clustering
Link Analysis
(Profiling)
(Similarity)
•2/25/13 - Union University
7
Classification Ex: Grading
x
<90
>=90
x
<80
>=80
x
<70
x
<50
F
•2/25/13 - Union University
A
B
>=70
C
>=60
D
8
Katydids
Given a collection of annotated data.
(in this case 5 instances of Katydids
and five of Grasshoppers), decide
what type of insect the unlabeled
example is.
Grasshoppers
(c) Eamonn Keogh, [email protected]
•2/25/13 - Union University
9
Insect ID
Abdomen
Length
Antennae
Length
Insect Class
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydid
previously unseen instance = 11
•2/25/13 - Union University
5.1
The classification
problem can now be
expressed as:
• Given a training database
predict the class label of a
previously unseen instance
(c) Eamonn Keogh, [email protected]
7.0
???????
10
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
•2/25/13 - Union University
Grasshoppers
(c) Eamonn Keogh, [email protected]
11
Katydids
How Stuff Works,
“Facial Recognition,”
http://computer.howstuf
fworks.com/facialrecognition1.htm
•2/25/13 - Union University
12
Facial Recognition
•2/25/13 - Union (c)
University
Eamonn Keogh, [email protected]
13
Handwriting
Recognition
1
0.5
0
0
50
100
150
200
250
300
350
400
450
(c) Eamonn Keogh, [email protected]
•2/25/13 - Union University
George Washington Manuscript
14
Rare Event Detection
•2/25/13 - Union University
15
•2/25/13 - Union University
16
Dallas Morning News
October 7, 2005
•2/25/13 - Union University
17
Classification Performance
True Positive
False Negative
False Positive
True Negative
•© Prentice Hall
18
Behavior Based Classification/Prediction
 Credit Card Fraud Detection
 Credit Score
 Home Mortgage Approval
2/25/13 - Union University
19
CLUSTERING

Partition data into previously undefined
groups.
•2/25/13 - Union University
20
http://149.170.199.144/multivar/ca.htm
•2/25/13 - Union University
21
What is Similarity?
•2/25/13 - Union University
(c) Eamonn Keogh, [email protected]
22
Two Types of Clustering
Hierarchical
Partitional
(c) Eamonn Keogh, [email protected]
•2/25/13 - Union University
23
Hierarchical Clustering Example
Iris Data Set
Versicolor
Setosa
Virginica
•The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple
Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.
•2/25/13 - Union•Hierarchical
University Clustering Explorer Version 3.0, Human-Computer Interaction
Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .
24
ASSOCIATION RULES/
LINK ANALYSIS

Find relationships between data
•2/25/13 - Union University
25
ASSOCIATION RULES EXAMPLES
 People who buy diapers also buy beer
 If gene A is highly expressed in this disease then gene
A is also expressed
 Relationships between people
 Book Stores
 Department Stores
 Advertising
 Product Placement
 http://www.amazon.com/Data-Mining-Introductory-AdvancedTopics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=123
5564485&sr=1-1
•2/25/13 - Union University
26
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
•2/25/13 - Union University
27
Data Mining Outline
 Introduction
 Techniques
 Examples
 Vision Mining
 Law Enforcement (Cheating, Plagiarism,
Fraud, Criminal Behavior,…)
 Bioinformatics
•2/25/13 - Union University
28
Vision Mining
 License Plate Recognition
 Red Light Cameras
 Toll Booths
 http://www.licenseplaterecognition.com/
 Computer Vision
 http://www.eecs.berkeley.edu/Research/Proj
ects/CS/vision/shape/vid/
•2/25/13 - Union University
29
Joshua Benton and Holly
K. Hacker, “At Charters,
Cheating’s off the Charts:,
Dallas Morning News,
June 4, 2007.
•2/25/13 - Union University
30
No/Little Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s
off the Charts:, Dallas Morning News, June 4, 2007.
•2/25/13 - Union University
31
Rampant Cheating
Joshua Benton
and Holly K.
Hacker, “At
Charters,
Cheating’s off
the Charts:,
Dallas Morning
News, June 4,
2007.
•2/25/13 - Union University
32
•2/25/13 - Union University
Jialun Qin, Jennifer J. Xu, Daning Hu,
Marc Sageman and Hsinchun Chen,
“Analyzing Terrorist Networks: A Case
Study of the Global Salafi Jihad
Network” Lecture Notes in Computer
33
Science, Publisher: Springer-Verlag
GmbH, Volume 3495 / 2005 , p. 287.
Arnet Miner
 http://arnetminer.org/
•2/25/13 - Union University
34
DNA




Basic building blocks of organisms
Located in nucleus of cells
Composed of 4 nucleotides
Two strands bound together
•2/25/13 - Union University
http://www.visionlearning.com/library/module_viewer.php?mi
d=63
35
Central Dogma: DNA -> RNA ->
Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
Amino Acid
•2/25/13 - Union www.bioalgorithms.info;
University
chapter 6; Gene Prediction
36
Human Genome
 Scientists originally thought there would be
about 100,000 genes
 Appear to be about 20,000
 WHY?
 Almost identical to that of Chimps. What makes
the difference?
 Answers appear to lie in the noncoding regions
of the DNA (formerly thought to be junk)
•2/25/13 - Union University
37
RNAi – Nobel Prize in Medicine 2006
siRNA may be artificially added to cell!
Double stranded RNA
Short Interfering RNA (~20-25 nt)
RNA-Induced Silencing Complex
Binds to mRNA
Cuts RNA
•2/25/13 - Union University
Image source:
http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html,
Advanced Information, Image 3
38
miRNA
 Short (20-25nt) sequence of noncoding RNA
 Known since 1993 but significance not widely
appreciated until 2001
 Impact / Prevent translation of mRNA
 Generally reduce protein levels without
impacting mRNA levels (animal cells)
 Functions
 Causes some cancers
 Guide embryo development
 Regulate cell Differentiation
 Associated with HIV
 …
•2/25/13 - Union University
39
TCGR – Mature miRNA
(Window=5; Pattern=3)
C Elegans
Homo Sapiens
Mus Musculus
All Mature
•2/25/13 - Union University
ACG
CGC
GCG
UCG
40
TCGRs for Xue Training Data
C. Xue, F. Li, T. He,
G. Liu, Y. Li, nad X.
Zhang, “Classification
of Real and Pseudo
MicroRNA Precursors
using Local StructureSequence Features
and Support Vector
Machine,” BMC
Bioinformatics, vol 6,
no 310.
P
O
S
I
T
I
VE
NE
G
AT
I
VE
•2/25/13 - Union University
41
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
•2/25/13 - Union University
42
BIG BROTHER ?
 Total Information Awareness
 http://en.wikipedia.org/wiki/Information_Awareness_Offi
ce
 Terror Watch List
 http://www.businessweek.com/technology/content/may2
005/tc20050511_8047_tc_210.htm
 http://www.theregister.co.uk/2004/08/19/senator_on_te
rror_watch/
 http://blog.wired.com/27bstroke6/2008/02/us-terrorwatch.html
 CAPPS
 http://en.wikipedia.org/wiki/CAPPS
•2/25/13 - Union University
43
•2/25/13 - Union University
44
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
•2/25/13 - Union University
45
My DM Toolbelt
 C, C++
 Perl, Ruby
 Weka
 R, SAS
 Excel, XLMiner
 Vi, word, …
 Grep, sed, …
2/25/13 - Union University
46
•2/25/13 - Union University
47
Related documents