Download data - Computer Science - Worcester Polytechnic Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Research on Data Mining and
Knowledge Discovery at WPI
Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Outline of this talk
• Short tutorial on Data Mining and
Knowledge Discovery in Databases (KDD)
• Sample ongoing KDD research projects at
WPI
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Need for Data Mining
• Data are being gathered and stored
extremely fast
– Currently, the amount of new data stored in digital computer
systems every day is roughly equivalent to 3000 pages of text for
every person on Earth (estimate based on a projection to 2003 of a
study led by Lyman & Varian at UC-Berkeley in 2000).
• Computational tools and techniques are
needed to help humans in summarizing,
understanding, and taking advantage of
accumulated data
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
What is Data Mining?
or more generally, Knowledge Discovery in Databases (KDD)
“Non-trivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data”
[Fayyad et al. 1996]
• Raw Data
Data Mining
• Patterns
» Analytical and Statistical Patterns (rules, decision trees, …)
» Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases"
AAAI Magazine, pp. 37-54. Fall 1996.
WPI Center for Research in Exploratory Data and Information Analysis
Data Analysis (KDD)Process
clean
data
data “pre”processing
CREDIA
data analysis
data mining
• analytical
 statistical
• visual
models
90
80
70
60
50
40
30
20
10
0
East
W est
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
• noisy/missing data
• dim. reduction
data
sources
model/pattern
evaluation
data
• quantitative
• qualitative
data
management
• databases
• data warehouses
new data
model/patterns
deployment
• prediction
• decision support
“good” model
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
KDD is Interdisciplinary
techniques come from multiple fields
• Machine Learning (AI)
– Contributes (semi-)automatic
induction of empirical laws from
observations & experimentation
• Statistics
– Contributes language, framework,
and techniques
• Pattern Recognition
– Contributes pattern extraction and
pattern matching techniques
• Databases
– Contributes efficient data
storage, data cleansing, and
data access techniques
• Data Visualization
– Contributes visual data displays
and data exploration
• High Performance Comp.
– Contributes techniques to
efficiently handling complexity
• Application Domain
– Contributes domain knowledge
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Modes
• Confirmatory
(verification)
– Given a hypothesis, verify
its validity against the data
• Exploratory
(discovery)
– Prescriptive patterns
• Patterns for predicting
behavior of newly
encountered entities
– Descriptive patterns
• Patterns for presenting the
behavior of observed
entities in a humanunderstandable format
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Analytical and Visual Data
Mining
• Analytical
– A model that
represents the data is
constructed using
computational methods
• Visual
– Data are displayed on
computer screen using
colors and shapes
– Patterns in the data are
identified by the
human (user) eye.
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
What do you want to learn from your data?
KDD approaches
A
B
C
D
blue
blue
orange
regression
IF A & B THEN
IF A & D THEN
classification
clustering
Data
change/deviation
detection
summarization
90
80
70
60
50
40
30
20
10
0
East
W est
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
dependency/assoc. analysis
0.5
A
0.3
C
B
0.75
D
IF a & b & c THEN d & k
IF k & a THEN e
A, B -> C 80%
C, D -> A 22%
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Commercial Data Mining Systems
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Closer Look: IBM’s Intelligent Miner
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Academic Systems
CBA
WEKA
Liu et al., National Univ. of Singapore
Frank et al., University of Waikato, New Zealand
ARMiner
WPI WEKA - Our Temporal/Spatial
Association Rules
Cristofor et al., UMass/Boston
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Some Current Analytical Data
Mining Research Projects at WPI
• Mining Complex Data: Set and Sequence Mining
–
–
–
–
Systems performance Data
Sleep Data
Financial Data
Web Data
• Data Mining for Genetic Analysis
– Correlating genetic information with diseases
– Predicting gene expression patterns
• Data Mining for Electronic Commerce
– Collaborative and Content-Based Filtering
• Using Association Rules and using Neural Networks
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Mining Complex Data
names/aliases bank account
{joe smith,
P1 greg jones}
age
27
{kathy
P2 pearls, kathy
97,72,67,80,…
dow,
susan
harris}
felonies
<burglary 2/86,
fraud 11/93,
murder 3/99>
gender iris scan …
M
53
<child abuse
9/98,
kidnapping
2/03>
F
49
<>
M
P3
…
{drew
harris}
10,29,37,16,…
Based partially on work w/ Norfolk County Sheriff Office
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Sample Complex Patterns
Potential temporal/spatial association:
– Teenage males from Eastern Massachusetts
who are convicted of burglary are likely (7%)
to commit violent crimes when they are adults.
WPI Center for Research in Exploratory Data and Information Analysis
Analyzing
Sleep
Data
Purpose:
CREDIA
 Associations between sleep patterns and health/pathology
Obtain patterns of different sleep stages (4 sleep+REM +Wake)
DATA SET
Clinical (sequential)
Electro-encephalogram (EEG),
Electro-oculogram (EOG),
(Source: http://www. blsc.com)
Electro-myogram (EMG),
Diagnostic (tabular)
Questionnaire responses
Patient’s demographic info.
Patient’s medical history
Probe measuring flow of Oxygen
in blood etc.
Potential Rules:
(A) Association Rules
(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy
confidence=92%, support= 13%
(B) Classification Rules
(snoring= HEAVY) & (AHI* > 30/hour): severe OSA***
=> (Race = Caucasian) confidence=70%, support= 8%
WPI, UMassMedical, BC
*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Input Data
• Each instance: [Tabular | set | sequential] * attributes
attr1
attr2
attr3 attr4
attr5 [class]
illnesses
{depression,
P1 fatigue}
heart rate
age
oxygen
27
gender Epworth
M
5
{stroke,
P2 dementia,
fatigue}
97,72,67,80,…
73
90,92,96,89,86,…
F
23
P3 {arthritis}
102,99,87,96,…
49
97,100,82,80,70,
…
M
14
…
…
…
…
…
…
…
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Analyzing Financial Data
• Sequential data – daily stock values
• “Normal” (tabular/relational) data
– sector (computers, agricultural, educational, …), type of
government, product releases, companies awards, …
• Desired rules:
– If DELL’s stock value increases & 1999<year<2002 =>
IBM’s stock value decreases
WPI Center for Research in Exploratory Data and Information Analysis
Financial Data Analysis
Stock values
Products
Awards
Neg. Events
Expansion/M
erge
Cisco
AMD
Athlon XP 2200+
(Nov 11, 02)
Lifetime
Achievement
(Oct 31, 02)
Aironet 1100
Series
(Oct 2, 02)
…
…
None
…
Reduce workforce
(Nov 14, 02)
None
…
None
None
…
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
Events –Sleep Data
CREDIA
6 Basic sleep events/stages: W,S1,S2,S3,S4,REM
• Sa02: the mean oxygen saturation (SaO2) around 90%
• heart rate shown by ECG in beats per minute
• the sleep stages - W or Wake, 1 or Stage1, 2, 3, 4 and REM or Rapid Eye Movement stage.
Also shown brown markings are:
• Epoch (of duration 30sec) and
• Clock time (indicating total sleep time).
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Events – Financial Data
Basic events: 16 or so financial templates
[Little&Rhodes78]
difficult pattern matching – alignments and time warping
Panic Reversal
Rounding Top Reversal
Head & Shoulders Reversal
Descending Triangle Reversal
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Example: Event Identification
•
•
Templates = increase
, decrease
, sustain
Confidence = 90%, support = 15%, class = Epworth
illnesses
heart rate
{depression,
P1 fatigue}
age
oxygen
27
gender Epworth
M
5
{stroke,
P2 dementia,
fatigue}
97,72,67,80,…
73
90,92,96,89,86,…
F
23
P3 {arthritis}
102,99,87,96,…
49
97,100,82,80,70,
…
M
14
…
…
…
…
…
…
…
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Temporal Relations between two Events
event1
meets
before
after
overlaps
is equal to
starts
during
finishes
event2
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Example: temporal association rules
• heart rate decreases immediately after oxygen stops increasing &
gender=M => epworth=10 (conf=95%, supp= 23%)
– HR-dec[t1,t2] & oxygen-inc[t0,t1] & gender=M =>epworth=10
• Heart rate sustains while oxygen increases & patient suffers of
dementia => ethnicity=white (conf=99%, supp= 16%)
 Patient suffers of dementia and depression & gender=F
& REM[t0,t2] => oxygen-inc[t1,t3] (conf=91%, supp= 17%)
t0
t1 t2
t3
WPI Center for Research in Exploratory Data and Information Analysis
Closer Look: WPI Weka
CREDIA
Tool for mining complex temporal/spatial associations
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining for Genetic Analysis
w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),
and Alvarez (CS, BC)
• SNP analysis
– discovering correlations between
sequence variations and diseases
• Gene expression
– discovering patterns that cause a gene
to be expressed in a particular cell
WPI Center for Research in Exploratory Data and Information Analysis
Correlating Genetics with
Diseases
• Utilize Data Mining
Techniques with Actual
Genetic Data Sampled
from Research
• Spinal Muscular Atrophy:
inherited disease that
results in progressive
muscle degeneration and
weakness.
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Genomic Data Resources
Patient
Gender
SMA Type
(Severity)
SNP
Location
C212
AG1-CA
Father / Mother
Father / Mother
Female
Severe
Y272C
31 / 28 29
102 / 108 112
Male
Mild
Y272C
28 29 / 25
108 112 / 114
Wirth, B. et al. Journal of Human Molecular Genetics
WPI Center for Research in Exploratory Data and Information Analysis
Data Mining Techniques
• Association Rule Mining
• Metrics for evaluation of mined rules
– Confidence
– Support
– Lift
• Example:
[
P(Consequent | Premise)
P(Consequent  Premise)
P(Consequent | Premise) / P(Consequent)
]
Ag1-CA, 110 = absent
Ag1-CA, 108 = associated
Gender = Female
Confidence: 100 %
Support:
9.364%
Lift:
2.39
SMA Type = Severe
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Mining Gene Expression Patterns
• Different cells require different proteins
• DNA uses a four letter alphabet (ATCG)
• Cell expression pattern depends on motifs
promoter sequences
neurons muscle
n1 n2 n3 m1 m2 m3
10 basepairs
20 basepairs
red=ON white=OFF
WPI Center for Research in Exploratory Data and Information Analysis
Gene expression Analysis
PR1
PROMOTER(S)
M1
M4
M2
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
Gene 1
CELL TYPES
neural
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
Gene 2
neural
Gene 3
muscle
Gene 4
neural
Gene 5
muscle
Gene 6
neural
Gene 7
neural
Gene 8
neural
Gene 9
muscle
PR3
M4
M1
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M3
M4
M5
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M5
M2
M3
M1
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
CREDIA
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
Our System: CAGE
To predict gene expression based on DNA
sequences.
Muscle Cell
Gene 3
Gene 1
Gene 2
Neural Cell
Gene 1
Gene 2
CAGE
On
Gene 3
Seam Cells
Gene 1
Gene 2
Off
Gene 3
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Summary
• KDD is the “non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data”
• The KDD process includes data collection and pre-processing, data
mining, and evaluation and validation of those patterns
• Data mining is the discovery and extraction of patterns from data,
not the extraction of data
• Important challenges in data mining: privacy, security, scalability,
real-time, and handling non-conventional data
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Resources –
Books
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Advances in Knowledge Discovery and Data Mining. Eds.: Fayyad, Piatetsky-Shapiro, Smyth,
and Uthurusamy. The MIT Press, 1995.
Data Mining: Concepts and Techniques. J. Han and M. Kamber. Morgan Kaufmann Publishers.
2001.
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. I.
Witten and E. Frank. Morgan Kaufmann Publishers. 2000.
Data Mining. Technologies, Techniques, Tools, and Trends. B. Thuraisingham. CRC, 1998.
Principles of Data Mining , D. J. Hand, H. Mannila and P. Smyth, MIT Press, 2000
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, T. Hastie, R.
Tibshirani, J. Friedman, Springer Verlag, 2001.
Data Mining Cookbook, modeling data for marketing, risk, and CRM. O. Parr Rud, Wiley, 2001.
Data Mining. A hands-on approach for business professionals. R. Groth. Prentice Hall, 1998.
Data Preparation for Data Mining. Dorian Pyle, Morgan Kaufmann, 1999
Data Mining Methods for Knowledge Discovery Cios, Pedrycz, & Swiniarski, Kluwer, 1998.
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Resources –
Books (cont.)
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Mastering Data Mining, M. Berry & G. Linoff, John Wiley & Sons, 2000.
Data Mining Techniques for Marketing, Sales and Customer Support. Berry & Linoff. John
Wiley & Sons, 1997.
Decision Support using Data Mining. S. Anand and A. Buchner. Financial Times Pitman
Publishing, 1998
Feature Selection for Knowledge Discovery and Data Mining. Liu and Motoda, Kluwer, 1998.
Feature Extraction, Construction and Selection: A Data Mining Perpective. Eds: Motoda and Liu.
Kluwer, 1998
Knowledge Acquisition from Databases. Xindong Wu.
Mining Very Large Databases with Parallel Processing. A. Freitas & S. Lavington. Kluwer, 1998.
Predictive Data-Mining: A Practical Guide. Weiss & Indurkhya. Morgan Kaufmann. 1998.
Machine Learning and Data Mining: Methods and Applications. Michalski, Bratko, and Kubat,
John Wiley & Sons. 1998.
Rough Sets and Data Mining: Analysis of Imprecise Data. Eds: Lin and Cercone; Kluwer.
Seven Methods for Transforming Corporate Data into Business Intelligence. Vasant Dhar and
Roger Stein; Prentice-Hall, 1997.
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Resources –
Journals
• Data Mining and Knowledge Discovery Journal
Newsletters:
• ACM SIGKDD Explorations Newsletter
Related Journals:
• TKDE: IEEE Transactions in Knowledge and Data
Engineering
• TODS: ACM Transaction on Database Systems
• JACM: Journal of ACM
• Data and Knowledge Engineering
• JIIS: Intl. Journal of Intelligent Information Systems
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining Resources –
Conferences
•
•
•
•
•
•
KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining
ICDM: IEEE International Conference on Data Mining,
SIAM International Conference on Data Mining
PKDD: European Conference on Principles and Practice of Knowledge
Discovery in Databases
PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining
DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery
Related Conferences:
•
•
•
•
•
•
•
ICML: Intl. Conf. On Machine Learning
IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning
IJCAI: International Joint Conference on Artificial Intelligence
AAAI: American Association for Artificial Intelligence Conference
SIGMOD/PODS: ACM Intl. Conference on Data Management
ICDE: International Conference on Data Engineering
VLDB: International Conference on Very Large Data Bases
Related documents