Download Research Projects - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
KDDRG Research Projects
Prof. Carolina Ruiz
[email protected]
Department of Computer Science
Worcester Polytechnic Institute
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Some Current Analytical Data
Mining Research Projects at WPI
• Mining Complex Data: Set and Sequence Mining
–
–
–
–
Systems performance Data
Sleep Data
Financial Data
Web Data
• Data Mining for Genetic Analysis
– Correlating genetic information with diseases
– Predicting gene expression patterns
• Data Mining for Electronic Commerce
– Collaborative and Content-Based Filtering
• Using Association Rules and using Neural Networks
WPI Center for Research in Exploratory Data and Information Analysis
Analyzing
Sleep
Data
Purpose:
CREDIA
 Associations between sleep patterns and health/pathology
Obtain patterns of different sleep stages (4 sleep+REM +Wake)
DATA SET
Clinical (sequential)
Electro-encephalogram (EEG),
Electro-oculogram (EOG),
(Source: http://www. blsc.com)
Electro-myogram (EMG),
Diagnostic (tabular)
Questionnaire responses
Patient’s demographic info.
Patient’s medical history
Probe measuring flow of Oxygen
in blood etc.
Potential Rules:
(A) Association Rules
(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy
confidence=92%, support= 13%
(B) Classification Rules
(snoring= HEAVY) & (AHI* > 30/hour): severe OSA***
=> (Race = Caucasian) confidence=70%, support= 8%
WPI, UMassMedical, BC*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Input Data
• Each instance: [Tabular | set | sequential] * attributes
attr1
attr2
attr3 attr4
attr5 [class]
illnesses
{depression,
P1 fatigue}
heart rate
age
oxygen
27
gender Epworth
M
5
{stroke,
P2 dementia,
fatigue}
97,72,67,80,…
73
90,92,96,89,86,…
F
23
P3 {arthritis}
102,99,87,96,…
49
97,100,82,80,70,
…
M
14
…
…
…
…
…
…
…
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Analyzing Financial Data
• Sequential data – daily stock values
• “Normal” (tabular/relational) data
– sector (computers, agricultural, educational, …), type of
government, product releases, companies awards, …
• Desired rules:
– If DELL’s stock value increases & 1999<year<2002 =>
IBM’s stock value decreases
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Events – Financial Data
Basic events: 16 or so financial templates
[Little&Rhodes78]
difficult pattern matching – alignments and time warping
Panic Reversal
Rounding Top Reversal
Head & Shoulders Reversal
Descending Triangle Reversal
WPI Center for Research in Exploratory Data and Information Analysis
WPI Weka
CREDIA
Tool for mining complex temporal/spatial associations
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Data Mining for Genetic Analysis
w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),
and Alvarez (CS, BC)
• SNP analysis
– discovering correlations between
sequence variations and diseases
• Gene expression
– discovering patterns that cause a gene
to be expressed in a particular cell
WPI Center for Research in Exploratory Data and Information Analysis
Correlating Genetics with
Diseases
• Utilize Data Mining
Techniques with Actual
Genetic Data Sampled
from Research
• Spinal Muscular Atrophy:
inherited disease that
results in progressive
muscle degeneration and
weakness.
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Genomic Data Resources
Patient
Gender
SMA Type
(Severity)
SNP
Location
C212
AG1-CA
Father / Mother
Father / Mother
Female
Severe
Y272C
31 / 28 29
102 / 108 112
Male
Mild
Y272C
28 29 / 25
108 112 / 114
Wirth, B. et al. Journal of Human Molecular Genetics
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
Our System: CAGE
To predict gene expression based on DNA
sequences.
Muscle Cell
Gene 3
Gene 1
Gene 2
Neural Cell
Gene 1
Gene 2
CAGE
On
Gene 3
Seam Cells
Gene 1
Gene 2
Off
Gene 3
WPI Center for Research in Exploratory Data and Information Analysis
Gene expression Analysis
PR1
PROMOTER(S)
M1
M4
M2
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
Gene 1
CELL TYPES
neural
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
Gene 2
neural
Gene 3
muscle
Gene 4
neural
Gene 5
muscle
Gene 6
neural
Gene 7
neural
Gene 8
neural
Gene 9
muscle
PR3
M4
M1
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M3
M4
M5
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M5
M2
M3
M1
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
CREDIA
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Gene Expression
• Transcription of DNA into RNA
TRANSCRIPTIONAL PROTEINS
TF 1
TF 3
TF 2
PROMOTER
REGION
GENE
M1
M4
M2
..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA
240
100
MOTIFS M1, M2, M4
MUSCLE CELL
WPI Center for Research in Exploratory Data and Information Analysis
PR1
PROMOTER(S)
M1
M4
M2
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
Gene 1
neural
Gene 2
neural
Gene 3
muscle
CREDIA
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M4
M1
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
Gene 4
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
Gene 5
PR6
M3
M4
M5
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M5
M2
M3
M1
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
Gene 6
Gene 7
R1: M1, M4, M5 => Neural
supp neural
=22%, conf=100%
[Supp. instances: PR1, PR2]
muscle
R2: M2, M4, M5 => Neural
neural
supp =22%
, conf=100%
[Supp. instances: PR1,PR8]
neural
Gene 8
neural
Gene 9
muscle
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
“Well-clustered” motifs
M1
M1
260
M4
M1
M4
240
M4
M2
M5
120
M2
M4
18
60
Coefficient of variation of
distances (cvd) between
two motifs:
M4
M3
IRn
( Mj , Mk )
 IRn ( Mj , Mk )
 IRn
 ( Mj , Mk )
M4
M3
21
IR1={M1,M2,M5}
M5
150
100
M4
cvd
M5
190
M2
210
M5
210
350
M1
M3
M5
150
M1
360
100
M2
100
110
M5
M1
(M1,M2) = 120.1
(M1,M2) = 216.6
cvd(M1,M2) = 0.55
WPI Center for Research in Exploratory Data and Information Analysis
CREDIA
Distance-based Association Rules
Sample distance-based assoc. rule
• Given:
– min-support
– min-confidence
– max-cvd
R1: M1, M2, M5=>Neural
(sup=33%, conf=100%)
M2
M1
thresholds
• Mine:
– all distance-based
association rules
M2
M5
cvd
0.554 0.076
mean
216.6 462.0
sdev
120.1 35.0
cvd
0.433
mean
237.0
sdev
103.0
WPI Center for Research in Exploratory Data and Information Analysis
Grad. & Undergrad. Students
•
•
•
•
•
•
•
•
•
•
•
•
•
Ali Benamara.
Dharmesh Thakkar.
Senthil K Palanisamy.
Zachary Stoecker-Sylvia.
Keith A. Pray.
Jonathan Freyberger.
Maged El-Sayed.
Parameshvyas
Laxminarayan.
Aleksandar Icev.
Wendy Kogel.
Michael Sao Pedro.
Christopher Shoemaker.
Weiyang Lin.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
CREDIA
Jonathan Rudolph
Eduardo Paredes
Iavor N. Trifonov.
Takeshi Kawato
Cindy Leung and Sam Holmes.
John Baird (BB), Jay Farmer, Rebecca Gougian (BB),
Ken Monterio (BB), Paul Young.
Zachary Stoecker-Sylvia.
Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB)
Wendy Kogel, Brooke LeClair, Christopher St. Yves.
Brian Murphy, David Phu (CS/BB), Ian Pushee,
Frederick Tan (CS/BB).
Daniel Doyle, Jared Judecki, James Lund, Bryan
Padovano (BB).
Christopher Cole.
Michael Ciman and John Gulbrandsen.
Tara Halwes
Christopher Martino.
Matthew Berube.
Anna Novikov.
Amy Kao and Dana Rock.
Related documents