Download Recent work by ISI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Stemming wikipedia , lookup

Junction Grammar wikipedia , lookup

Transcript
Recent Work at ISI
Jose Luis Ambite
Yigal Arens
Eduard Hovy
Andrew Philpot
USC/ISI
Overview
1. EDC system
– NHANES health questionnaire data
– (Semi-)automatic domain model construction
– NL-based question understanding
2. Proposals
– Urban Transportation SGER awarded
– Submitted proposal to ITR
3. Outreach
– Connections to USC campus
– Conference planning: dg.o 2002
NHANES Data Collection
• We acquired and wrapped NHANES database
– From National Center for Health Statistics
– Survey of thousands of records (people), each record
contains max. 12,000 questions about health, family,
medical history, etc.
– Database wrapped and accessible via EDC system
Challenge: can we learn the domain model
automatically?
– Try to extract terms from DB, cluster them, and then
link them into Ontology
– Then test Domain Model using SIMS
Automated Domain Modeling Research
• Step 1: performed manual pre-test
– extracted approx. 60 column headings (database questions)
– clustered them manually
– compared accuracy: about 50% overlap only
• Step 2: developed clustering toolkit
– assembled CLINK, SLINK, Median, k-Means, etc. into toolkit
– developed speedup techniques
• Step 3: ran series of 10 experiments
– various word manipulations (word weighting by inverse frequency,
etc.; word stemming; longer passage extracts; etc.)
– mapped out extensive parameter space; did pinpointed sweep
• Results still not great
NL Question Understanding
Challenge: can we interpret user’s question when
posed in English, not using menus or ontology?
• Approach:
1. create new Finite State Machine
2. create question grammar and lexicon (linked to Ontology)
3. create conversion routines that assemble SQL queries out
of user input
4. test and evaluate using EDC system and SIMS
• Current status:
– new FSM completed
– grammar and conversion routines under construction
– will demo English (+ other?) query input at conference
Proposals
• SGER proposal funded
– Topic: Urban transportation study—new methods for freight
tracking in LA by comparing across databases
– Grant awarded to USC, shared by ISI and USC’s Dept of Policy
and Planning
– Jose Luis Ambite will spend approx. 25% time on this study
• White paper to DoT
– Topic: Searching for patterns in freight traffic
– Submitted by USC campus people and Jose Luis Ambite
• ITR proposal submitted
– Topic: Semi-automated topic hierarchy creation
– Partners: Eduard Hovy communicated with EPA group
– If funded will use EPA’s CARAT ontology as starting point and
evaluation standard
Outreach
• USC Campus Group
– Urban policy planners, digital democracy sociologists, industrial and
systems engineers, etc.
– Held several meetings, chaired by Yigal Arens and Genevieve
Giuliano, to explore collaborations and to see if we can extend DGRC
to start a separate organization
– Drafted a statement of goals to hand to Provost and USC-based small
funding offices
• New issue of DG Online! http://www.dgrc.org/dg-online/
• Conference: dg.o 2002
–
–
–
–
Hotel arranged
Website up (but still need fancy graphics)
Call for presentations disseminated
Some portions of program and invitees determined