Download data mining - University of Houston

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
for the Health Sciences
Christoph F. Eick
www.cs.uh.edu/~ceick/
[email protected]
University of Houston
Organization
1. Health Care and Computer Science
2. Promising Technologies
2.1 KDD / Data Mining
2.2 Agent-based Systems
2.3 Shared Ontologies and Knowledge Brokering
3. Model Generation as an Example
4. Summary and Conclusion
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
1. Health Care and Computer Science
Not too long ago (e.g. 1989):
 Offline data / Missing data / hand written reports
 Computer that cannot talk to each other
 Lack of standardization (Tower of Babel, too many languages…)
Today: faster computers, cheaper computers, better computer networks,
electronic scanners, better connectivity, the internet,...
 We have a lot of computerized knowledge on almost any aspects of human
health(a well of knowledge)
 We have much more computing power to conduct complex data analysis tasks
 New Problems:
 How can we find anything?
 How do we gather information that is distributed over various
computer systems and represented using different formats?
 If we find something, how do we know that it is complete?
 How can this large amount of information be analyzed?
 What information can we trust?
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
2. Promising “Newer” Technologies to Cope with the
Information Flood




Knowledge Discovery and Data Mining (KDD)
Agent-based Technologies
Ontologies and Knowledge Brokering
Non-traditional data analysis techniques
Model Generation
As an Example
To Explain /
Discuss Technologies
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Knowledge Discovery in Data [and Data Mining] (KDD)
Let us find something interesting!




Definition := “KDD is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data” (Fayyad)
Frequently, the term data mining is used to refer to KDD.
Many commercial and experimental tools and tool suites are available (see
http://www.kdnuggets.com/siftware.html)
Field is more dominated by industry than by research institutions
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
General KDD Steps
Data sources Selected/Preprocessed data
Select/preprocess
Transform
Transformed data Extracted information Knowledge
Data mine
Interpret/Evaluate/Assimilate
Data preparation
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
KDD and Classical Data Analysis





KDD is less focused than data analysis in that it looks for interesting patterns
in data; classical data analysis centers on analyzing particular relationships in
data. The notion of interestingness is a key concept in KDD. Classical data
analysis centers more on generating and testing pre-structured hypothesis with
respect to a given sample set.
KDD is more centered on analyzing large volumes of data (many fields, many
tuples, many tables, …).
In a nutshell the the KDD-process consists of preprocessing (generating a
target data set), data mining (finding something interesting in the data set), and
post processing (representing the found pattern in understandable form and
evaluated their usefulness in a particular domain); classical data analysis is
less concerned with the the preprocessing step.
KDD involves the collaboration between multiple disciplines: namely,
statistics, AI, visualization, and databases.
KDD employs non-traditional data analysis techniques (neural networks,
association rules, decision trees, fuzzy logic, evolutionary computing,…).
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
3. Generating Models as an Example




The goal of model generation (sometimes also called predictive data mining) is the
creation, evaluation, and use of models to make predictions and to understand the
relationships between various variables that are described in a data collection.
Typical example application include:
– generate a model to that predicts a student’s academic performance based on
the applicants data such as the applicant’s past grades, test scores, past
degree,…
– generate a model that predicts (based on economic data) which stocks to sell,
hold, and buy.
– generate a model to predict if a patient suffers from a particular disease based
on a patient’s medical and other data.
Model generation centers on deriving a function that can predict a variable using
the values of other variables: v=f(a1,…,an)
Neural networks, decision trees, naïve Bayesian classifiers and networks,
regression analysis and many other statistical techniques, fuzzy logic and neurofuzzy systems, association rules are the most popular model generation tools in the
KDD area.
All model generation tools and environments employ the basic train-evaluateData Mining for the Health Sciences, Houston, Feb. 9, 2000.
predict cycle.
Why Do We Need so many
Data Mining / Analysis Techniques?



No generally good technique exists.
Different methods make different assumptions with respect to the
data set to be analyzed (to be discussed on the next transparency)
Cross fertilization between different methods is desirable and
frequently helpful in obtaining a deeper understanding of the
analyzed dataset.
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Example: Decision Tree Approach
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Decision Tree Approach2
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Example: Nearest Neighbor Approach
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Characteristics and Assumptions of
Popular Data Mining/Analysis Techniques



Distance based approaches (assume that a distance function with respect to the
objects in the dataset exists) vs. order-based approaches (just use the ordering of
values in their decision making; 3>2>1 is indistinguishable from 2.01>2>1.99)
Approaches that make no assumptions / assume a particular distribution of the data
in the underlying dataset.
Differences in employed approximation techniques
– Rectangular vs. other approximation
– Linear vs. non-linear approximations







Sensitivity to redundant attributes (variables)
Sensitivity to irrelevant attributes
Sensitivity to attributes of different degrees of importance
Different Training Performance / Testing Performance
What does the learnt function tell us about the analyzed data set? How difficult is it
to understand the learnt function?
Deterministic / non-deterministic approaches
Stability of the obtained results
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Players in the Model Generation Society
Data Analysts
Data Collection
Providers
Tool Builders
End Users (Managers, Doctors, Decision Makers, Gamblers,...)
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
(More) Problems of Model Generation

It is difficult to find appropriate data collections.
 Sharing of models is not supported.
 Model generation is mostly performed in a centralized environment,
not taking advantage of distributed computed computing technology.
 Degree of tool standardization is low, which makes more difficult to
use different tools for the same data analysis problems.
 Evaluation of claims with respect to to the performance models is
very difficult. Problem: the model itself, as well as tools and data
collection that were used to generate the model are not accessible
online.
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Key Ideas Agent-based Technologies






“Agents operate independently and anticipate user needs” (P. Maes)
“Agent help users suffering from information overload” (O. Etzioni) rather to mimic
human intelligence
“Agents are important because the allow users to interoperate with modern
applications such as electronic commerce and information retrieval. Most of these
applications assume that components are added dynamically and that they will be
autonomous (serve different users and providers to fill different goals) and
heterogeneous.” (M. Singh)
“Essentially, agent-based architectures are characterized by three key features:
autonomy, adaptation, and cooperation. Agent-based systems are computational
systems in which several agents interact for their own good and for the good of the
overall system.
“In an agent-based architecture services are provided in the context of a community
of loosely coupled agents of various types in a distributed environment.”
“Agents are aware of their environment and capable of communicating with other
agents that belong to the same agent community”.
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Simplified View of Agent-based Systems
End User
Agents
Agents that act on
behalf of end users
that look for services
Mediator
Agents
Service Provider
Agents
Agents that act as a
matchmaker between
service providers
and end users
Agents that act
on behalf of
service providers
Conversation Layer
Message Layer
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Agent-based Model Generation



Model generation services are provided in the context of a
community of loosely coupled agents of various types in a
distributed environment.
Model generation tools are accessed using a unified interface.
Tool providers and data collection providers offer their services to
data analysts and end-users via the internet. New forms of
collaboration can easily be supported in this environment:
– data analysts no longer run the tools on their own computing
environment
– brokering techniques can be used to find interesting data
collections, suitable tools, useful models, and available
ontologies.
– tool developers offer tool services on the internet charging onetime tool use fee.
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Model Generation Agent Communities
Data Collection
Provider
Resource
Generation
Tool
Model
Model
Data Collection
Model Generation
End User Browser
Resource Agent
Resource Agent
Model Generation
Browser
Data Collection
Broker
Model Broker
Tool Broker
Data Analyst
Model
Generation
Tool
Tool Integration
Tool
Model
Generation
Tool
Data Collection
Data Collection
Agent-based Model
Generation Community
Tool
Developer
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Shared Ontologies






“Ontologies are content theories about sorts of objects, properties of objects, and
relationship between objects that are possible in a specified domain of knowledge”
(Chandrasekaran)
“We consider ontologies to be domain theories that specify a domain-specific
vocabulary of entities, classes, properties, predicates, and functions, and a set of
relationships that necessarily hold among those vocabulary items” (Fikes)
“Shared ontologies form the basis for domain specific knowledge representation
languages” (Chandrasekaran)
“If we could develop ontologies that could be used as the basis of multiple systems,
they would share a common terminology that would facilitate sharing and reuse”
(W. Swartout)
“Ontologies play an important role for the standardization of terminology in
medicine (e.g. UMLS) and other domains”
“Ontologies can serve as the glue between knowledge that is represented at
different, usually heterogeneous information sources.”
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Ontologies and Brokering




Service providers describe their capabilities in terms of a domain (or task)
ontology
Agents that seek services describe their needs in terms of a domain (or task)
ontology
Broker agents server as matchmakers between service providers and service
seekers by finding suitable agents and by evaluating the extent to which they can
provide those services relying on a semantic brokering approach.
Various languages have been advocated in the recent years to specify ontologies:
OKBC, CKML/OML, ONTOLINGUA, XML, UMLS,...
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Promising Technologies to Use the
Flood of Data for Providing Better Health Care
Agent-based Systems
KDD
Software
Development
Environments
Knowledge
Acquisition
Tools
Visualization
The Well
of Knowledge
Database
Technology
Ontologies
Traditional
Data Analysis
Techniques
Knowledge Brokering
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
References

WWW-Links:
– http://ksl-web.stanford.edu/Reusable-ontol/P001.html (Richard Fikes’ (Stanford
University) Slide Show on “Reusable Ontologies”
– http://www.kdnuggets.com/index.html (KDD Nuggets Directory: Data Mining and
Knowledge Discovery Resources)
– http://www.mcc.com/projects/infosleuth/ (InfoSleuth (MCC) --- an Agent-based System
for Information Gathering)
– http://www.cs.cmu.edu/~softagents/ (CMU Intelligent Software Agents Page)
– http://www.cs.uh.edu/~ceick/6368.html (Homepage UH Graduate AI-class)

Papers:
– Special Issue IEEE Intelligent Systems on “Coming to Terms with Ontologies”,
Jan./Feb. 1999.
– Special Issue IEEE Intelligent System on “Unmasking Intelligent Agents”, March/April
1999.
– Special Issue IEEE Computer on “Data Mining”, October 1999.
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
End of Presentation
Transparencies that follow are very
likely not to be used
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
What is KDD?


Definition := “KDD is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data” (Fayyad)
The identified knowledge is used to
– make predictions
– classify new examples
– summarize the content of data collections and documents to facilitate
understanding, decision making, and for supporting search and indexing
– support graphical visualization to aid human in discovering deeper patterns

Example applications:
– learn to classify brain tissue from examples
– predict a patient’s life expectancy from his medical history
– summarize/cluster/mine clinical trial reports
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
What are Ontologies good for?


As a shared conceptual model of a particular application domain that describes the
semantics of the objects that are part of the domain, and captures knowledge that is
inherent to the particular domain --- idea: knowledge base .
Ontologies provide a vocabulary for representing knowledge about a domain and
for describing specific situations in a domain (tool for defining and describing
domain-specific vocabularies) --- idea: language for communication

For data/knowledge translation and transformation (provide a solution to the
translation problem between different terminologies); for fusion and refinement of
existing knowledge --- idea: interoperation
 For matchmaking between users, agents, and information resources in agent-based
systems --- idea: collaboration, brokering
focus of next slides
 As reusable building blocks to build systems that solve particular problems in the
application domain --- idea: model reuse
Summary: “Ontologies can be used as building block components of knowledge bases,
object schema for object-oriented systems, conceptual schema for data bases,
structured glossaries for human collaborations, vocabularies for communication
between agents, class definitions for conventional software system, etc.” (Fikes)
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Service
Provider
Agents
A “Traditional” Approach
End User
Agents
Specify keywords
with respect to the
documents they are
looking for
Search Engine
Abstract
Clinical Trial Report
Summary
Clinical
Trial Report
Semantic Brokering Approach
Service
Provider
Agents
End User
Agents
Semantic Brokering
Specify subset of
ontology
Subset of an
Ontology
Summary
Clinical
Trial Report
:= matchmaking
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Why do agent-based systems
show promise for health care?

Scalability

Tasks to be solved involve the collaboration between
different groups

Well suited for the world-wide web

Health care is a dynamically changing environment

Establish standards (as a by product)
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Example Semantic Brokering
Data Analyst’s Information Requirement
Patient
Age>40
weight
Intensive-CarePatient
Hours-in-intensive-care
Data Collection1
Result Semantic Brokering:
((DataCollection1 nil ((missing slot weight)
(contradictory (< age 15) (> age 40))
(DataCollection2 t)
(DataCollection3 t ((> age 60)(> weight 300)))
Data Collection2
Data Collection3
Patient
Patient
Patient
Age<15
age
Age>60
weight
Intensive-CarePatient
Hours-in-intensive-care
Intensive-CarePatient
Hours-in-intensive-care
Weight>300
Intensive-CarePatient
Hours-in-intensive-care
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.
Critical Problems
with Respect to Shared Ontologies




Scientific communities have to agree on ontologies; otherwise, the whole approach
is flawed.
Development of ontologies for a particular domain is a difficult task (see Digital
Anatomist project at UW, development of UMLS). The development of user
friendly, and intelligent knowledge acquisition tools is very important for the
successful development of shared ontologies.
Expressiveness of languages that are used to define ontologies limits what can be
done with domain ontologies.
Reasoning capabilities are important for systems that use shared ontologies (we
need a language to specify ontologies and an inference engine that can reason with
the given ontologies)
– finding inconsistencies in knowledge bases, for finding errors at data entry
– semantic brokering
– more intelligent mappings between terms
– ...
Data Mining for the Health Sciences, Houston, Feb. 9, 2000.