Download Research Issues in Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
RESEA
ARCH IS
SSUES IN
N DATA
A MININ
NG
Sanjeev Ku
S
umar
I.A.S.R.II., Library Avenue, Pusa, New Delhi-11001
D
12
[email protected]
1. Introoduction
The field of data mining
m
and knowledgee discovery is emerging as a new,, fundamen
ntal research
h
area witth importannt applicatioons to sciencce, engineerring, medicine, businesss, and educcation. Dataa
mining attempts too formulate analyze and
d implemennt basic induuction processes that facilitate
f
thee
extractiion of meaaningful infformation and
a
knowleedge from unstructureed data. Data
D
miningg
extractss patterns, cchanges, asssociations and
a anomaliies from large data setss. Work in data
d mining
g
ranges from theoreetical work on the prinnciples of learning andd mathemattical represeentations off
s
thaat perform information
i
n filtering on
o the web,
data to building addvanced enngineering systems
find geenes in DN
NA sequennces, help understandd trends annd anomalies in econ
nomics andd
educatio
on, and deetect netwoork intrusioon. Data mining
m
is also a proomising com
mputationall
paradiggm that enhances tradittional approoaches to ddiscovery annd increases the opporrtunities forr
breakthhroughs in the
t understtanding of complex phhysical andd biologicall systems. Researchers
R
s
from many
m
intelleectual comm
munities havve much too contributee to this fieeld. These include thee
commuunities of maachine learn
ning, statisttics, databasses, visualizzation and ggraphics, op
ptimization,
computtational matthematics, and
a the theory of algoriithms.
2. The Process off Data Miniing
The datta mining pprocess is often charactterized as a multi-stagge iterative pprocess invvolving dataa
selectioon, data cleaaning, and application
a
o data miniing algorithhms, evaluattion, and so
of
o forth. Heree
we adoppt a somewhhat differennt process-oriented view
w and breakk it down innto different steps:
q
thee
2.1. Exxploring and Pre-proccessing: thee initial stepps of explorring, visualiizing, and querying
data, to
o gain insighht into the data
d in an interactive m
manner. Preeprocessing steps such as variablee
selectioon, data focuusing, and data
d validatiion can alsoo be includeed in these innitial steps.
Research Issues in Data Mining
2.2. Modeling
The steps involved in (a) selecting the model representations that we seek to fit to the data (e.g.,
a tree, a linear function, a probability density model, etc.), (b) selecting the score functions that
score different models with respect to the data, and (c) specifying the computational methods and
algorithms to optimize the score function (e.g., greedy local search). These \components"
combined together specify the data mining algorithm to be used. The components may be
\precompiled" into a specific algorithm (e.g., CART or C4.5 decision tree implementations) or
may be integrated in a \customized" manner for a specific application (much more common in
the sciences).
2.3. Mining
The step (often repeated) is of actually running a particular data mining algorithm on a particular
data set.
2.4. Evaluating
The step (often ignored) of critically evaluating the quality of the output of the data mining
algorithm from step 3, both the predictions of the model and the interpretation of the fitted model
itself.
2.5. Deploying
The step (rarely achieved) of putting a model from a data mining algorithm into routine
predictive use, e.g., using the model continuously in real-time for scoring customers visiting an
ecommerce Web site. A challenging (and under-appreciated) technical issue in this context is,
how and when models should be updated for such continuous data stream applications.
3. Two Extremes of Data Mining
Given the vast numbers of different users of data analysis tools, across different application
disciplines, it is clearly dangerous to make broad generalizations about data analysis and data
mining! Nonetheless, we will now consider two \prototype" users of data mining tools in the
general context of the 5-step data mining process above. These two prototypes are in some
respects at opposite ends of a hypothetical spectrum of approaches to data mining.
4. The Business Data Miner
The first prototype data miner is \The Business-Person," or \BP" for short. BP typically deals
with large numbers of customers, for example in retail consumer or financial services
environments. BP's main goal is to use data mining techniques for prediction of customer
behavior to gain competitive advantage, e.g., using decision trees to try to identify promising
customers for a marketing campaign. BP's approach to data mining is often strictly constrained
by a number of factors: cost factors (the cost of data mining development and deployment should
not be less than anticipated gains in revenue), integration factors (the deployed predictive model
may need be integrated into an existing transaction-oriented environment), and time constraints
(any potential competitive advantage may depend critically on being able to develop and deploy
a predictive model quickly). If one asks the BP what their most pressing problems are, they will
almost certainly not say that they need a slightly more accurate decision tree algorithm, or a
slightly faster association rule algorithm. Instead their most pressing problem is that of managing
the overall process: how should features be defined for time-dependent data? Which models are
212
Research Issues in Data Mining
most appropriate? How much data is needed for training? Will a random sample of 100k
customers from a database of 10 million be \good enough"? What kind of a system can be put in
place to update the models as new data arrives? and so on.
5. The Science Data Miner
The other prototype we consider is the \Science Person" (SP for short) which is in some sense at
the other end of a hypothetical spectrum of data miners. The SP might be an atmospheric
scientist, investigating global climate change patterns via analysis of spatio-temporal gridded
observations of temperature, pressure, wind-speed, taken on a global grid on a daily basis for the
past 30 years. Or the SP might be working in computational biology, exploring a large gene
expression data set and its relation to cancer. The SP's data mining approach is quite different to
that of the BP. SP typically spends significant time in Steps 1 and 2: exploring, visualizing,
defining alternative models, and so forth. For example, in atmospheric science, the data can be
\shaped" in different ways via numerous pre-processing techniques such as principal components
analysis, time-series smoothing, and so forth. A single model may take 6 months or a year to
develop and only be evaluated once on a particular data set. SP's primary goal is to explore
model space in such a manner as to better understand the phenomena generating the data:
generating better predictions is merely a stepping stone on the path to identifying better models.
6. Recent Research Achievements
The opportunities today in data mining rest solidly on a variety of research achievements, which
were interdisciplinary in nature, resting on discoveries made by researchers from different
disciplines working together collaboratively.
 Neural Networks: Neural networks are systems inspired by the human brain. A basic
example is provided by a back propagation network which consists of input nodes, output
nodes, and intermediate nodes called hidden nodes. Initially, the nodes are connected with
random weights. During the training, a gradient descent algorithm is used to adjust the
weights so that the output nodes correctly classify data presented to the input nodes. The
algorithm was invented independently by several groups of researchers.
 Tree-based Classifiers: A tree is a convenient way to break large data sets into smaller ones.
By presenting a learning set to the root and asking questions at each interior node, the data at
the leaves can often be analyzed very simply. For example, a classifier to predict the
likelihood that a credit card transaction is fraudulent may use an interior node to divide a
training data set into two sets, depending upon whether or not five or fewer transactions were
processed during the previous hour. After a series of such questions, each leaf can be labeled
fraud/no-fraud by using a simple majority vote. Tree based classifiers were independently
invented in information theory, statistics, pattern recognition and machine learning.
 Graphical Models and Hierarchical Probabilistic Representations: A directed graph is a
good means of organizing information about qualitative knowledge about conditional
independence and causality gleamed from domain experts. Graphical models generalize
Markov models and hidden Markov models, which have proved themselves to be a powerful
modeling tool. Graphical models were independently invented by computational probabilists
and artificial intelligence researchers studying uncertainty.
 Ensemble Learning: Rather than use data mining to build a single predictive model, it is
often better to build a collection or ensemble of models and to combine them, say with a
simple, efficient voting strategy. This simple idea has now been applied in a wide variety of
213
Research Issues in Data Mining





contexts and applications. In some circumstances, this technique is known to reduce variance
of the predictions and therefore to decrease the overall error of the model.
Linear Algebra: Scaling data mining algorithms often depends critically upon scaling
underlying computations in linear algebra. Recent work in parallel algorithms for solving
linear system and algorithms for solving sparse linear systems in high dimensions are
important for a variety of data mining applications, ranging from text mining to detecting
network intrusions.
Large Scale Optimization: Some data mining algorithms can be expressed as large-scale,
often non-convex, optimization problems. Recent work has provided parallel and distributed
methods for large-scale continuous and discrete optimization problems, including heuristic
search methods for problems too large to be solved exactly.
High Performance Computing and Communication: Data mining requires statistically
intensive operations on large data sets. These types of computations would not be practical
without the emergence of powerful SMP workstations and high performance clusters of
workstations supporting protocols for high performance computing such as MPI and MPIO.
Distributed data mining can require moving large amounts of data between geographically
separated sites, something which is now possible with the emergence of wide area high
performance networks.
Databases, Data Warehouses, and Digital Libraries: The most time consuming part of the
data mining process is preparing data for data mining. This step can be stream-lined in part if
the data is already in a database, data warehouse, or digital library, although mining data
across different databases, for example, is still a challenge. Some algorithms, such as
association algorithms, are closely connected to databases, while some of the primitive
operations being built into tomorrow's data warehouses should prove useful for some data
mining applications.
Visualization of Massive Data Sets: Massive data sets, often generated by complex
simulation programs, required graphical visualization methods for best comprehension.
Recent advances in multi-scale visualization allow the rendering to be done far more quickly
and in parallel, making these visualization tasks practical.
7. Research Challenges
The amount of digital data has been exploding during the past decade, while the number of
scientists, engineers, and analysts available to analyze the data has been static. To bridge this gap
requires the solution of fundamentally new research problems, which can be grouped into the
following broad areas:
 Developing a unifying theory of data mining
 Scaling up for high dimensional data and high speed data streams
 Mining sequence data and time series data
 Mining complex knowledge from complex data
 Data mining in a network setting
 Distributed data mining and mining multi-agent data
 Data mining for biological and environmental problems
 Data Mining process-related problems
 Security, privacy and data integrity
 Dealing with non-static, unbalanced and cost-sensitive data
214
Research Issues in Data Mining
Scaling data mining algorithms: Most data mining algorithms, assume that the data fits into
memory. Although success on large data sets is often claimed, usually this is the result of
sampling large data sets until they fit into memory. A fundamental challenge is to scale data
mining algorithms as:
1. the number of records or observations increases;
2. the number of attributes per observation increases;
3. the number of predictive models or rule sets used to analyze a collection of observations
increases; and,
4. as the demand for interactivity and real-time response increases.
Not only must distributed, parallel, and out-of-memory versions of current data mining
algorithms be developed, but genuinely new algorithms are required. For example, association
algorithms today can analyze out-of-memory data with one or two passes, while requiring only
some auxiliary data be kept in memory.
Extending data mining algorithms to new data types: Most data mining algorithms work
with vector-valued data. It is an important challenge to extend data mining algorithms to work
with other data types, including 1) time series and process data, 2) unstructured data, such as
text, 3) semi-structured data, such as HTML and XML documents, 4) multi-media and
collaborative data, 5) hierarchical and multi-scale data, and 6) and collection-valued data.
Developing distributed data mining algorithms: Today most data mining algorithms require
bringing all together data to be mined in a single, centralized data warehouse. A fundamental
challenge is to develop distributed versions of data mining algorithms so that data mining can be
done while leaving some of the data in place. In addition, appropriate protocols, languages, and
network services are required for mining distributed data to handle the meta-data and mappings
required for mining distributed data. As wireless and pervasive computing environments become
more common, algorithms and systems for mining the data produced by these types of systems
must also be developed.
Ease of Use: Data mining today is at best a semi-automated process and perhaps destined to
always remain so. On the other hand, a fundamental challenge is to develop data mining systems
which are easier to use, even by casual users. Relevant techniques include improving user
interface, supporting casual browsing and visualization of massive and distributed data sets,
developing techniques and systems to manage the meta-data required for data mining, and
developing appropriate languages and protocols for providing casual access to data. In addition,
the development of data mining and knowledge discovery environments which address the
process of collecting, processing, mining, and visualizing data, as well as the collaborative and
reporting aspects necessary when working with data and information derived from it, is another
important fundamental challenge.
Privacy and Security: Data mining can be a powerful means of extracting useful information
from data. As more and more digital data becomes available, the potential for misuse of data
mining grows. A fundamental challenge is to develop privacy and security models and protocols
215
Research Issues in Data Mining
appropriate for data mining and to ensure that next generation data mining systems are designed
from the ground up to employ these models and protocols.
8. Conclusions
Data mining and knowledge discovery are emerging as a new discipline with important
applications to science, engineering, health care, education, and business. Data mining rests
firmly on 1) research advances obtained during the past two decades in a variety of areas and 2)
more recent technological advances in computing, networking and sensors. Data mining is driven
by the explosion of digital data and the scarcity of scientists, engineers, and domain experts
available to analyze it. Data mining is beginning to contribute research advances of its own, by
providing scalable extensions and advances to work in associations, ensemble learning, graphical
models, techniques for on-line discovery, and algorithms for the exploration of massive and
distributed data sets.
Advances in data mining requires a) supporting single investigators working in data mining and
the underlying research domains supporting data mining; b) supporting inter-disciplinary and
multi-disciplinary research groups working on important basic and applied data mining
problems; and c) supporting the appropriate test-beds for mining large, massive and distributed
data sets. Appropriate privacy and security models for data mining must be developed and
implemented.
216