Download Data Mining Techniques and Research Challenges and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
Data Mining Techniques and Research Challenges and Issues
Girish Kumar Sorot
Arya Institute of Engineering and Technology, Kukas Industrial Area(RIICO), Delhi Road, Jaipur, Rajasthan(India)
The development of Information technology has paved
way to generate large amount of databases and huge data
in various areas. The research in databases and
information technology has given rise to approach to
store and manipulate precious data for further decision
making [1]. Data mining is a process to extract the
implicit information and knowledge by extracting from
the mass, incomplete, noisy, fuzzy and random data with
knowing the data well in advance and which is
potentially useful to various fields [2].
Topics of interest include but are not limited to
practical areas that span a variety of aspects of data
integration and mining including Large-scale data
integration and mining , Metadata integration and
management, Data security and privacy, Social media
data analysis and computing,
Web-scale data mining and semantic discovery,
Network data integration and delivery, Data filtering and
cleaning,
Data integration environments and applications, Data
models, schemas, Database integration systems, Data
management and analysis in specific application domains
Data mining algorithms are widely used today for the
analysis of large corporate and scientific datasets stored
in databases and data archives. Industry, science, and
commerce fields often need to analyze very large datasets
maintained over geographically distributed sites by using
the computational power of distributed and parallel
systems.
Abstract-- Data mining is considered to deal with huge
amounts of data which are kept in the database, to locate
required information and facts. Data mining is the
exploration and analysis of large quantities of data in order
to discover valid, novel, potentially useful, and ultimately
understandable patterns in data. Non trivial extraction of
implicit, previously unknown and potentially useful
information from data Exploration & analysis, by
automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns. In this paper,
we discuss the data mining techniques and functionalities
with application. Also discuss the research challenges in
science and engineering, from the data mining perspective,
with a focus on the data mining issues.
Keyword-- data mining, data mining techniques and
functionalities, research challenges, data mining issues.
I. INTRODUCTION
With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly
important, if not necessary, to develop powerful means
for analysis and perhaps interpretation of such data and
for the extraction of interesting knowledge that could
help in decision-making.
Data Mining, also popularly known as Knowledge
Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and
potentially useful information from data in databases.
II. D ATA M IMINING T ECHNIQES AND FUNCTIONALITIES
The data mining functionalities and the variety of
knowledge they discover are briefly presented in the
following list:
A. Characterization:
Data characterization is a summarization of general
features of objects in a target class, and produces what is
called characteristic rules. The data relevant to a userspecified class are normally retrieved by a database query
and run through a summarization module to extract the
essence of the data at different levels of abstractions. For
example, one may want to characterize the
OurVideoStore customers who regularly rent more than
30 movies a year. With concept hierarchies on the
attributes describing the target class, the attribute
oriented induction method can be used, for example, to
carry out data summarization. Note that with a data cube
containing summarization of data, simple OLAP
operations fit the purpose of data characterization.
Figure 1 Data mining (Knowledge discovery in database)
While data mining and knowledge discovery in
databases (or KDD) are frequently treated as synonyms,
data mining is actually part of the knowledge discovery
process. Figure1 shows data mining as a step in an
iterative knowledge discovery process.
529
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
B. Discrimination:
Approach: Process the data on tools and parts required in
previous repairs at different consumer locations and
Data discrimination produces what are called
discover the co-occurrence patterns.
discriminant rules and is basically the comparison of the
general features of objects between two classes referred
D. Classification:
to as the target class and the contrasting class. For
Classification analysis is the organization of data in
example, one may want to compare the general
given classes. Also known as supervised classification,
characteristics of the customers who rented more than 30
the classification uses given class labels to order the
movies in the last year with those whose rental account is
objects in the data collection. Classification approaches
lower than 5. The techniques used for data discrimination
normally use a training set where all objects are already
are very similar to the techniques used for data
associated with known class labels. The classification
characterization with the exception that data
algorithm learns from the training set and builds a model.
discrimination results include comparative measures.
The model is used to classify new objects. For example,
C. Association analysis:
after starting a credit policy, the OurVideoStore
managers could analyze the customers’ behaviours vis-àAssociation analysis is the discovery of what are
vis their credit, and label accordingly the customers who
commonly called association rules. It studies the
received credits with three possible labels ―safe‖, ―risky‖
frequency of items occurring together in transactional
and ―very risky‖. The classification analysis would
databases, and based on a threshold called support,
generate a model that could be used to either accept or
identifies the frequent item sets. Another threshold,
reject credit requests in the future.
confidence, which is the conditional probability than an
item appears in a transaction when another item appears,
1) Classification: Application
is used to pinpoint association rules. Association analysis
1.1 Direct Marketing
is commonly used for market basket analysis. For
example, it could be useful for the OurVideoStore
Goal: Reduce cost of mailing by targeting a set of
manager to know what movies are often rented together
consumers likely to buy a new cell-phone product.
or if there is a relationship between renting a certain type
Approach: Use the data for a similar product introduced
of movies and buying popcorn or pop. The discovered
before. We know which customers decided to buy and
association rules are of the form: P->Q [s,c], where P and
which decided otherwise. This {buy, don’t buy} decision
Q are conjunctions of attribute value-pairs, and s (for
forms the class attribute.
support) is the probability that P and Q appear together in
Collect various demographic, lifestyle, and company
a transaction and c (for confidence) is the conditional
interaction related information about all such customers.
probability that Q appears in a transaction when P is
Type of business, where they stay, how much they earn,
present. For example, the hypothetic association rules:
etc. Use this information as input attributes to learn a
RentType(X, “game”) ɅAge(X, “13-19”) ->Buys(X,
classifier model.
“pop”) [s=2% ,c=55%] would indicate that 2% of the
transactions considered are of customers aged between
1.2 Fraud Detection
13 and 19 who are renting a game and buying a pop, and
Goal: Predict fraudulent cases in credit card transactions.
that there is a certainty of 55% that teenage customers
Approach: Use credit card transactions and the
who rent a game also buy pop.
information on its account-holder as attributes. When
1) Association Rule Discovery: Application
does a customer buy, what does he buy, how often he
pays on time, etc.
1.1 Supermarket shelf management.
Label past transactions as fraud or fair transactions.
Goal: To identify items that are brought together by
This
forms the class attribute.
sufficiently many customers.
Learn a model for the class of the transactions. Use
Approach: Process the point-of-sale data collected with
this model to detect fraud by observing credit card
barcode scanners to find dependencies among items.
transactions on an account.
A classic rule - If a customer buys diaper and milk, then
E. Prediction:
he is very likely to buy beer. So, don’t be surprised if you
find six-packs stacked next to diapers!
Prediction has attracted considerable attention given
the
potential implications of successful forecasting in a
1.2 Inventory Management:
business context. There are two major types of
Goal: A consumer appliance repair company wants to
predictions: one can either try to predict some
anticipate the nature of repairs on its consumer products
unavailable data values or pending trends, or predict a
and keep the service vehicles equipped with right parts to
class label for some data. The latter is tied to
reduce on number of visits to consumer households.
classification.
530
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
Once a classification model is built based on a training
H. Evolution and deviation analysis:
set, the class label of an object can be foreseen based on
Evolution and deviation analysis pertain to the study
the attribute values of the object and the attribute values
of time related data that changes in time. Evolution
of the classes. Prediction is however more often referred
analysis models evolutionary trends in data, which
to the forecast of missing numerical values, or increase/
consent to characterizing, comparing, classifying or
decrease trends in time related data. The major idea is to
clustering of time related data. Deviation analysis, on the
use a large number of past values to consider probable
other hand, considers differences between measured
future values.
values and expected values, and attempts to find the
cause of the deviations from the anticipated values. It is
F. Clustering:
common that users do not have a clear idea of the kind of
Similar to classification, clustering is the organization
patterns they can discover or need to discover from the
of data in classes. However, unlike classification, in
data at hand. It is therefore important to have a versatile
clustering, class labels are unknown and it is up to the
and inclusive data mining system that allows the
clustering algorithm to discover acceptable classes.
discovery of different kinds of knowledge and at different
Clustering is also called unsupervised classification,
levels of abstraction. This also makes interactivity an
because the classification is not dictated by given class
important attribute of a data mining system.
labels. There are many clustering approaches all based on
the principle of maximizing the similarity between
III. ISSUES IN DATA M INING
objects in a same class (intra-class similarity) and
Data mining algorithms embody techniques that have
minimizing the similarity between objects of different
sometimes existed for many years, but have only lately
classes (inter-class similarity).
been applied as reliable and scalable tools that time and
1) Clustering: Application
again outperform older classical statistical methods.
While data mining is still in its infancy, it is becoming a
1.1. Market Segmentation:
trend and ubiquitous. Before data mining develops into a
Goal: subdivide a market into distinct subsets of
conventional, mature and trusted discipline, many still
customers where any subset may conceivably be selected
pending issues have to be addressed. Some of these
as a market target to be reached with a distinct marketing
issues are addressed below. Note that these issues are not
mix.
exclusive and are not ordered in any way.
Approach: Collect different attributes of customers based
A. Security and social issues:
on their geographical and lifestyle related information.
Find clusters of similar customers. Measure the clustering
Security is an important issue with any data collection
quality by observing buying patterns of customers in
that is shared and/or is intended to be used for strategic
same cluster vs. those from different clusters.
decision-making. In addition, when data is collected for
customer profiling, user behavior understanding,
1.2. Document Clustering:
correlating personal data with other information, etc.,
Goal: To find groups of documents that are similar to
large amounts of sensitive and private information about
each other based on the important terms appearing in
individuals or companies is gathered and stored. This
them.
becomes controversial given the confidential nature of
Approach: To identify frequently occurring terms in each
some of this data and the potential illegal access to the
document. Form a similarity measure based on the
information. Moreover, data mining could disclose new
frequencies of different terms. Use it to cluster.
implicit knowledge about individuals or groups that
could be against privacy policies, especially if there is
Gain: Information Retrieval can utilize the clusters to
potential dissemination of discovered information.
relate a new document or search term to clustered
Another issue that arises from this concern is the
documents.
appropriate use of data mining. Due to the value of data,
G. Outlier analysis:
databases of all sorts of content are regularly sold, and
Outliers are data elements that cannot be grouped in a
because of the competitive advantage that can be attained
given class or cluster. Also known as exceptions or
from implicit knowledge discovered, some important
surprises, they are often very important to identify. While
information could be withheld, while other information
outliers can be considered noise and discarded in some
could be widely distributed and used without control.
applications, they can reveal important knowledge in
other domains, and thus can be very significant and their
analysis valuable.
531
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
B. User interface issues:
D. Performance issues:
The knowledge discovered by data mining tools is
Many artificial intelligence and statistical methods
useful as long as it is interesting, and above all
exist for data analysis and interpretation. However, these
understandable by the user. Good data visualization eases
methods were often not designed for the very large data
the interpretation of data mining results, as well as helps
sets data mining is dealing with today. Terabyte sizes are
users better understand their needs. Many data
common. This raises the issues of scalability and
exploratory analysis tasks are significantly facilitated by
efficiency of the data mining methods when processing
the ability to see data in an appropriate visual
considerably large data. Algorithms with exponential and
presentation. There are many visualization ideas and
even medium-order polynomial complexity cannot be of
proposals for effective data graphical presentation.
practical use for data mining. Linear algorithms are
However, there is still much research to accomplish in
usually the norm. In same theme, sampling can be used
order to obtain good visualization tools for large datasets
for mining instead of the whole dataset. However,
that could be used to display and manipulate mined
concerns such as completeness and choice of samples
knowledge. The major issues related to user interfaces
may arise. Other topics in the issue of performance are
and visualization are ―screen real-estate‖, information
incremental updating, and parallel programming. There
rendering, and interaction. Interactivity with the data and
is no doubt that parallelism can help solve the size
data mining results is crucial since it provides means for
problem if the dataset can be subdivided and the results
the user to focus and refine the mining tasks, as well as to
can be merged later. Incremental updating is important
picture the discovered knowledge from different angles
for merging results from parallel mining, or updating data
and at different conceptual levels.
mining results when new data becomes available without
having to re-analyze the complete dataset.
C. Mining methodology issues:
F. Data source issues:
These issues pertain to the data mining approaches
applied and their limitations. Topics such as versatility of
There are many issues related to the data sources,
the mining approaches, the diversity of data available, the
some are practical such as the diversity of data types,
dimensionality of the domain, the broad analysis needs
while others are philosophical like the data glut problem.
(when known), the assessment of the knowledge
We certainly have an excess of data since we already
discovered, the exploitation of background knowledge
have more data than we can handle and we are still
and metadata, the control and handling of noise in data,
collecting data at an even higher rate. If the spread of
etc. are all examples that can dictate mining methodology
database management systems has helped increase the
choices. For instance, it is often desirable to have
gathering of information, the advent of data mining is
different data mining methods available since different
certainly encouraging more data harvesting. The current
approaches may perform differently depending upon the
practice is to collect as much data as possible now and
data at hand. Moreover, different approaches may suit
process it, or try to process it, later. The concern is
and solve user’s needs differently. Most algorithms
whether we are collecting the right data at the appropriate
assume the data to be noise-free. This is of course a
amount, whether we know what we want to do with it,
strong assumption. Most datasets contain exceptions,
and whether we distinguish between what data is
invalid or incomplete information, etc., which may
important and what data is insignificant. Regarding the
complicate, if not obscure, the analysis process and in
practical issues related to data sources, there is the
many cases compromise the accuracy of the results. As a
subject of heterogeneous databases and the focus on
consequence, data preprocessing (data cleaning and
diverse complex data types. We are storing different
transformation) becomes vital. It is often seen as lost
types of data in a variety of repositories. It is difficult to
time, but data cleaning, as time consuming and
expect a data mining system to effectively and efficiently
frustrating as it may be, is one of the most important
achieve good mining results on all kinds of data and
phases in the knowledge discovery process. Data mining
sources. Different kinds of data and sources may require
techniques should be able to handle noise in data or
distinct algorithms and methodologies. Currently, there is
incomplete information. More than the size of data, the
a focus on relational databases and data warehouses, but
size of the search space is even more decisive for data
other approaches need to be pioneered for other specific
mining techniques. The size of the search space is often
complex data types. A versatile data mining tool, for all
depending upon the number of dimensions in the domain
sorts of data, may not be realistic. Moreover, the
space. The search space usually grows exponentially
proliferation of heterogeneous data sources, at structural
when the number of dimensions increases. This is known
and semantic levels, poses important challenges not only
as the curse of dimensionality. This ―curse‖ affects so
to the database community but also to the data mining
badly the performance of some data mining approaches
community.
that it is becoming one of the most urgent issues to solve.
532
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
Frequent pattern mining has been a focused theme in
IV. MAJOR RESEARCH CHALLENGES
data
mining research for over a decade [HCXY07].
In this section, we will examine several major
Abundant
literature has been dedicated to this research,
challenges raised in science and engineering from the
and
tremendous
progress has been made, ranging from
data mining perspective, and point out some promising
efficient
and
scalable
algorithms for frequent item set
research directions.
mining in transaction databases to numerous research
A. Information network analysis
frontiers, such as sequential pattern mining, structural
pattern mining, correlation mining, associative
With the development of Google and other effective
classification, and frequent-pattern-based clustering, as
web search engines, information network analysis has
well as their broad applications.
become an important research frontier, with broad
The promotion of effective application of pattern
applications, such as social network analysis, web
analysis methods in scientific and engineering
community discovery, terrorist network mining,
applications is an important task in data mining.
computer network analysis, and network intrusion
Moreover, it is important to further develop efficient
detection. However, information net- work research
methods for mining long, approximate, compressed, and
should go beyond explicitly formed, homogeneous
sophisticated patterns for advanced applications, such as
networks (e.g., web page links, computer networks, and
mining biological sequences and networks and mining
terrorist e-connection networks) and delve deeply into
patterns related to scientific and engineering processes.
implicitly formed, heterogeneous, and multidimensional
Furthermore, the exploration of mined patterns for
information networks. Science and engineering provide
classification, clustering, correlation analysis, and pattern
us with rich opportunities on exploration of networks in
understanding will still be interesting topics in research.
this direction.
There are a lot of massive natural, technical, social,
C. Stream data mining
and information networks in science and engineering
Stream data refers to the data that flows into and out of
applications, such as gene, protein, and microarray
the system like streams. Stream data is usually in vast
networks in biology; highway transportation networks in
volume, changing dynamically, possibly infinite, and
civil engineering; topic- or theme-author-publicationcontaining multi-dimensional features. Typical examples
citation networks in library science; and wireless
of such data include audio and video recording of
telecommunication networks among commanders,
scientific and engineering processes, computer network
soldiers and supply lines in a battle field. In such
information flow, web click streams, and satellite data
information networks, each node or link in a network
flow. Such data cannot be handled by traditional database
contains valuable, multidimensional information, such as
systems, and moreover, most systems may only be able
textual contents, geographic information, traffic flow,
to read a data stream once in sequential order. This poses
and other properties. Moreover, such networks could be
great challenges on effective mining of stream data
highly dynamic, evolving, and inter-dependent.
[BBD+02, Agg06].
Many domains of interest today are best described as a
First, the techniques to summarize the whole or part of
network of interrelated heterogeneous objects. As future
the
data streams are studied, which is the basis for stream
work, link mining may focus on the integration of link
data
mining. Such techniques include sampling [DH01],
mining algorithms for a spectrum of knowledge
load
shedding [TcZ+03] and sketching techniques
discovery tasks. Furthermore, in many applications, the
[Mut03],
synopsis data structures [GKMS01], stream
facts to be analyzed are dynamic and it is important to
cubing [CDH+02], and clustering [AHWY03]. Progress
develop incremental link mining algorithms. Besides
has been made on efficient methods for mining frequent
mining knowledge from links, objects and networks, we
patterns in data streams [MM02], multidimensional
may wish to construct an information network based on
analysis of stream data (such as construction of stream
both ontological and unstructured information.
cubes) [CDH+02], stream data classification [AHWY04],
B. Discovery, understanding, and usage of patterns and
stream clustering [AHWY03], stream outlier analysis,
knowledge
rare event detection [GFHY07], and so on. The general
philosophy is to develop single-scan algorithms to collect
Scientific and engineering applications often handle
information about stream data in tilted time windows,
massive data of high dimensionality. The goal of pattern
exploring micro-clustering, limited aggregation, and
mining is to find item sets, subsequences, or
approximation.
substructures that appear in a data set with frequency no
The focus of stream pattern analysis is to approximate
less than a user-specified threshold. Pattern analysis can
the frequency counts for infinite stream data.
be a valuable tool for finding correlations, clusters,
classification models, sequential and structural patterns,
and outliers.
533
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
Algorithms have been developed to count frequency
Clustering the nodes of the sensor networks is an
using tilted windows [GHPY02] based on the fact that
important optimization problem. Nodes that are clustered
users are more interested in the most recent transactions;
together can easily communicate with each other, which
approximate frequency counting based on previous
can be applied to energy optimization and developing
historical data to calculate the frequent patterns
optimal algorithms for clustering sensor nodes. Other
incrementally [MM02] and track the most frequent k
works in this field include identification of rare events or
items in the continuously arriving data [CM03].
anomalies, finding frequent item sets, and data
Stream data is often encountered in science and
preprocessing in sensor networks.
engineering applications. It is important to explore stream
Recent years have witnessed and enormous increase in
data mining in such applications and develop applicationmoving object data from RFID records in supply chain
specific methods, e.g., real-time anomaly detection in
operations, toll and road sensor readings from vehicles on
computer network analysis, in electric power grid
road networks, or even cell phone usage from different
supervision, in weather modeling, in engineering and
geographic regions. These movement data, including
security surveillance, and other stream data applications.
RFID data, object trajectories, anonymous aggregate data
such as the one generated by many road sensors, contain
D. Mining moving object data, RFID data, and data from
rich information. Effective management of such data is a
sensor networks
major challenge facing society today, with important
With the popularity of sensor networks, GPS, cellular
implications into business optimization, city planning,
phones, other mobile devices, and RFID technology,
privacy, and national security. Interesting research has
tremendous amount of moving object data has been
been conducted on warehousing RFID data sets
collected, calling for effective analysis. This is especially
[GHLK06], which could handle moving object data sets
true in many scientific, engineering, business and
by significantly compressing such data, and proposing a
homeland security applications.
new aggregation mechanism that preserves their path
Sensor networks are finding increasing number of
structures. Mining moving objects is a challenging
applications in many domains, including battle fields,
problem due to the massive size of the data, and its
smart buildings, and even the human body. Most sensor
spatiotemporal characteristics. The methods developed
networks consist of a collection of light-weight (possibly
along this line include Flow Graph [GHL06b], which is a
mobile) sensors connected via wireless links to each
probabilistic model that captures the main trends and
other or to a more powerful gateway node that is in turn
exceptions in moving object data, and FlowCube
connected with an external network through either wired
[GHL06a], which is a multi-dimensional extension of the
or wireless connections. Sensor nodes usually
FlowGraph and an adaptive fastest path algorithm
communicate in a peer-to-peer architecture over an
[GHL+07] that computes routes based on driving patterns
asynchronous network. In many applications, sensors are
present in the data. RFID systems are known to generate
deployed in hostile and difficult to access locations with
noisy data so data cleaning is an essential task for the
constraints on weight, power supply, and cost. Moreover,
correct interpretation and analysis of moving object data,
sensors must process a continuous (possibly fast) stream
especially when it is collected from RFID applications
of data. Data mining in wireless sensor networks (WSNs)
and thus demands for cost-effective cleaning methods
is a challenging area, as algorithms need to work in
(such as [GHS07]). One important application with
extremely demanding and constrained environment of
moving objects is automated identification of suspicious
sensor networks (such as limited energy, storage,
movements. A framework for detecting anomalies
computational power, and bandwidth). WSNs also
[LHKG07] is proposed to express object trajectories
require highly decentralized algorithms.
using discrete pattern fragments, extract features to form
Development of algorithms that take into
a hierarchical feature space and learn effective
consideration the characteristics of sensor networks, such
classification rules at multiple levels of granularity.
as energy and computation constraints, network
Another line of work on outlier detection in trajectories
dynamics, and faults, constitute an area of current
focuses on detecting outlying sub-trajectories [LHL08]
research. Some work has been done in developing
based on partition-and-detect framework, which
localized,
collaborative,
distributed
and
selfpartitions a trajectory into a set of line segments, and
configuration mechanisms in sensor networks.
then, detects outlying line segments for trajectory
In designing algorithms for sensor networks, it is
outliers. The problem of clustering trajectory data
imperative to keep in mind that power consumption has
[LHW07] is also studied where common sub-trajectories
to be minimized. Even gathering the distributed sensor
are discovered using the minimum description length
data in a single site could be expensive in terms of
(MDL) principle.
battery power consumed, some attempts have been made
Overall, this is still a young field with many research
towards making the data collection task energy efficient
issues to be explored on mining moving object data,
and balance the energy-quality trade-offs.
RFID data, and data from sensor networks.
534
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
For example, how to explore correlation and regularity
The problems of incorporating domain knowledge into
to clean noisy sensor network and RFID data, how to
mining when data is scarce and integrating data
integrate and construct data warehouses for such data,
collection with mining are worth studying in spatial data
how to perform scalable mining for peta-byte RFID data,
mining, and both theoretical analyses toward general
how to find strange moving objects, how to classify
studies of spatial phenomena and empirical model
multidimensional trajectory data, and so on. With time,
designs targeted for specific applications represent the
location, moving direction, speed, as well as
trends for future research.
multidimensional semantics of moving object data, likely
Research in this domain needs the confluence of
multi-dimensional data mining will play an essential role
multiple disciplines including image processing, pattern
in this study.
recognition, geographic information systems, parallel
processing, and statistical data analysis. Automatic
F. Spatial, temporal, spatiotemporal, and multimedia
categorization of images and videos, classification of
data mining
spatiotemporal data, finding frequent/sequential patterns
Scientific and engineering data is usually related to
and outliers, spatial collocation analysis, and many other
space, time, and in multimedia modes (e.g., containing
tasks have been studied popularly. With the mounting of
color, image, audio, and video). With the popularity of
such data, the development of scalable analysis methods
digital photos, audio DVDs, videos, YouTube, web-based
and new data mining functions will be an important
map services, weather services, satellite images, digital
research frontier for years to come.
earth, and many other forms of multimedia, spatial, and
G. Mining text, Web, and other unstructured data
spatiotemporal data, mining spatial, temporal,
spatiotemporal, and multimedia data will become
Web is the common place for scientists and engineers
increasingly popular, with far-reaching implications
to publish their data, share their observations and
[MH01, SC03]. For example, mining satellite images
experiences, and exchange their ideas. There is a
may help detect forest fire, find unusual phenomena on
tremendous amount of scientific and engineering data on
earth, predict hurricane landing site, discover weather
the web. For example, in biology and bioinformatics
patterns, and outline global warming trends.
research, there are GenBank, ProteinBank, GO, PubMed,
Spatial data mining is the process of discovering
and many other biological or biomedical information
interesting and previously unknown, but potentially
repositories available on theWeb. Therefore, theWeb has
useful patterns from large spatial data sets [SZHV04].
become the ultimate information access and processing
Extracting interesting and useful patterns from spatial
platform, housing not only billions of link-accessed
data sets is more difficult than extracting the
\pages", containing textual data, multimedia data, and
corresponding patterns from traditional numeric and
linkages, on the surface Web, but also query-accessible
categorical data due to the complexity of spatial data
\databases" on the deep Web.With the advent of Web 2.0,
types, spatial relationships, and spatial autocorrelation.
there is an increasing amount of dynamic \work°ow"
Interesting research topics in this field include prediction
emerging. With its penetrating deeply into our daily life
of events at particular geographic locations, detecting
and evolving into unlimited dynamic applications, the
spatial outliers whose no-spatial attributes are extreme
Web is central in our information infrastructure. Its
relative to its neighbors, finding co-location patterns
virtually unlimited scope and scale render immense
where instances containing the patterns often located in
opportunities for data mining.
close geographic proximity, and grouping a set of spatial
H. Data cube-oriented multidimensional online
objects into clusters. Future research is needed to
analytical mining
compare the difference and similarity between classical
Scientific and engineering datasets are usually highdata mining and spatial data mining techniques, model
dimensional in nature. Viewing and mining data in
semantically rich spatial properties other than
multidimensional space will substantially increase the
neighborhood relationships, design effective statistical
power and flexibility of data analysis. Data cube
methods to interpret the mined spatial patterns,
computation and OLAP (online analytical processing)
investigate proper measures for location prediction to
technologies developed in data warehouse have
improve spatial accuracy and facilitate visualization of
substantially increased the power of multidimensional
spatial relationships by representing both spatial and nonanalysis of large datasets.
spatial features.
535
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
Some researchers began to investigate how to conduct
Besides popular bar charts, pie charts, curves,
traditional data mining and statistical analysis in the
histograms, quantile plots, quantitle-quantile plots,
multi-dimensional manner efficiently. For example,
boxplots, scatter plots, there are also many visualization
regression cube [CDH+06] is designed to support
tools using geometric (e.g., dimension stacking, parallel
efficient computation of the statistical models. In this
coordinates), hierarchical (e.g., treemap), and icon-based
framework, each cell can be compressed into an auxiliary
(e.g., Chernoff faces and stick figures) techniques.
matrix with a size independent of the number of tuples
Moreover, there are methods for visualizing sequences,
and then the statistical measures for any data cell can be
time-series data, phylogenetic trees, graphs, networks,
computed from the compressed data of the lower-level
web, as well as various kinds of patterns and knowledge
cells without accessing the raw data. In a prediction cube
(e.g., decision-trees, association rules, clusters and
[CCLR05], each cell contains a value that summarizes a
outliers) [FGW01]. There are also visual data mining
predictive model trained on the data corresponding to that
tools that may facilitate interactive mining based on
cell and characterizes its decision behavior or
user's judgement of intermediate data mining results
predictiveness. The authors further show that such cubes
[AEEK99]. Recently, we have developed a DataScope
can be efficiently computed by exploiting the idea of
system that maps relational data into 2-D maps so that
model decomposition. In [LH07], the issues of anomaly
multidimensional relational data can be browsed in
detection in multi-dimensional time-series data are
Google map's way [WLX+07]. We believe that visual
examined. A time-series data cube is proposed to capture
data mining is appealing to scientists and engineers
the multi-dimensional space formed by the attribute
because they often have good understanding of their data,
structure and facilitate the detection of anomalies based
can use their knowledge to interpret their data and
on expected values derived from higher level, more
patterns with the help of visualization tools, and interact
general time-series. Moreover, an efficient search
with the system for deeper and more effective mining.
algorithm is proposed to iteratively select subspaces in
Tools should be developed for mapping data and
the original high-dimensional space and detect anomalies
knowledge into appealing and easy-to-understand visual
within each one. Recent study on sampling cubes
forms, and for interactive browsing, drilling, scrolling,
[LHY+08] discuss about the desirability of OLAP over
and zooming data and patterns to facilitate user
sampling data, which may not represent the full data in
exploration. Finally, for visualization of large amount of
the population. The proposed sampling cube framework
data, parallel processing and high-performance
could efficiently calculate confidence intervals for any
visualization tools should be investigated to ensure high
multidimensional query and uses the OLAP structure to
performance and fast response.
group similar segments to increase sampling size when
J. Domain-specific data mining:
needed. Further, to handle high dimensional data, a
Data mining by integration of sophisticated scientific
Sampling Cube Shell method is proposed to effectively
and
engineering domain knowledge besides general data
reduce the storage requirement while still preserving
mining
methods and tools for science and engineering,
query result quality. Such multi-dimensional, especially
each
scientific
or engineering discipline has its own data
high-dimensional, analysis tools will ensure data can be
sets
and
special
mining requirements, some could be
analyzed in hierarchical, multidimensional structures
rather different from the general ones. Therefore, inefficiently and flexibly at user's finger tips. This leads to
depth investigation of each problem domain and
the integration of online analytical processing with data
development of dedicated analysis tools are essential to
mining, i.e., OLAP mining. Some efforts have been
the success of data mining in this domain. Here we
devoted along this direction, but grand challenge still
examine two problem domains: biology and software
exist when one needs to explore the large space of
engineering.
choices to find interesting patterns and trends [RC07].
We believe that OLAP mining will substantially
1) Biological data mining
enhance the power and flexibility of data analysis and
The fast progress of biomedical and bioinformatics
lead to the construction of easy-to-use tools for the
research
has led to the accumulation and publication (on
analysis of massive data with hierarchical structures in
the
web)
of vast amount of biological and bioinformatics
multidimensional space. It is a promising research field
data. However, the analysis of such data poses much
for developing effective tools and scalable methods for
greater challenges than traditional data analysis methods
exploratory-based scientific and engineering data mining.
[BHLY04]. For example, genes and proteins are gigantic
I. Visual data mining
in size (e.g., a DNA sequence could be in billions of base
pairs), very sophisticated in function, and the patterns of
A picture is worth a thousand words. There have been
their interactions are largely unknown. Thus it is a fertile
numerous data visualization tools for visualizing various
field to develop sophisticated data mining methods for inkinds of data sets in massive amount and of
depth bioinformatics research.
multidimensional space [Tuf01].
536
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
We believe substantial research is badly needed to
REFERENCES
produce powerful mining tools in many biological and
[1] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan,
"demon: mining and monitoring evolving data" IEEE transactions
bioinformatics
subfields,
including
comparative
on knowledge and data engineering, vol. 13, no. 1,
genomics, evolution and phylogeny, biological data
january/february 2001
cleaning and integration, biological sequence analysis,
[2] Philip K. Chan, Florida Institute of Technology Wei Fan, Andreas
biological network analysis, biological image analysis,
L. Prodromidis, and Salvatore J. Stolfo, Columbia University"
biological literature analysis (e.g., PubMed), and systems
Distributed Data Mining in Credit Card Fraud Detection"
november/december 1999 1094-7167/99/$10.00 © 1999 IEEE
biology. From this point view, data mining is still very
[3] Rachna Somkunwar, "A study on Various Data Mining
young with respect to biology and bioinformatics
Approaches of Association Rules " IJARCSSE Volume 2, Issue 9,
applications. Substantial research should be conducted to
September 2012 ISSN: 2277 128X
cover the vast spectrum of data analysis tasks.
[4]
2) Data mining for software engineering
Software program executions potentially (e.g., when
program execution traces are turned on) generate huge
amounts of data. However, such data sets are rather
di®erent from the datasets generated from the nature or
collected from video cameras since they represent the
executions of program logics coded by human
programmers. It is important to mine such data to
monitor program execution status, improve system
performance, isolate software bugs, detect software
plagiarism, analyze programming system faults, and
recognize system malfunctions.
Data mining for software engineering can be
partitioned into static analysis and dynamic/stream
analysis, based on whether the system can collect traces
beforehand for post-analysis or it must react at real time
to handle online data. Different methods have been
developed in this domain by integration and extension of
the methods developed in machine learning, data mining,
pattern recognition, and statistics. For example, statistical
analysis such as hypothesis testing) approach [LFY+06]
can be performed on program execution traces to isolate
the locations of bugs which distinguish program success
runs from failing runs. Despite of its limited success, it is
still a rich domain for data miners to research and further
develop sophisticated, scalable, and real-time data mining
methods.
[5]
[6]
[7]
[8]
[9]
Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono,
and Huan Liu, Member, IEEE, " Effective Data Mining Using
Neural Networks" IEEE transactions on knowledge and data
engineering, vol. 8, no. 6, december 1996
Daniel A. Keim, " Information Visualization and Visual Data
Mining" IEEE transactions on visualization and computer
graphics, vol. 7, no. 1, january-march 2002
Mario Cannataro, Antonio Congiusta, Andrea Pugliese, Domenico
Talia and Paolo Trunfio, " Distributed Data Mining on Grids:
Services, Tools, and Applications" IEEE transactions on systems,
man, and cybernetics—part b: cybernetics, vol. 34, no. 6,
december 2004
Michael Goebel, Le Gruenwald, "A survey of data mining and
knowledge Discovery software tools" SIGKDD Explorations.
Copyright 1999 ACM SIGKDD, June 1999. Volume 1, Issue 1
– page 21
S.Hameetha Begum, "Data Mining Tools and Trends – An
Overview " International Journal of Emerging Research in
Management &Technology ISSN: 2278-9359
Tipawan
Silwattananusarn1
and
Assoc.Prof.
Dr.
KulthidaTuamsuk" Data Mining and Its Applications for
KnowledgeManagement : A Literature Review from 2007 to2012"
International Journal of Data Mining & Knowledge Management
Process (IJDKP) Vol.2, No.5, September 2012
BIBLOGRAPHY
Girish Kumar Sorot received his
B.Tech. degree in computer science &
engineering from Rajasthan Technical
University, Kota
and currently
pursuing M.Tech. degree in computer
science & engineering from Rajasthan
Technical
University,
Kota
(Rajasthan). His current research interests are in the areas
of network Security, cloud computing, and Real-Time
System.
V. CONCLUSION
In this paper, we have examined a few important
research challenges and issues in science and engineering
data mining. Also examine data mining techniques for
Data security and privacy like fraud detection and direct
marketing , Social media data analysis and computing ,
Web-scale data mining and semantic discovery , Largescale data integration and mining .
537