Download in computing, a data warehouse (DW or DWH), also known as an

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
SYSTEM ANALYSIS AND DESIGN
RESEARCH REPORT
DATA MINING
26.11.2016
Seyit Mert AYVAZ
2012555008
Introduction
Data is a term which is occured with the fast developing of computers. As long as the
evolution of computers continued, data is seperated into big and different levels. This
seperation has leaded to study on data with different aspects. In time , big databases has
became to hard to work on it. Both the scientists and employees of big companies has
started to think for a solution. After all these, the “Data Mining” process has taken a place at
informatics world.
In this report, I approached to data mining in various ways. This report consist almost
every topic concerning data mining. Firstly, some introduction topics are considered on a
preferential basis such as term of data mining, architecture, mining process. Since it is
important the understand how data mining is evolved, its history and milestones are
evaluated demonstratively. Afterwards, the most important topic regarding to data mining, I
think, the scope of data mining is handled. As much as it is important for academicians, it is
also accepted as so important for business world. So, the great affects of data mining and
the current studies are told for the both sides, in academically and in business world.After
all, some ideas and future trends are explanied shortly.
I hope that, this report can be beneficial for its readers and people who is curious
about data mining.
Page 2 of 25
TABLE OF CONTENT
1.Introductino to Data Mining...................................................................................................4
1.1 What is Data Mining.................................................................................................4
1.1.1.Automatic Discovery..................................................................................4
1.1.2.Prediction..................................................................................................5
1.1.3.Grouping....................................................................................................5
1.1.4.Actionable Information.............................................................................5
1.2.Architecture of Data Mining.....................................................................................6
1.2.1. Data Sources.............................................................................................7
1.2.2. Database or Data Warehouse Server.......................................................7
1.2.3. Data Mining Engine...................................................................................7
1.2.4. Pattern Evaluation Modules.....................................................................7
1.2.5. Graphical User Interface...........................................................................7
1.2.6. Knowledge Base........................................................................................7
1.3.Data Mining Processes.............................................................................................8
1.3.1. Problem definition....................................................................................8
1.3.2. Data exploration.......................................................................................9
1.3.3. Data preparation......................................................................................9
1.3.4. Modeling...................................................................................................9
1.3.5. Evaluation.................................................................................................9
1.3.6. Deployment..............................................................................................9
2.History of Data Mining .........................................................................................................10
2.1 Foundations of Data Mining...................................................................................10
2.2. Evolution in data mining for business...................................................................11
2.3. Milestones of Data Mining....................................................................................12
3.Scope of Data Mining............................................................................................................15
3.1. Usage of Data Mining Techniques ........................................................................16
3.1.1. Association..............................................................................................16
3.1.2. Classification...........................................................................................16
3.1.3. Clustering................................................................................................17
3.1.4. Prediction...............................................................................................17
3.1.5. Sequential Patterns................................................................................17
3.1.6. Decision trees.........................................................................................17
3.2. Data Mining in Academically.................................................................................18
3.2.1.Science and Engineering..........................................................................18
3.2.2. Medical Data Mining...............................................................................19
3.2.3. Spatial Data Mining.................................................................................19
3.2.4. Pattern mining........................................................................................20
3.2.5. Human Rights.........................................................................................20
3.2.6. Sensor Data Mining................................................................................20
3.3 Data Mining in Business.........................................................................................20
4.Future of Data Mining...........................................................................................................23
4.1. Distributed/Collective Data Mining (DDM) ..........................................................23
4.2. Ubiquitous Data Mining (UDM) ............................................................................23
4.3.
Hypertext
and
Hypermedia
Data
Mining...............................................................23
Page 3 of 25
4.4.
Multimedia
Mining........................................................................................24
4.5.
Time
Series/Sequence
Mining.......................................................................24
Data
Data
1.Introduction to Data Mining
Before anything else , you have to study on and understand some terms about data
mining like data,information and knowledge. Since all the studies related with data mining
are also related with those; it is improtant to catch the main point of the relation between
data,information and knowledge.
Data: data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases. This includes:
•
operational or transactional data such as, sales, cost, inventory, payroll, and
accounting
•
nonoperational data, such as industry sales, forecast data, and macro economic data
•
meta data - data about the data itself, such as logical database design or data
dictionary definitions
Information: the patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
Knowledge: information can be converted into knowledge about historical patterns and
future trends. For example, summary information on retail supermarket sales can be
analyzed in light of promotional efforts to provide knowledge of consumer buying behavior.
Thus, a manufacturer or retailer could determine which items are most susceptible to
promotional efforts.
1.1.What is Data Mining ?
Data mining is an interdisciplinary subfield of computer science. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
Data mining uses sophisticated mathematical algorithms to segment the data and evaluate
the probability of future events. Data mining is also known as Knowledge Discovery in Data
(KDD).
In general, key proeprties of the data mining can be summarized as:
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large data sets and databases
In order to understand better the properties, we can make some more explanation as
follows.
Page 4 of 25
1.1.1.Automatic Discovery
Data mining is accomplished by building models. A model uses an algorithm to act on a set of
data. The notion of automatic discovery refers to the execution of data mining models.Data
mining models can be used to mine the data on which they are built, but most types of
models are generalizable to new data. The process of applying a model to new data is known
as scoring.
1.1.2.Prediction
Many forms of data mining are predictive. For example, a model might predict income based
on education and other demographic factors. Predictions have an associated probability
(How likely is this prediction to be true?). Prediction probabilities are also known as
confidence. Some forms of predictive data mining generate rules, which are conditions that
imply a given outcome. For example, a rule might specify that a person who has a bachelor's
degree and lives in a certain neighborhood is likely to have an income greater than the
regional average.
1.1.3.Grouping
Other forms of data mining identify natural groupings in the data. For example, a model
might identify the segment of the population that has an income within a specified range,
that has a good driving record, and that leases a new car on a yearly basis.
1.1.4.Actionable Information
Data mining can derive actionable information from large volumes of data. For
example, a town planner might use a model that predicts income based on demographics to
develop a plan for low-income housing. A car leasing agency might a use model that
identifies customer segments to design a promotion targeting high-value customers.
The actual data mining task is the automatic or semi-automatic analysis of large
quantities of data to extract previously unknown, interesting patterns such as groups of data
records (cluster analysis), unusual records (anomaly detection), and dependencies
(association rule mining). This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the input data, and may be
used in further analysis or, for example, in machine learning and predictive analytics. For
example, the data mining step might identify multiple groups in the data, which can then be
used to obtain more accurate prediction results by a decision support system. Neither the
data collection, data preparation, nor result interpretation and reporting is part of the data
mining step, but do belong to the overall KDD process as additional steps.
In other words, data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both. Data
mining software is one of a number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases.
Page 5 of 25
1.2.Architecture of Data Mining
Figure 1.2.1:Architecture of data mining levels.
The major components of any data mining system are data source, data warehouse
server, data mining engine, pattern evaluation module, graphical user interface. In order to
get a better knowledge on these, we will examine that ‘’what are these?’’ and “what is the
aim of these components ?”
1.2.1. Data Sources:Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. It is a necessarity to have large quantity of
historical data for data mining to be successful. Organizations generally store data in
databases or data warehouses. Data warehouses may contain one or more databases, text
files, spreadsheets or other kinds of information repositories. Sometimes, data may reside
even in plain text files or spreadsheets. World Wide Web or the Internet is another big
source of data.
Why does the data get involved to different processes?
The data needs to be cleaned, integrated and selected before passing it to the database or
data warehouse server. As the data is from different sources and in different formats, it
cannot be used directly for the data mining process because the data might not be complete
and reliable. So, first data needs to be cleaned and integrated. Again, more data than
required will be collected from different data sources and only the data of interest needs to
Page 6 of 25
be selected and passed to the server. These processes are not as simple as we think. A
number of techniques may be performed on the data as part of cleaning, integration and
selection.
1.2.2. Database or Data Warehouse Server:The database or data warehouse server contains
the actual data that is ready to be processed. Hence, the server is responsible for retrieving
the relevant data based on the data mining request of the user.
1.2.3. Data Mining Engine:The data mining engine is the core component of any data mining
system. It consists of a number of modules for performing data mining tasks including
association, classification, characterization, clustering, prediction, time-series analysis etc.
1.2.4. Pattern Evaluation Modules:The pattern evaluation module is mainly responsible for
the measure of interestingness of the pattern by using a threshold value. It interacts with the
data mining engine to focus the search towards interesting patterns.
1.2.5. Graphical User Interface:The graphical user interface module communicates between
the user and the data mining system. This module helps the user use the system easily and
efficiently without knowing the real complexity behind the process. When the user specifies
a query or a task, this module interacts with the data mining system and displays the result
in an easily understandable manner.
1.2.6. Knowledge Base:The knowledge base is helpful in the whole data mining process. It
might be useful for guiding the search or evaluating the interestingness of the result
patterns. The knowledge base might even contain user beliefs and data from user
experiences that can be useful in the process of data mining. The data mining engine might
get inputs from the knowledge base to make the result more accurate and reliable.
1.3.Data Mining Processes ( explain )
Figure 1.3.1: Phases of the Cross Industry Standard Process for data mining (CRISP DM)
process model. From where*********
Page 7 of 25
Many organizations in various industries are taking advantages of data mining
including manufacturing, marketing, chemical, aerospace… etc, to increase their business
efficiency. Therefore, the needs for a standard data mining process increased comparatively.
A data mining process must be reliable and it must be repeatable by business people with
little or no knowledge of data mining background. As the result, in 1990, a cross-industry
standard process for data mining (CRISP-DM) first announced after going through a lot of
workshops, and contributions from over 300 organizations.
Cross Industry Standard Process for data mining is an iterative process that typically involves
the following phases:
1.3.1. Problem definition
A data mining project starts with the understanding of the business problem. Data
mining experts, business experts, and domain experts work closely together to define the
project objectives and the requirements from a business perspective. The project objective is
then translated into a data mining problem definition.
In the problem definition phase, data mining tools are not yet required.
1.3.2. Data exploration
Domain experts understand the meaning of the metadata. They collect, describe, and
explore the data. They also identify quality problems of the data. A frequent exchange with
the data mining experts and the business experts from the problem definition phase is vital.
In the data exploration phase, traditional data analysis tools, for example, statistics, are used
to explore the data.
1.3.3. Data preparation
Domain experts build the data model for the modeling process. They collect, cleanse,
and format the data because some of the mining functions accept data only in a certain
format. They also create new derived attributes, for example, an average value.
In the data preparation phase, data is tweaked multiple times in no prescribed order.
Preparing the data for the modeling tool by selecting tables, records, and attributes, are
typical tasks in this phase. The meaning of the data is not changed.
1.3.4. Modeling
Data mining experts select and apply various mining functions because you can use
different mining functions for the same type of data mining problem. Some of the mining
functions require specific data types. The data mining experts must assess each model.
In the modeling phase, a frequent exchange with the domain experts from the data
preparation phase is required.
The modeling phase and the evaluation phase are coupled. They can be repeated
several times to change parameters until optimal values are achieved. When the final
modeling phase is completed, a model of high quality has been built.
1.3.5. Evaluation
Data mining experts evaluate the model. If the model does not satisfy their
expectations, they go back to the modeling phase and rebuild the model by changing its
Page 8 of 25
parameters until optimal values are achieved. When they are finally satisfied with the model,
they can extract business explanations and evaluate the following questions:
“Does the model achieve the business objective?”
“Have all business issues been considered?”
At the end of the evaluation phase, the data mining experts decide how to use the data
mining results.
1.3.6. Deployment
Data mining experts use the mining results by exporting the results into database
tables or into other applications, for example, spreadsheets.
The Intelligent Miner™**** products assist you to follow this process. You can apply the
functions of the Intelligent Miner products independently, iteratively, or in combination.
2.History of Data Mining
First of all, to be able to know history and evolution of data mining, it is important to
find out the milestones and foundations of data mining and evolution of these. All of these
processes don’t lie to long background except the theories of several scientific field like
statistic,machine learning and artifical intelligence. At further sections, foundations and
milestones will be expounded in detail. Here we will go into the relation between
statistics,machine learning,artifical intelligence and data mining and how their relation is
evolved together.
Data mining roots are traced back along three family lines: classical statistics, artificial
intelligence, and machine learning.
Statistics are the foundation of most technologies on which data mining is built, e.g.
regression analysis, standard distribution, standard deviation, standard variance,
discriminate analysis, cluster analysis, and confidence intervals. All of these are used to study
data and data relationships.
Artificial intelligence, or AI, which is built upon heuristics as opposed to statistics,
attempts to apply human-thought-like processing to statistical problems. Certain AI concepts
which were adopted by some high-end commercial products, such as query optimization
modules for Relational Database Management Systems (RDBMS).
Machine learning is the union of statistics and AI. It could be considered an evolution
of AI, because it blends AI heuristics with advanced statistical analysis. Machine learning
attempts to let computer programs learn about the data they study, such that programs
make different decisions based on the qualities of the studied data, using statistics for
fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its
goals.
Data mining, in many ways, is fundamentally the adaptation of machine learning
techniques to business applications. Data mining is best described as the union of historical
and recent developments in statistics, AI, and machine learning. These techniques are then
used together to study data and find previously-hidden trends or patterns within.
2.1 Foundations of Data Mining
Page 9 of 25
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:



Massive data collection
Powerful multiprocessor computers
Data mining algorithms
2.2. Evolution in data mining for business
In the evolution from business data to business information, each new step has built
upon the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
From the user’s point of view, the four steps listed in Table 1 were revolutionary because
they allowed new business questions to be answered accurately and quickly
Evolutionary
Step
Data Collection
(1960s)
Data Access
(1980s)
Data
Warehousing
&Decision
Support
(1990s)
Data Mining
(Emerging
Today)
Business
Question
"What was my
total revenue in
the last five
years?"
Enabling
Technologies
Computers,
tapes, disks
Product
Providers
IBM, CDC
Characteristics
"What were
unit sales in
New England
last March?"
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at
record level
"What were
unit sales in
New England
last March?
Drill down to
Boston."
Relational
databases
(RDBMS),
Structured
Query Language
(SQL), ODBC
On-line analytic
processing
(OLAP),
multidimensiona
l databases, data
warehouses
Pilot,
Comshare,
Arbor, Cognos,
Microstrategy
Retrospective,
dynamic data
delivery at
multiple levels
"What’s likely
to happen to
Boston unit
sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers,
massive
Pilot, Lockheed,
IBM, SGI,
numerous
startups
(nascent
Prospective,
proactive
information
delivery
Retrospective,
static data
delivery
Page 10 of 25
databases
industry)
Table 2.2.1: Steps in the Evolution of Data Mining.[1]
The core components of data mining technology have been under development for
decades, in research areas such as statistics, artificial intelligence, and machine learning.
Today, the maturity of these techniques, coupled with high-performance relational database
engines and broad data integration efforts, make these technologies practical for current
data warehouse environments.
2.3. Milestones of Data Mining
Figure 2.3.1:Milestones of data mining related with main topics.
The following are major milestones and “firsts” in the history of data mining plus how
it’s evolved and blended with data science and big data.
1763 Thomas Bayes’ paper is published posthumously regarding a theorem for relating
current probability to prior probability called the Bayes’ theorem. It is fundamental to data
mining and probability, since it allows understanding of complex realities based on
estimated probabilities.
1805 Adrien-Marie Legendre and Carl Friedrich Gauss apply regression to determine the
orbits of bodies about the Sun (comets and planets). The goal of regression analysis is to
Page 11 of 25
estimate the relationships among variables, and the specific method they used in this case is
the method of least squares. Regression is one of the key tools in data mining.
1936 This is the dawn of computer age which makes possible the collection and processing
of large amounts of data. In a 1936 paper, On Computable Numbers, Alan Turing introduced
the idea of a Universal Machine capable of performing computations like our modern day
computers. The modern day computer is built on the concepts pioneered by Turing.
1943 Warren McCulloch and Walter Pitts were the first to create a conceptual model of a
neural network. In a paper entitled A logical calculus of the ideas immanent in nervous
activity, they describe the idea of a neuron in a network. Each of these neurons can do 3
things: receive inputs, process inputs and generate output.
1965 Lawrence J. Fogel formed a new company called Decision Science, Inc. for applications
of evolutionary programming. It was the first company specifically applying evolutionary
computation to solve real-world problems.
1970s With sophisticated database management systems, it’s possible to store and query
terabytes and petabytes of data. In addition, data warehouses allow users to move from a
transaction-oriented way of thinking to a more analytical way of viewing the data. However,
extracting sophisticated insights from these data warehouses of multidimensional models is
very limited.
1975 John Henry Holland wrote Adaptation in Natural and Artificial Systems, the groundbreaking book on genetic algorithms. It is the book that initiated this field of study,
presenting the theoretical foundations and exploring applications.
1980s HNC trademarks the phrase “database mining.” The trademark was meant to protect
a product called DataBase Mining Workstation. It was a general purpose tool for building
neural network models and now no longer is available. It’s also during this period that
sophisticated algorithms can “learn” relationships from data that allow subject matter
experts to reason about what the relationships mean.
1989 The term “Knowledge Discovery in Databases” (KDD) is coined by Gregory PiatetskyShapiro. It also at this time that he co-founds the first workshop also named KDD.
1990s The term “data mining” appeared in the database community. Retail companies and
the financial community are using data mining to analyze data and recognize trends to
increase their customer base, predict fluctuations in interest rates, stock prices, customer
demand.
1992 Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested an
improvement on the original support vector machine which allows for the creation of
nonlinear classifiers. Support vector machines are a supervised learning approach that
analyzes data and recognizes patterns used for classification and regression analysis.
Page 12 of 25
1993 Gregory Piatetsky-Shapiro starts the newsletter Knowledge Discovery Nuggets
(KDnuggets). It was originally meant to connect researchers who attended the KDD
workshop. However, KDnuggets.com seems to have a much wider audience now.
2001 Although the term “data science” has existed since 1960s, it wasn’t until 2001 that
William S. Cleveland introduced it as an independent discipline. As per Build Data Science
Teams, DJ Patil and Jeff Hammerbacher then used the term to describe their roles at
LinkedIn and Facebook.
2015 In February 2015, DJ Patil became the first Chief Data Scientist at the White House.
Today, data mining is widespread in business, science, engineering and medicine just to
name a few. Mining of credit card transactions, stock market movements, national security,
genome sequencing and clinical trials are just the tip of the iceberg for data mining
applications.
Present (2016) Finally, one of the most active techniques being explored today is “Deep
Learning”. Capable of capturing dependencies and complex patterns far beyond other
techniques, it is reigniting some of the biggest challenges in the world of data mining, data
science and artificial intelligence. [2]
3.Scope of Data Mining
At this section, the scope will be examined according to the types of the relations
between transaction and analytical systems, analysis levels and tasks of the data
mining.Then the usage of data mining in academically and business will both be explained.
While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two.Comparatively, mining
softwares have been developed continuously. Data mining software analyzes relationships
and patterns in stored transaction data based on open-ended user queries. Several types of
analytical software are available such as statistical, machine learning, and neural networks.
Mostly, any of four types of relationships are sought:




Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers
visit and what they typically order. This information could be used to increase traffic
by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or
consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper example is
an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack
being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Page 13 of 25
Data mining consists of five major elements:
1.
2.
3.
4.
5.
Extract, transform, and load transaction data onto the data warehouse system.
Store and manage the data in a multidimensional database system.
Provide data access to business analysts and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:






Artificial neural networks: Non-linear predictive models that learn through training
and resemble biological neural networks in structure.
Genetic algorithms: Optimization techniques that use processes such as genetic
combination, mutation, and natural selection in a design based on the concepts of
natural evolution.
Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree
techniques used for classification of a dataset.
They provide a set of rules that you can apply to a new (unclassified) dataset to
predict which records will have a given outcome. CART segments a dataset by
creating 2-way splits while CHAID segments using chi square tests to create multiway splits. CART typically requires less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record in a dataset based
on a combination of the classes of the k record(s) most similar to it in a historical
dataset (where k 1). Sometimes called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on statistical
significance.
Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data relationships.
3.1. Usage of Data Mining Techniques
With the meaning of in academically, topic is seperated into two parts. First part is
the techniques that are using for mining and significant studies by using these in
different areas.
There are several major data mining techniques have been developing and using by
researchers in data mining studies recently including association, classification,
clustering, prediction, sequential patterns and decision tree. We will briefly examine
those data mining techniques in the following sections.
3.1.1. Association
Association is one of the best-known data mining technique. In association, a pattern
is discovered based on a relationship between items in the same transaction. That’s is
the reason why association technique is also known as relation technique. The
Page 14 of 25
association technique is used in market basket analysis to identify a set of products that
customers frequently purchase together.
Retailers are using association technique to research customer’s buying habits. Based
on historical sale data, retailers might find out that customers always buy crisps when
they buy beers, and, therefore, they can put beers and crisps next to each other to save
time for customer and increase sales.
3.1.2. Classification
Classification is a classic data mining technique based on machine learning. Basically,
classification is used to classify each item in a set of data into one of a predefined set of
classes or groups. Classification method makes use of mathematical techniques such as
decision trees, linear programming, neural network and statistics. In classification, we
develop the software that can learn how to classify the data items into groups. For
example, we can apply classification in the application that “given all records of
employees who left the company, predict who will probably leave the company in a
future period.” In this case, we divide the records of employees into two groups that
named “leave” and “stay”. And then we can ask our data mining software to classify the
employees into separate groups.
3.1.3. Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of
objects which have similar characteristics using the automatic technique. The clustering
technique defines the classes and puts objects in each class, while in the classification
techniques, objects are assigned into predefined classes. To make the concept clearer,
we can take book management in the library as an example. In a library, there is a wide
range of books on various topics available. The challenge is how to keep those books in a
way that readers can take several books on a particular topic without hassle. By using the
clustering technique, we can keep books that have some kinds of similarities in one
cluster or one shelf and label it with a meaningful name. If readers want to grab books in
that topic, they would only have to go to that shelf instead of looking for the entire
library.
3.1.4. Prediction
The prediction, as its name implied, is one of a data mining techniques that discovers
the relationship between independent variables and relationship between dependent
and independent variables. For instance, the prediction analysis technique can be used in
the sale to predict profit for the future if we consider the sale is an independent variable,
profit could be a dependent variable. Then based on the historical sale and profit data,
we can draw a fitted regression curve that is used for profit prediction.
3.1.5. Sequential Patterns
Sequential patterns analysis is one of data mining technique that seeks to discover or
identify similar patterns, regular events or trends in transaction data over a business
period.
Page 15 of 25
In sales, with historical transaction data, businesses can identify a set of items that
customers buy together different times in a year. Then businesses can use this
information to recommend customers buy it with better deals based on their purchasing
frequency in the past.
3.1.6. Decision trees
The A decision tree is one of the most common used data mining techniques because
its model is easy to understand for users. In decision tree technique, the root of the
decision tree is a simple question or condition that has multiple answers. Each answer
then leads to a set of questions or conditions that help us determine the data so that we
can make the final decision based on it.
3.2. Data Mining in Academically
Since the data mining algorithms are generated and used in researches, many studies
are varied and started to apply in different areas.
3.2.1.Science and Engineering
In recent years, data mining has been used widely in the areas of science and
engineering, such as bioinformatics, genetics, medicine, education and electrical power
engineering.
In the study of human genetics, sequence mining helps address the important goal of
understanding the mapping relationship between the inter-individual variations in
human DNA sequence and the variability in disease susceptibility. In simple terms, it aims
to find out how the changes in an individual's DNA sequence affects the risks of
developing common diseases such as cancer, which is of great importance to improving
methods of diagnosing, preventing, and treating these diseases. One data mining
method that is used to perform this task is known as multifactor dimensionality
reduction.
In the area of electrical power engineering, data mining methods have been widely
used for condition monitoring of high voltage electrical equipment. The purpose of
condition monitoring is to obtain valuable information on, for example, the status of the
insulation (or other important safety-related parameters). Data clustering techniques –
such as the self-organizing map (SOM), have been applied to vibration monitoring and
analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can
be observed that each tap change operation generates a signal that contains information
about the condition of the tap changer contacts and the drive mechanisms. Obviously,
different tap positions will generate different signals. However, there was considerable
variability amongst normal condition signals for exactly the same tap position. SOM has
been applied to detect abnormal conditions and to hypothesize about the nature of the
abnormalities.
Data mining methods have been applied to dissolved gas analysis (DGA) in power
transformers. DGA, as a diagnostics for power transformers, has been available for many
years. Methods such as SOM has been applied to analyze generated data and to
Page 16 of 25
determine trends which are not obvious to the standard DGA ratio methods (such as
Duval Triangle).
In educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviors which reduce their learning, and to
understand factors influencing university student retention. A similar example of social
application of data mining is its use in expertise finding systems, whereby descriptors of
human expertise are extracted, normalized, and classified so as to facilitate the finding of
experts, particularly in scientific and technical fields. In this way, data mining can
facilitate institutional memory.
Data mining methods of biomedical data facilitated by domain ontologies, mining
clinical trial data, and traffic analysis using SOM.
In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998,
used data mining methods to routinely screen for reporting patterns indicative of
emerging drug safety issues in the WHO global database of 4.6 million suspected adverse
drug reaction incidents. Recently, similar methodology has been developed to mine large
collections of electronic health records for temporal patterns associating drug
prescriptions to medical diagnoses.
Data mining has been applied to software artifacts within the realm of software
engineering: Mining Software Repositories.
3.2.2. Medical Data Mining
Some machine learning algorithms can be applied in medical field as second-opinion
diagnostic tools and as tools for the knowledge extraction phase in the process of
knowledge discovery in databases. One of these classifiers (called Prototype exemplar
learning classifier (PEL-C) is able to discover syndromes as well as atypical clinical cases.
In 2011, the case of Sorrell v. IMS Health, Inc., decided by the Supreme Court of the
United States, ruled that pharmacies may share information with outside companies.
This practice was authorized under the 1st Amendment of the Constitution, protecting
the "freedom of speech." However, the passage of the Health Information Technology
for Economic and Clinical Health Act (HITECH Act) helped to initiate the adoption of the
electronic health record (EHR) and supporting technology in the United States. The
HITECH Act was signed into law on February 17, 2009 as part of the American Recovery
and Reinvestment Act (ARRA) and helped to open the door to medical data mining. Prior
to the signing of this law, estimates of only 20% of United States-based physicians were
utilizing electronic patient records.Søren Brunak notes that “the patient record becomes
as information-rich as possible” and thereby “maximizes the data mining opportunities.”
Hence, electronic patient records further expands the possibilities regarding medical
data mining thereby opening the door to a vast source of medical data analysis.
3.2.3. Spatial Data Mining
Spatial data mining is the application of data mining methods to spatial data. The end
objective of spatial data mining is to find patterns in data with respect to geography. So
far, data mining and Geographic Information Systems (GIS) have existed as two separate
Page 17 of 25
technologies, each with its own methods, traditions, and approaches to visualization and
data analysis. Particularly, most contemporary GIS have only very basic spatial analysis
functionality. The immense explosion in geographically referenced data occasioned by
developments in IT, digital mapping, remote sensing, and the global diffusion of GIS
emphasizes the importance of developing data-driven inductive approaches to
geographical analysis and modeling.
3.2.4. Pattern mining
"Pattern mining" is a data mining method that involves finding existing patterns in
data. In this context patterns often means association rules. The original motivation for
searching association rules came from the desire to analyze supermarket transaction
data, that is, to examine customer behavior in terms of the purchased products. For
example, an association rule "beer ⇒ potato chips (80%)" states that four out of five
customers that bought beer also bought potato chips.
In the context of pattern mining as a tool to identify terrorist activity, the National
Research Council provides the following definition: "Pattern-based data mining looks for
patterns (including anomalous data patterns) that might be associated with terrorist
activity — these patterns might be regarded as small signals in a large ocean of noise."
Pattern Mining includes new areas such a Music Information Retrieval (MIR) where
patterns seen both in the temporal and non temporal domains are imported to classical
knowledge discovery search methods.
3.2.5. Human Rights
Data mining of government records – particularly records of the justice system (i.e.,
courts, prisons) – enables the discovery of systemic human rights violations in
connection to generation and publication of invalid or fraudulent legal records by various
government agencies.
3.2.6. Sensor Data Mining
Wireless sensor networks can be used for facilitating the collection of data for spatial
data mining for a variety of applications such as air pollution monitoring. A characteristic
of such networks is that nearby sensor nodes monitoring an environmental feature
typically register similar values. This kind of data redundancy due to the spatial
correlation between sensor observations inspires the techniques for in-network data
aggregation and mining. By measuring the spatial correlation between data sampled by
different sensors, a wide class of specialized algorithms can be developed to develop
more efficient spatial data mining algorithms.
3.3. Data Mining in Business
In business, data mining is the analysis of historical business activities, stored as static
data in data warehouse databases. The goal is to reveal hidden patterns and trends. Data
mining software uses advanced pattern recognition algorithms to sift through large
amounts of data to assist in discovering previously unknown strategic business
information. Examples of what businesses use data mining is to include performing
market analysis to identify new product bundles, finding the root cause of manufacturing
Page 18 of 25
problems, to prevent customer attrition and acquire new customers, cross-selling to
existing customers, and profiling customers with more accuracy.
In today’s world raw data is being collected by companies at an exploding rate. For
example, Walmart processes over 20 million point-of-sale transactions every day. This
information is stored in a centralized database, but would be useless without some type
of data mining software to analyze it. If Walmart analyzed their point-of-sale data with
data mining techniques they would be able to determine sales trends, develop marketing
campaigns, and more accurately predict customer loyalty.
Categorization of the items available in the e-commerce site is a fundamental
problem. A correct item categorization system is essential for user experience as it helps
determine the items relevant to him for search and browsing. Item categorization can be
formulated as a supervised classification problem in data mining where the categories
are the target classes and the features are the words composing some textual
description of the items. One of the approaches is to find groups initially which are
similar and place them together in a latent group. Now given a new item, first classify
into a latent group which is called coarse level classification. Then, do a second round of
classification to find the category to which the item belongs to.
Every time a credit card or a store loyalty card is being used, or a warranty card is
being filled, data is being collected about the users behavior. Many people find the
amount of information stored about us from companies, such as Google, Facebook, and
Amazon, disturbing and are concerned about privacy. Although there is the potential for
our personal data to be used in harmful, or unwanted, ways it is also being used to make
our lives better. For example, Ford and Audi hope to one day collect information about
customer driving patterns so they can recommend safer routes and warn drivers about
dangerous road conditions.
Data mining in customer relationship management(CRM) applications can contribute
significantly to the bottom line. Rather than randomly contacting a prospect or customer
through a call center or sending mail, a company can concentrate its efforts on prospects
that are predicted to have a high likelihood of responding to an offer. More sophisticated
methods may be used to optimize resources across campaigns so that one may predict
to which channel and to which offer an individual is most likely to respond (across all
potential offers). Additionally, sophisticated applications could be used to automate
mailing. Once the results from data mining (potential prospect/customer and
channel/offer) are determined, this "sophisticated application" can either automatically
send an e-mail or a regular mail. Finally, in cases where many people will take an action
without an offer, "uplift modeling" can be used to determine which people have the
greatest increase in response if given an offer. Uplift modeling thereby enables
marketers to focus mailings and offers on persuadable people, and not to send offers to
people who will buy the product without an offer. Data clustering can also be used to
automatically discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but also they
recognize that the number of predictive models can quickly become very large. For
example, rather than using one model to predict how many customers will churn, a
business may choose to build a separate model for each region and customer type. In
situations where a large number of models need to be maintained, some businesses turn
to more automated data mining methodologies.
Page 19 of 25
Data mining can be helpful to human resources (HR) departments in identifying the
characteristics of their most successful employees. Information obtained – such as
universities attended by highly successful employees – can help HR focus recruiting
efforts accordingly. Additionally, Strategic Enterprise Management applications help a
company translate corporate-level goals, such as profit and margin share targets, into
operational decisions, such as production plans and workforce levels.
Market basket analysis, relates to data-mining use in retail sales. If a clothing store
records the purchases of customers, a data mining system could identify those
customers who favor silk shirts over cotton ones. Although some explanations of
relationships may be difficult, taking advantage of it is easier. The example deals with
association rules within transaction-based data. Not all data are transaction based and
logical, or inexact rules may also be present within a database.
Market basket analysis has been used to identify the purchase patterns of the Alpha
Consumer. Analyzing the data collected on this type of user has allowed companies to
predict future buying trends and forecast supply demands.[citation needed]
Data mining is a highly effective tool in the catalog marketing industry.[citation needed]
Catalogers have a rich database of history of their customer transactions for millions of
customers dating back a number of years. Data mining tools can identify patterns among
customers and help identify the most likely customers to respond to upcoming mailing
campaigns.
Data mining for business applications can be integrated into a complex modeling and
decision making process. LIONsolver uses Reactive business intelligence (RBI) to
advocate a "holistic" approach that integrates data mining, modeling, and interactive
visualization into an end-to-end discovery and continuous innovation process powered
by human and automated learning.
In the area of decision making, the RBI approach has been used to mine knowledge
that is progressively acquired from the decision maker, and then self-tune the decision
method accordingly. The relation between the quality of a data mining system and the
amount of investment that the decision maker is willing to make was formalized by
providing an economic perspective on the value of “extracted knowledge” in terms of its
payoff to the organization. This decision-theoretic classification framework was applied
to a real-world semiconductor wafer manufacturing line, where decision rules for
effectively monitoring and controlling the semiconductor wafer fabrication line were
developed.[3]
Page 20 of 25
4.Future of Data Mining
Over recent years data mining has been establishing itself as one of the major
disciplines in computer science with growing industrial impact. Undoubtedly, research in
data mining will continue and even increase over coming decades.In this section we will
examine the future trends and applications of data mining.
4.1. Distributed/Collective Data Mining (DDM)
One area of data mining which is attracting a good amount of attention is that of
distributed and collective data mining. Much of the data mining which is being done
currently focuses on a database or data warehouse of information which is physically
located in one place. However, the situation arises where information may be located in
different places, in different physical locations. This is known generally as distributed
data mining (DDM). Therefore, the goal is to effectively mine distributed data which is
located in heterogeneous sites. Examples of this include biological information located in
different databases, data which comes from the databases of two different firms, or
analysis of data from different branches of a corporation, the combining of which would
be an expensive and time-consuming process.
Distributed data mining (DDM) is used to offer a different approach to traditional
approaches analysis, by using a combination of localized data analysis, together with a
―global data model. In more specific terms, this is specified as:- performing local data
analysis for generating partial data models, and-combining the local data models from
different data sites in order to develop the global model. This global model combines the
results of the separate analyses. Often the global model produced, especially if the data
in different locations has different features or characteristics, may become incorrect or
ambiguous. This problem is especially critical when the data in distributed sites is
heterogeneous rather than homogeneous
4.2. Ubiquitous Data Mining (UDM)
The advent of laptops, palmtops, cell phones, and wearable computers is making
ubiquitous access to large quantity of data possible. Advanced analysis of data for
extracting useful knowledge is the next natural step in the world of ubiquitous
computing. Accessing and analyzing data from a ubiquitous computing device offer many
challenges.For example, UDM introduces additional cost due to communication,
computation, security, and other factors. So one of the objectives of UDM is to mine data
while minimizing the cost of ubiquitous presence.
4.3. Hypertext and Hypermedia Data Mining
Hypertext and hypermedia data mining can be characterized as mining data which
includes text, hyperlinks, text mark-ups, and various other forms of hypermedia
information. As such, it is closely related to both web mining, and multimedia mining,
but in reality are quite close in terms of content and applications. While the World Wide
Web is substantially composed of hypertext and hypermedia elements, there are other
kinds of hypertext/hypermedia data sources which are not found on the web. Examples
of these include the information found in online catalogues, digital libraries, online
information databases, and the like.. Some of the important data mining techniques used
Page 21 of 25
for hypertext and hypermedia data mining include classification (supervised learning),
clustering(unsupervised learning), semi-structured learning, and social network analysis.
In the case of classification, or supervised learning, the process starts off by reviewing
training data in which items are marked as being part of a certain class or group. This
data is the basis from which the algorithm is trained. One application of classification is
in the area of web topic directories, which can group similar sounding or spelled terms
into appropriate categories, so that searches will not bring up inappropriate sites and
pages.
Semi-supervised learning and social network analysis are other methods which are
important to hypermediabaseddata mining. Semi-supervised learning is the case where
there are both labelled and unlabeled documents, and there is a need to learn from both
types of documents. Social network analysis is also applicable because the web is
considered a social network, which examines networks formed through collaborative
association, whether it be between friends, academics doing research or service on
committees, and between papers through references and citations.
4.4. Multimedia Data Mining
Multimedia Data Mining is the mining and analysis of various types of data, including
images, video, audio, and animation. As multimedia data mining incorporates the areas
of text mining, as well as hypertext/hypermedia mining, these fields are closely related.
Much of the information describing these other areas also applies to multimedia data
mining. This field is also rather new, but holds much promise for the future. Multimedia
information, because its nature as a large collection of multimedia objects, must be
represented differently from conventional forms of data. One approach is to create a
multimedia data cube which can be used to convert multimedia-type data into a form
which is suited to analysis using one of the main data mining techniques, but taking into
account the unique characteristics of the data.
4.5. Time Series/Sequence Data Mining
Another important area in data mining centres on the mining of time series and
sequence-based data. Simply put, this involves the mining of a sequence of data, which
can either be referenced by time (time-series, such as stock market and production
process data), or is simply a sequence of data which is ordered in a sequence. In general,
one aspect of mining time series data focuses on the goal of identifying movements or
components which exist within the data (trend analysis).These can include long-term or
trend movements, seasonal variations, cyclical variations, and random movements.
Sequential pattern mining has as its focus the identification of sequences which occur
frequently in a time series or sequence of data. This is particularly useful in the analysis
of customers, where certain buying patterns could be identified, such as what might be
the likely follow-up purchase to purchasing a certain electronics item or computer, for
example.
Page 22 of 25
Conclusion
I started to this report to by noting for creating various aspects of data mining as a
whole. All the information given here were aimed to a complete research by composing
different partions. Clearly some of portions that we wrote are not entirely unique. The
hypotesises and methods are explained correctly, trends and future dynamism are taken
as possible as current. The statistical and theorical data are checked carefully and are
verified with the help of various sources.
In addition, all we can see the importance of data mining in increasingly globalized
world. There are many techniques,studies and softwares to make the life easier and to
increase companies market values. Especially, enterprise resource planning and
customer relationships management softwares are getting higher places at the cost and
budget of companies latterly which is based on data mining. Since the data mining also
underlies the bussines intelligence, we will see much more studies related with it in
future.
I hope this research report can be beneficial for both its readers and people who is
curious about data mining.
Page 23 of 25
Glossary
cluster analysis: or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another) to
each other than to those in other groups(cluster)
anomaly detection: anomaly detection (also outlier detection) is the identification of items,
events or observations which do not conform to an expected pattern or other items in a
dataset.
association rule mining: method for discovering interesting relations between variables in
large databases
predictive analytics:predictive analytics encompasses a variety of statistical techniques from
predictive modeling, machine learning, and data mining that analyze current and historical
facts to make predictions about future or otherwise unknown events
classification: is the problem of identifying to which of a set of categories (sub-populations)
a new observation belongs, on the basis of a training set of data containing observations (or
instances) whose category membership is known.
data warehouse: in computing, a data warehouse (DW or DWH), also known as an
enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is
considered a core component of business intelligence
time series analysis: comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of the data.
threshold value: The threshold limit value (TLV) of a chemical substance is a level to which it
is believed a worker can be exposed day after day for a working lifetime without adverse
effects.
LIONsolver: LIONsolver is an integrated software for data mining, business intelligence,
analytics, and modeling Learning and Intelligent OptimizatioN.
Reactive business intelligence (RBI): advocates an holistic approach that integrates data
mining, modeling and interactive visualization, into an end-to-end discovery and continuous
innovation process powered by human and automated learning.
VLSI Test:very large scale integration test.
IC: integrated circuit.
Page 24 of 25
References
[1]
: www.thearling.com
: www.rayli.net
[3]
: www.wikipedia.org
4: http://www.ibm.com/support/knowledgecenter/
5: https://www.linkedin.com/pulse/what-does-future-hold-data-mining-thiensi-le
6: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
7: https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/materials.shtml#dataware
8: http://www.cs.bu.edu/~gkollios/dm07/lectnotes.html
9: http://searchsqlserver.techtarget.com/definition/data-mining
10: Introduction to Data Mining, Pang-Ning Tan, Michigan State University, Michael
Steinbach,University of Minnesota Vipin Kumar, University of Minnesota, (March 25, 2006)
11: Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and Information Science and
Engineering University of Florida, Gainesville
[2]
Page 25 of 25