Download Data Mining What Is Data Mining? What Are Some Key

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Data mining has become a valuable technique relied upon by businesses all over the
world so they can better understand their markets and consequently gain competitive
advantage. Data mining is only possible because of the amazing development of
computer hardware—dazzling processing speeds, tons of available RAM options, and the
evolution of data storage devices. All over the planet business are learning to take
advantage of this combination of factors. There are many data mining software programs
available for businesses, but as usually is the case in MIS, the best system for you
depends on what you want to accomplish and your current situation.
What Is Data Mining?
Data mining can be defined as “the process of searching and analyzing data in order to
find implicit, but potentially useful, information. It involves selecting, exploring, and
modelling large amounts of data to uncover previously unknown patterns, and ultimately
comprehensible information, from large databases” (Shaw, Subramanian, Tan, and Welge
2001:128). This technique allows organizations to use the data residing in enormous
databases to their own advantage. It provides organizations with statistical tools to
uncover hidden patterns that may provide the useful information needed to understand
why a certain group of customers behave in a certain way. Your textbook provides a few
cases in which the use of data mining techniques has helped organizations succeed.
The statistical tools available for organizations in data mining software applications are
broad in range, from regression analyses to genetic algorithms. Some examples of these
tools area Association Discovery; Bayesian Statistics; Bayesian Networks; Classification
trees; Regression Trees; Conceptual Clustering; Decision Trees; Fuzzy Logic; Genetic
Algorithms; and Neural Networks. Every software package will provide you with
different combinations of tools. Therefore, when selecting data mining software, one
should carefully analyze the needs of the organization.
For the information to be useful, knowledge workers who perform such analyses must be
careful to avoid common mistakes that will not only invalidate the results, but could
provide the organization with the wrong answer to the initial problem.
What Are Some Key Considerations Before Mining Data?
As in any research or statistical analysis, one must start with the correct understanding of
what needs to be done and why it needs to be done. This requires the correct
understanding of the problem, the selection of the right set of actions to address the
problem, and the analyses of the results in light of the initial question(s). If data mining
(or any other data analysis) is undertaken without this initial consideration, chances are
that the final product will not solve any problems. On the contrary, it might create more
serious ones.
For example, if a business would like to know what its customers’ buying behaviour is at
Thanksgiving, what would be a sensible approach? First, the analyst would have to look
at the data and restrict it to a certain period of time (e.g., a month before the
Thanksgiving holiday to a few days after since many people buy goods after the holiday
to increase their savings). It would be nonsense to include year-round data. Also, if the
business has operations in Canada and the United States, for each relevant population
different time periods would apply since Thanksgiving in Canada is earlier than in the
U.S. This is an example of correct analysis and understanding of the task before moving
on to data preparation.
There have been many discussions on how to correctly address the data mining process.
One result of such discussions among professionals in the field is the CRISP-DM Model.
CRISP-DM stands for CRoss Industry Standard Process for Data Mining. You can find
more detailed information at the following Web site: http://www.crisp-dm.org. The
model calls for a six-phase approach to data mining: (1) Business Understanding—
understand the project objectives and requirements; (2) Data Understanding—collect the
relevant data and become familiar with it, and also critically look into the quality of the
data you have; (3) Data Preparation—clean, organize, and prepare the data for the
analysis by one or many of the statistical tools available. Good data preparation leads to
data that can be trusted and that will yield valid results; (4) Modeling—the phase in
which data analysis will take place. It is important to notice that some of the modelling
techniques require preparing data in specific ways, therefore the importance of the
previous phase; (5) Evaluation—this phase requires a serious and non-biased analysis of
your results. Here you need to find out if the model(s) is (are) really valid and useful.
You need to make sure that the analysis was done right, that the data was prepared right,
and that the results make business and research sense; and (6) Deployment—put the
model(s) into action. Please refer to the above-mentioned Web site to gain a more indepth understanding of such a critical and important process. If you follow such a model
and perform the tasks with seriousness and high standards, you will avoid incurring many
serious pitfalls that will invalidate your findings.
Some of the software available provides a step-by-step approach to perform some of the
above-mentioned phases. For example, SAS Enterprise Miner makes use of what is called
the SEMMA approach: Sample, Explore, Modify, Model, and Assess. The goal of this
approach is to provide the user with a logical, organized framework for conducting data
mining. It should be noted that, in the case of the example presented, phases 1 and 6 are
not part of SEMMA. It is assumed that you and your organization have done your
"homework" (phase 1) before using the data mining tools provided with a data mining
package. Phase 6 is how you will put the model in practice for your organization, which
is not part of a data mining package but part of the overall business understanding and
purpose for the model train of thought. You can find more information on the SEMMA
approach and SAS Miner at http://www.sas.com.
Problems Occurring When Data Is Not Correctly Cleaned
and Prepared
Not spending enough time cleaning and preparing data will bring questions on validity
and reliability to your analysis. Textbooks on the topics of research design and statistical
techniques provide the analyst with many reasons for carefully cleaning and preparing
data before any analysis can be run. Here are some examples:
a) Missing values. When we survey people (customers, clients, employees,
colleagues, etc.), many questions tend not to be answered. In this case, the analyst
should understand (via descriptive statistics) the importance the missing values
have to his/her analysis. Let’s say that what the analyst is looking at requires
information on income, family size, and age. If respondents did not provide
enough income information (people on higher income brackets do not report
income as often as people on lower income brackets), a comparison between
different levels of income may reflect a bias. The results will not demonstrate the
difference in behaviour between higher-income customers and lower-income
customers. Age is another big issue when it comes to missing values. If the
sample selected does not represent the age groups accordingly (due to lack of
responses), any analysis dependent on age differences is biased.
b) Correct sample selection. Not selecting the correct sample also brings problems.
Let’s say that the question the analyst needs to answer requires enough subjects
from the teenage and young adults group of customers. If the analyst fails to
understand that his/her analysis should zoom in on these two groups, the results
will not provide enough statistical evidence for correct decision making and the
analysis (or data collection) will have to be re-done.
c) Outliers. Outliers are answers to questions that significantly change the
distribution of all the other answers to the same question. Consider this example:
The analyst has a nice breakdown of subjects in income brackets. Let’s also say
the analyst didn’t notice that, due to an external factor (market behaviour), for a
certain period of time (say March to April) a sub-sample of subjects in the higherincome bracket bought substantially more goods than they usually do. If the
analyst had plotted the data, s/he will be able to spot the phenomena described.
(The sub-sample above will show as dots that are far away to the right of the
"money spent in purchases" normal curve). S/he will then make a decision on
what to do with those data points. If the analyst did not realize what was
happening and ran the analysis, s/he would be tricked into believing that clients
are spending more money purchasing their goods than is true. Good data
preparation catches outliers and good decision making (based on knowledge of
the data, the problem, and research/statistics) allows the analyst to correctly deal
with such extreme cases in order not to bias the distribution and the final results
of the analysis
d) Normality. Many statistical techniques require that variables be normally
distributed (the bell-shaped curve). Failure to test for normality and to perform
transformation of those variables to bring their distributions back to the normal
distribution will invalidate the results of the analysis.
There are many other considerations that a knowledge worker (analyst) should tackle
when collecting, cleaning, preparing, and analyzing data. Good knowledge on research
and statistical techniques is required to perform a correct analysis.
SAS Enterprise Miner
Parts of the information below can be found in "Enterprise Miner: Applying Data Mining
Techniques Course Notes" by SAS.
SAS Miner is a user-friendly, graphic-oriented tool that allows the user to define what
needs to be done during the data mining process by selecting “nodes” (using a mouse,
clicking on them) and placing the nodes (dragging them) in the “workspace” area. These
nodes can then be easily connected by graphically inserted lines (similar to when you are
using drawing software or other visualization software). That is, when the knowledge
worker/analyst makes decisions about what needs to be done, s/he can easily tell the
software what to do by using this graphical user-friendly feature instead of having to
write lines of syntaxes.
The SEMMA approach that SAS Miner makes use of guides the knowledge
worker/analyst through the process of preparing the data and running the desired analysis.
The first step is Sample. In this step, the analyst will identify the input data sets. The task
here is to identify the data to be input in the software, followed by the selection of a
sample from the larger data set, and the partition of the data set intro training, validation,
and test data.
The second step is Explore. In this step the analyst should explore the data statistically
and graphically by means of plotting and descriptive statistics. This step allows one to
better know the data and to correctly select the appropriate and important variables.
The third step is Modify. Data is prepared for the subsequent analysis. Here the analyst
will transform variables (e.g., to compensate for skew), re-group answers (e.g., married
status from single, married, divorced, widowed, living with partner—five categories—to
having a partner/spouse and not having a partner/spouse—two categories), deal with
missing values (replace or not), and identify outliers, select control, dependent, and
independent variables, and perform any other analyses that are required by the
investigation the analyst plans to perform.
These first steps will allow for what, in research methodology, is referred to as "getting to
know and cleaning your data." This step is of extreme importance since the correct
preparation of data for analysis is what guarantees the reliability of the results (see
previous item).
Step number 4 is Model. A model or models will be created by using a regression,
decision tree, neural network, or user-defined model.
Step number 5 is Assess. The analyst compares competing predictive models.
One very important point in all of this process is that the knowledge worker can only
make decisions that make sense and are right if s/he has understood the business needs
and has an understanding of the research and statistical analysis methodology. This is
why it is important to delegate such tasks to those workers that have been trained in
research methods and statistics.
Conclusion
Data mining does, indeed, provide organizations with very powerful analytical tools to
extract relevant and useful information from their ever-growing data pools. It is no doubt
that lying dormant in databases is a wealth of relevant and crucial information businesses
can learn from. But, as this textbook correctly enforces, one needs to understand the
organization and its needs in order to ask the right questions, choose the right systems,
and implement the right solutions. Data mining without careful consideration of factors
such as the ones presented in the CRISP-DM methodology is doomed to sub-optimal
results.
References
1. CRISP-DM: CRoss Industry Standard Process for Data Mining Methodology.
Downloaded from web site: http://www.crisp-dm.org.
2. Shaw, M.J., Subramanian, C., Tan, G.W. and Welge, M.E. (2001). Knowledge
Management and Data Mining for Marketing. Decision Support Systems, 31, 127137.
3. Enterprise Miner: Applying Data Mining Techniques – Course Notes. Edited by
SAS. 1999.