Download Understanding Linkage between Data Mining and Statistics (PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
Understanding Linkage between Data Mining and Statistics
Jaya Srivastava, Dr. Abhay Kumar Srivastava
Department of Computer Science, Jaipur National University, Jaipur, India
Department of Decision Sciences, Jaipuria Institute of Management, Lucknow, India
Abstract: Data Mining and Statistics, though having different origin, exist for common purpose. Many of us are unable
to understand scope and limitations of the two disciplines and how it is interrelated [6, 8]. The paper highlights these
issues and also the linkage power between Data Mining and Statistics [12, 13].
Keywords: “Data Mining”, “Statistics”, “knowledge extraction”, “connection”, “linkage”
Introduction
Data Mining (DM) and Statistics are the two disciplines which are commonly used in Data analysis and
knowledge extraction. Though Statistics is a traditional branch that has evolved from applied Mathematics
while Data Mining is a multidisciplinary branch that has evolved from computer science, but both are used for
the same purpose. There are many techniques which are common in both disciplines but some approaches
used in statistics can reduce the job of a data miner. In this paper we shall be highlighting these issues by
introducing both Data Mining and Statistics and finding linkage and differences between these two streams.
The growth of data mining has been massive in past decade. Its application has increased with the increase of
data generation as more and more data being captured through various means of Information Technology like
internet. There is a growing research in the area of databases with the help of data mining. Since data mining
can be used in advance data research analysis and is capable of extracting valuable knowledge from large data
sets. It has emerged as a new scientific and engineering discipline to meet such requirements. Data Mining is
commonly quoted as “solving problems by analyzing data that already exists in databases”. In addition to the
mining of structured and numeric data stored in data warehouses, more and more interest is now being
experienced in the mining of unstructured and non-numeric data such as text and web in recent times.
1. Defining Data Mining
DM is a combination of computational and statistical techniques to perform exploratory data analysis (EDA)
on rather large and mostly not very well cleaned data sets (or data bases). In recent times, the issue of
capturing data is not considered to be a major issue but since a huge amount of data does not convey any
information, screening of useful and non useful data has become a major challenge. Most modern problems
can electronically deal with the cumulative data from many years ago [39]. This leads to a requirement for
training the data miners in statistics or statistics graduates in data mining.
Although DM has a short history but its importance is felt in various domains. It has been defined in different
manner by experts. Some of the definitions of DM are as under:
Data Mining is the analysis of large observational data sets rather than experimental data sets to find
unsuspected relationships and presenting the data in novel ways which are easy to understand and useful for
the users.
DM is the extraction of hidden predictive information from large databases. [28]
DM is process of analyzing data from different perspectives and summarizing it into useful information within
a particular context. [4]
DM is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of
data in order to discover meaningful patterns and rules. [2]
DM is finding interesting structure (patterns, statistical models, relationships) in databases. [39]
4
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
DM is the application of statistics in the form of exploratory data analysis and predictive models to reveal
patterns and trends in very large data sets. (Insightful Miner 3.0 User Guide)
DM is “learning from the data” or “turning data into information”. [24]
DM is the process of knowledge discovery in databases (KDD). [41]
In almost all definitions the focus is on analyzing large data (generally exploratory) and finding hidden
information in databases. The process of extraction of information is automatic or semi-automatic which is
presented in a very simple manner. Figure1 below shows the evolution of data mining nomenclature. It all
started with different terms by statisticians like Data fishing, data dredging and data snooping. Now-a-days
data mining is often called as Knowledge Discovery in databases.
Figure 1: Evolution of Nomenclature in Data Mining
1.1 Major goals of data mining
We can distinguish the major goals of data mining by two types:
a.
Verification of user’s hypothesis
b.
Discovery of new patterns that can be used for prediction and description
Data mining methods seek to discover unexpected and interesting regularities, called patterns, in presented
data sets. Statistical significance testing also called as Hypothesis testing can be applied in these scenarios to
select the surprising patterns that do not appear as clearly in random data. As each pattern is tested for
significance, a set of statistical hypotheses are considered simultaneously. The multiple comparisons of
several hypotheses simultaneously are often used in Data Mining.
Prediction involves using some variables or fields in the database to forecast unknown or future values of
other variables of interest. Description focuses on finding human-interpretable patterns describing the data.
Various complexities in the stored data (data interrelations) have limited the use of Verification-Driven Data
Mining in decision-making. It must be complemented with the discovery-driven data mining. Furthermore, in
5
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
the context of Data Mining, description tends to be more important than prediction. This is contrast to pattern
recognition and machine learning applications where prediction is often the primary goal.
1.2 Popular Data Mining Methods
There are many different techniques for data mining. Often the technique to use is determined by the type of
data we have and the type of information we are trying to determine from the data. The most popular data
mining methods in current use are classification, clustering, neural networks, association, Bayesian Network,
estimation, and visualization. A very brief explanation of each method is given below:
S.
No.
1.
2.
3.
D.M. Technique
Description
Classification
Clustering
Associations
4.
5.
6.
Description &
Visualization
Summarization
Estimation
7.
8.
Deviation Detection
Link Analysis
predicting an item class
Finding similar groups and sub-groups in data
Determining which things go together, also
known as dependency modeling
Depicting visual summaries in data and
exploring
describing a group
predicting a continuous value such as income,
bank balance etc.
finding changes
finding relationships
2. Statistics
Statistics deals with the quantification, collection, analysis, interpretation, and drawing conclusions from data.
It is considered to be the oldest research stream that has become one of the branches of Pure Applied
Mathematics. Different Statistician had defined statistics in different ways:
Statistics is both art and science which examines the principles and methods implemented in collecting,
summarizing, analyzing and interpreting the numerical data on a research field [38]
Statistics is the branch of mathematics concerned with collection, classification, analysis, and interpretation
of numerical facts, for drawing inferences on the basis of their quantifiable likelihood (probability).It is
subdivided into descriptive statistics and inferential statistics. [31]
The most important science in the whole world: for upon it depends the practical application of every other
science and every art: the one science essential to all political and social administration, all education, and all
organization based on experience, for it only gives results of our experience [11]
Statistics is the science of counting and averages. [31]
Statistics is the science of estimate and probabilities. [31]
It is the method of judging collection, natural or social phenomena from the results obtained from the analysis
or enumeration or collection of estimates [31]
Statistics is the numerical statement of facts capable of analysis and interpretation and the science of statistics
is the study of the principles and the methods applied in collecting, presenting, analysis and interpreting the
numerical data in any field of inquiry.
Statistics consists of two main parts, descriptive and inferential statistics. The methodology for organizing and
summarizing the data for the sample is called descriptive statistics. When one uses these summaries to draw
conclusions about an entire population, we use the methodology called as statistical inference [l]
6
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
2.1 Major Approaches in Statistics
S.No. Statistical Technique
1.
Descriptive Statistics
2.
3.
4.
5.
6.
7.
8.
9.
Regression
-Linear
-Logistic
-Non Linear
Correlation Analysis
-Pearson correlation
-Spearman correlation
Probability Theory
-marginal
-Union
-Joint
-Conditional
Probability Distribution
-Discrete
Probability
Distribution
-Continuous
Probability
Distribution
Bayesian Classification
Estimation Theory
Analysis of Variance
ANOVA)
Factor Analysis (FA)
Description
Central Tendency
Dispersion
Shape (Graphical Display)
-Prediction
-Modeling
-Association
Prediction of the behavior of the system defined
Bayes’ Theorem and Naïve Bayesian classification
-Model Selection
-Estimating Confidence interval and significance level
-ROC Curves
( Test equality of more than two groups mean
Reduction of large no. of variables into some general ones, also known
as Data reduction Technique
Predict a categorical response variable
Forecasting trends and seasonality
10
11.
Discriminate Analysis
Time series analysis
-Moving Average Method
-Exponential smoothing
-auto regression method
12.
Quality Control Charts
Display the spread of individual observation with reference to mean
-Attributes Charts
-Variable charts
Principal
Component
Analysis
Canonical
Correlation
Data Reduction
Analysis
Cluster Analysis
-Hierarchal
-Non Hierarchal
Sampling
-Random Sampling
-Non Random Sampling
13.
14.
15.
16.
7
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
3. Why Data Mining?
With the availability of large number of data analysis tools in statistics, one may think about the relevance of
Data Mining.
Many reasons can be stated in support of DM. First, as industry needs solutions for real-life problems largely
in Customer Relationship Management [23, 27], one of the most important issues is the problem solving
speed: many data mining methods are able to deal with very large datasets in a very efficient way, while the
algorithmic complexity of statistical methods may turn out to be prohibitive for their use on very large
databases. Next, the results of the analysis need to be represented in an appropriate, usually human
understandable way; apart from the analytical languages used in statistics, data mining methods also use other
forms of representations, the most popular being decision trees and rule sets. Another important issue is the
assumptions about the type of data. In general one may claim that data mining deals with all sorts of
structured tabular data (e.g., non-numeric, highly unbalanced, unclean data) as well as with non-structured
data (e.g., text documents, images, multimedia), and does not make assumptions about the distribution of the
data. Finally, while one of the main goals of statistics is hypothesis testing, one of the main goals of data
mining is the construction of hypotheses.
3.1 Nature of Data used in Statistics and Data Mining
Most statisticians are concerned with primary data analysis. That is, the data are collected with a particular
question or set of questions in mind. In-fact experimental design and survey design have been developed to
facilitate the efficient collection of data so as to answer the given questions. On the other hand, Data mining
is entirely concerned with secondary data analysis. In fact we might define data mining as the process of
secondary analysis of large databases aimed at finding unsuspected relationships which are of interest or value
to the database owners. We see from this that data mining is very much an inductive exercise, as opposed to
the hypothetical-derived approach often seen as the paradigm for how modern science progresses
Statistics is more concerned towards learning from data or turning data into information which can be further
used for making rational decisions. The context of data mining encompasses statistics, but with a somewhat
different emphasis. In particular, data mining involves retrospective analyses of data i.e. if something
occurred, we try to investigate why it has occurred thus Experimental design may not be very suitable in data
mining [18]. Data miners are often more interested in ease of understanding or interpreting rather than
accuracy or predictability. Thus, there is a focus on relatively simple interpretable models involving rules,
trees, graphs, and so forth. Applications involving very large numbers of variables and vast numbers of
measurements are also common in data mining. Thus computational efficiency and scalability are very
important, and issues of statistical significance maybe a secondary consideration.
4. Role of Statistics in Data Mining: Both Statistics and Data Mining are data-centered process and the Real
time data is always error prone due to several factors like ultra large size, noise in data, incomplete data,
redundancy and dynamism in data. In Data Mining, Data driven techniques either rely on heuristics to guide
their search through the large space of possible relations between combinations of attribute values or adopt
some kind of data-reduction method to make the algorithm more efficient. Statistics provides several
algorithms which can be used for data analysis in data mining also. For ultra large size and dynamic nature of
data, traditional statistics provides sampling and Bayesian analysis which can be effectively used to counter
these problems [8].
As shown in figure 2, DM can be viewed as intersection of Artificial Intelligence, classical statistics and
Machine learning advance statistics.
4.1 Algorithms for data analysis in statistics
Computing has always been a fundamental to statistics. Some of the important computational tools for data
analysis, rooted in classical statistics are: efficient estimation by maximum likelihood, least squares and least
absolute deviation estimation, and the EM algorithm; analysis of variance (ANOVA, MANOVA, ANCOVA),
and the analysis of repeated measurements; nonparametric statistics; log-linear analysis of categorical data;
8
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
linear regression analysis, generalized additive and linear models, logistic regression, survival analysis, and
discriminant analysis; frequency domain (spectrum) and time domain (ARIMA) methods for the analysis of
time series; multivariate analysis tools such as factor analysis, principal component and later independent
component analyses, and cluster analysis; density estimation, smoothing and de-noising, and classification and
regression trees (decision trees); Bayesian networks and the Monte Carlo Markov Chain (MCMC) algorithm
for Bayesian inference. The overview of these topics is readily available [12].
Figure 2: Data mining as an interdisciplinary branch
4.2 Handling massive data through sampling
While data storage has become cheaper as memories have become increasingly affordable; CPU, throughput,
memory management, and network bandwidth continue to be constraints when it comes to processing large
quantities of data. Business analysts and Data Scientist are so overwhelmed with the sheer volume; they do
not know where to start in order to convert data into information. Sampling could be used in a smart way to
overcome this problem. Sampling is a very well developed area of statistics, but is usually used in DM at the
very basic level.
Exploring a representative sample is easier, more efficient, and can be as accurate as exploring the entire
database. After the initial sample is explored, some preliminary models can be fitted and assessed. If the
preliminary models perform well, then perhaps the data mining project can continue to the next phase.
However, it is likely that the initial modeling generates additional, more specific questions, and more data
exploration is required.
A major benefit of sampling is the speed and efficiency of working with a smaller data table that still contains
the essence of the entire database. Ideally, one uses enough data to reveal the important findings, and no more.
Sufficient quantity depends mostly on the modeling technique, and that in turn depends on the problem.
Sampling enables analysts to spend relatively more time fitting models and thereby less time waiting for
modeling results.
Visualization of the data and its structure facilitate understanding of the data, as well as drawing conclusions
drawn from the data, are another central theme in DM. Visualization of quantitative data as a major activity
flourished in the statistics of the 19th century. Both the theory of visualizing quantitative data and the practice
have dramatically changed in recent years. To better understand a variable, univariate plots of the distribution
of values are useful. To examine relationships among variables, bar charts and scatter plots (2-dimensional
and 3-dimensional) are helpful. Spinning data to gain a 3-dimensional understanding of point-clouds, or the
use of projection pursuit are just two examples of visualization technologies that emerged from statistics.
Data cleansing (detecting, investigating, and correcting errors, outliers, missing values, and so on) can be very
time-consuming. To cleanse the entire database might be a very difficult and frustrating task. Data
9 Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
augmentation (adding information to the data such as demographics, credit bureau scores, and so on), like data
cleaning, will be less expensive if applied only to a sample.
4.3 Bayesian network: An important Linkage between Statistics and Data Mining
A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest.
The model is well suited for Data Mining. When used in conjunction with statistical techniques, this graphical
model has several advantages for data modeling.
1.
Since the model encodes dependencies among all variables, it readily handles situations where some
data entries are missing. Hence it is suited for incomplete and noisy data.
2.
A Bayesian network can be used to learn causal (Cause-effect) relationships, and hence can be used
to gain understanding about a problem domain and to predict the consequences of intervention especially in
the case of exploratory data Analysis.
3.
As the model has both a causal and probabilistic semantics, it is an ideal representation for combining
prior knowledge (which often comes in causal form) and data. In a real-world modeling task, one knows the
importance of prior or domain knowledge, especially when data is scarce or expensive. The fact that some
commercial systems (i.e., expert systems) can be built from prior knowledge. Thus posterior values can be
derived with the help of prior knowledge
4.
Over fitting of Data occurs when a model describes random error or noise instead of the underlying
relationship. Over fitting generally occurs when a model is excessively complex, such as having too many
parameters relative to the number of observations. A model that has been over fit will generally have
poor predictive performance, as it can exaggerate minor fluctuations in the data (source Wikipedia). Bayesian
statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for
avoiding the over fitting of data.
5. Conclusion & Future Work
This paper provides several aspects of Statistics and data mining, aiming to provide the background and basic
understanding of the topics and linkage between them.
Statistics is observed to be very useful for verifying relationships among few parameters when the
relationships are linear. It considers every aspect right from planning for the collection of data and subsequent
data management to end of the line activities such as drawing inferences from numerical facts called data and
presentation of results. It can be viewed as fulfilling the basic need of human being.
On the other hand Data mining may be viewed as building many complex, predictive, nonlinear models which
are used for predicting behavior impacted by many factors. It is used to discover those hidden patterns and
relationships in our data that make business decisions more accurate and realistic. Another reason why data
mining has a scientific and commercial future was given by Friedman “Every time the amount of data
increases by a factor of 10, we should totally rethink how we analyze it.”[12]
What distinguishes data mining from conventional statistical data analysis is that data mining is usually done
for the purpose of 'secondary analysis' aimed at finding unsuspected relationships, perhaps, unrelated to the
purposes for which the data were originally collected. In other words, data mining is very much an inductive
exercise, as opposed to the traditional hypothetico-deductive approach of statistics. Data mining sits at the
common frontiers of fields such as Information Systems (Database management & Data warehousing),
Computer Science (Artificial Intelligence, Machine Learning & Pattern Recognition), and Statistics (Data
Visualisation & Modelling) [13].
There is a strong linkage between statistics and Data Mining because most of the basic functions in DM are
covered in statistics using proper algorithms. Since the focus of DM is on non parametric data which is less
relied in traditional statistics so there is a lot of scope left to work on these non-parametric tests stated in
statistics which can be used in DM.
10
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
References:
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
Anders Hald, A history of probability and statistics and their applications before 1750, Wiley IEEE, ISBN
0471471291, (2003).
Berry, J.A.M., and Linoff, G., Data mining techniques-for marketing, sales and customer support", New York,
Wiley, (1997).
Berry, M.J.A. and Linoff, G.S., Mastering Data Mining -The Art and Science of Customer Relationship
Management, New York, Wiley(2000).
Bill
Palace,
Data
Mining:
What
is
Data
Mining?,
(1996),
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
Chatfield C., Data Mining, Royal statistical society news, Vol. 25, (1997), 1-2.
Clark Glymour Et. Al, Statistical Themes and lessons for Data Mining, Data Mining and Knowledge
Discovery,Vol. l, Kluwer Academic publishers, (1997), 11- 28.
Data Mining Community's Top Resource , Data Mining, and Knowledge Discovery: An Introduction,
(2011),http://www.kdnuggets.com/data_mining_course/x1-intro-to-data-mining-notes.html
David J. Hand, Statistics and Data Mining: Intersecting Disciplines, copyright @ ACM SIGKDD, Vol. 1, Issue
1, (1999).
David J. HAND, Data Mining: Statistics and More? The American Statistician, Vol. 52, No. 2., (1998).
Emanuel Parzen, Data Mining, Statistical Methods Mining and History of Statistics, Interface Symposium on
Computing Science and Statistics, Proceedings, ed. D. Scott., (1998).
Florence
Nightingale,
Statistics,(2011),
http://jwilson.coe.uga.edu/emt668/EMAT6680.Folders/Brooks/6690stuff/Statistics/Statistics.htm
Friedman J.H, Data mining and Statistics-What's the Connection, 29th Symposium on the interface, (1998).
Ganesh, S., Data mining: Should it be Included in The Statistics Curriculum? The 6th international conference
on teaching statistics (ICOTS 6), Cape Town, South Africa, (2002).
Glymour Et al., Statistical Inference and Data Mining, Communications of the ACM, Vol. 39, No. 11, (1996).
Goodman A. , Kamath C. And Kumar V., Data Analysis in the Twenty-First Century, Vol. 1, No. 1, Journal
Volume: 1; Journal Issue: 1, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, (2008), 1-3.
Gorunescu F., Data Mining Concepts, Models and Techniques, Vol. 12, intelligent systems reference library,
Springer, (2011).
Hamparsum Bozdogan Et al, Statistical Data Mining and Knowledge Discovery. 2nd edition, London, (2004).
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
149
Hand, D.J., Data mining-statistics and more?” American Statistician, Vol. 52, (1998), 112-118.
Hand, D.J. (1999). Data mining: new challenges for statisticians. Proceedings of the ASC (Association for
Survey Computing) International Conference, 21-26.
Hand, D.J. (1999). Statistics and data mining: intersecting disciplines. SIGKDD Explorations, 1, 16-19. Hand,
D.J., Blunt, G., Kelly, M.G. & Adams, N.M. (2000). Data mining for fun and profit. Statistical Science, 15,
111-131.
Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press.
Hastie, T., Tibshirani R. and Friedman J. H., Elements of statistical learning-data mining inference and
prediction, Springer Verlag, New York, (2001).
I. Krishna Murthy, Data Mining- Statistics Applications: A Key to Managerial Decision Making” article,
indiastat.com, (2010)
J. Hosking, E. Pednault, and M. Sudan, A Statistical Perspective on Data Mining, Future Generation Computing
Systems, special issue on Data Mining, (1997).
Jure Leskovec, Data Mining : Introduction, (2010), http://www.stanford.edu/class/cs345a/slides/01- intro.pdf
Klamber M. and Han J."Data Mining: Concepts and Techniques, 2nd Edition, Elsevier Inc., USA, (2006).
11
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
[27]
[28]
[29]
[30 ]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
Kuonen, D., Data mining and Statistics: What is the connection? The Data Administrative Newsletter,
Switzerland, (2004).
Kurt Thearling, An Introduction to Data Mining, (2010), http://www.thearling.com/text/dmwhite/dmwhite.htm
Lomax, R. G., An Introduction to Statistical Concepts for Education and Behavioral Sciences (2nd ed.). New
York: Routledge, (2007).
Lovleen Kumar Grover and Rajni Mehra, The Lure of Statistics in Data Mining, Journal of Statistics Education
Volume 16, Number 1, (2008),www.amstat.org/publications/jse/v16n1/grover.html/
Math Zone, Definition of Statistics, (2011), http://www.emathzone.com/tutorials/basic-statistics/definition-ofstatistics.html
Neal Leavitt, Data Mining Corroborate Masses, (2011), http://www.leavcom.com/ieee_may02.htm [30] Robert
Nisbet Et al, The Handbook of Statistical Analysis and Data Mining Applicants, Academic Press, ISBN:
0123747651, (2009), www.elsevierdirect.com/datamining
SAS Analytics, Statistics Definitions, http://www.businessdictionary.com/definition/statistics.html
Siva Ganesh, Data Mining: Should It Be Included In The 'Statistics' Curriculum? ICOTS6, (2002).
SPSS Inc., SPSS Data Mining Tips, ISBN 1-56827-282-0 Printed in the U.S.A., (2005). [34] STASTICA Data
Analysis Software and Services, Stat Soft Electronic Statistics Textbook, Data Mining Techniques, (2011),
http://www.statsoft.com/textbook/data-mining-techniques/
Stephen M. Stigler, Statistics on the table: the history of statistical concepts and methods, Cambridge, Mass:
Harvard University Press, (2002).
Stephen M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Cambridge, MA:
Belknap Press of Harvard University Press, (1986).
Tim Menzies and Ying Hu, Computing Practices Data Mining for Very Busy People, (2009),
http://biblioteca.universia.net/html_bura/ficha/params/title/computing-practices-data-mining-for-verybusypeople/id/47808919.html
U. Fayyad, S. Chaudhuri and P. Bradley, Data mining and its rule in database systems, Proceeding of 26th
VLDB Conference. Cairo, Egypt, Morgan Kaufmanu, (2000), 63 – 124.
Wiley Inter Science, Data Analysis in the 21st Century, (2007), www.interscience.wiley.com/.
Wikipedia, Data Mining, (2011), http://en.wikipedia.org/wiki/Data_mining
12
Jaya Srivastava, Dr. Abhay Kumar Srivastava