Download A Statistical Perspective on Data Mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Data analysis wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
ETH and University of Zurich
Proff. A.D. Barbour – P. Bühlmann – H.R. Künsch
Invited talk in the series Seminar über Statistik
A Statistical Perspective on Data Mining?
by John Maindonald
Australian National University, Canberra, Australia
Tuesday, September 11, 13.30 h
in LEO C15, Leonhardstrasse 27, 8006 Zürich
(LEO is close to the main building, across the hill-side station of the ’Polybahn’)
Abstract
The range of activities at Australian National University that come under the heading of Data Mining
is wide. It includes numerical methods with a particular focus on algorithms for use in the analysis
of large data sets, design of database query systems, networked databases, data visualisation, computational genomics, and statistics. My serious involvement with the term ‘data mining’ started when
the Dean of the Australian National University Graduate School asked me to prepare a document,
intended to inform both University senior administrators and academics, that would give a coherent
account of data mining.
Data analysis aspects of data mining have the same aims as statistical analysis. Differences between
a statistics and a data mining approach to data analysis have arisen from the different disciplinary skills
of their practitioners, from different historical origins, from different views of the role of statistical
theory, and from differences in the sizes of the data sets that they typically encounter. Legitimate
differences are not in principle greatly different from those that one finds between different areas of
statistical application between for example the ways that statistical methodology is applied in such
different areas as agriculture, economics, finance, banking, business, and genomics.
Data miners and statisticians have typically used different tools for data analysis problems, and
had different views of the role of statistical theory. There are a number of questions that it is useful
to consider. Under what circumstances is one or other set of tools likely to be effective? Do large data
sets demand the use of tools that are different from those used with data sets of modest size? To
the extent that size affects the tools that are used, is it necessary to take account of major structure
within the data in order to decide whether, for analysis purposes, a data set is large? For example,
there may be huge amounts of data on each of a small number of hospitals. I will illustrate these
points from analyses that I have undertaken with data sets of modest size.
Data mining highlights the challenge that computing developments offer to statistics. Were the
discipline of statistics to be starting its development at the present point in time, it would develop
in ways that are different from the way that it developed historically. The ideas and outlook of those
who bring a computer science perspective to data analysis problems hint at the differences one might
expect to find. At the same time, and notwithstanding the historical accidents that have affected its
development, statistical theory does have solid gains to show from several centuries of development
of its central ideas. The current body of statistical theory will remain important, even though its role
and relevance are likely to change.