Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ETH and University of Zurich Proff. A.D. Barbour – P. Bühlmann – H.R. Künsch Invited talk in the series Seminar über Statistik A Statistical Perspective on Data Mining? by John Maindonald Australian National University, Canberra, Australia Tuesday, September 11, 13.30 h in LEO C15, Leonhardstrasse 27, 8006 Zürich (LEO is close to the main building, across the hill-side station of the ’Polybahn’) Abstract The range of activities at Australian National University that come under the heading of Data Mining is wide. It includes numerical methods with a particular focus on algorithms for use in the analysis of large data sets, design of database query systems, networked databases, data visualisation, computational genomics, and statistics. My serious involvement with the term ‘data mining’ started when the Dean of the Australian National University Graduate School asked me to prepare a document, intended to inform both University senior administrators and academics, that would give a coherent account of data mining. Data analysis aspects of data mining have the same aims as statistical analysis. Differences between a statistics and a data mining approach to data analysis have arisen from the different disciplinary skills of their practitioners, from different historical origins, from different views of the role of statistical theory, and from differences in the sizes of the data sets that they typically encounter. Legitimate differences are not in principle greatly different from those that one finds between different areas of statistical application between for example the ways that statistical methodology is applied in such different areas as agriculture, economics, finance, banking, business, and genomics. Data miners and statisticians have typically used different tools for data analysis problems, and had different views of the role of statistical theory. There are a number of questions that it is useful to consider. Under what circumstances is one or other set of tools likely to be effective? Do large data sets demand the use of tools that are different from those used with data sets of modest size? To the extent that size affects the tools that are used, is it necessary to take account of major structure within the data in order to decide whether, for analysis purposes, a data set is large? For example, there may be huge amounts of data on each of a small number of hospitals. I will illustrate these points from analyses that I have undertaken with data sets of modest size. Data mining highlights the challenge that computing developments offer to statistics. Were the discipline of statistics to be starting its development at the present point in time, it would develop in ways that are different from the way that it developed historically. The ideas and outlook of those who bring a computer science perspective to data analysis problems hint at the differences one might expect to find. At the same time, and notwithstanding the historical accidents that have affected its development, statistical theory does have solid gains to show from several centuries of development of its central ideas. The current body of statistical theory will remain important, even though its role and relevance are likely to change.