Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced statistical analysis on working accidents using data mining tools H. P. Mavropoulos1 , G. D. Dounias 1, G . Potamias2 1 Technical University of Crete,MANLAB,Chania,Greece Institute of Computer Science, FORTH, Heraklion, Greece Abstract: In this paper we are applying alternative data mining techniques in order to discover useful knowledge and drive statistical analysis applied to databases provided by the Greek Social Insurance Organization. The databases contain records of working accidents that happened in Greece during the years 1989-1995. For this kind of research we use Data Surveyor v. 1.4, a new multiple use mining tool that seems to fit the application domain successfully. 1.Introduction to Data mining Data mining is the discovery of interesting, yet hidden, knowledge in very large databases. Corporate databases often contain unknown trends, patterns and relationships among objects (e.g. clients and products) that are of strategic importance to the organization. This knowledge cannot be discovered easily with conventional query tools or statistical packages, because they either lack support for handling very large data sets or expect the user to have some idea of the form of the hidden relationships from the beginning of the search process[1]. Data mining tools in general, apply algorithms to large amounts of data in such a way that the data reveal hidden patterns and relationships and uncover correlations that were previously invisible to workers and the business. Data mining tools help the enterprise understand customer behavior, predict events and expose the linkages between events and trends[4]. 2.Presentation of the data mining tool Data Surveyor, [3] allows both the expert-user (data analyst) to interactively discover knowledge as well as business-user to apply data mining technology to his daily-automated activities. Data Surveyor is a user-friendly data-mining tool for the analysis of very large databases. It uses highly efficient search strategies and database optimization techniques to discover the most interesting business patterns and trends in minimal time. Data Surveyor has been developed in close co-operation with leading data mining research centers in Europe through an EC –funded, multi-national project geared to provide researched and advanced datamining software .As a result of this project, existing data mining algorithms have been reviewed and a unique decomposition of these algorithms has been established. Where most data mining tools offer a range of independent algorithms, Data Surveyor brings all these algorithms in one integrated suite. Different algorithms can now be combined easily. The following three data mining dimensions describe the spectrum of data mining algorithms that are used in Data Surveyor: Hypothesis Language: the aim of data mining is to discover a model that uncovers hidden information in a database. A hypothesis language describes a model. Examples of hypothesis languages are trees, networks and rules. Quality Functions: the quality of a hypothesis defines how well the hypothesis fits the data of the real world. Examples of quality functions are entropy and chi-square measures. Search Strategies: Search Strategies are used to find the model that fits the data best. The search strategy aims at finding the hypotheses with the best quality at minimum effort. Examples of such search techniques are: exhaustive search, hill climbing, beamsearch, and broad-view search Different business problems require different techniques. Data Surveyor offers a set of techniques with which someone can experiment to find the optimum solution for his business problems. These techniques are applied in the following areas: • Decision rules induction: finding profiles of customers or products that are somehow extraordinary, for example, faulty products in a production process. • Decision trees : building segmentation models e.g. for credit scoring and risk analysis • Association rules : used for finding cross-selling patterns in finance and detection of product combinations in retail • Bayesian networks: Bayesian networks automatically detect corellations between variables and offer users an easy-to-interpret overview of the relationships in the data. This knowledge helps to understand the results generated by the above techniques. All these techniques generate explicit, understandable knowledge. It is very important for data mining techniques to generate knowledge that can be understood by human experts. Although data mining techniques generate results that are statistically correct, this does not mean that these results are valid or useful in the outside world. 2 3.Description of the application domain Our application domain copes with the labor (working) accidents occured annually in Greece and declared in the Greek Social Insurance Organisation (GSIO). The total database (not in a computer) of GSIO contains somewhat 2,000,000 records (i.e. workers) from which approximately 0.5 % appear to have an accident annually. The statisticians of GSIO then construct a database in a computer program, containing information about the annual accident data. Such collections of data were encoded properly, concerning the detailed annual accident report of GSIO. Each record (i.e. each injured subject) consisted of four (4) major areas of information and contains the following 13 distinctive attributes : GSIO local office of the subject, area of economic activity in which the subject is classified, profession, insurance class, gender, age in years, category of age, cause of accident, type of injure, injured body part, month of accident, days of funding and category of funding [2] . At this part of experimentation, we applied the following exploration scenario : We used association rules in all the attributes in order to identify the most significant groups and their descriptions (i.e. attribute-value pairs ). Then we used these groups as a starting point of further explorations aiming at interesting attribute-value groupings (i.e., as a target attribute-value) using decision rules. In order to discover the strongest correlations among the attributes we used association rules were we selected beamsearch as our search strategy. We set the associated parameters (width ,depth, probability) at the most reasonable values ( for example the depth of the search was set to 1 as we were just looking for initial points of further analysis. Some of the results we obtained are : • the most important age groups are 26-40 and 41-65 years old ( which is expected as these ages include almost the whole workpower of the country) . • the most frequently injured body parts are arms/hands and legs/feet. • most of the accidents occur in Athens/Mainland and Thesaloniki which is expected as in these two cities live about 60% of the population of the whole country. • injured workers receive usually 2-4 weeks of funding days. • the most common trauma is simple injure. Starting from the results we got from our previous experiment, we continued our mining process with the use of decision rules , searching for interesting correlations. We set body part as target attribute and legs/feet as target value as we are trying to discover interesting subgroups of people who injure their legs/feet. The following rules were extracted : • the most common leg injure is bone dislocation mainly caused to fall from height . • construction workers are also keen on injured legs as they fall both from heights and from walkin level. • most common leg injures for the construction workers are bone dislocation and simple fractures. • Construction workers receive 30-180 days of fund either when they fall from height or when they break their leg. 4.Conclusions At this paper we used a data mining tool in order to perform advanced statistical analysis on working accidents. The application of Data Surveyor using data mining tools concluded that much additional information can be extracted from complex databases. The selected application domain consists an ideal opportunity for testing the efficiency of data mining processing, as well as for analyzing complex attribute associations, discovering knowledge representations and demonstrating ways of acquiring the information needed to form managerial strategies and reengineering issues. Some future experiments will be focused in finding interesting subgroups that receive a lot of funding days from GSIO or professions that seem to have a high risk of injure. Acknowledgements: The Statistics department of GSIO is greatly acknowledged for providing the authors with the working accident's records data that were used in this work. This paper was funded by the European Community through the project: Knowledge Extraction for Statistical Offices (KESO), project nr 20596. References: [1] P.Adriaans,D. Zantinge (1996): Data Mining. Addison-Wesley-Longman,UK. [2] J.Kardaun,B.van der Wateren,E.Kaper,S.Laaksonen,T.Alanko,C.von der Heyde,G.Dounias, G.Potamias (1998): Users evaluation of the interface of KESO v1.3 (Technical report,Project nr 20596, BPA: 6672-98-RSM ) . [3] Data Distilleries B.V.(1998) : Data Surveyor 1.3-The KESO III System,Users manual [4] U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,R.Uthurusamy(1996): Advances in knowledge discovery and data mining. AAAI Press/The MIT Press,CA,USA .