Download Advanced statistical analysis on working accidents using data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Advanced statistical analysis on working accidents using data mining tools
H. P. Mavropoulos1 , G. D. Dounias 1, G . Potamias2
1
Technical University of Crete,MANLAB,Chania,Greece
Institute of Computer Science, FORTH, Heraklion, Greece
Abstract: In this paper we are applying alternative data mining techniques in order to discover useful
knowledge and drive statistical analysis applied to databases provided by the Greek Social Insurance
Organization. The databases contain records of working accidents that happened in Greece during the
years 1989-1995. For this kind of research we use Data Surveyor v. 1.4, a new multiple use mining tool
that seems to fit the application domain successfully.
1.Introduction to Data mining
Data mining is the discovery of interesting, yet hidden, knowledge in very large databases. Corporate
databases often contain unknown trends, patterns and relationships among objects (e.g. clients and
products) that are of strategic importance to the organization. This knowledge cannot be discovered
easily with conventional query tools or statistical packages, because they either lack support for
handling very large data sets or expect the user to have some idea of the form of the hidden
relationships from the beginning of the search process[1]. Data mining tools in general, apply
algorithms to large amounts of data in such a way that the data reveal hidden patterns and relationships
and uncover correlations that were previously invisible to workers and the business. Data mining tools
help the enterprise understand customer behavior, predict events and expose the linkages between
events and trends[4].
2.Presentation of the data mining tool
Data Surveyor, [3] allows both the expert-user (data analyst) to interactively discover knowledge as
well as business-user to apply data mining technology to his daily-automated activities. Data Surveyor
is a user-friendly data-mining tool for the analysis of very large databases. It uses highly efficient
search strategies and database optimization techniques to discover the most interesting business
patterns and trends in minimal time.
Data Surveyor has been developed in close co-operation with leading data mining research centers in
Europe through an EC –funded, multi-national project geared to provide researched and advanced datamining software .As a result of this project, existing data mining algorithms have been reviewed and a
unique decomposition of these algorithms has been established. Where most data mining tools offer a
range of independent algorithms, Data Surveyor brings all these algorithms in one integrated suite.
Different algorithms can now be combined easily.
The following three data mining dimensions describe the spectrum of data mining algorithms that are
used in Data Surveyor:
Hypothesis Language: the aim of data mining is to discover a model that uncovers hidden information
in a database. A hypothesis language describes a model. Examples of hypothesis languages are trees,
networks and rules.
Quality Functions: the quality of a hypothesis defines how well the hypothesis fits the data of the real
world. Examples of quality functions are entropy and chi-square measures.
Search Strategies: Search Strategies are used to find the model that fits the data best. The search
strategy aims at finding the hypotheses with the best quality at minimum effort. Examples of such
search techniques are: exhaustive search, hill climbing, beamsearch, and broad-view search
Different business problems require different techniques. Data Surveyor offers a set of techniques with
which someone can experiment to find the optimum solution for his business problems. These
techniques are applied in the following areas:
• Decision rules induction: finding profiles of customers or products that are somehow
extraordinary, for example, faulty products in a production process.
• Decision trees : building segmentation models e.g. for credit scoring and risk analysis
• Association rules : used for finding cross-selling patterns in finance and detection of
product combinations in retail
• Bayesian networks: Bayesian networks automatically detect corellations between
variables and offer users an easy-to-interpret overview of the relationships in the data.
This knowledge helps to understand the results generated by the above techniques.
All these techniques generate explicit, understandable knowledge. It is very important for data mining
techniques to generate knowledge that can be understood by human experts. Although data mining
techniques generate results that are statistically correct, this does not mean that these results are valid or
useful in the outside world.
2
3.Description of the application domain
Our application domain copes with the labor (working) accidents occured annually in Greece and
declared in the Greek Social Insurance Organisation (GSIO). The total database (not in a computer) of
GSIO contains somewhat 2,000,000 records (i.e. workers) from which approximately 0.5 % appear to
have an accident annually. The statisticians of GSIO then construct a database in a computer program,
containing information about the annual accident data. Such collections of data were encoded properly,
concerning the detailed annual accident report of GSIO. Each record (i.e. each injured subject)
consisted of four (4) major areas of information and contains the following 13 distinctive attributes :
GSIO local office of the subject, area of economic activity in which the subject is classified, profession,
insurance class, gender, age in years, category of age, cause of accident, type of injure, injured body
part, month of accident, days of funding and category of funding [2] .
At this part of experimentation, we applied the following exploration scenario : We used association
rules in all the attributes in order to identify the most significant groups and their descriptions (i.e.
attribute-value pairs ). Then we used these groups as a starting point of further explorations aiming at
interesting attribute-value groupings (i.e., as a target attribute-value) using decision rules. In order to
discover the strongest correlations among the attributes we used association rules were we selected
beamsearch as our search strategy. We set the associated parameters (width ,depth, probability) at the
most reasonable values ( for example the depth of the search was set to 1 as we were just looking for
initial points of further analysis. Some of the results we obtained are :
•
the most important age groups are 26-40 and 41-65 years old ( which is expected as these ages
include almost the whole workpower of the country) .
• the most frequently injured body parts are arms/hands and legs/feet.
• most of the accidents occur in Athens/Mainland and Thesaloniki which is expected as in these two
cities live about 60% of the population of the whole country.
• injured workers receive usually 2-4 weeks of funding days.
• the most common trauma is simple injure.
Starting from the results we got from our previous experiment, we continued our mining process with
the use of decision rules , searching for interesting correlations.
We set body part as target attribute and legs/feet as target value as we are trying to discover interesting
subgroups of people who injure their legs/feet.
The following rules were extracted :
• the most common leg injure is bone dislocation mainly caused to fall from height .
• construction workers are also keen on injured legs as they fall both from heights and from walkin
level.
• most common leg injures for the construction workers are bone dislocation and simple fractures.
• Construction workers receive 30-180 days of fund either when they fall from height or when they
break their leg.
4.Conclusions
At this paper we used a data mining tool in order to perform advanced statistical analysis on working
accidents. The application of Data Surveyor using data mining tools concluded that much additional
information can be extracted from complex databases. The selected application domain consists an
ideal opportunity for testing the efficiency of data mining processing, as well as for analyzing complex
attribute associations, discovering knowledge representations and demonstrating ways of acquiring the
information needed to form managerial strategies and reengineering issues. Some future experiments
will be focused in finding interesting subgroups that receive a lot of funding days from GSIO or
professions that seem to have a high risk of injure.
Acknowledgements:
The Statistics department of GSIO is greatly acknowledged for providing the authors with the working
accident's records data that were used in this work. This paper was funded by the European Community
through the project: Knowledge Extraction for Statistical Offices (KESO), project nr 20596.
References:
[1] P.Adriaans,D. Zantinge (1996): Data Mining. Addison-Wesley-Longman,UK.
[2] J.Kardaun,B.van der Wateren,E.Kaper,S.Laaksonen,T.Alanko,C.von der Heyde,G.Dounias,
G.Potamias (1998): Users evaluation of the interface of KESO v1.3 (Technical report,Project nr 20596,
BPA: 6672-98-RSM ) .
[3] Data Distilleries B.V.(1998) : Data Surveyor 1.3-The KESO III System,Users manual
[4] U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,R.Uthurusamy(1996): Advances in knowledge discovery
and data mining. AAAI Press/The MIT Press,CA,USA .