Download Development of an Efficient Data Mining System - ASEE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Development of an Efficient Data Mining System for the WIC
Program in the State of Tennessee
Abdulqadir Ismail Khoshnaw
and
Information Systems Programmer
Dept. of Health, State of Tennessee
Dr. Satinderpaul Singh Devgan
Professor and Head of ECE
Tennessee State University
Abstract
Data mining is the process of discovering meaningful new correlations, patterns and trends by
searching through large amounts of data, using pattern recognition technologies as well as statistical
and mathematical techniques. This paper addresses the issues related to the rapid increase in the
clinical cost of the Women, Infants, and Children (WIC) program in the State of Tennessee over the
past three years. WIC is a special nutrition program that distributes supplemental food vouchers to low
income families and provides clinical services. Throughout the years of clinical operation, the program
has collected millions of records about its participants. The objective is to build an efficient data
mining system for this large database of WIC program in order to discover interesting patterns in the
clinical data. A systems engineering approach was used to define requirements, and then design,
develop and implement a data mining system. In this system, a Microsoft Access Application for data
storage and a data-mining tool were built. The data mining system was successful in identifying
excessive visits by the WIC participants and resulted in changes in the operating policies and
procedures that can now control its costs.
Introduction
Large amounts of data are being collected by companies, organizations, and government
agencies so that proper analysis of the data can benefit the way the business is conducted. Data
mining is the process of discovering meaningful new patterns and trends through analysis of large
amounts of data, using statistical and mathematical techniques as well as pattern recognition
technologies. Data mining has gained a lot of attention in information technology for its use in such
areas as detecting fraud in bank and credit card transactions, diagnoses in the medical field, and
improving sales in the retail business. A number of software packages have been developed for data
mining of different kinds of data. Data mining has advanced a lot with improvements in data
warehouses and data marts. The State of Tennessee has a special nutritional program for women,
infants and children (WIC). This program distributes supplemental food vouchers to low income families
and provides clinical services. It was noticed that the overall cost of the clinical services of the WIC
program increased significantly during the years 1999 and 2000 over that in year 1998. An efficient
data mining system was used to investigate interesting patterns in the large database of the WIC
program that could explain the cause of this cost increase. This paper describes the systematic
development and application of a data mining system used to identify the cause of cost increase and
provides a solution that will pay back the system development cost within a reasonable time period.
The WIC Database
Data was collected from WIC participants at the time of clinic visits. This data is collected using a
central IBM AS/400 computer in each region of the Department of Health of the State of Tennessee.
The WIC data consists of records of participants with their status in a voucher history file of the
database and records of all clinical visits by participants in an encounter file. This data is captured
and stored in the form of tables of voucher history of participants and their clinical records. This is the
only WIC database and it is operational. It does not have a data warehouse or data mart for analysis
purposes.
System Engineering Approach
Since the life of most State and Federal programs is dependent upon the politics of funding, a
systems engineering management plan (SEMP) was developed to design, develop and implement a
data mining system by the end of the first year and to retire or upgrade the system after five years [1].
During the conceptual design phase, it was recognized that a separate PC based database for data
mining applications is necessary to avoid delays that could occur if operational database was used
[2]. The important stages in the conceptual design phase included development of functional,
operational and maintenance requirements, and selection of a data mining system. Four alternate
systems, which included an in-house data mining system development, a DB Miner system, a Neural
Connection System, and a Clusten Graphics system, were analyzed based on ease of use, skill level
required to implement, use and upgrade, and the overall life cycle cost. The in-house system was
selected as a better alternative [3].
In the preliminary design phase of the in-house system, a PC based Microsoft Access database
subsystem was selected from among the Oracle database and Microsoft Access Application mainly
on the basis of cost and ease of use. The PC based subsystem included designing and building the
Access database for data storage and data mining operations. In this phase the data was
downloaded and data preparations were carried-out. The PC was connected to the WIC database
through the Internet. The Access database was linked to the tables in the WIC database via an OpenDatabase-Connectivity (ODBC) interface. Using SQL queries in the Access database, the download
process is carried out in two steps. At first all clinical records are downloaded from the encounter file
and then the records of the patients, with their status, are downloaded from voucher file of the
database. The encounters are matched with the table that contains the patient’s status to get the
status of each participant. The records are then stored in tables in the Microsoft Access Application.
The clinical records of participants are used for the data mining task (4). This data is then prepared for
clustering and data analysis.
Data Mining Model
The next important subsystem is the selection and design of a clustering based data mining tool.
Clustering is a method or a way to group things together that have similar attributes and a cluster is a
group of similar items or entities. One of the important items or entities that can be clustered is data.
Data clustering is often one of the first steps in data mining analysis. It identifies groups of related
records that can be used as a starting point for exploring further relationships. Clustering WIC data was
the method used for discovering interesting patterns of the WIC participant's visits to the clinics. The
objective of the data mining subsystem is to build a data-mining tool based on clustering WIC data
[5].
Three different clustering methods (the Minimum Distance Method, the Maximum Distance
Method and the K-means Method) were tested on the WIC data and K-means Method performed
better than the other two and was selected as a data mining tool. This tool was developed by using
Visual Basic programming language [3, 4, 5].
Data Preparation and Clustering
The selected in-house data mining system consists of a PC based Microsoft Access Application for
data download, storage, and data preparation, and a data-mining tool programmed in Visual Basic.
The data in the Access database is searched using SQL queries and is fed into the data-mining tool for
clustering and analysis. The data mining process often requires going back to the data in the
database to validate results and to further investigate the discovered information.
After downloading the data from the WIC to the PC based Microsoft Access database, it is
prepared for clustering and data analysis. Data preparation is an important step in the data mining
process. It involves linking and matching tables to get different values from different tables into one
record. It could also involve initial calculations to get some attributes which can not be extracted
directly from the data. Initial data preparation included calculating the number of clinic visits per year
for each participant status. WIC participants were clustered according to the number of visits and the
status of the participant. WIC participants have three different status; women, infants, and children.
The women are further subdivided into pregnant, postpartum and breastfeeding. The data was
prepared for data clustering and organized into records for each participant with the number of visits
and status. The K-means clustering program evaluates each record and assigns it to a cluster
according to the status data element and the number of visits. This process is followed for the entire
record. At the end of the process, the data set is distributed according to the clusters. The clusters are
then further studied for identifying the characteristics of the clusters using analytical methods.
The clusters were helpful in determining the pattern of the clinical visits by WIC participants. The
cluster pattern indicated that many participants had more visits than the four allowed by the WIC
program. Further analysis indicated that the percentage of excess visits was higher among infants and
children. Figures 1 and 2 show the distribution of five and more visits, by various participant statuses,
during the years 1999 and 2000 respectively.
7000
6000
5000
4000
3000
2000
1000
0
5
Infants
Children
6
7
Pregnant W
8
9
10
Delivered W
11
12
Breastfeeding W
Figure 1. Extra visit pattern of WIC participants by category in 1999
Pattern Analysis
The pattern of the extra visits was further analyzed to determine the reasons for higher number of
visits, and their effect on the cost. A detailed examination of the records of the participants who had
higher number of visits was carried out by the WIC staff. This included analysis of the extra visits in the
WIC clinical records for the years 1999 and 2000.
16000
14000
Number of Participants
12000
10000
8000
6000
4000
2000
0
5
6
7
8
9
10
11
12
Number of Visits
Infants
Children
Pregnant W
Delivered W
Breastfeeding W
Figure 2. Extra visit pattern of WIC participants by category in 2000
The analysis of all clinical records for the years 1999 and 2000 confirmed that infants and children
had larger percentage of extra visits. Understanding this pattern, its cause and its effect on the clinical
cost is important. The analysis showed that the extra visits contributed to 97% of the increase in
expenditure in 1999 and 98% in 2000.
Acquisition
Phase
Operational Phase $409000
$245460
$33260
$2740
1
2
3
4
5
$2554 $99
$99
$99
$99
$99
0
$11000
$109
39 K 39 K 39 K 39 K
Figure 3. Cash flow diagram for life cycle
Year
Proposed Plan and Breakeven Analysis
It was recommended to eliminate the extra visits gradually. In the first operational year of the
system life, the clinics were to eliminate the 12 th and 11th visits. This would save $2,740 for that year. In
the second operational year, the clinics were to eliminate the 10 th and 9th visits, which will save $33,260
for that year. Next in the third operational year, the clinics must cut the 8 th and 7th visits which will save
about $245,460. Finally in the last year of the operational life of the system, the clinics must cut the 6 th
visits from the participants. The expenditures and cost savings are shown in Figure 3. The Equivalent
Present Worth analysis indicated that the savings from the proposed plan and the cost of data mining
system development will break-even by the end of 3rd year if a 10% interest rate is considered.
Conclusions
WIC data mining system was developed and implemented with the objective to find interesting
patterns in the clinical records of the WIC participant's database. The systems approach to data
mining and analysis identified that the WIC participants visited WIC clinics more than the number of
visits projected by the program and this pattern was more prevalent among infants and children than
among women. It further indicated that the extra visits by the infants and children were for the use of
other services, which were not required by the WIC program. They return to the clinics for these other
services, which contributed to extra cost to the WIC program. The recommended reduction in extra
visits would save enough money to pay off the system in three years. Data mining system application
was helpful in solving this real problem.
Reference
1. Blanchard B.S., and W.J. Fabrycky, (1998). System Engineering and Analysis, 4th Edition, Prentice
Hall, Upper Saddle River, NJ.
2. Youness S., (2000). Professional Data Warehousing with SQL Server 7.0 and OLAP Services, Wrox
Press Ltd., Chicago, Illinois.
3. Chapman, Pete, and Julian Clinton, Randy Kerber, Thomas Khabza, Thomas Reinartz, Colin
Shearer, Rudiger Wirth, (2000). "Step by step data mining guide", The CRISP-DM Consortium, August
2000.
4. Baldwin, Dirk and David Paradice, (2000). Applications Development in Microsoft Access 2000,
Course Technology, Cambridge, MA.
5. Romesburg, H. Charles, (1984). Cluster Analysis for Researchers, Lifetime Learning Publications,
Belmont, California.
Authors
Mr. Abdulqadir I. Khoshnaw is presently working as Information Systems Support in the Health
Department of the State of Tennessee. He received his B.S. in Civil Engineering from the University of
Saladin in Arbil, Iraq and his Master of Science in Computer and Information Systems Engineering
(CISE) in December 2001 from Tennessee State University, Nashville, TN.
Satinderpaul Singh Devgan is Professor and Head of Electrical and Computer Engineering at
Tennessee State University since 1979. He received his M.S. and Ph.D. degrees in Power Systems from
Illinois Institute of Technology in Chicago, IL before joining Tennessee State University in 1970. His areas
of teaching interest include systems engineering, computer communication and networks,
electromagnetic theory, power system analysis, hybrid wind energy source applications. Over the past
twenty years he has received research funding worth over 9.19 million dollars. His current research
funding is in the areas of Systems engineering approach to HPC software user usability enhancement,
intranet development and graphical interface using Java. He has developed and implemented new
graduate programs in Master of Engineering with option in Electrical Engineering and M.S. and Ph.D. in
Computer and Information Systems Engineering (CISE), and has published in IEEE and ASEE
Conference Proceedings. He is a recipient of Outstanding Researcher of the Year award in 1994 from
Tennessee State University and the G.E. Faculty Excellence Award in 1980. He has served as an IEEE
ABET Evaluator and is currently secretary of Southeastern Center for Electrical Engineering Education.
He is a senior member of IEEE, member of ASEE, Eta Kappa Nu and Phi Kappa Phi Honor Societies and
is a Registered Professional Engineer in the States of Illinois and Tennessee. He is past-chairman of
Southeastern Association of Electrical Engineering Department Heads (SAEEDH) and is serving as
Secretary of the BOD of Southeastern Center for Electrical Engineering Education (SCEEE)