Download Intelligence Ingrained Data Mining Engine Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
Intelligence Ingrained Data Mining Engine Architecture
S. P. Deshpande#1, Dr. V. M. Thakare *2
#
P.G. Department of Computer Science and Technology, D.C.P.E., Amravati (MS), India.
1
[email protected]
*
Department of Computer Science, S.G.B. Amravati University, Amravati (MS), India.
2
[email protected]
Abstract: Information plays vital role in every
field. To take complete advantage of data; it
requires a tool for automatic summarization of
data, extraction of the essence of information
stored, and the discovery of patterns in raw
data. „Data mining‟ is a tool to accomplish the
above mentioned needs. Several attempts have
been made to design and develop the generic
data mining system but no system found
completely generic. The domain experts play
important role in the different stages of data
mining. The decisions at different stages are
influenced by the factors like domain and data
details, aim of the data mining, and the context
parameters. This paper present the architecture
of a Data mining engine that work for specific
problem domain. The engine is developed with
some prerequisite domain knowledge therefore
a non expert can use the system effectively for
data mining.
Key Words: Data Mining, Engine Architecture,
problem specific data mining engine.
I. Introduction
Data mining is the process of
extraction of hidden predictive information
from large databases; it is a powerful
technology with great potential to help
organizations focus on the most important
information in their data warehouses [1,2,3,4].
The automated, prospective analyses offered
by data mining move beyond the analyses of
past events provided by retrospective tools
typical of decision support systems. Data
mining is one of the tasks in the process of
knowledge discovery from the data. The data
stored in the database is used to discover the
patterns of data, which then interpreted by
applying the domain knowledge. The data
mining applications can be generic or domain
specific. The generic application is required to
be an intelligent system that by its own can
takes certain decisions like: selection of data,
selection of data mining method, presentation
and interpretation of the result. Some generic
data mining applications cannot take its own
these decisions but guide users for selection of
data, selection of data mining method and for
the interpretation of the results.
The domain specific applications are
focused to use the domain specific data and
data mining algorithm that targeted for
specific objective. The data generating sources
generate different type of data. Data can be
from a simple text, numbers to more complex
audio-video data. To mine the patterns and
thus knowledge from this data, different types
of data mining algorithms are used. The
collection and selection of context specific
data and applying the data mining algorithm to
generate the context specific knowledge is
thus a skillful job. In many domains specific
data mining applications the domain experts
plays vital role to mine useful knowledge.
II. The Problem Definition
Most of the previous studies on data
mining applications in various fields like:
Language Science [11,12,11], E-learning[34],
Medical
science
and
Health
care
[12,13,15,18], Education [14,33], Banking
[29], Market research [26], E-commerce [31],
Engineering [15], Sports Science [24,28,36],
Detection of terrorist activities [17,27,30,35]
Intrusion detection [21], Quality Control [39],
Software Engineering [19,20], Library Science
[32], Bio-informatics [22,23] etc. uses the
variety of data types range from text to images
and stores in variety of databases and data
structures. The different methods of data
mining are used to extract the patterns and
thus the knowledge from this variety
databases. Selection of data and methods for
data mining is an important task in this process
and needs the knowledge of the domain.
Several attempts have been made to design
and develop the generic data mining system
but no system found completely generic
165
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
[6,7,8,9,10]. Thus, for every domain the
domain expert‟s assistant is mandatory. The
domain experts shall be guided by the system
to effectively apply their knowledge for the
use of data mining systems to generate
required knowledge [5]. The domain experts
are required to select specific data for data
mining, cleaning and transformation of data,
extracting patterns for knowledge generation
and finally interpretation of the patterns for
knowledge generation.
Most of the domain specific data mining
applications show accuracy above 90%. The
generic data mining applications are having
the limitations. From the study of various data
mining applications it is observed that, no
application called generic application is 100 %
generic. The intelligent interfaces and
intelligent agents up to some extent make the
application generic but have limitations. The
domain experts play important role in the
different stages of data mining. The decisions
at different stages are influenced by the factors
like domain and data details, aim of the data
mining, and the context parameters [37]. The
domain specific applications are aimed to
extract specific knowledge. The domain
experts by considering the user‟s interest and
other context parameters guide the system to
select the data, preprocess the data, apply the
data mining algorithm and to discover the
knowledge. The results yields from the domain
specific applications are more accurate and
useful. Therefore it is concluded that the
domain specific applications are more specific
for data mining. From above study it seems
very difficult to design and develop a data
mining system, which can work dynamically
for any domain. Designing a domain specific
system is also a challenging task. Data mining
engine that work for specific problem domain
and can be used by the user do not have much
knowledge of the domain is thus the central
idea of this study.
III. The Proposed System Architecture
In this research work a new
architectural model of a data-mining engine is
proposed. As the problem domain is known
and well defined, the environment in which
the system run is also well defined. The user‟s
requirements are clearly defined and target
processing on the data is also known, in this
situation system design become easy. The
essential domain knowledge is maintained in
the knowledgebase. The system has four major
components: The user Interface, The data
store,
Data
mining
engine
and
Knowledgebase.
The problem related data selected from the
data repository is preprocessed and then
copied in to the data mart. The user interface
for selection of specific data mining algorithm
enables user in the selection of specific
algorithm. The domain knowledge required for
actual mining purpose shall be referenced
from the domain knowledge component. The
discovered knowledge shall be presented to
the users through the user interfaces. The
generated knowledge is also useful for the
future DM applications and the knowledge
discovery process therefore it shall be stored in
the knowledge base also. The Proposed Data
Mining Engine Architecture is as given below:
IV. Implementation of Proposed
Architecture
The
proposed
architecture
is
implemented for generating players profile
from sports related text. In the pilot
application „Profile Generator‟ the system
contains four main components: The user
interface, Data mining engine, data store and
knowledge base. The data store is simple
folder contain text of the interest. This text is
preprocessed; the interfaces are available to
accomplish the preprocessing work. The
166
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
preprocessing includes conversion of text in
lower case letters, removal of punctuation
marks, removal of commonly used words
using stop-list of words, and tokenization of
the text. The knowledge base contain
prerequisite knowledge like: Name of
countries, Name of the sports/games,
commonly used words (stop words) etc. Rules
like: Word presided with, word followed by,
if_word_is etc.
Ex.
IF_PRESIDED_WIT
H
Mr or Ms or Miss
or Mrs or Master or
Shri or Smt or Ku.
Any word
IF_TEXT_IS
<Name>
IF_TEXT_IS
THEN_VALUE
Not
Name_of_Country
Not
Commonly_used_W
ord
Not Numeric string
Not date string
Not
Name_of_Country
Not
Commonly_used_W
ord
Not Numeric string
Not date string
Name
May be Name
IF_FOLLOWED_BY
THEN_VALUE
Of <Country>
Country He/She
representing
The rule base for other attributes like:
Date of Birth, Country he/she representing,
Playing hobby, Favorite food, Favorite Movie,
Favorite clothing and Sports Record etc. is
created and made available with the system.
The knowledge base used with the data mining
engine is lace with domain and problem
specific knowledge. The training phase also
fulfills the knowledge base with relevant
knowledge and therefore a user can use data
mining tool without domain expertise.
This knowledge base is used for the extraction
of values for the above-mentioned variables.
The concept of use of knowledge base in the
data mining application is new. The domain
experts are required to create the initial
knowledge base. The derived values for
different attributes are also stored in the
knowledge base. This table is used to train the
system and to use in the next pass of the
system.
The Data mining engine is design to apply
different data mining algorithm for the
extraction of values for the specific attributes.
In the training phase of the system the
association rule mining algorithm is used to
generate knowledge like associated keywords
in any game This algorithm with 5% support
and 50% confidence generate the rules those
specifies the set of key words associated with
the specific game. Ex. Game Cricket
associated words: Runs, Wicket, Catch, Bold,
Wicketkeeper, Batsman, Bowler, Fielder etc.,
Game Hockey associated words: Goal,
Goalkeeper, Penalty corner, Free hit, Hockey
stick etc. In the process of extraction of values
of the profile attributes the N-Gram analysis
along with association rules is used. The 1tokan, 2-tokans and 3-tokans are used
interchangeably for the extraction of different
attribute values. The interfaces available in
this system provide opportunity to the user to
interact at each level in the data mining
process. They are user friendly and designed
to use effectively by novice user.
V. Result and Discussion
The results obtained from the
application
developed
using proposed
architecture are more convincing. In the pilot
application „Profile Generator‟ the values for
few attributes of a player profile are derived.
The attributes of which values are derived:
Name, Date of Birth, Game he/she plays,
Country he/she representing, other Sports
he/she like, favorite food, favorite movie,
favorite clothing. It has derived values
successfully for these attributes. The results
obtained are as given below:
Sr. Attribute
Value retrieve
No.
Success Rate
1
Full
Name
of 86 %
Player
2
Date of Birth
92 %
3
Game He/She Plays
4
6
Country
He/She 93%
Representing
Other
Sports 84%
He/She Likes
Favorite Food
82%
7
Favorite Movie
84%
8
Favorite Clothing
79%
9
Sports Record
61%
5
99%
167
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
For the attributes for which success rate is
above 90% the algorithm works good and very
rare retrieves false value. For the attributes for
which success rate is between 80% to 90 % the
system prompts the user with few options and
asks for selection of most correct value. In the
extraction of sports record, the text document
plays important role, the algorithm requires
modification to extract sport record as it has
different parameters in different games. This
architecture of a data mining engine is using
domain and problem relevant knowledge. This
knowledge is partially loaded by domain
experts and partially generated in the training
phase of the algorithm. Due to the use of
knowledge base this system can be used by
novice users.
VI. Conclusion
The application developed using
proposed architecture shown success rate
84.44 % in average therefore it can be
comfortably used for data mining. The major
advantage of this architecture is the presence
of domain specific knowledge along with data.
This facilitates the non domain-expert users in
data mining. Existing data mining tools are
required domain expertise and also training of
using the tool. They are limited to extraction
of patterns of data of features which need
further interpretation by the domain experts to
generate knowledge. Using this architecture
system can generate the knowledge but
developing knowledgebase is the main issue in
this context.
[5]
[6]
[7]
[8]
References
[9]
[1]
[2]
[3]
[4]
Introduction to Data Mining and
Knowledge Discovery, Third Edition
ISBN: 1-892095-02-5, Publication of Two
Crows Corporation, 10500 Falls Road,
Potomac, MD 20854 (U.S.A.), 1999.
Daniel
T.
Larose,
Discovering
Knowledge in Data: An Introduction to
Data Mining, ISBN 0-471-66657-2, John
Wiley & Sons, Inc, 2005.
Dunham, M. H., Sridhar S., Data Mining:
Introductory and Advanced Topics,
Pearson Education, New Delhi, ISBN: 817758-785-4, 1st Edition, 2006.
Pete Chapman, Julian Clinton, Randy
Kerber, Thomas Khabaza, Thomas
Reinartz, Colin Shearer and Rüdiger
Wirth, CRISP-DM 1.0 : Step-by-step
[10]
[11]
data mining guide, NCR Systems
Engineering Copenhagen (USA and
Denmark),
DaimlerChrysler
AG
(Germany), SPSS Inc. (USA) and OHRA
Verzekeringenen Bank Group B.V (The
Netherlands), 2000.
Abraham Bernstein and Foster Provost,
An Intelligent Assistant for the
Knowledge Discovery Process, Working
Paper of the Center for Digital Economy
Research, New York University and also
presented at the IJCAI 2001 Workshop on
Wrappers for Performance Enhancement
in Knowledge Discovery in Databases.
H. Baazaoui Zghal, S. Faiz, and H. Ben
Ghezala, A Framework for Data Mining
Based Multi-Agent: An Application to
Spatial Data, volume 5, ISSN 1307-6884,
Proceedings of World Academy of
Science, Engineering and Technology,
April 2005.
Ralf Rantzau and Holger Schwarz, A
Multi-Tier Architecture for HighPerformance Data Mining, A Technical
Project Report of ESPRIT project, The
consortium of CRITIKAL project, Attar
Software Ltd. (UK), Gehe AG (Denmark);
Lloyds TSB Group (UK), Parallel
Applications Centre, University of
Southampton (UK), BWI, University of
Stuttgart (Denmark), IPVR, University of
Stuttgart (Denmark).
J. A., Botia, M. y Garijo, J. R Velasco,.,
A. F Skarmeta., A Generic Data mining
System
basic
design
and
implementation guidelines, A Technical
Project Report of CYCYT project of
Spanish
Government.
http://citeseerx.ist.psu.edu/viewdoc/summ
ary?doi=10.1.1.53.1935
Marcos M. Campos, Peter J. Stengard,
Boriana L. Milenova, Data-Centric
Automated Data Mining, , Publication of
Oracle
Corporation,
Web
Site.
www.oracle.com/technology/products/bi/o
dm/pdf/automated_data_mining_paper_12
05.pdf
J. Sirgo, A. Lopez, R. Janez, R. Blanco, N.
Abajo, M. Tarrio, R. Perez, A Data
Mining Engine based on Internet,
Emerging Technologies and Factory
Automation ETFA 2003, Proceedings of
IEEE Conference, 16-19 Sept. 2003.Web
Sitewww.citeseerx.ist.psu.edu/viewdoc/su
mmary?doi=10.1.1.11.8955
Identification of foreign-accented French
using data mining techniques, By Bianca
Vieru-Dimulescu, Philippe Boula de
168
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
[12]
[13]
[14]
[15]
[18]
[19]
[21]
Mareüil and Martine Adda-Decker,
Computer Sciences Laboratory for
Mechanics and Engineering Sciences
(LIMSI).WebSitewww.limsi.fr/Individu/bi
anca/article/Vieru&Boula&Madda_ParaLi
ng07.pdf
Halteren, H. van, Linguistic Profiling for
Author Recognition and Verification,
Proceedings of the 42nd Annual
Meeting
on
Association
for
Computational
Linguistics
USA,
Barcelona, Spain, Article No. 199, Year of
Publication: 2004.
Linguistic profiling of texts for the
purpose of language verification, By Hans
Van Halteren and Nelleke Oostdijk, The
ILK research group, Tilburg centre for
Creative Computing and the Department
of Communication and Information
Sciences of the Faculty of Humanities,
Tilburg University, The Netherlands.
WebSite
:
www.ilk.uvt.nl/~antalb/textmining/LingPr
ofColingDef.pdf
Application of Data Mining Techniques
for Medical Image Classification, By
Maria-Luiza Antonie, Osmar R. Zaiane,
Alexandru Coman, Proceedings of the
Second International Workshop on
Multimedia Data Mining (MDM/KDD
2001) in conjunction with ACM SIGKDD
conference, San Francisco, August 26,
2001.
Data Mining: Medical and Engineering
Case Studies, By A. Kusiak, K.H.
Kernstine, J.A. Kern, K.A. McLaughlin,
and T.L. Tseng, Proceedings of the
Industrial Engineering Research 2000
Conference, Cleveland, Ohio, pp. 1-7,May
21-23, 2000.
Data Warehousing and Data Mining
System Applied to E-Learning By R.Luis,
J.Redol, D.Simoes, N.Horta, Proceedings
of the II International Conference on
Multimedia
and
Information
&
Communication
Technologies
in
Education, Badajoz, Spain, December 36th 2003.
Crime Data Mining: An Overview and
Case Studies, By Hsinchun Chen,
Wingyan Chung, Yi Qin, Michael Chau,
Jennifer Jie Xu, Gang Wang, Rong Zheng,
Homa Atabakhsh, A project under NSF
Digital Government Programme, USA,
“COPLINK Center: Information and
Knowledge Management for Law
Enforcement,”, July 2000 – June 2003.
Rao, R. B., Krishnan, S. and Niculescu, R.
S., Data Mining for Improved Cardiac
Care, SIGKDD Explorations Volume 8,
Issue 1.
[24]
Kanellopoulos, Y., Dimopulos, T.,
Tjortjis, C., Makris, C., Mining Source
Code Elements for Comprehending
Object-Oriented
Systems
and
Evaluating
Their
Maintainability,
SIGKDD Explorations Volume 8, Issue 1.
[25]
Schultz, M. G., Eskin, Eleazar, Zadok,
Erez, and Stolfo, Salvatore, J., Data
Mining Methods for Detection of New
Malicious Executables, Proceedings of
the 2001 IEEE Symposium on Security
and Privacy, IEEE Computer Society
Washington, DC, USA , ISSN:1081-6011,
2001.
[26]
Cai, W. and Li L., Anomaly Detection
using TCP Header Information,
STAT753 Class Project Paper, May 2004.
Web Site:
http://www.scs.gmu.edu/~wcai/stat753/sta
t753report.pdf.
[27]
Comparative genomics using data mining
tools, By Tannistha Nandi, Chandrika BRao and Srinivasan Ramchandran, Journal
of Bio-Science, Indian Academy of
Sciences, Vol. 27,No. 1, Suppl. 1, page
No. 15-25, February 2002.
[29]
A survey of data mining methods for
linkage dis-equilibrium mapping, By Paivi
Onkamo and Hannu Toivonen, Henry
Stewart Publications 1473 – 9542. Human
Genomics. VOL 2, NO 5, Page No. 336–
340, MARCH 2006.
[30]
Data Mining in Sports: Predicting Cy
Young Award Winners, By Lloyd Smith,
Bret Lipscomb, and Adam Simkins,
Journal of Computer Science, Vol. 22,
Page No. 115-121,April 2007.
[31]
Deng, B., Liu, X., Data Mining in
Quality Improvement, Proceedings of
the Twenty-Seventh Annual SAS Users
Group International Conference 2002 by
SAS Institute Inc., Cary, NC, USA. ISBN
1-59047-061-3.
Web
Site
:
http://www2.sas.com/proceedings/sugi27/
Proceed27.pdf
[32]
Cohen, J. J., Olivia, C., Rud, P., Data
Mining of Market Knowledge in The
Pharmaceutical Industry, Proceeding of
13th Annual Conference of North-East
SAS Users Group Inc., NESUG2000,
Philadelphia Pennsilvania, September 2426 2000.
[33]
Elovici, Y., Kandel, A., Last, M., Shapira,
B., Zaafrany, O.,Using Data Mining
Techniques for Detecting TerrorRelated Activities on the Web, Website:
www.ise.bgu.ac.il/faculty/mlast/papers/JI
W_Paper.pdf
169
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170
[34]
[36]
[37]
[39]
[40]
Data Mining in Sports: A Research
Overview, By Osama K. Solieman, A
Technical Report, MIS Masters Project,
August 2006.
Foster, D. P. and Stine, R. A., Variable
Selection in Data Mining: Building a
Predictive Model for Bankruptcy,
Journal of the American Statistical
Association, Alexandria, VA, ETATSUNIS, vol. 99, ISSN 0162-1459, pp. 303313 January 15, 2004.
Kraft, M. R., Desouza, K. C., Androwich,
I., Data Mining in Healthcare
Information Systems: Case Study of a
Veterans’ Administration Spinal Cord
Injury Population, IEEE, Proceedings of
the 36th Hawaii International Conference
on System Sciences, 0-7695-1874-5/03,
2002.
Integrating E-Commerce and Data
Mining: Architecture and Challenges, By
Suhail Ansari, Ron Kohavi, Llew Mason,
and Zijian Zheng, IEEE International
Conference on Data Mining, 2001.
Jadhav, S. R., and Kumbargoudar, P.,
Multimedia Data Mining in Digital
Libraries: Standards and Features,
Proceedings of conference Recent
advances in Information Science and
Technology READIT – 2007, pp 54-59,
Organized by Madras Library Association
- Kalpakkam Chapter & Scintific
information Resource Division, Indira
Gandhi Center for Atomic research,
Department
of
Atomic
Energy,
Kalpakkam, Tamilnadu,India. 12-13 July
2007.
[41]
[42]
[43]
[44]
[60]
Anjewierden, A., Koll¨offel, B., and
Hulshof C., Towards educational data
mining: Using data mining methods for
automated chat analysis to understand
and support inquiry learning processes,
International Workshop on Applying Data
Mining in e-Learning, ADML'07, Vol305, Page No 23-32, Sissi, Lassithi - Crete
Greece, 18 September, 2007.
Knowledge Discovery with Genetic
Programming for Providing Feedback to
Courseware Authors, By Cristobal
Romero, Sebastian Ventura and Paul De
Bra, Kluwer Academic Publishers, Printed
in the Netherlands, 30/08/2004.
Crime Data Mining: A General
Framework and Some Examples, By
Hsinchun Chen, Wingyan Chung, Jennifer
Jie Xu, Gang Wang, Yi Qin, Michael
Chau, Technical Report, Published by the
IEEE Computer Society, 0018-9162/04,
pp 50-56, April 2004.
Chodavarapu Y., Using data-mining for
effective (optimal) sports squad selections,
Web Site:
ttp://insightory.com/view/74//using_datamining_for_effective_(optimal)_sports_sq
uad_selections
Vajirkar Pravin, Singh Sachin, and Lee
Yugyung, Context-Aware Data Mining
Framework for Wireless Medical
Application, Lecture Notes in Computer
Science (LNCS), Volume 2736, SpringerVerlag. ISBN 3-540-40806-1, pp. 381 –
391.
170