Download Study and Analysis of Data Mining Concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Relational model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
International Journal of Latest Trends in Engineering and Technology (IJLTET)
Study and Analysis of Data Mining Concepts
M.Parvathi
Head/Department of Computer Applications
Senthamarai college of Arts and Science,Madurai,TamilNadu,India/
Dr. S.Thabasu Kannan
Principal
Pannai College of engineering and Inforamtion Technology,Madurai,TamilNadu,India
Abstract- Data mining is a process which finds useful patterns from large amount of data. It predicts future trends and
behaviors allowing businesses to take decisions. The paper discusses few of the data mining techniques, algorithms and
some of the organizations which have adapted data mining technology to improve their businesses and found excellent
results. And also discuss about the architecture of data mining systems, and the tasks and the major issues of data mining.
Keywords – Data mining Techniques, Data mining algorithms, Tasksa and Issues.
I. INTRODUCTION
The major reason that data mining has attracted a great deal of attention in information industry in recent years is
due to the wide availability of huge amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and science exploration. Data
mining can be viewed as a result of the natural evolution of information technology. An evolutionary path has been
witnessed in the database industry in the development of the following functionalities data collection and database
creation, data management (including data storage and retrieval, and database transaction processing), and data
analysis and understanding (involving data warehousing and data mining). For instance, the early development of
data collection and database creation mechanisms served as a prerequisite for later development of effective
mechanisms for data storage and retrieval, and query and transaction processing. With numerous database systems
query and transaction processing as common practice, data analysis and understanding has naturally become the
next target.
II. EVOLUTION OF DATABASE
Database technology since the mid - 1980’s has been characterized by the popular adoption of relational technology
and an upsurge of research and development activities on new and powerful database systems. These systems
employ advanced data models such as extended relational, object-oriented, object-relational, and deductive models.
Advanced-oriented database systems, including spatial, temporal, multimedia, active, and scientific databases,
knowledge bases, and office information bases, have flourished. Issues related to the distribution, diversification,
and sharing of data have been studied extensively. Heterogeneous database systems and Internet-based global
information systems such as the World-Wide Web (WWW) also emerged and play a vital role in the information
industry.
Vol. 5 Issue 1 January 2015
280
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
Figure-1 Evaluation of Database
Data Collection and Database Creation
1960’s and earlier
Database Management Systems
1970’s – early 1980
·
hierarchical and network
database system
·
relational database system
·
data modeling tools
·
query language
·
on line transaction
0 processing (OLTP)
Advanced database system
mid 1980’s – present
·
Advanced
Models
·
Advanced
Applications
Advanced data Analysis
late 1980’s – present
Data warehouse and OLAP
·
·
Data mining and knowledge
discovery
·
Data mining Applications
Web based databases
1990’s – present
XML based database system
·
·
Integration with information
retrieval
·
Data and information
integration
New generation of Data integration and
Information Systems
Present and future
III. EVOLUTION AND FOUNDATIONS OF DATA MINING
It is a application for business and is supported by three technologies
· Massive data collection
· Multiprocessor computers
· Data mining algorithm
Steps in Evolution of Data mining
· Data collection (1960) – computers, tapes and disks
· Data access (1980) – RDBMS, SQL, ODBC
· Data warehousing and decision support (1990) – online analytic processing (OLAP) multidimensional
databases, data warehouse.
· Data mining – advanced algorithm, multiprocessor computer, massive databases.
Vol. 5 Issue 1 January 2015
281
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
IV. DATA MINING DEFINITION
Data mining is a process of extracting or mining of useful information and patterns from huge data. It is also
called as knowledge discovery process, knowledge mining from data, knowledge extraction or dta / pattern
analysis. The mined information may be of any other relation between the data items in the data. Mining process
is valid, actionable and previously unknown.
Figure – 2 Data Mining Process
Problem definition
Data gathering and
Preparation
Data access
Data sampling
Data transformation
Model Building and
Evaluation
Knowledge Deployment
Create Model
Test Model
Evaluate and interpret Model
Modern apply
Custom repots
External Applications
Data mining is a logical process that is used to search through large amount of data in order to find useful data.
The goal of this technique is to find patterns that were previously unknown. Once these patterns are found they
can further be used to make certain decisions for development of their businesses.
Three steps involved are
·
·
·
Exploration
Pattern identification
Deployment
Exploration: In the first step of data exploration data is cleaned and transformed into another form, and important
variables and then nature of data based on the problem are determined.
Pattern Identification: Once data is explored, refined and defined for the specific variables and second step is to
form pattern identification. Identify and choose the patterns which make the best prediction.
Deployment: Patterns are deployed for desired outcome.
Vol. 5 Issue 1 January 2015
282
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
V. ARCHITECTURE OF DATA MINING
Figure – 3 Architecture of Data Mining
User Interface
Pattern Evaluation
Data Mining Engine
Knowledge
base
Data Warehouse and Database Server
Data Cleaning, integration and selection
Database
Data
Warehouse
World
Wide
Web
Other
Repositories
a.
Database, Data Warehouse or other Information Repository
This is one or et of database and data warehouse and etc. Data cleaning and integration techniques may be
applied on data.
b. Data base and Data warehouse server
It is responsible for fetching data based on user’s data mining request.
c. Knowledge base
It is used to search or evaluate the interestingness of resulting patterns. It uses the hierarchy concept to
organize the attribute.
d. Data Mining Engine
It is essential. It consists of some functional modules for task like characterization, classification, clustering
etc.
e.
Pattern Evaluation
It measures and interacts with the modules to focus the search towards interesting pattern. It is necessary to confine
the search to only the interesting pattern.
Vol. 5 Issue 1 January 2015
283
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
f.
GUI
It communicates between user and data mining systems. User may interact the system by specifying the data mining
queries. It allows the user to browse the database, data structure, evaluate patterns, and visualize the patterns in
different forms.
VI.
DATA MINING TASKS
Data mining provides the link between the transaction and analytical systems. Data mining software analysis
relationships and patterns in stored transaction data based on open ended user queries. Relationships are classified
into two methods. Prediction method is uses some variables to predict unknown values of other variables.
Description method is used to identify the pattern or relationship in data.
Figure – 4 Tasks of Data Mining
Classification
Regression
Predictive
Time Series Analysis
Prediction
DATA
MINING
Clustering
Summarization
Descriptive
Association Rules
Sequence Discovery
a.
Classification
It maps the data into predefined groups or classes. The classes are determined before examining the data.
And also to stored data locate the data in the predefined group.
b.
Clustering
In this method groups are not predefined. It is defined by the data. It determines the similarity among the data on
predefined attributes. Data are grouped into clusters. And the grouping is based on logical relationships.
c.
Association rules
It indentifies data associated with each others. It is often used in the retail sales community which is frequently
purchase together.
d.
Sequence pattern discovery
It is used to determine sequential pattern in data. It is based on a time sequence of actions. It is similar to association
in that data but relationships based on time.
Vol. 5 Issue 1 January 2015
284
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
e.
Regression
It assumes that the target data fit into some known type of function. It determined the best function of this type of
data.
f.
Time series analysis
The value of an attributes is examined as it varies over time. The values obtained at limited time period.
g.
Prediction
Predict the future data states based on past and current data. It predicts future state than the current state. It includes
flooding, speech recognize and pattern recognisation.
h.
Summarization
It maps the data into subsets with associated simple descriptions. It is known as characterization and generalization.
It derives representative from the database. It characterize the content of the database.
VII.
DATA MINING ISSUES
Figure – 5 Issues of Data Mining
STATISTICS
DATABASE
TECHNOLOGY
VISUALIZATION
DATA MINING
INFORMATION
SCIENCE
\
a.
OTHER
DISCIPLINES
Human interaction
Interfaces may be needed with both domain and technical experts. Experts formulates the queries to interpret the
results. Users identify the data and desired results.
b.
Over fitting
When the model is generated with the given databases. It must fit in further for further database. It may arise when
the model is created for small size of database. It may arise even though the data are not changed.
Vol. 5 Issue 1 January 2015
285
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
c.
Outliers
Data entities not fit into derived model. Some of the model may not behave well for the data that are not with
outliers.
d.
Interpretation of results
Experts needed to interpret the results. It may be meaningless for average database users.
e.
Visualization of results
It is helpful to view the output of data mining algorithms.
f.
Large data sets
The algorithm designed for smaller data sets may create problem for large data sets, associated with data using.
g.
High dimensionality
Not all the attributes are needed to solve the problem. Some method may increase the complexity. Some method
may decrease the efficiency of an algorithm.
h.
Multimedia data
Previous data mining algorithm designed for traditional data types. Some new algorithm use of multimedia data.
i.
Missing data
Missing data may be replaced with estimates. Missing data can lead to invalid results.
j.
Irrelevant data
This may not be used to develop the data mining task.
k.
Noisy data
Data which is invalid or incorrect. It must be corrected whenever the data mining application is running.
l.
Changing data
Data bases are not static. Data mining assume the database as static. Algorithm are to be run again whenever the
changes occur in the database.
m. Integration
Integration of data mining functions into traditional DBMS systems is used for a desirable results.
n.
Application
Determine the use for the information obtained for data mining function.
VIII.
HOW DATA MINING WORKS
Data mining is a process called knowledge discovery form database. It invokes scientist, machine learning, Artificial
intelligence, information retrieval and pattern recognition.
Vol. 5 Issue 1 January 2015
286
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
Figure – 6 Working Principles of Data Mining
LEARNING
COLLECTING RELEVANT DATA
MODEL BUILDING
UNDERSTANDING OF BUSINESS
PROBLEM IDENTIFICATION
BUSINESS STRATEGY
AND EVALUATION
ACTION
a.
Modeling
Build a model on the data from the existing situation where thon where the answer is known and then applying the
model to other situation where the answer is not known
People have been doing it for a long time. No problem of data storage and communication. Lots of information
about a variety of situations where an answer is known is loaded. Data mining software filters the characteristics of
the data that go into the model. Model is built and now can be used in similar situation where the answer is not
known.
b.
Discovery
Find something that is new.
Data mining tools that sweeps through databases and identify previously hidden pattern. Pattern discovery is the
analysis of retail sales data to indentify unrelated products that are often purchased together.
c.
Prediction
Predict the reason. Find a pattern is association with a very specific event or attribute.
d.
Over fitting
Data mining term was used in statistical community.
IX.
CONCLUSION
Data mining involves extracting useful rules or interesting patterns from historical data. There are many data mining
tasks each of them further has many techniques. No free lunch theorem exists that is a single technique is not
suitable for all kinds of data for all types of domains. Sometimes hybrid techniques have been observed to perform
better as compared to the pure ones. Data mining is a “decision support” process in which we search for patterns of
information in data. Data mining techniques such as classification, clustering, prediction, association and sequential
patterns etc. The commercial, educational and scientific applications are increasingly dependent on these
methodologies. Decision trees are a reliable and effective decision making technique which provide high
Vol. 5 Issue 1 January 2015
287
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
classification accuracy with a simple representation of collected KDD. It help experts to validate and classify the
results and outcomes of tests and analyze various new symptoms of diseases based on data. Thus , data mining can
help to play an important role in the field of medicine or health care and disease prediction.
REFERENCES
[1]
[2]
[3]
[4]
Han.J.Kamber. M. data mining concepts and techniques, Morgan Kaufmann publisher, 2001.
R.S. Michalski, I. Bratko, and M. Kubat. Machine learning and data mining: Methods and applications. John wiley & sons, 1998.
Hand. D., Mannila. H.m Smythe. P., Principles of data mining, Prentice Hall of India, 2001.
S.Vijiyarani S.Sudha, Disease Prediction in Data Mining Technique – A Survey, International Journal of Computer Applications &
Information Technology, ISSN: 2278-7720 Vol. II, Issue I, January 2013 .
[5] Vili Podgorelec, Peter Kokol, Bruno Stiglic, Ivan Rozman, Decision trees: an overview and their use in medicine, Journal of Medical
Systems, Kluwer Academic/Plenum Press,Vol. 26, Num. 5, pp. 445-463, October 2002.
[6] Goebel, M., and Gruenwald, L. A Survey of Knowledge Discovery and Data Mining Tools. Technical Report, University of Oklahoma,
School of Computer Science, Norman, OK, February 1998.
[7] Meta Group Inc. Data Mining: Trends, Technology, and Implementation Imperatives. Stamford, CT, February 1997.
[8] Goebel, M. and Grunewald, L., A Survey of Knowledge Discovery and Data Mining Tools. Technical Report, University of Oklahoma,
School of Computer Science, Norman, OK, February 1998.
[9] Berson, A., Smith, S., & Thearling, K. (2011). An Overview of Data Mining Techniques Retrieved November 28, 2011
[10] Dunham, M. (2003). Data Mining: Introductory and Advanced Topics Pearson Education.
Vol. 5 Issue 1 January 2015
288
ISSN: 2278-621X