Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
Data vault modeling wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 REVIEW ON DATA MINING Madhuri Chawale1, Mayuri Lahe2,Prof. S.A.Zunke3 1 Student, I.T. Department, J.D.I.E.T. Yavatmal, [email protected] 2 Student, I.T. Department, J.D.I.E.T. Yavatmal, [email protected] Assistant Professor,I.T.Department, J.D.I.E.T. Yavatmal,[email protected] ABSTRACT This paper focuses on the data mining and the current trends associated with it. It presents an overview of data mining system and clarifies how data mining and knowledge discovery in databases are related both to each other and to related fields. Data Mining is a technology used to describe knowledge discovery and to search for significant relationships such as patterns, association and changes among variables in databases. These techniques facilitate useful data interpretations for the banking sector to avoid customer attrition. Customer retention is the most important factor to be analysed in today’s competitive business environment. And also fraud is a significant problem in banking sector. Detecting and preventing fraud is difficult, because fraudsters develop new schemes all the time, and the schemes grow more and more sophisticated to elude easy detection. We have also tried to identify the research area in data mining where further work can be continued. Keywords: Data Mining, KDD 1. INTRODUCTION In today’s computer age data storage has been growing in size to unthinkable ranges that only computerized methods applied to find information among these large repositories of data available to organizations whether it was online or offline. Data mining was conceptualized in the 1990s as a means of addressing the problem of analyzing the vast repositories of data that are available to mankind, and being added to continuously. Data mining is necessary to extract hidden useful information from the large datasets in a given application. This usefulness relates to the user goal, in other words only the user can determine whether the resulting knowledge answers his goal.DM also refers as analytical intelligence and business intelligence. Because data mining is a relatively new concept, it has been defined in various ways by various authors in the recent past. Some widely used techniques in data mining include artificial neural networks, genetic algorithms, K-nearest neighbor method, decision trees, and data reduction. Data mining is a process of the knowledge discovery in databases and the goal is to find out the hidden and interesting information. Various important steps are involved in knowledge discovery in databases (KDD) which helps to convert raw data into knowledge. Data mining is just a step in KDD which is used to extract interesting patterns from data that are easy to perceive, interpret, and manipulate. Several major kinds of data mining methods, including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta-rule guided mining will be reviewed. The explosive growth of databases makes the scalability of data mining techniques increasingly important. Data mining algorithms have the ability to rapidly mine vast amount of data.Essentially, the two types of data mining approaches differ in whether they seek to build models or to find patterns. The first approach, concerned with building models is, apart from the problems inherent from the large sizes of the data sets, similar to conventional exploratory statistical methods. The objective is to produce an overall summary of a set of data to identify and describe the main features of the shape of the distribution.The IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 second type of data mining approach, pattern detection, seeks to identify small departures from the norm, to detect unusual patterns of behavior. Scientific data mining distinguishes itself in the sense that the nature of the datasets is often very different from traditional market driven data mining applications. In this work, a detailed survey is carried out on data mining applications in the healthcare sector, types of data used and details of the information extracted. Data mining algorithms applied in healthcare industry play a significant role in prediction and diagnosis of the diseases. There are a large number of data mining applications are found in the medical related areas such as Medical device industry, Pharmaceutical Industry and Hospital Management. To find the useful and hidden knowledge from the database is the purpose behind the application of data mining. 2.LITERATURE REVIEW A literature review is a text written by critical points of current knowledge including substantive find theoretical and methodological contributions to a particular topic. Literature reviews are secondary sources and do not report any new or original experimental work. HianChyeKoh and Gerald Tan mainly discusses data mining and its applications with major areas like Treatment effectiveness, Management of healthcare, Detection of fraud and abuse, Customer relationship management. JayanthiRanjan presents how data mining discovers and extracts useful patterns of this large data to find observable patterns. This paper demonstrates the ability of Data mining in improving the quality of the decision making process in pharma industry. Issues in the pharma industry are adverse reactions to the drugs. M. Durairaj, K. Meena illustrates a hybrid prediction system consists of Rough Set Theory (RST) and Artificial Neural Network (ANN) for dispensation medical data. The process of developing a new data mining technique and software to assist competent solutions for medical data analysis has been explained. The experiments onspermatological data set for predicting excellence of animal semen is carried out. The projected hybrid prediction system is applied for pre-processing of medical database and to train the ANN for production prediction. The prediction accuracy is observed by comparing observed and predicted cleavage rate. K. Srinivas, B. Kavitha Rani and Dr. A. Goverdhan discusses mainly examine the potential use of classification based data mining techniques such as Rule Based, Decision tree, Naïve Bayes and Artificial Neural Network to the massive volume of healthcare data. Using an age, sex, blood pressure and blood sugar medical profiles it can predict the likelihood of patients getting a heart disease. 3. DATA MINING AND KDD PROCESS: Data mining is a detailed process of analysing large amounts of data and picking out the relevant information. It refers to extracting or mining knowledge from large amounts of data. Data Mining is the fundamental stage inside the process of extraction of useful and comprehensible knowledge, previously unknown, from large quantities of data stored in different formats, with the objective of improving the decision of companies, organizations where the data can be collected. However data mining and overall process known as Knowledge Discovery from Databases (KDD) is usually an expensive process, especially in the stages of business objectives elicitation, data mining objectives elicitation, and data preparation. This is especially the case each time data mining is applied to a blood bank. Data Mining can be defined as the extraction or fetching of the relevant information i.e. Knowledge from the large repositories of data. That’s the reason it is also called as Knowledge Mining. However many synonyms are linked with Data Mining viz. Knowledge Mining from Data, Knowledge extraction, Data/ pattern analysis, Data archaeology and Data dredging. Data Mining is also popularly known as Knowledge Discovery in data bases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in Databases. The KDD process includes selecting the data needed for data mining process & may be obtained from many different & heterogeneous data sources. Preprocessing includes finding incorrect or missing data. There may be many different activities performed at this time. Erroneous data may be corrected or removed, whereas missing data must be supplied. Preprocessing also include: removal of noise or outliers, collecting IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 necessary information to model or account for noise, accounting for time sequence information and known changes. Transformation is converting the data into a common format for processing. Some data may be encoded or transformed into more usable format. Data reduction, dimensionality reduction & data transformation method may be used to reduce the number of possible data values being considered. Data Mining is the task being performed, to generate the desired result. Interpretation/Evaluation is how the data mining results are presented to the users which are extremely important because the usefulness of the result is dependent on it. Different kinds of knowledge requires different kinds of representation e.g. classification, clustering, association rule etc. Many people treat the data mining as a synonym for another popularly used term, Knowledge Discovery from Data. The following figure (fig.1), shows the data mining as simply an essential step in the process of KDD i.e. Knowledge Discovery from Data. 4. STEPS OF KDD PROCESS: The knowledge discovery in databases (KDD) process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process contains of the following steps: 4.1 Data cleaning: It also known as data cleansing, it is a fundamental step in which noise data and irrelevant data are removed from the collection. 4.2 Data integration: In this step, multiple data sources, often heterogeneous, may be combined in a common source. Data selection at this step, the data relevant to the analysis is decided on and retrieved from the data collection . 4.3 Data transformation : It is also known as data consolidation, it is a stage in which the selected data is transformed into forms appropriate for the mining procedure. 4.4 Data mining: It is the core step of KDD process in which clever techniques are applied to extract patterns potentially useful. 4.5 Pattern evaluation: IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 In this step, strictly interesting patterns representing knowledge are identified based on given measures. 4.6 Knowledge representation: This is the final step in which the discovered knowledge is visually represented to the user. This is a very essential step that uses visualization techniques to help users understand and interpret the data mining results. 5. DATA ANALYSIS TASKS AND TECHNIQUES: Several data mining problem types or analysis tasks are typically encountered during a data mining project. Depending on the desired outcome, several data analysis techniques with different goals may be applied successively to achieve a desired result. For example, to determine which customers are likely to buy a new product, a business analyst may need first to use cluster analysis to segment the customer database, and then apply regression analysis to predict buying behavior for each cluster . 5.1Data Summarization: It gives the user an overview of the structure of the data and is generally carried out in the early stages of a project. This type of initial exploratory data analysis can help to understand the nature of the data and to find potential hypotheses for hidden information. Simple descriptive statistical and visualization techniques generally apply. 5.2 Segmentation: Segmentation separates the data into interesting and meaningful sub-groups or classes. In this case, the analyst can hypothesize certain subgroups as relevant for the business question based on prior knowledge or based on the outcome of data description and summarization. Automatic clustering techniques can detect previously unsuspected and hidden structures in data that allow segmentation. Clustering techniques, visualization and neural nets generally apply. 5.3 Classification: Assumes that a set of objects characterized by some attributes or features belong to different classes. The class label is a discrete qualitative identifier; for example, large, medium, or small. The objective is to build classification models that assign the correct class to previously unseen and unlabeled objects. Classification models are mostly used for predictive modeling. Discriminant analysis, decision tree, rule induction methods, and genetic algorithms generally apply. 5.4 Prediction: Prediction is very similar to classification. The difference is that in prediction, the class is not a qualitative discrete attribute but a continuous one. The goal of prediction is to find the numerical value of the target attribute for unseen objects; this problem type is also known as regression, and if the prediction deals with time series data, then it is often called forecasting. Regression analysis, decision trees, and neural nets generally apply. The least-squares criterion is a common method used in regression analysis, which finds the regression coefficients that minimize the sum of the squared deviation of the predicted values of the model from the observed values of the data. 5.5 Dependency analysis: Dependency analysis deals with finding a model that describes significant dependencies (or associations) between data items or events. Dependencies can be used to predict the value of an item given information on other data items. Dependency analysis has close connections with classification and prediction because the dependencies are implicitly used for the formulation of predictive models. IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 6. CLUSTERING: A cluster is a subset of objects which are “similar”. subset of objects such that the distance between any two objects in the cluster is less than the distance between any object in the cluster and any object not located inside it.Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called cluster.The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Fig. process of clustering 7. DATA MINING APPLICATIONS IN HEALTHCARE SECTOR : Healthcare industry today generates large amounts of complex data about patients, hospital resources, disease diagnosis, electronic patient records, medical devices etc. Larger amounts of data are a key resource to be processed and analysed for knowledge extraction that enables support for cost-savings and decision making. Data mining applications in healthcare can be grouped as the evaluation into broad categories, 7.1 Treatment effectiveness: Data mining applications can develop to evaluate the effectiveness of medical treatments. Data mining can deliver an analysis of which course of action proves effective by comparing and contrasting causes, symptoms, and courses of treatments. 7.2 Healthcare management: IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Data mining applications can be developed to better identify and track chronic disease states and high-risk patients, design appropriate interventions, and reduce the number of hospital admissions and claims to aid healthcare management. Data mining used to analyze massive volumes of data and statistics to search for patterns that might indicate an attack by bio-terrorists. 7.3 Customer relationship management Customer relationship management is a core approach to managing interactions between commercial organizations- typically banks and retailers-and their customers, it is no less important in a healthcare context. Customer interactions may occur through call centers, physicians’ offices, billing departments, inpatient settings, and ambulatory care settings. 7.4 Fraud and abuse: Detect fraud and abuses establish norms and then identify unusual or abnormal patterns of claims by physicians, clinics, or others attempt in data mining applications. Data mining applications fraud and abuse applications can highlight inappropriate prescriptions or referrals and fraudulent insurance and medical claims 7.5 Medical Device Industry: Healthcare system’s one important point is medical device. For best communication work this one is mostly used. Mobile communications and low-cost of wireless bio- sensors have paved the way for development of mobile healthcare applications that supply a convenient, safe and constant way of monitoring of vital signs of patients. Ubiquitous Data Stream Mining (UDM) techniques such as light weight, one-pass data stream mining algorithms can perform real-time analysis on-board small/mobile devices while considering available resources such as battery charge and available memory. 7.6 Pharmaceutical Industry: The technology is being used to help the pharmaceutical firms manage their inventories and to develop new product and services. A deep understanding of the knowledge hidden in the Pharma data is vital to a firm’s competitive position and organizational decision-making. 8. LIMITATIONS: 8.1 Security issues: Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information. 8.2 Misuse of information: Trends obtain through data mining intended to be used for marketing purpose or for some other ethical purpose, may be misused. Unethical businesses or people may used the information obtained through data mining to take advantage of vulnerable people. 9. CONCLUSION This report presents an overview of data mining and its techniques which have been used to extract interesting patterns and to develop significant relationships among variables stored in a huge dataset. Data mining is needed in many fields to extract the useful information from the large amount of data. Large amount of data is maintained in every field to keep different records such as medical data, scientific data, educational data, demographic data, financial data, marketing data etc. Therefore, different ways have been found to automatically analyze the data, to summarize it, to discover and characterize trends in it and to automatically flag anomalies. The several data mining techniques are introduced by the different researchers. These techniques are used to do classification, to do clustering, to find interesting patterns. In our future work, the data mining techniques will be implemented on blood donor’s data set for predicting the blood donor’s behavior and attitude, which have been collected from the blood bank center. REFERENCE [1] AnkitBhardwaj, Arvind Sharma, V.K. ShrivastavaData Mining Techniques and Their Implementation in Blood Bank Sector, “International Journal of Engineering Research and Applications (IJERA)” ISSN: 2248-9622 www.ijera.com Vol. 2, Issue4, July-August 2012, pp.1303-1309 [2] S.D.Gheware1, A.S.Kejkar2, S.M.Tondare3, Data Mining: Task, Tools, Techniques and Applications, “International Journal of Advanced Research in Computer and Communication Engineering” Vol. 3, Issue 10, October 2014 IJRISE| www.ijrise.org|[email protected] [537-543] International Journal of Research In Science & Engineering Special Issue: Techno-Xtreme 16 e-ISSN: 2394-8299 p-ISSN: 2394-8280 [3] Shyam Sundaram1 and Santhanam T2, Data Mining Techniques, “IJCSI International Journal of Computer Science Issues”, Vol. 8, Issue 5, No 2, September 2011 ISSN (Online): 1694-0814 [4] M. Durairaj, V. Ranjani, Data Mining Applications in Healthcare Sector, “International Journal of scientific & Technology research” volume 2, issue 10 October 2013 IJRISE| www.ijrise.org|[email protected] [537-543]