Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evolving Ideas Computing, Communication and Networking Publish by Global Vision Publishing House Edited5by 5 Jeetendra Pande Nihar Ranjan Pande Deep Chandra Joshi Interaction of Computer and Statistics: A Case of Data Mining Gaurav Shukla1, Pankaj Tiwari2 and Mohit Giri Goswami3 ABSTRACT Data mining is emerging as one of the key features for various fields like as medical, agriculture, business and also an important tool for homeland security efforts. Data mining represents a significant advance in the type of analytical tools currently available. Data mining is becoming increasingly common in both the private and public sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs enhance research, and increase sales. In the public sector, data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance. INTRODUCTION Data mining, a branch of computer science is an analytic process of extracting the hidden predictive information from large datasets by combining methods from statistics and artificial intelligence with database management. It is a powerful new technology to help companies focus on the most important information in their data warehouses. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions 1 2 3 LES Divison, IVRI, Izzatnagar, Bareilly (U.P.),Email: [email protected]. LES Divison, IVRI, Izzatnagar, Bareilly (U.P.) Email: [email protected]. AITS, Haldwani, Nainital (Uttrakhand) Email: [email protected]. Gaurav Shukla, Pankaj Tiwari and Mohit Giri Goswami 56 that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. Data mining (sometimes called data or knowledge discovery) is also the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes’ theorem (1700s) and regression analysis (1800s). The increasing power of computer technology has increased data collection, storage and manipulations. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks, clustering, genetic algorithms (1950s) and decision trees (1960s). A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other approaches, such as Choice Modelling for human-generated data. In these situations, inherent correlations can be either controlled for, or removed altogether, during the construction of the experimental design. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. FUNDAMENTALS OF DATA MINING Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: • • • Massive data collection Powerful multiprocessor computers Data mining algorithms 57 Interaction of Computer and Statistics: A Case of Data Mining In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed below were revolutionary because they allowed new business questions to be answered precisely and rapidly way. Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) “What was my total revenue in the last five years?” Computers, tapes, disksIBM, CDC Retrospective, static data delivery Relational databases Retrospective, Data Access (1980s) “What were unit sales in India last March?” Oracle, Sybase, (RDBMS), Structured Informix, IBM, Query Language (SQL), Microsoft ODBC dynamic data delivery at record level Data Warehousing & Decision Support (1990s) “What were unit sales On-line analytic in India last March? processing (OLAP), Drill down to Mumbai.” multidimensional databases, data warehouses Pilot, Comshare, Arbor, Cognos, Microstrategy Retrospective, dynamic data delivery at multiple levels Data Mining “What’s likely to Advanced algorithms, Pilot, Lockheed, Prospective, proactive (Emerging Today) happen to Mumbai unit sales next month? Why?” multiprocessor computers, massive databases IBM, SGI, numerous information delivery startups (nascent industry) The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Some crucial concepts of Data Mining are: Bagging (Voting, Averaging) The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Boosting The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification. Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an “expert” in classifying observations that were not well classified by those preceding it. Boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. Data Preparation Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying “garbage-in-garbage-out” is particularly applicable to the typical data mining 58 Gaurav Shukla, Pankaj Tiwari and Mohit Giri Goswami projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes Data Reduction The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or merge the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc. Deployment The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, we usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent. Drill-Down Analysis The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next, we may want to “drill-down” to expose and further analyze the data “underneath” one of the categorizations. Feature Selection One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification. Machine Learning Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the “models” or techniques that are used to generate the prediction is interpretable or open to simple explanation. 59 Interaction of Computer and Statistics: A Case of Data Mining Meta-Learning The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization). Models for Data Mining In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various “general frameworks” have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. General sequence of steps for data mining projects: Business Understanding Data Understanding Modeling Data Preparation Modeling Evaluation Deployment Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations). It postulated a sequence of, so-called, DMAIC steps Define Measure Analyses Improve Control that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments. Another framework of this kind is the approach proposed by SAS Institute called SEMMA Sample Explore Modify Model Assess which is focusing more on the technical activities typically involved in a data mining project. Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Stacking (Stacked Generalization) The concept of stacking (Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. 60 Gaurav Shukla, Pankaj Tiwari and Mohit Giri Goswami Text Mining While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often unstructured, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.). Architecture for Data Mining Data Warehousing Data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management systems, using designated technology suitable for any data base management (e.g., Oracle, Sybase, MS SQL Server). To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. The below figure illustrates an architecture for advanced analysis in a large data warehouse. Fig. 1 An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic Interaction of Computer and Statistics: A Case of Data Mining 61 metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. METHODS OF DATA MINING Association Rule Learning In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal et al. introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions, potatoes}emplies {burger}found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Cluster Analysis Cluster Analysis is class of techniques use to classify objects or cases into relativily homogeneous groups called clusters. Objects in each cluster tend to similiar to each other and dissimilar to object in the other cluster. Cluster is also called classification analysis. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, and bioinformatics. Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. Types of Clustering Hierarchical Clustering: Hierarchical Clustering is characterized by development of hierarchy or tree like structure. Hierarchical method can be agglomerative or divisive. Agglomerative clustering starts with each object in separate cluster. Cluster are formed by grouping object into bigger and bigger cluster this process is continue until all objects are members of a single cluster. Divisive clustering starts with all objects grouped in a single cluster. Clusters are divided or split until each object is in separate cluster. Agglomerative methods consists of linkage methods, error sum of squares or variance method and centroid methods. Linkage method is a Agglomerative methods of Hierarchical Clustering that cluster objects based on a computation of the distance between them. Variance method is a Agglomerative methods of Hierarchical Clustering in which clusters are generated to minimize within cluster variance. Centriod method is a Agglomerative methods of Hierarchical Clustering in which the distance between two clusters is the distance between their centriods. Non Hierarchical Clustering: The second type of clustering procedures, the Non Hierarchical Clustering methods is frequently referred to as K-means clustering. These methods include sequential threshold, parallel threshold and optimizing partitioning. Sequential threshold is Non Hierarchical Clustering method in which a cluster center is selected and all objects with in a pre specified threshold value from the center are grouped together. Similarly Parallel threshold is a Non Hierarchical Clustering method that Gaurav Shukla, Pankaj Tiwari and Mohit Giri Goswami 62 specifies several cluster centers at once. All object that are within a pre specified threshold value from the center are grouped together where as Optimizing partioning is Non Hierarchical Clustering method that allows for later reassignment of objects to clusters to optimize and overall criterion. Structured Data Analysis Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an a priori structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits the given data, either exactly or approximately. This structure can then be used for making comparisons, predictions, manipulations etc. Regression analysis, Bayesian analysis, Combinational Data analysis and Tree Structured Data analysis are important types structured data analysis. Data Analysis Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis, and confirmatory data analysis. EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Statistical Methods A lot of statistical methods have been used for statistical data analyses. A very brief list of four of the more popular methods is: • • • • General linear model: A widely used model on which various statistical methods are based (e.g. t test ANOVA ANCOVA MANOVA). Usable for assessing the effect of several predictors on one or more continuous dependent variables. Generalized linear model: An extension of the general linear model for discrete dependent variables. Structural equation modelling: Usable for assessing latent structures from measured manifest variables. Item response theory: Models for (mostly) assessing one latent variable from several binary measured variables. Predictive Analytics Predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behaviour patterns. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes. It is important to note, however, that the accuracy and usability of results will depend greatly on the level of data analysis and the quality of assumptions. Working of Data Mining While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in Interaction of Computer and Statistics: A Case of Data Mining 63 stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: • Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. • Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. • Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. • Sequential patterns: Data is mined to anticipate behaviour patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer’s purchase of sleeping bags and hiking shoes. Data mining consists of five major elements: • Extract, transform, and load transaction data onto the data warehouse system. • Store and manage the data in a multidimensional database system. • Provide data access to business analysts and information technology professionals. • Analyze the data by application software. • Present the data in a useful format, such as a graph or table. Different levels of analysis are available: • • • • • • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. SCOPES AND APPLICATIONS Scopes Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: Gaurav Shukla, Pankaj Tiwari and Mohit Giri Goswami 64 • • Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases.. A typical example of a predictive problem is targeted marketing. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Applications In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering. In the area of study on human genetics, an important goal is to understand the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual’s DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction. In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation’s health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for exactly the same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities. Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for many years. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle. A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviours which reduce their learning and to understand the factors influencing university student retention. A similar example of the social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory. Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis using SOM, et cetera. In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the Interaction of Computer and Statistics: A Case of Data Mining 65 WHO global database of 4.6 million suspected adverse drug reaction incidents. Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses. CONCLUSION A new technological leap is needed to structure and prioritize information for specific end-user problems in engineering, medical agricultural and business fields. The data mining tools can make this leap. Quantifiable business and others benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users. The medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs. Recently, data mining has been increasingly cited as an important tool for homeland security efforts. Some observers suggest that data mining should be used as a means to identify terrorist activities, such as money transfers and communications. REFERENCES 1. Xingquan Zhu, Ian Davidson (2007). Knowledge Discovery and Data Mining: Challenges and Realities. Hershey, New York. pp. 163–189. 2. Pieter Adriaans and Dolf Zantinge, Data Mining (New York: Addison Wesley, 1996), pp.5-6. 3. George Cahlink, “Data Mining Taps the Trends,” Government Executive Magazine, October 1, 2000, 4. Y. Peng, G. Kou, Y. Shi, Z. Chen (2008). “A Descriptive Framework for the Field of Data Mining and Knowledge Discovery”. International Journal of InformatioTecn hnology and Decision Making, Volume 7, Issue 4 7: 639 – 682. 5. A.J. McGrail, E. Gulski et al.. “Data Mining Techniques to Asses the Condition of High Voltage Electrical Plant”. CIGRE WG 15.11 of Study Committee 15. 6. R. Baker. “Is Gaming the System State-or-Trait? Educational Data Mining Through the Multi-Contextual Application of a Validated Behavioral Model”. Workshop on Data Mining for User Modeling 2007.