Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SYSTEM ANALYSIS AND DESIGN RESEARCH REPORT DATA MINING 26.11.2016 Seyit Mert AYVAZ 2012555008 Introduction Data is a term which is occured with the fast developing of computers. As long as the evolution of computers continued, data is seperated into big and different levels. This seperation has leaded to study on data with different aspects. In time , big databases has became to hard to work on it. Both the scientists and employees of big companies has started to think for a solution. After all these, the “Data Mining” process has taken a place at informatics world. In this report, I approached to data mining in various ways. This report consist almost every topic concerning data mining. Firstly, some introduction topics are considered on a preferential basis such as term of data mining, architecture, mining process. Since it is important the understand how data mining is evolved, its history and milestones are evaluated demonstratively. Afterwards, the most important topic regarding to data mining, I think, the scope of data mining is handled. As much as it is important for academicians, it is also accepted as so important for business world. So, the great affects of data mining and the current studies are told for the both sides, in academically and in business world.After all, some ideas and future trends are explanied shortly. I hope that, this report can be beneficial for its readers and people who is curious about data mining. Page 2 of 25 TABLE OF CONTENT 1.Introductino to Data Mining...................................................................................................4 1.1 What is Data Mining.................................................................................................4 1.1.1.Automatic Discovery..................................................................................4 1.1.2.Prediction..................................................................................................5 1.1.3.Grouping....................................................................................................5 1.1.4.Actionable Information.............................................................................5 1.2.Architecture of Data Mining.....................................................................................6 1.2.1. Data Sources.............................................................................................7 1.2.2. Database or Data Warehouse Server.......................................................7 1.2.3. Data Mining Engine...................................................................................7 1.2.4. Pattern Evaluation Modules.....................................................................7 1.2.5. Graphical User Interface...........................................................................7 1.2.6. Knowledge Base........................................................................................7 1.3.Data Mining Processes.............................................................................................8 1.3.1. Problem definition....................................................................................8 1.3.2. Data exploration.......................................................................................9 1.3.3. Data preparation......................................................................................9 1.3.4. Modeling...................................................................................................9 1.3.5. Evaluation.................................................................................................9 1.3.6. Deployment..............................................................................................9 2.History of Data Mining .........................................................................................................10 2.1 Foundations of Data Mining...................................................................................10 2.2. Evolution in data mining for business...................................................................11 2.3. Milestones of Data Mining....................................................................................12 3.Scope of Data Mining............................................................................................................15 3.1. Usage of Data Mining Techniques ........................................................................16 3.1.1. Association..............................................................................................16 3.1.2. Classification...........................................................................................16 3.1.3. Clustering................................................................................................17 3.1.4. Prediction...............................................................................................17 3.1.5. Sequential Patterns................................................................................17 3.1.6. Decision trees.........................................................................................17 3.2. Data Mining in Academically.................................................................................18 3.2.1.Science and Engineering..........................................................................18 3.2.2. Medical Data Mining...............................................................................19 3.2.3. Spatial Data Mining.................................................................................19 3.2.4. Pattern mining........................................................................................20 3.2.5. Human Rights.........................................................................................20 3.2.6. Sensor Data Mining................................................................................20 3.3 Data Mining in Business.........................................................................................20 4.Future of Data Mining...........................................................................................................23 4.1. Distributed/Collective Data Mining (DDM) ..........................................................23 4.2. Ubiquitous Data Mining (UDM) ............................................................................23 4.3. Hypertext and Hypermedia Data Mining...............................................................23 Page 3 of 25 4.4. Multimedia Mining........................................................................................24 4.5. Time Series/Sequence Mining.......................................................................24 Data Data 1.Introduction to Data Mining Before anything else , you have to study on and understand some terms about data mining like data,information and knowledge. Since all the studies related with data mining are also related with those; it is improtant to catch the main point of the relation between data,information and knowledge. Data: data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: • operational or transactional data such as, sales, cost, inventory, payroll, and accounting • nonoperational data, such as industry sales, forecast data, and macro economic data • meta data - data about the data itself, such as logical database design or data dictionary definitions Information: the patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. Knowledge: information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. 1.1.What is Data Mining ? Data mining is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD). In general, key proeprties of the data mining can be summarized as: Automatic discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on large data sets and databases In order to understand better the properties, we can make some more explanation as follows. Page 4 of 25 1.1.1.Automatic Discovery Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of data mining models.Data mining models can be used to mine the data on which they are built, but most types of models are generalizable to new data. The process of applying a model to new data is known as scoring. 1.1.2.Prediction Many forms of data mining are predictive. For example, a model might predict income based on education and other demographic factors. Predictions have an associated probability (How likely is this prediction to be true?). Prediction probabilities are also known as confidence. Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might specify that a person who has a bachelor's degree and lives in a certain neighborhood is likely to have an income greater than the regional average. 1.1.3.Grouping Other forms of data mining identify natural groupings in the data. For example, a model might identify the segment of the population that has an income within a specified range, that has a good driving record, and that leases a new car on a yearly basis. 1.1.4.Actionable Information Data mining can derive actionable information from large volumes of data. For example, a town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps. In other words, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Page 5 of 25 1.2.Architecture of Data Mining Figure 1.2.1:Architecture of data mining levels. The major components of any data mining system are data source, data warehouse server, data mining engine, pattern evaluation module, graphical user interface. In order to get a better knowledge on these, we will examine that ‘’what are these?’’ and “what is the aim of these components ?” 1.2.1. Data Sources:Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual sources of data. It is a necessarity to have large quantity of historical data for data mining to be successful. Organizations generally store data in databases or data warehouses. Data warehouses may contain one or more databases, text files, spreadsheets or other kinds of information repositories. Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the Internet is another big source of data. Why does the data get involved to different processes? The data needs to be cleaned, integrated and selected before passing it to the database or data warehouse server. As the data is from different sources and in different formats, it cannot be used directly for the data mining process because the data might not be complete and reliable. So, first data needs to be cleaned and integrated. Again, more data than required will be collected from different data sources and only the data of interest needs to Page 6 of 25 be selected and passed to the server. These processes are not as simple as we think. A number of techniques may be performed on the data as part of cleaning, integration and selection. 1.2.2. Database or Data Warehouse Server:The database or data warehouse server contains the actual data that is ready to be processed. Hence, the server is responsible for retrieving the relevant data based on the data mining request of the user. 1.2.3. Data Mining Engine:The data mining engine is the core component of any data mining system. It consists of a number of modules for performing data mining tasks including association, classification, characterization, clustering, prediction, time-series analysis etc. 1.2.4. Pattern Evaluation Modules:The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by using a threshold value. It interacts with the data mining engine to focus the search towards interesting patterns. 1.2.5. Graphical User Interface:The graphical user interface module communicates between the user and the data mining system. This module helps the user use the system easily and efficiently without knowing the real complexity behind the process. When the user specifies a query or a task, this module interacts with the data mining system and displays the result in an easily understandable manner. 1.2.6. Knowledge Base:The knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or evaluating the interestingness of the result patterns. The knowledge base might even contain user beliefs and data from user experiences that can be useful in the process of data mining. The data mining engine might get inputs from the knowledge base to make the result more accurate and reliable. 1.3.Data Mining Processes ( explain ) Figure 1.3.1: Phases of the Cross Industry Standard Process for data mining (CRISP DM) process model. From where********* Page 7 of 25 Many organizations in various industries are taking advantages of data mining including manufacturing, marketing, chemical, aerospace… etc, to increase their business efficiency. Therefore, the needs for a standard data mining process increased comparatively. A data mining process must be reliable and it must be repeatable by business people with little or no knowledge of data mining background. As the result, in 1990, a cross-industry standard process for data mining (CRISP-DM) first announced after going through a lot of workshops, and contributions from over 300 organizations. Cross Industry Standard Process for data mining is an iterative process that typically involves the following phases: 1.3.1. Problem definition A data mining project starts with the understanding of the business problem. Data mining experts, business experts, and domain experts work closely together to define the project objectives and the requirements from a business perspective. The project objective is then translated into a data mining problem definition. In the problem definition phase, data mining tools are not yet required. 1.3.2. Data exploration Domain experts understand the meaning of the metadata. They collect, describe, and explore the data. They also identify quality problems of the data. A frequent exchange with the data mining experts and the business experts from the problem definition phase is vital. In the data exploration phase, traditional data analysis tools, for example, statistics, are used to explore the data. 1.3.3. Data preparation Domain experts build the data model for the modeling process. They collect, cleanse, and format the data because some of the mining functions accept data only in a certain format. They also create new derived attributes, for example, an average value. In the data preparation phase, data is tweaked multiple times in no prescribed order. Preparing the data for the modeling tool by selecting tables, records, and attributes, are typical tasks in this phase. The meaning of the data is not changed. 1.3.4. Modeling Data mining experts select and apply various mining functions because you can use different mining functions for the same type of data mining problem. Some of the mining functions require specific data types. The data mining experts must assess each model. In the modeling phase, a frequent exchange with the domain experts from the data preparation phase is required. The modeling phase and the evaluation phase are coupled. They can be repeated several times to change parameters until optimal values are achieved. When the final modeling phase is completed, a model of high quality has been built. 1.3.5. Evaluation Data mining experts evaluate the model. If the model does not satisfy their expectations, they go back to the modeling phase and rebuild the model by changing its Page 8 of 25 parameters until optimal values are achieved. When they are finally satisfied with the model, they can extract business explanations and evaluate the following questions: “Does the model achieve the business objective?” “Have all business issues been considered?” At the end of the evaluation phase, the data mining experts decide how to use the data mining results. 1.3.6. Deployment Data mining experts use the mining results by exporting the results into database tables or into other applications, for example, spreadsheets. The Intelligent Miner™**** products assist you to follow this process. You can apply the functions of the Intelligent Miner products independently, iteratively, or in combination. 2.History of Data Mining First of all, to be able to know history and evolution of data mining, it is important to find out the milestones and foundations of data mining and evolution of these. All of these processes don’t lie to long background except the theories of several scientific field like statistic,machine learning and artifical intelligence. At further sections, foundations and milestones will be expounded in detail. Here we will go into the relation between statistics,machine learning,artifical intelligence and data mining and how their relation is evolved together. Data mining roots are traced back along three family lines: classical statistics, artificial intelligence, and machine learning. Statistics are the foundation of most technologies on which data mining is built, e.g. regression analysis, standard distribution, standard deviation, standard variance, discriminate analysis, cluster analysis, and confidence intervals. All of these are used to study data and data relationships. Artificial intelligence, or AI, which is built upon heuristics as opposed to statistics, attempts to apply human-thought-like processing to statistical problems. Certain AI concepts which were adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS). Machine learning is the union of statistics and AI. It could be considered an evolution of AI, because it blends AI heuristics with advanced statistical analysis. Machine learning attempts to let computer programs learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its goals. Data mining, in many ways, is fundamentally the adaptation of machine learning techniques to business applications. Data mining is best described as the union of historical and recent developments in statistics, AI, and machine learning. These techniques are then used together to study data and find previously-hidden trends or patterns within. 2.1 Foundations of Data Mining Page 9 of 25 Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Massive data collection Powerful multiprocessor computers Data mining algorithms 2.2. Evolution in data mining for business In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly Evolutionary Step Data Collection (1960s) Data Access (1980s) Data Warehousing &Decision Support (1990s) Data Mining (Emerging Today) Business Question "What was my total revenue in the last five years?" Enabling Technologies Computers, tapes, disks Product Providers IBM, CDC Characteristics "What were unit sales in New England last March?" Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level "What were unit sales in New England last March? Drill down to Boston." Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensiona l databases, data warehouses Pilot, Comshare, Arbor, Cognos, Microstrategy Retrospective, dynamic data delivery at multiple levels "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive Pilot, Lockheed, IBM, SGI, numerous startups (nascent Prospective, proactive information delivery Retrospective, static data delivery Page 10 of 25 databases industry) Table 2.2.1: Steps in the Evolution of Data Mining.[1] The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments. 2.3. Milestones of Data Mining Figure 2.3.1:Milestones of data mining related with main topics. The following are major milestones and “firsts” in the history of data mining plus how it’s evolved and blended with data science and big data. 1763 Thomas Bayes’ paper is published posthumously regarding a theorem for relating current probability to prior probability called the Bayes’ theorem. It is fundamental to data mining and probability, since it allows understanding of complex realities based on estimated probabilities. 1805 Adrien-Marie Legendre and Carl Friedrich Gauss apply regression to determine the orbits of bodies about the Sun (comets and planets). The goal of regression analysis is to Page 11 of 25 estimate the relationships among variables, and the specific method they used in this case is the method of least squares. Regression is one of the key tools in data mining. 1936 This is the dawn of computer age which makes possible the collection and processing of large amounts of data. In a 1936 paper, On Computable Numbers, Alan Turing introduced the idea of a Universal Machine capable of performing computations like our modern day computers. The modern day computer is built on the concepts pioneered by Turing. 1943 Warren McCulloch and Walter Pitts were the first to create a conceptual model of a neural network. In a paper entitled A logical calculus of the ideas immanent in nervous activity, they describe the idea of a neuron in a network. Each of these neurons can do 3 things: receive inputs, process inputs and generate output. 1965 Lawrence J. Fogel formed a new company called Decision Science, Inc. for applications of evolutionary programming. It was the first company specifically applying evolutionary computation to solve real-world problems. 1970s With sophisticated database management systems, it’s possible to store and query terabytes and petabytes of data. In addition, data warehouses allow users to move from a transaction-oriented way of thinking to a more analytical way of viewing the data. However, extracting sophisticated insights from these data warehouses of multidimensional models is very limited. 1975 John Henry Holland wrote Adaptation in Natural and Artificial Systems, the groundbreaking book on genetic algorithms. It is the book that initiated this field of study, presenting the theoretical foundations and exploring applications. 1980s HNC trademarks the phrase “database mining.” The trademark was meant to protect a product called DataBase Mining Workstation. It was a general purpose tool for building neural network models and now no longer is available. It’s also during this period that sophisticated algorithms can “learn” relationships from data that allow subject matter experts to reason about what the relationships mean. 1989 The term “Knowledge Discovery in Databases” (KDD) is coined by Gregory PiatetskyShapiro. It also at this time that he co-founds the first workshop also named KDD. 1990s The term “data mining” appeared in the database community. Retail companies and the financial community are using data mining to analyze data and recognize trends to increase their customer base, predict fluctuations in interest rates, stock prices, customer demand. 1992 Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested an improvement on the original support vector machine which allows for the creation of nonlinear classifiers. Support vector machines are a supervised learning approach that analyzes data and recognizes patterns used for classification and regression analysis. Page 12 of 25 1993 Gregory Piatetsky-Shapiro starts the newsletter Knowledge Discovery Nuggets (KDnuggets). It was originally meant to connect researchers who attended the KDD workshop. However, KDnuggets.com seems to have a much wider audience now. 2001 Although the term “data science” has existed since 1960s, it wasn’t until 2001 that William S. Cleveland introduced it as an independent discipline. As per Build Data Science Teams, DJ Patil and Jeff Hammerbacher then used the term to describe their roles at LinkedIn and Facebook. 2015 In February 2015, DJ Patil became the first Chief Data Scientist at the White House. Today, data mining is widespread in business, science, engineering and medicine just to name a few. Mining of credit card transactions, stock market movements, national security, genome sequencing and clinical trials are just the tip of the iceberg for data mining applications. Present (2016) Finally, one of the most active techniques being explored today is “Deep Learning”. Capable of capturing dependencies and complex patterns far beyond other techniques, it is reigniting some of the biggest challenges in the world of data mining, data science and artificial intelligence. [2] 3.Scope of Data Mining At this section, the scope will be examined according to the types of the relations between transaction and analytical systems, analysis levels and tasks of the data mining.Then the usage of data mining in academically and business will both be explained. While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two.Comparatively, mining softwares have been developed continuously. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available such as statistical, machine learning, and neural networks. Mostly, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Page 13 of 25 Data mining consists of five major elements: 1. 2. 3. 4. 5. Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. Different levels of analysis are available: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multiway splits. CART typically requires less data preparation than CHAID. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. 3.1. Usage of Data Mining Techniques With the meaning of in academically, topic is seperated into two parts. First part is the techniques that are using for mining and significant studies by using these in different areas. There are several major data mining techniques have been developing and using by researchers in data mining studies recently including association, classification, clustering, prediction, sequential patterns and decision tree. We will briefly examine those data mining techniques in the following sections. 3.1.1. Association Association is one of the best-known data mining technique. In association, a pattern is discovered based on a relationship between items in the same transaction. That’s is the reason why association technique is also known as relation technique. The Page 14 of 25 association technique is used in market basket analysis to identify a set of products that customers frequently purchase together. Retailers are using association technique to research customer’s buying habits. Based on historical sale data, retailers might find out that customers always buy crisps when they buy beers, and, therefore, they can put beers and crisps next to each other to save time for customer and increase sales. 3.1.2. Classification Classification is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we develop the software that can learn how to classify the data items into groups. For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.” In this case, we divide the records of employees into two groups that named “leave” and “stay”. And then we can ask our data mining software to classify the employees into separate groups. 3.1.3. Clustering Clustering is a data mining technique that makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique. The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes. To make the concept clearer, we can take book management in the library as an example. In a library, there is a wide range of books on various topics available. The challenge is how to keep those books in a way that readers can take several books on a particular topic without hassle. By using the clustering technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in that topic, they would only have to go to that shelf instead of looking for the entire library. 3.1.4. Prediction The prediction, as its name implied, is one of a data mining techniques that discovers the relationship between independent variables and relationship between dependent and independent variables. For instance, the prediction analysis technique can be used in the sale to predict profit for the future if we consider the sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction. 3.1.5. Sequential Patterns Sequential patterns analysis is one of data mining technique that seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period. Page 15 of 25 In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. 3.1.6. Decision trees The A decision tree is one of the most common used data mining techniques because its model is easy to understand for users. In decision tree technique, the root of the decision tree is a simple question or condition that has multiple answers. Each answer then leads to a set of questions or conditions that help us determine the data so that we can make the final decision based on it. 3.2. Data Mining in Academically Since the data mining algorithms are generated and used in researches, many studies are varied and started to apply in different areas. 3.2.1.Science and Engineering In recent years, data mining has been used widely in the areas of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering. In the study of human genetics, sequence mining helps address the important goal of understanding the mapping relationship between the inter-individual variations in human DNA sequence and the variability in disease susceptibility. In simple terms, it aims to find out how the changes in an individual's DNA sequence affects the risks of developing common diseases such as cancer, which is of great importance to improving methods of diagnosing, preventing, and treating these diseases. One data mining method that is used to perform this task is known as multifactor dimensionality reduction. In the area of electrical power engineering, data mining methods have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on, for example, the status of the insulation (or other important safety-related parameters). Data clustering techniques – such as the self-organizing map (SOM), have been applied to vibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for exactly the same tap position. SOM has been applied to detect abnormal conditions and to hypothesize about the nature of the abnormalities. Data mining methods have been applied to dissolved gas analysis (DGA) in power transformers. DGA, as a diagnostics for power transformers, has been available for many years. Methods such as SOM has been applied to analyze generated data and to Page 16 of 25 determine trends which are not obvious to the standard DGA ratio methods (such as Duval Triangle). In educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning, and to understand factors influencing university student retention. A similar example of social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized, and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate institutional memory. Data mining methods of biomedical data facilitated by domain ontologies, mining clinical trial data, and traffic analysis using SOM. In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents. Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses. Data mining has been applied to software artifacts within the realm of software engineering: Mining Software Repositories. 3.2.2. Medical Data Mining Some machine learning algorithms can be applied in medical field as second-opinion diagnostic tools and as tools for the knowledge extraction phase in the process of knowledge discovery in databases. One of these classifiers (called Prototype exemplar learning classifier (PEL-C) is able to discover syndromes as well as atypical clinical cases. In 2011, the case of Sorrell v. IMS Health, Inc., decided by the Supreme Court of the United States, ruled that pharmacies may share information with outside companies. This practice was authorized under the 1st Amendment of the Constitution, protecting the "freedom of speech." However, the passage of the Health Information Technology for Economic and Clinical Health Act (HITECH Act) helped to initiate the adoption of the electronic health record (EHR) and supporting technology in the United States. The HITECH Act was signed into law on February 17, 2009 as part of the American Recovery and Reinvestment Act (ARRA) and helped to open the door to medical data mining. Prior to the signing of this law, estimates of only 20% of United States-based physicians were utilizing electronic patient records.Søren Brunak notes that “the patient record becomes as information-rich as possible” and thereby “maximizes the data mining opportunities.” Hence, electronic patient records further expands the possibilities regarding medical data mining thereby opening the door to a vast source of medical data analysis. 3.2.3. Spatial Data Mining Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate Page 17 of 25 technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling. 3.2.4. Pattern mining "Pattern mining" is a data mining method that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. For example, an association rule "beer ⇒ potato chips (80%)" states that four out of five customers that bought beer also bought potato chips. In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise." Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search methods. 3.2.5. Human Rights Data mining of government records – particularly records of the justice system (i.e., courts, prisons) – enables the discovery of systemic human rights violations in connection to generation and publication of invalid or fraudulent legal records by various government agencies. 3.2.6. Sensor Data Mining Wireless sensor networks can be used for facilitating the collection of data for spatial data mining for a variety of applications such as air pollution monitoring. A characteristic of such networks is that nearby sensor nodes monitoring an environmental feature typically register similar values. This kind of data redundancy due to the spatial correlation between sensor observations inspires the techniques for in-network data aggregation and mining. By measuring the spatial correlation between data sampled by different sensors, a wide class of specialized algorithms can be developed to develop more efficient spatial data mining algorithms. 3.3. Data Mining in Business In business, data mining is the analysis of historical business activities, stored as static data in data warehouse databases. The goal is to reveal hidden patterns and trends. Data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information. Examples of what businesses use data mining is to include performing market analysis to identify new product bundles, finding the root cause of manufacturing Page 18 of 25 problems, to prevent customer attrition and acquire new customers, cross-selling to existing customers, and profiling customers with more accuracy. In today’s world raw data is being collected by companies at an exploding rate. For example, Walmart processes over 20 million point-of-sale transactions every day. This information is stored in a centralized database, but would be useless without some type of data mining software to analyze it. If Walmart analyzed their point-of-sale data with data mining techniques they would be able to determine sales trends, develop marketing campaigns, and more accurately predict customer loyalty. Categorization of the items available in the e-commerce site is a fundamental problem. A correct item categorization system is essential for user experience as it helps determine the items relevant to him for search and browsing. Item categorization can be formulated as a supervised classification problem in data mining where the categories are the target classes and the features are the words composing some textual description of the items. One of the approaches is to find groups initially which are similar and place them together in a latent group. Now given a new item, first classify into a latent group which is called coarse level classification. Then, do a second round of classification to find the category to which the item belongs to. Every time a credit card or a store loyalty card is being used, or a warranty card is being filled, data is being collected about the users behavior. Many people find the amount of information stored about us from companies, such as Google, Facebook, and Amazon, disturbing and are concerned about privacy. Although there is the potential for our personal data to be used in harmful, or unwanted, ways it is also being used to make our lives better. For example, Ford and Audi hope to one day collect information about customer driving patterns so they can recommend safer routes and warn drivers about dangerous road conditions. Data mining in customer relationship management(CRM) applications can contribute significantly to the bottom line. Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict to which channel and to which offer an individual is most likely to respond (across all potential offers). Additionally, sophisticated applications could be used to automate mailing. Once the results from data mining (potential prospect/customer and channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or a regular mail. Finally, in cases where many people will take an action without an offer, "uplift modeling" can be used to determine which people have the greatest increase in response if given an offer. Uplift modeling thereby enables marketers to focus mailings and offers on persuadable people, and not to send offers to people who will buy the product without an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set. Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. For example, rather than using one model to predict how many customers will churn, a business may choose to build a separate model for each region and customer type. In situations where a large number of models need to be maintained, some businesses turn to more automated data mining methodologies. Page 19 of 25 Data mining can be helpful to human resources (HR) departments in identifying the characteristics of their most successful employees. Information obtained – such as universities attended by highly successful employees – can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels. Market basket analysis, relates to data-mining use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favor silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical, or inexact rules may also be present within a database. Market basket analysis has been used to identify the purchase patterns of the Alpha Consumer. Analyzing the data collected on this type of user has allowed companies to predict future buying trends and forecast supply demands.[citation needed] Data mining is a highly effective tool in the catalog marketing industry.[citation needed] Catalogers have a rich database of history of their customer transactions for millions of customers dating back a number of years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns. Data mining for business applications can be integrated into a complex modeling and decision making process. LIONsolver uses Reactive business intelligence (RBI) to advocate a "holistic" approach that integrates data mining, modeling, and interactive visualization into an end-to-end discovery and continuous innovation process powered by human and automated learning. In the area of decision making, the RBI approach has been used to mine knowledge that is progressively acquired from the decision maker, and then self-tune the decision method accordingly. The relation between the quality of a data mining system and the amount of investment that the decision maker is willing to make was formalized by providing an economic perspective on the value of “extracted knowledge” in terms of its payoff to the organization. This decision-theoretic classification framework was applied to a real-world semiconductor wafer manufacturing line, where decision rules for effectively monitoring and controlling the semiconductor wafer fabrication line were developed.[3] Page 20 of 25 4.Future of Data Mining Over recent years data mining has been establishing itself as one of the major disciplines in computer science with growing industrial impact. Undoubtedly, research in data mining will continue and even increase over coming decades.In this section we will examine the future trends and applications of data mining. 4.1. Distributed/Collective Data Mining (DDM) One area of data mining which is attracting a good amount of attention is that of distributed and collective data mining. Much of the data mining which is being done currently focuses on a database or data warehouse of information which is physically located in one place. However, the situation arises where information may be located in different places, in different physical locations. This is known generally as distributed data mining (DDM). Therefore, the goal is to effectively mine distributed data which is located in heterogeneous sites. Examples of this include biological information located in different databases, data which comes from the databases of two different firms, or analysis of data from different branches of a corporation, the combining of which would be an expensive and time-consuming process. Distributed data mining (DDM) is used to offer a different approach to traditional approaches analysis, by using a combination of localized data analysis, together with a ―global data model. In more specific terms, this is specified as:- performing local data analysis for generating partial data models, and-combining the local data models from different data sites in order to develop the global model. This global model combines the results of the separate analyses. Often the global model produced, especially if the data in different locations has different features or characteristics, may become incorrect or ambiguous. This problem is especially critical when the data in distributed sites is heterogeneous rather than homogeneous 4.2. Ubiquitous Data Mining (UDM) The advent of laptops, palmtops, cell phones, and wearable computers is making ubiquitous access to large quantity of data possible. Advanced analysis of data for extracting useful knowledge is the next natural step in the world of ubiquitous computing. Accessing and analyzing data from a ubiquitous computing device offer many challenges.For example, UDM introduces additional cost due to communication, computation, security, and other factors. So one of the objectives of UDM is to mine data while minimizing the cost of ubiquitous presence. 4.3. Hypertext and Hypermedia Data Mining Hypertext and hypermedia data mining can be characterized as mining data which includes text, hyperlinks, text mark-ups, and various other forms of hypermedia information. As such, it is closely related to both web mining, and multimedia mining, but in reality are quite close in terms of content and applications. While the World Wide Web is substantially composed of hypertext and hypermedia elements, there are other kinds of hypertext/hypermedia data sources which are not found on the web. Examples of these include the information found in online catalogues, digital libraries, online information databases, and the like.. Some of the important data mining techniques used Page 21 of 25 for hypertext and hypermedia data mining include classification (supervised learning), clustering(unsupervised learning), semi-structured learning, and social network analysis. In the case of classification, or supervised learning, the process starts off by reviewing training data in which items are marked as being part of a certain class or group. This data is the basis from which the algorithm is trained. One application of classification is in the area of web topic directories, which can group similar sounding or spelled terms into appropriate categories, so that searches will not bring up inappropriate sites and pages. Semi-supervised learning and social network analysis are other methods which are important to hypermediabaseddata mining. Semi-supervised learning is the case where there are both labelled and unlabeled documents, and there is a need to learn from both types of documents. Social network analysis is also applicable because the web is considered a social network, which examines networks formed through collaborative association, whether it be between friends, academics doing research or service on committees, and between papers through references and citations. 4.4. Multimedia Data Mining Multimedia Data Mining is the mining and analysis of various types of data, including images, video, audio, and animation. As multimedia data mining incorporates the areas of text mining, as well as hypertext/hypermedia mining, these fields are closely related. Much of the information describing these other areas also applies to multimedia data mining. This field is also rather new, but holds much promise for the future. Multimedia information, because its nature as a large collection of multimedia objects, must be represented differently from conventional forms of data. One approach is to create a multimedia data cube which can be used to convert multimedia-type data into a form which is suited to analysis using one of the main data mining techniques, but taking into account the unique characteristics of the data. 4.5. Time Series/Sequence Data Mining Another important area in data mining centres on the mining of time series and sequence-based data. Simply put, this involves the mining of a sequence of data, which can either be referenced by time (time-series, such as stock market and production process data), or is simply a sequence of data which is ordered in a sequence. In general, one aspect of mining time series data focuses on the goal of identifying movements or components which exist within the data (trend analysis).These can include long-term or trend movements, seasonal variations, cyclical variations, and random movements. Sequential pattern mining has as its focus the identification of sequences which occur frequently in a time series or sequence of data. This is particularly useful in the analysis of customers, where certain buying patterns could be identified, such as what might be the likely follow-up purchase to purchasing a certain electronics item or computer, for example. Page 22 of 25 Conclusion I started to this report to by noting for creating various aspects of data mining as a whole. All the information given here were aimed to a complete research by composing different partions. Clearly some of portions that we wrote are not entirely unique. The hypotesises and methods are explained correctly, trends and future dynamism are taken as possible as current. The statistical and theorical data are checked carefully and are verified with the help of various sources. In addition, all we can see the importance of data mining in increasingly globalized world. There are many techniques,studies and softwares to make the life easier and to increase companies market values. Especially, enterprise resource planning and customer relationships management softwares are getting higher places at the cost and budget of companies latterly which is based on data mining. Since the data mining also underlies the bussines intelligence, we will see much more studies related with it in future. I hope this research report can be beneficial for both its readers and people who is curious about data mining. Page 23 of 25 Glossary cluster analysis: or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups(cluster) anomaly detection: anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. association rule mining: method for discovering interesting relations between variables in large databases predictive analytics:predictive analytics encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events classification: is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. data warehouse: in computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence time series analysis: comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. threshold value: The threshold limit value (TLV) of a chemical substance is a level to which it is believed a worker can be exposed day after day for a working lifetime without adverse effects. LIONsolver: LIONsolver is an integrated software for data mining, business intelligence, analytics, and modeling Learning and Intelligent OptimizatioN. Reactive business intelligence (RBI): advocates an holistic approach that integrates data mining, modeling and interactive visualization, into an end-to-end discovery and continuous innovation process powered by human and automated learning. VLSI Test:very large scale integration test. IC: integrated circuit. Page 24 of 25 References [1] : www.thearling.com : www.rayli.net [3] : www.wikipedia.org 4: http://www.ibm.com/support/knowledgecenter/ 5: https://www.linkedin.com/pulse/what-does-future-hold-data-mining-thiensi-le 6: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm 7: https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/materials.shtml#dataware 8: http://www.cs.bu.edu/~gkollios/dm07/lectnotes.html 9: http://searchsqlserver.techtarget.com/definition/data-mining 10: Introduction to Data Mining, Pang-Ning Tan, Michigan State University, Michael Steinbach,University of Minnesota Vipin Kumar, University of Minnesota, (March 25, 2006) 11: Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville [2] Page 25 of 25