Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 1 CIS 512 DATA MINING INTRODUCTION TO DATA MINING INTRODUCTION TO DATA MINING 2 OBJECTIVES Learn why data mining is in high demand and how it is part of the natural evolution of information technology. Define data mining with respect to the knowledge discovery process. Learn about data mining from many aspects, such as: kinds of data that can be mined kinds of knowledge to be mined kinds of technologies to be used targeted applications Information Systems Department 13-Mar-17 Why Data Mining? 3 Information Systems Department 13-Mar-17 Why Data Mining? Moving toward the Information Age 4 Data versus Information The Explosive Growth of Data: from terabytes to petabytes Society produces huge amounts of data Sources • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation,… • Society and everyone: news, digital cameras, YouTube • medicine, economics, geography, environment, sports, … Potentially valuable resource Raw data is useless: need techniques to automatically extract information are needed Data: recorded facts Information: patterns underlying the data/processed Data Information Systems Department 13-Mar-17 Why Data Mining? Moving toward the Information Age 5 Example 1.1 Data mining turns a large collection of data into knowledge. A search engine (e.g., Google) receives hundreds of millions of queries every day. What novel and useful knowledge can a search engine learn from such a huge collection of queries collected from users over time? • Some patterns found in these queries can disclose invaluable knowledge that cannot be obtained by reading individual data items alone. For example, Google’s Flu Trends found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms. This example shows how data mining can turn a large collection of data into knowledge that can help meet a current global challenge. Information Systems Department 13-Mar-17 What Is Data Mining? 6 What Is Data Mining? 7 Data mining /knowledge discovery from data Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Needed: programs that can automatically detect patterns and regularities in the data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, information harvesting, business intelligence, etc. 13-Mar-17 Information Systems Department What Is Data Mining? Knowledge Discovery (KDD) Process 8 • • Data Mining is the central part of a bigger process called Knowledge Discovery Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Information Systems Department DATA AND DATA MINING 9 Attributes • • Each object is described by a number of variables that corresponds to its properties. These variables are called attributes. Data is described by a fixed predefined set of features, called “attributes” Information Systems Department 13-Mar-17 DATA AND DATA MINING 10 Instances and datasets • • The set of variable values corresponding to each of the objects is called a record or an instance. The complete set of available data is called a dataset. A dataset is often depicted as a table, with each row representing an instance. Each column contains the value of one of the variables (attributes) for each of the instances Information Systems Department 13-Mar-17 Labelled and Unlabelled Data 11 This dataset is an example of labeled data, where one attribute is given special significance The standard name of this special attribute is “class”. When there is no such special significant attribute we call the data unlabeled. Information Systems Department 13-Mar-17 Supervised and Unsupervised Learning 12 For labeled data there is a specially designated attribute. The aim is to use the given data to predict the value of the attribute for instances that have not yet been seen. Data mining using labeled data is called supervised learning. Data that does not have any specially designated attribute is called unlabeled. Data mining of unlabeled data is called unsupervised learning. Information Systems Department 13-Mar-17 What Kind of Data Can Be Mined? 1. Database Data 13 A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. 13-Mar-17 Information Systems Department What Kind of Data Can Be Mined? 1. Database Data 14 Example 1.2 A relational database for AllElectronics. 13-Mar-17 Information Systems Department What Kind of Data Can Be Mined? 2. Data Warehouses 15 A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. 13-Mar-17 Information Systems Department What Kind of Data Can Be Mined? 3. Transactional Data 16 In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web page. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction, such as the items purchased in the transaction. 13-Mar-17 Information Systems Department What Kind of Data Can Be Mined? 3. Transactional Data 17 Example 1.4 A transactional database for AllElectronics. As an analyst of AllElectronics, you may ask, “Which items sold well together?” This kind of market basket data analysis would enable you to bundle groups of items together as a strategy for boosting sales. 13-Mar-17 Information Systems Department What Kind of Data Can Be Mined? 3. Transactional Data 18 Example 1.4 A transactional database for AllElectronics. For example, given the knowledge that printers are commonly purchased together with computers, you could offer certain printers at a steep discount (or even for free) to customers buying selected computers, in the hopes of selling more computers (which are often more expensive than printers). A traditional database system is not able to perform market basket data analysis. Fortunately, data mining on transactional data can do so by mining frequent itemsets, that is, sets of items that are frequently sold together. 13-Mar-17 Information Systems Department What Kinds of Patterns Can Be Mined? 19 Data mining functionalities Mining of frequent patterns, associations, correlations Classification and regression Clustering analysis Outlier analysis Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In general, such tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize properties of the data in a target data set. Predictive mining tasks perform induction on the current data in order to make predictions. 13-Mar-17 Information Systems Department