Download data and data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 4
1
CIS 512
DATA MINING
INTRODUCTION TO DATA MINING
INTRODUCTION TO DATA MINING
2
OBJECTIVES
Learn why data mining is in high demand and how it is
part of the natural evolution of information technology.
Define data mining with respect to the knowledge
discovery process.
Learn about data mining from many aspects, such as:
kinds of data that can be mined
kinds of knowledge to be mined
kinds of technologies to be used
targeted applications
Information Systems Department
13-Mar-17
Why Data Mining?
3
Information Systems Department
13-Mar-17
Why Data Mining?
Moving toward the Information Age
4
Data versus Information
The Explosive Growth of Data: from terabytes to petabytes
Society produces huge amounts of data Sources
•
Business: Web, e-commerce, transactions, stocks, …
•
Science: Remote sensing, bioinformatics, scientific simulation,…
•
Society and everyone: news, digital cameras, YouTube
•
medicine, economics, geography, environment, sports, …
Potentially valuable resource
Raw data is useless: need techniques to automatically
extract information are needed
Data: recorded facts
Information: patterns underlying the data/processed Data
Information Systems Department
13-Mar-17
Why Data Mining?
Moving toward the Information Age
5
Example 1.1 Data mining turns a large collection of data into knowledge.
A search engine (e.g., Google) receives hundreds of millions of queries
every day.
What novel and useful knowledge can a search engine learn from such a
huge collection of queries collected from users over time?
•
Some patterns found in these queries can disclose invaluable knowledge that cannot be
obtained by reading individual data items alone.
For example, Google’s Flu Trends found a close relationship between the
number of people who search for flu-related information and the number
of people who actually have flu symptoms.
This example shows how data mining can turn a large collection of data
into knowledge that can help meet a current global challenge.
Information Systems Department
13-Mar-17
What Is Data Mining?
6
What Is Data Mining?
7
Data mining /knowledge discovery from data
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Needed: programs that can automatically detect
patterns and regularities in the data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, information
harvesting, business intelligence, etc.
13-Mar-17
Information Systems Department
What Is Data Mining?
Knowledge Discovery (KDD) Process
8
•
•
Data Mining is the central part of a
bigger process called Knowledge
Discovery
Data mining plays an essential role in
the knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Information Systems Department
DATA AND DATA MINING
9
Attributes
•
•
Each object is described by a number of variables
that corresponds to its properties. These variables are
called attributes.
Data is described by a fixed predefined set of
features, called “attributes”
Information Systems Department
13-Mar-17
DATA AND DATA MINING
10
Instances and datasets
•
•
The set of variable values corresponding to each of
the objects is called a record or an instance.
The complete set of available data is called a
dataset. A dataset is often depicted as a table, with
each row representing an instance. Each column
contains the value of one of the variables (attributes)
for each of the instances
Information Systems Department
13-Mar-17
Labelled and Unlabelled Data
11
This dataset is an example of labeled data, where one
attribute is given special significance
The standard name of this special attribute is “class”.
When there is no such special significant attribute we call
the data unlabeled.
Information Systems Department
13-Mar-17
Supervised and Unsupervised Learning
12
For labeled data there is a specially designated
attribute. The aim is to use the given data to predict
the value of the attribute for instances that have not
yet been seen. Data mining using labeled data is
called supervised learning.
Data that does not have any specially designated
attribute is called unlabeled. Data mining of
unlabeled data is called unsupervised learning.
Information Systems Department
13-Mar-17
What Kind of Data Can Be Mined?
1. Database Data
13
A database system, also called a database management
system (DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data.
A relational database is a collection of tables, each of which
is assigned a unique name. Each table consists of a set of
attributes (columns or fields) and usually stores a large set of
tuples (records or rows). Each tuple in a relational table
represents an object identified by a unique key and described
by a set of attribute values.
13-Mar-17
Information Systems Department
What Kind of Data Can Be Mined?
1. Database Data
14
Example 1.2 A relational database for AllElectronics.
13-Mar-17
Information Systems Department
What Kind of Data Can Be Mined?
2. Data Warehouses
15
A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
usually residing at a single site.
13-Mar-17
Information Systems Department
What Kind of Data Can Be Mined?
3. Transactional Data
16
In general, each record in a transactional database captures
a transaction, such as a customer’s purchase, a flight booking,
or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
13-Mar-17
Information Systems Department
What Kind of Data Can Be Mined?
3. Transactional Data
17
Example 1.4 A transactional database for AllElectronics.
As an analyst of AllElectronics, you may ask, “Which items sold
well together?” This kind of market basket data analysis would
enable you to bundle groups of items together as a strategy
for boosting sales.
13-Mar-17
Information Systems Department
What Kind of Data Can Be Mined?
3. Transactional Data
18
Example 1.4 A transactional database for AllElectronics.
For example, given the knowledge that printers are commonly purchased
together with computers, you could offer certain printers at a steep discount
(or even for free) to customers buying selected computers, in the hopes of
selling more computers (which are often more expensive than printers).
A traditional database system is not able to perform market basket data
analysis. Fortunately, data mining on transactional data can do so by mining
frequent itemsets, that is, sets of items that are frequently sold together.
13-Mar-17
Information Systems Department
What Kinds of Patterns Can Be Mined?
19
Data mining functionalities
Mining of frequent patterns, associations, correlations
Classification and regression
Clustering analysis
Outlier analysis
Data mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks.
In general, such tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make
predictions.
13-Mar-17
Information Systems Department