Download Steven F. Ashby Center for Applied Scientific Computing Month DD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CISC 4631
Data Mining
Lecture 01:
Introduction to Data Mining
1
Let’s Start By Seeing What You Know
• Quick Quiz
– Do you know what Data Mining is?
– Do you know of any examples of Data Mining?
2
What is Data Mining?
• Data Mining has many definitions
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in order to
discover meaningful patterns
3
Alternative Names
• Data Mining was/is known by these other names
(although many of these have lost favor over time):
–
–
–
–
Knowledge discovery in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology, data dredging, information harvesting,
business intelligence, etc.
• Recently introduced new names (maybe with different
emphases):
– Data Science
– Big Data
4
Some Examples
• Netflix and Amazon use data mining to recommend
products (recommender systems)
• Companies use data mining for marketing
– Who should be mailed a catalog
– Who should see what online ads (Google Adwords)
• Fordham’s WISDM project uses smartphone
accelerometer data to classify user activities (walking,
jogging, sitting, etc.)
• Some search engines cluster retrieved documents into
meaningful groups
– Group pages about Jaguar into “car” pages and “cat” pages
5
Why Data Mining and Why Now?
• Data Mining was not very popular until about
10 – 15 years ago
Quick Quiz: What do
you think changed?
6
Why Mine Data?
• There are now tremendous amounts of data
that are automatically collected and
warehoused. What are some examples?
– Web data, e-commerce
– Store purchases
– Bank/Credit Card transactions
– Cell phone GPS information
– Smartphone and Smartwatch Sensor Data
7
Why Mine Data?
• What technological changes have helped make data
mining so prevalent now?
– Computers: cheaper and more powerful
• Smaller mobile devices are exploding in popularity
– Disk and other storage: greater capacity and cheaper
– Increased use of on-line resources and Internet
• We shouldn’t discount the advances in algorithms
but most data mining algorithms are relatively
mature
8
Why Mine Data?
• In business, competitive pressure is strong
– Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
– CRM is a relatively big deal now
• How do we get the most out of the customer over the
long run
• Example: Customer Churn Analysis
9
Why Mine Data?
• Often info “hidden” in data is not evident
• Analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
– There is just too much data to analyze without
“assistance”
10
Scientific Need
• Data collected at enormous speeds
– remote sensors on satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
• Traditional techniques infeasible
11
How Big is the Data?
• Examples of Large Data Sets
– AT&T’s 26TB call detail database (2003)
– Ebay 6PB, IRS 150TB data warehouse
– Yahoo has a 2PB DB to analyze behavior of ½ billion
web visitors/month (24 billion events/day)
– Wal-Mart has a 583 TB database (2006)
– Indexed web contains about 20 Billion pages
– Sites like Facebook, Flicker & Twitter contain lots of
data
• Google is estimated (in 2011) to have 900,000
servers to handle its data!
12
How Much Data is Being Created?
• 5 Exabytes new data created (2002, UC Berkeley)
• Humans created/copied 161/281 Exabytes in 06/07 (IDC)
–
–
–
–
1 Exabyte = 1018
12 stacks of books stretching from Earth to Sun
3 million times the books ever written
Not all data stored at once (includes temporary data)
• In 2012 2.8 ZB (2800EB) of data will be created/copied
– Forecast for 2020: 40 ZB, or (57X number of grains of sand on Earth)
OK, we get the point
already.! Head hurts.
13
Why Data Mining? Why Now?
According to BabyCenter.com,
today one in three children born in
the United States already have an
online presence (usually in the
form of a sonogram) before they
are born. That number grows to
92% by the time they are two. In
2012 the average digital birth of
children occurs at approximately six
months, with a third of all
children’s photos and information
posted online within weeks of their
birth. What will it mean to live in a
world where our every moment,
from birth to death, is digitally
chronicled and preserved in vast
cloud based databases, forever?
During the first day of a baby’s life, the amount of data generated by humanity
is equivalent to 70 times the information contained in the library of congress.
14
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems*
• Traditional techniques
Artificial Intelligence
may be unsuitable due to
Statistics Machine Learning
– Enormity of data
– High dimensionality
– Heterogeneous & distributed data
* databases currently have limited impact;
data mining is rarely done in a database but
rather on “flat files”
Pattern Recognition
Data Mining
Database
systems
15
Statistics vs. Data Mining
• Experience has shown that students with
statistics backgrounds are often confused by
data mining if the differences aren’t highlighted
• When compared to Data Mining:
• Statistics is more theory-based
– Data mining methods are often based on heuristic algorithms
– Statistics is based firmly on mathematics (e.g., probability)
• Statistics is more focused on testing hypotheses vs.
finding interesting relationships
• Statistics makes more assumptions about the data
16
The Process of Data Mining
Data Mining is a process, sometimes referred to as a knowledge
discovery process. In this process there is a data mining step that
applies data mining algorithms to extract knowledge. About 80% of
our class will focus on the data mining step but in the real world 80%
of the time is spent on the other steps (e.g., prepping data)
17
Second Part of Introduction:
DATA MINING TASKS
18
Top-Level Data Mining Tasks
• At highest level, data mining tasks can be
divided into:
– Prediction Tasks (supervised learning)
• Use some variables to predict unknown or future
values of other variables
– Description Tasks (unsupervised learning)
• Find human-interpretable patterns that describe the
data
19
Key Data Mining Tasks
• Overview of the major data mining tasks
studied in this course:
– Prediction Tasks
• Classification
• Regression
– Description Tasks
• Clustering
• Association Rule Discovery
20
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the
class, which is to be predicted.
• Find a model for class attribute as a function of
the values of other attributes.
– Model maps record to a class value
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine accuracy of the model
• Can you think of classification tasks?
21
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the
class attribute
• Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this info as input attributes to learn a classifier model
23
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions
– Approach:
• Use credit card transactions and info on account-holders
as attributes
– When and what does customer buy, how often pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
24
Classification: Application 3
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
•
•
•
•
Segment the image.
Measure image attributes (features) - 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some of
the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
25
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
26
Regression
• Predict a value of a given continuous (numerical)
variable based on the values of other variables
• Greatly studied in statistics
• Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
– Time series prediction of stock market indices
27
Clustering
• Given a set of data points find clusters so that
– Data points in same cluster are similar
– Data points in different clusters are dissimilar
You try it on the Simpsons. How can
we cluster these 5 “data points”?
28
What is a natural grouping among these objects?
29
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
30
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary
Similarity is hard
to define, but…
“We know it
when we see it”
The real meaning
of similarity is a
philosophical
question. We will
take a more
pragmatic
approach.
31
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
similar customers
– Approach:
• Collect different attributes of customers based on
their geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
32
Clustering: Application 2
• Document Clustering:
– Goal: Find groups of documents that are similar to
each other based on the words appearing in them
– Approach: Identify frequently occurring terms in
each document. Form a similarity measure based
on the frequencies of different terms. Use it to
cluster.
– Uses: Information Retrieval can utilize the clusters
to relate a new document or search term to
clustered documents.
33
Association Rule Discovery
• Given a set of records each of which contain
some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered:
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Diapers
beer
34
Association Rule Discovery
Application
• Marketing and Sales Promotion Applications
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what
should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products
would be affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be
used to see what products should be sold with Bagels to
promote sale of Potato chips!
• Can help determine where to position store items
– Supermarket shelf management
– Did you ever notice that some stores have bananas in the
cereal aisle?
35
Challenges of Data Mining
•
•
•
•
•
•
•
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
36
What is (and is not) Data Mining?
• Based on the definitions of data mining, are these
DM or not?
– Finding a phone number in a directory
• Not data mining (trivial?, DB query)
– Grouping related documents returned by search engine
• Is data mining (not trivial, clustering)
– Identifying who has a disease based on symptoms
• Is data mining (not trivial, classification)
– Web search on keyword using search engine
• May be data mining**
** More of an information retrieval task than data mining task. However,
since Google does much more than keyword matching, there will be a
data mining component. For example, Google mines the link structure
of the Web to decide which pages are important (link mining is a type
of data mining).
37
If you are Interested in Data
Mining
• Data sets
– NYC open data (https://nycopendata.socrata.com/)
– UCI Data Repository (http://archive.ics.uci.edu/ml/)
• Visit kdnuggets, an online newsletter and more
– http://www.kdnuggets.com
– You can arrange to have newsletter emailed to you
– Also includes job openings
• ACM SIGKDD is the professional organization associated with
data mining
– ACM Special Interest Group (SIG) on data mining
– Can join SIGKDD for $22 or for $54 can also join ACM as student
member
38