Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Includes Review of Syllabus
OVERVIEW OF THE CLASS
What is this class about?
This class will introduce data mining
The types of problems that can be addressed
The methods that can be used
Focus will be balanced between learning how to use the
methods and understanding how they work
A significant class project is required
This class is a key component of the new MS in
Data Analytics (MSDA)
The spring course “Machine Learning” will cover
some methods not covered in this course or
covered only superficially
2
Textbooks
For many years the “Intro to Data Mining” book by
Tan, Steinbach, and Kumar was used
One of the commonly used DM textbooks for CS
Not always very clear, but other books are not really any
better
“Data Science for Business” is much clearer and
well written so it also being used.
Provides much more on applications of data mining
Provides surprising technical depth, so eventually my be
the sole book supplemented with other materials
Currently in transition. Some readings may be a
bit redundant but good to get two perspectives.
3
More on the Use of 2 Textbooks
Initially I was put off by the use of “Business” in
the Data Science title since this is a CS course
But book is still relatively technical and covers
some details of the algorithms
Also it is critical for a data analyst/scientist to be
able to understand how and why to use certain
methods and to be able to articulate this.
Intro to Data Mining does a poor job at this
4
The Class Website & Syllabus
The class webiste is at:
http://storm.cis.fordham.edu/~gweiss/classes/cisc6930/
Includes the syllabus and is linked to the class schedule
The class schedule is under active development since this course
is being revised from the last time it was offered.
Will keep you up to date about what is current and what may still
be modified
5
Data Mining
Lecture 01:
Introduction to Data Mining
Much of the material in this presentation is not from either textbook
6
Let’s Start By Seeing What You
Know
Quick Quiz
Do you know what Data Mining is?
Do you know of any examples of Data Mining?
7
What is Data Mining?
Data Mining has many definitions
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in order
to discover meaningful patterns
8
Alternative Names
Data Mining was/is known by these other names
(although many of these have lost favor over time):
Knowledge discovery in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology, data dredging, information harvesting,
business intelligence, etc.
Recently introduced new names (maybe with different
emphases):
Data Science
Big Data
9
What is Big Data?
Technically big data means data is so large that
conventional data mining methods cannot be
applied in normal manner
Using this definition, in most cases a data set with
10,000,000 cases is not big data
But the term is overused and generally not always
interpreted this way
Big data technologies include Hadoop
Big data technologies used for implementing
data mining methods
We offer a course in “Big Data Programming”
10
Some Examples
Netflix and Amazon use data mining to recommend products
(recommender systems)
Companies use data mining for marketing
Who should be mailed a catalog
Who should see what online ads (Google Adwords)
Online advertising: large impact
Financial companies use credit scoring; fraud detection
Customer Churn: who will leave
Fordham’s WISDM project uses smartphone/watch
accelerometer data to classify user activities and perform
biometric identification
Some search engines cluster retrieved documents into
meaningful groups
Group pages about Jaguar into “car” pages and “cat” pages
11
Interesting specific example
Wal-Mart used data mining to find out what
is needed when a hurricane is coming
Strawberry PopTarts increase in sales 7X ahead of
a hurricane and the pre-hurricane top selling item
is beer. (Data Science for Business page 3)
12
A Significant Example
Signet bank convinced that modelling
profitability, not just default probability, is the
way to go
But they did not have the proper data
Constrained by having data only for strategies they
already used
Decided to purposefully offer loans in new cases
(explore new strategies)
Initially poor results but eventually learned from
data and go it right
Became one of the most successful credit card
issuers: Capital One
13
Why Data Mining and Why Now?
Data Mining was not very popular until about 10
– 15 years ago
Quick Quiz: What do
you think changed?
14
Why Mine Data?
There are now tremendous amounts of data that
are automatically collected and warehoused.
What are some examples?
Web data, e-commerce
Store purchases
Bank/Credit Card transactions
Cell phone GPS information
Smartphone and Smartwatch Sensor Data
15
Why Mine Data?
What technological changes have helped make data
mining so prevalent now?
Computers: cheaper and more powerful
Smaller mobile devices are exploding in popularity
Disk and other storage: greater capacity and cheaper
Increased use of on-line resources and Internet
We shouldn’t discount the advances in algorithms
but most data mining algorithms are relatively
mature
In business, competitive pressure is strong
16
Why Mine Data?
Often info “hidden” in data is not evident
Analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
There is just too much data to analyze without
“assistance”
17
Scientific Need
Data collected at enormous speeds
remote sensors on satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
Traditional techniques infeasible
18
How Big is the Data?
Examples of Large Data Sets
AT&T’s 26TB call detail database (2003)
Ebay 6PB, IRS 150TB data warehouse
Yahoo has a 2PB DB to analyze behavior of ½ billion
web visitors/month (24 billion events/day)
Wal-Mart has a 583 TB database (2006)
Indexed web contains about 20 Billion pages
Sites like Facebook, Flicker & Twitter contain lots of
data
Google is estimated (in 2011) to have 900,000
servers to handle its data!
19
How Much Data is Being Created?
5 Exabytes new data created (2002, UC Berkeley)
Humans created/copied 161/281 Exabytes in 06/07 (IDC)
1 Exabyte = 1018
12 stacks of books stretching from Earth to Sun
3 million times the books ever written
Not all data stored at once (includes temporary data)
In 2012 2.8 ZB (2800EB) of data will be created/copied
Forecast for 2020: 40 ZB, or (57X number of grains of sand on
Earth)
OK, we get the point
already.! Head hurts.
20
Why Data Mining? Why Now?
According to BabyCenter.com, today
one in three children born in the
United States already have an online
presence (usually in the form of a
sonogram) before they are born.
That number grows to 92% by the
time they are two. In 2012 the
average digital birth of children
occurs at approximately six months,
with a third of all children’s photos
and information posted online within
weeks of their birth. What will it
mean to live in a world where our
every moment, from birth to death,
is digitally chronicled and preserved
in vast cloud based databases,
forever?
During the first day of a baby’s life, the amount of data generated by
humanity is equivalent to 70 times the information contained in the library of
congress.
21
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems*
Traditional techniques
may be unsuitable due to
Artificial Intelligence
Enormity of data
Statistics
Machine Learning
Pattern Recognition
High dimensionality
Heterogeneous & distributed data
* databases currently have limited impact;
data mining is rarely done in a database but
rather on “flat files”
Data Mining
Database
systems
22
Statistics vs. Data Mining
Students familiar from statistics are often
confused if differences aren’t highlighted
When compared to Data Mining:
Statistics is more theory-based
Data mining methods are based on heuristic algorithms
Statistics is based firmly on mathematics (e.g., probability)
Statistics is more focused on testing hypotheses vs.
finding interesting relationships
Statistics makes more assumptions about the data
23
The Process of Data Mining
Data Mining is a process, formerly referred to as a knowledge discovery process. In
this process there is a data mining step that applies data mining algorithms to
extract knowledge. About 80% of our class will focus on the data mining step but in
the real world 80% of the time is spent on the other steps (e.g., prepping data).
The process below was articulated by Fayyad in a seminal paper on Data Mining
and KDD. There should be a loop since the process is iterative.
24
CRISP Data Mining Process
25
Second Part of Introduction:
Data Mining Tasks
26
Top-Level Data Mining Tasks
At highest level, data mining tasks can be
divided into:
Prediction Tasks (supervised learning)
Use some variables to predict unknown or future
values of other variables
Description Tasks (unsupervised learning)
Find human-interpretable patterns that describe the
data
27
Key Data Mining Tasks
Overview of the major data mining tasks studied
in this course:
Prediction Tasks
Classification (and class probability estimation)
Regression
Description Tasks
Clustering
Association Rule Discovery
Also known as “co-occurrence grouping” or “association
rule mining”
28
Data Mining Tasks Continued
“Data Science for Business” includes several
more. These are generally not as basic and
may be more application oriented
Additional data mining tasks
Similarity matching: More of a technique and is
always used in clustering and sometimes in
classification. Can be of interest in its own right.
Profiling, Link Prediction, Data Reduction, and
Causal Modeling
We will certainly cover similarity matching and
may cover some of the others
29
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the
class, which is to be predicted.
Find a model for class attribute as a function of
the values of other attributes.
Model maps record to a class value
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine accuracy of the model
Class Probability Estimation: estimate the probability
that an object belongs to a class
Can you think of classification tasks?
30
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which didn’t This
{buy, don’t buy} decision forms the class attribute
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this info as input attributes to learn a classifier model
Specific Example
KDD Cup is a competition associated with top DM conference
The KDD CUP 1998 competition was about direct marketing for a
charity. Lots of information is provided
http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html
32
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions
Approach:
Use credit card transactions and info on account-holders
as attributes
When and what does customer buy, how often pays on time,
etc
Label past transactions as fraud or fair transactions. This
forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
33
Classification: Application 3
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image.
Measure image attributes (features) - 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some of
the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
34
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
35
Regression
Predict a value of a given continuous (numerical)
variable based on the values of other variables
Greatly studied in statistics
Examples:
Predicting sales amounts of new product based on
advertising expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices
36
Clustering
Given a set of data points find clusters so that
Data points in same cluster are similar
Data points in different clusters are dissimilar
You try it on the Simpsons. How can
we cluster these 5 “data points”?
37
What is a natural grouping among these objects?
38
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
39
Clustering Application
Market Segmentation:
Goal: subdivide a market into distinct subsets of
similar customers
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
40
Association Rule Discovery
Given a set of records each of which contain
some number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
Rules Discovered:
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Diapers
beer
41
Association Rule Discovery Application
Marketing and Sales Promotion Applications
When items purchased together one can be used to drive sales of
the other
Can help determine where to position store items
Supermarket shelf management
Some stores place bananas in the cereal aisle
42
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Streaming Data
Privacy Preservation
43
What is (and is not) Data Mining?
Based on the definitions of data mining, are these
DM or not?
Finding a phone number in a directory
Not data mining (trivial?, DB query)
Grouping related documents returned by search engine
Is data mining (not trivial, clustering)
Identifying who has a disease based on symptoms
Is data mining (not trivial, classification)
Web search on keyword using search engine
May be data mining**
** More of an information retrieval task than data mining task.
However, since Google does much more than keyword matching,
there will be a data mining component. For example, Google mines
the link structure of the Web to decide which pages are important
(link mining is a type of data mining).
44
If you are Interested in Data Mining
Data sets
NYC open data (https://nycopendata.socrata.com/)
UCI Data Repository (http://archive.ics.uci.edu/ml/)
Visit kdnuggets, an online newsletter and more
http://www.kdnuggets.com
You can arrange to have newsletter emailed to you
Also includes job openings
ACM SIGKDD is the professional organization associated
with data mining
ACM Special Interest Group (SIG) on data mining
Can join SIGKDD for $22 or for $54 can also join ACM as student
member
45