Download lecture 1.pptx

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Introduction to
the Data Mining
course
What is data
mining all about
Inst%&ctor: Nick Cercone -­‐ 3050 LAS-­‐ [email protected] 1 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 – Data Mining
Instructor:
Nick Cercone
TA:
Nastaran Babanejad
WIKI: https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/
Read Chapter 1 of Han et al. (textbook)
Introduction
• 
• 
• 
• 
• 
• 
• 
• 
• 
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
What Motivated Data Mining? Why Is It Important?
So, What Is Data Mining?
Data Mining—On What Kind of Data?
Data Mining Functionalities—What Kinds of Patterns Can Be Mined?
Are All of the Patterns Interesting?
Classification of Data Mining Systems
Data Mining Task Primitives
Integration of a Data Mining System with a Database or DataWarehouse System
Major Issues in Data Mining
CSE 4412 & CSE64112 Data Mining, Fall, 2014 2 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 – Requirements
Homework:
small assignment (10%)
big assignment (15%)
not accepted late
In-class:
presentation (10%)
Final Exam: undergraduates (25%), graduate students (10%)
Project:
start early, takes lots of time (40%)
Paper:
graduate students (15%)\
Basic Knowledge:
databases, statistics, algorithms,
programming
CSE 4412 & CSE64112 Data Mining, Fall, 2014 3 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 –Data Mining Project
Project:
Software implementation related to course
subject matter
Should involve an original component or
experiment
More later about available data and computing
resources
It’s going to be fun and hard work
CSE 4412 & CSE64112 Data Mining, Fall, 2014 4 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 – Projects
•  Many projects deal with collaborative filtering (advice based
on what is done by similar people)
•  Others deal with engineering solutions to machine-learning
problems
•  Project suggestions on wiki
•  Data and infrastructure available on wiki, elsewhere
•  Projects require thought (1) Tell important pages from
unimportant (PageRank); (2) Tell real news from publicity
(how?); (3) Distinguish positive from negative product
reviews (how?); (4) Feature generation in ML; etc.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 5 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 –Data Mining Projects
Working in pairs is okay, but
No more than two per project.
Expect more from a pair than from an individual and
documentation should make clear who did what.
The effort should be roughly evenly distributed
CSE 4412 & CSE64112 Data Mining, Fall, 2014 6 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
CSE4412, CSE6412 –Data Mining
Course Outline:
see calendar on wiki
CSE 4412 & CSE64112 Data Mining, Fall, 2014 7 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
What is data mining all about
Data, Information, Knowledge, Understanding & Wisdom
The DIKW Pyramid, also known variously as the "DIKW
Hierarchy", "Wisdom Hierarchy", the "Knowledge Hierarchy",
the "Information Hierarchy", and the "Knowledge Pyramid”,
refers loosely to a class of models for representing structural and/
or functional relationships between data, information, knowledge,
and wisdom.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 8 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
What is data mining all about
Data are symbols or signs, representing stimuli or signals
Data pervades our daily lives. Such data, properly interpreted and
visualized, becomes information which, contextualized turns into
knowledge to be understood and applied as wisdom and
(business and government) intelligence.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 9 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
The DIKW Pyramid
CSE 4412 & CSE64112 Data Mining, Fall, 2014 10 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
or
CSE 4412 & CSE64112 Data Mining, Fall, 2014 11 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
or
CSE 4412 & CSE64112 Data Mining, Fall, 2014 12 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
or
CSE 4412 & CSE64112 Data Mining, Fall, 2014 13 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
or
CSE 4412 & CSE64112 Data Mining, Fall, 2014 14 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
or
CSE 4412 & CSE64112 Data Mining, Fall, 2014 15 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
What is data mining all about
•  Data: symbols
•  Information: data that are processed to be useful; provides
answers to "who", "what", "where", and "when" questions
•  Knowledge: application of data and information; answers
"how" questions
•  Understanding: appreciation of "why"
•  Wisdom: evaluated understanding.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 16 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
What is data mining (traditional)
“Data mining is the extraction of implicit, previously
unknown, and potentially useful information from data.”
“The application of specific algorithms for extracting
patterns from data, it is a part of knowledge discovery from
databases”
“Data mining is a process, not just a series of statistical
analyses.”
CSE 4412 & CSE64112 Data Mining, Fall, 2014 17 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Statistics
Computer Science
•  (Semi-)automated application
of algorithms for pattern
discovery
•  Algorithms developed in the
field of Artificial Intelligence
(machine learning)
•  Part of the process of
knowledge discovery
•  Process of discovering
patterns in data
•  (Manual) application of a
series of statistical
techniques (among which
machine learning)
•  Incorporates
•  Exploration
•  Sampling
•  Modeling
•  Validation
Traditionally
CSE 4412 & CSE64112 Data Mining, Fall, 2014 18 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
What can data mining do?
1. Classification
discrete outcomes (bird, cat, or fish)
2. Estimation
continuously valued outcomes (height, income, or weight)
3. Prediction
estimated future value (predicting the number of children in the next year, or which
telephone subscribers will order a three-way calling)
4. Affinity Grouping (market basket analysis)
grouping which things go together (pizza and soft drink, coffee and coffee maid, or
cat food and kitty litter)
5. Clustering
segmenting a heterogeneous population into a number of more homogeneous
subgroups (a cluster of symptoms might indicate different diseases)
6. Description
describing what is going on (women support Democrats in greater numbers than do
men)
CSE 4412 & CSE64112 Data Mining, Fall, 2014 19 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Differences between data mining and machine learning
1. 
2. 
3.
4.
ML is a broader field (not only learning from examples, but also
reinforcement learning, evolutionary computing …)
Some overlapping in the algorithms used and the problems
addressed
Data mining is concerned with finding understandable
knowledge, while ML is concerned with improving
performance of an agent (e.g. training a neural network to
balance a pole is part of ML, but not data mining)
Data mining deals with large sets of real-world examples
(realistic examples are normally very large and noisy)
CSE 4412 & CSE64112 Data Mining, Fall, 2014 20 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining Application Examples
Areas where data mining has been applied recently include:
•  Science, Astronomy, Bioinformatics, Drug Discovery, ...
•  Business, Advertising, Customer modeling and Customer
Relationship management (CRM), e-Commerce, Fraud
Detection, Targeted Marketing
Health care, Investments,
•  Manufacturing,
•  Sports/entertainment,
•  Telecom (telephone and communications),
•  targeted marketing,
CSE 4412 & CSE64112 Data Mining, Fall, 2014 21 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining Application Examples (cont)
•  Web: Search engines, bots, …
•  Government, Anti-terrorism efforts (we will discuss
controversy over privacy later), Law enforcement, Profiling
tax cheaters
One of the most important and widespread business
applications of data mining is Customer Modeling, also called
Predictive Analytics. This includes tasks such as
CSE 4412 & CSE64112 Data Mining, Fall, 2014 22 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Predictive Analytics
•  Predicting attrition or churn, i.e., find which customers are
likely to terminate service; Targeted marketing: Customer
acquisition - find which prospects are likely to become
customers; Cross-sell - for given customer and product, find
which other product(s) they are likely to buy
•  Credit-risk - identify the risk that this customer will not pay
back the loan or credit card
•  fraud detection - is this transaction fraudulent?
The largest users of Customer Analytics are industries such as
banking, telecom, retailers, where businesses with large numbers
of customers are making extensive use of these technologies.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 23 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Customer Attrition: Case Study
Let's consider a case study of mobile phone company. Typical
attrition (also called churn) rate at for mobile phone customers is
around 25-30% a year!
The task is
•  Given customer information for the past N (N can range from
2 to 18 months), predict who is likely to attrite in next month
or two.
•  Also, estimate customer value and what is the cost-effective
offer to be made to this customer.
CSE CSE 4412 & CSE64112 Data Mining, Fall, 2014 24 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Customer Attrition: Case Study (cont)
•  Verizon Wireless is the largest wireless service provider in the
United States with a customer base of 34.6 million subscribers,
see http://www.kdnuggets.com/news/2003/n19/22i.html).
•  Verizon built a customer data warehouse that
– 
– 
– 
– 
Identified potential attriters
Developed multiple, regional models
Targeted customers with high propensity to accept the offer
Reduced attrition rate from over 2%/month to under 1.5%/month (huge
impact over 34 million subscribers)
CSE CSE 4412 & CSE64112 Data Mining, Fall, 2014 25 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Assessing Credit Risk : Case Study
Consider a situation where a person applies for a loan.
Should a bank approve the loan?
Note: People who have the best credit don't need the loans, and
people with worst credit are not likely to repay. Bank's best customers
are in the middle.
Banks develop credit models using variety of machine learning
methods.
Mortgage and credit card proliferation are the results of being
able to successfully predict if a person is likely to default on a
loan. Credit risk assessment is universally used in the US and
widely deployed in most developed countries.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 26 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Successful e-commerce - Case Study
•  Amazon.com is the largest on-line retailer, which started with
books and expanded into music, electronics, and other
products. Amazon.com has an active data mining group, which
focuses on personalization. Why personalization? Consider a
person that buys a book (product) at Amazon.com.
•  Task: Recommend other books (and perhaps products) this
person is likely to buy
•  Amazon initial and quite successful effort was using clustering
based on books bought.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 27 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Successful e-commerce - Case Study (cont)
•  For example, customers who bought "Advances in Knowledge
Discovery and Data Mining", by Fayyad, Piatetsky-Shapiro,
Smyth, and Uthurusamy, also bought "Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations" , by Witten and Eibe.
•  Recommendation program is quite successful and more
advanced programs are being developed.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 28 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Unsuccessful e-commerce - Case Study
•  Of course application of data mining is no guarantee of
success and during the Internet bubble of 1999-2000, we have
seen plenty of examples.
•  Consider the legwear and legcare e-tailer Gazelle.com, whose
clickstream and purchase data from was the subject of KDD
Cup 2000 competition (http://www.ecn.purdue.edu/
KDDCUP/)
•  One of the questions was: Characterize visitors who spend
more than $12 on an average order at the site
CSE 4412 & CSE64112 Data Mining, Fall, 2014 29 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Unsuccessful e-commerce - Case Study (cont)
•  The data included a dataset of 3,465 purchases, 1,831
customers
•  Very interesting and illuminating analysis was done by dozens
Cup participants. The total time spend was thousands of hours,
which would have been equivalent to millions of dollars in
consulting fees.
•  However, the total sales of Gazelle.com were only a few
thousands of dollars and no amount of data mining could help
them. Not surprisingly, Gazelle.com went out of business in
Aug 2000.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 30 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining, Security & Fraud Detection
•  There are currently numerous applications of data mining for
security and fraud detection. One of the most common is
Credit Card Fraud Detection. Almost all credit card purchases
are scanned by special algorithms that identify suspicious
transactions for further action. I have recently received such a
call from my bank, when I used a credit card to pay for a
journal published in England. This was an unusual transaction
for me (first purchase in the UK on this card) and the software
flagged it.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 31 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining, Security & Fraud Detection (cont)
•  Other applications include detection of money laundering - a
notable system, called FAIS, was developed by Ted Senator
for the US Treasury [Se96].
•  National Association of Securities Dealers (NASD) which
runs NASDAQ, has developed a system called Sonar that uses
data mining for monitoring insider trading and fraud through
misrepresentation
http://www.kdnuggets.com/news/2003/n18/13i.html
•  Many telecom companies, including AT&T, Bell Atlantic,
British Telecom/MCI have developed systems for catching
phone fraud.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 32 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining, Security & Fraud Detection (cont)
•  Data mining and security was in the headlines with US
Government efforts on using data mining for terrorism
detection, as part of now closed Total Information Awareness
Program. However, the problem of terrorism is unlikely to go
away soon, and US government efforts are continuing.
•  Less controversial is use of data mining for bio-terrorism
detection, as was done at Salt Lake Olympics 2002 (the only
thing that was found was a small outbreak of tropical
diseases).
CSE 4412 & CSE64112 Data Mining, Fall, 2014 33 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Problems Suitable for Data Mining
The areas where data mining applications are likely to be
successful have these characteristics:
•  require knowledge-based decisions
•  have a changing environment
•  have sub-optimal current methods
•  have accessible, sufficient, and relevant data
•  provides high payoff for the right decisions
•  Also, if the problem involves people, then proper
consideration should be given to privacy
CSE 4412 & CSE64112 Data Mining, Fall, 2014 34 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Knowledge Discovery
We define Knowledge Discovery in Data (KDD) as the nontrivial process of identifying
•  valid
•  novel
•  potentially useful
•  and ultimately understandable
patterns in data
from Advances in Knowledge Discovery and Data Mining, Fayyad, PiatetskyShapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
CSE 4412 & CSE64112 Data Mining, Fall, 2014 35 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Knowledge Discovery (cont)
•  Knowledge Discovery is an interdisciplinary field, which
builds upon a foundation provided by databases and statistics
and applies methods from machine learning and visualization
in order to find useful patterns.
•  Data mining has much in common with Statistics and with
Machine Learning. However, there are differences.
•  Statistics provides a theory for dealing with randomness and
tools for testing hypotheses. Statistics does not study topics
such as data preprocessing or results visualization, which are
part of data mining.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 36 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Knowledge Discovery (cont)
•  Machine learning has a more heuristic approach and focuses
on improving performance of a learning agent. ML
encompasses other areas such as real-time learning and
robotics - which are not part of data mining. Data Mining and
Knowledge Discovery integrates theory and heuristics
focusing on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 37 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Knowledge Discovery Process
•  Knowledge Discovery emphasizes process. KDD is not a
single step solution of applying a machine learning method to
a dataset, but a continuous process with loops and feedbacks.
This process has been formalized by an industry group called
CRISP-DM, which stands for CRoss Industry Standard
Process for Data Mining
CSE 4412 & CSE64112 Data Mining, Fall, 2014 38 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Knowledge Discovery Process (cont)
The main steps in the KD process include:
•  1. Business (or Problem) Understanding
•  2. Data Understanding
•  3. Data Preparation (including data cleaning & preprocessing)
•  4. Modeling (applying machine learning and data mining
algorithms)
•  5. Evaluation (checking the performance of these algorithms)
•  6. Deployment
•  7. Monitoring.
See www.crisp-dm.org for more information on CRISP-DM.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 39 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Historical Note: Many names of Data Mining
•  Data Mining and Knowledge Discovery has many names.
•  In 1960’s, statisticians used terms like "Data Fishing" or "Data
Dredging" to refer to what they considered a bad practice of
analyzing data without an apriori hypothesis.
•  The term "Data Mining" appeared around 1990 in the database
community. A phrase "database mining"™, was trademarked
by HNC, and researchers turned to "data mining". Other terms
used include Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, etc.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 40 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Historical Note: Many names of Data Mining
•  Gregory Piatetsky-Shapiro coined the term "Knowledge
Discovery in Databases" for the first workshop on the same
topic (1989) and this term became popular.
•  The term data mining became more popular in business
community and in the press.
•  Data Mining and Knowledge Discovery are used
interchangeably, and we use these terms as synonyms.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 41 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining Tasks
•  Data mining is about many different types of patterns, and
there are correspondingly many types of data mining tasks.
Some of the most popular are
• 
• 
• 
• 
• 
• 
• 
• 
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
CSE 4412 & CSE64112 Data Mining, Fall, 2014 42 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Brief History of Data Mining
•  The term "Data mining" was introduced in the 1990s, but data
mining is the evolution of a field with a long history.
•  Data mining roots are traced back along three family lines:
classical statistics, artificial intelligence, and machine learning.
•  Satistics are the foundation of most technologies on which
data mining is built, e.g. regression analysis, standard
distribution, standard deviation, standard variance,
discriminant analysis, cluster analysis, and confidence
intervals. All of these are used to study data and data
relationships.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 43 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Brief History of Data Mining (cont)
•  Artificial intelligence(AI) attempts to apply human-thoughtlike processing to statistical problems. Certain AI concepts
were adopted for commercial products, e.g., query
optimization modules for Relational DB Management Systems
•  Machine learning combines statistics and AI, because it blends
AI heuristics with advanced statistical analysis. Machine
learning techniques learn about the data they study, such that
programs make different decisions based on the qualities of
the studied data, using statistics for fundamental concepts, and
adding more advanced AI heuristics and algorithms to achieve
goals.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 44 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Brief History of Data Mining (cont)
•  Data mining, in many ways, is fundamentally the adaptation of
machine learning techniques to business applications. Data
mining is best described as the union of historical and recent
developments in statistics, AI, and machine learning. These
techniques are then used together to study data and find
previously-hidden trends or patterns within.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 45 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Computer science conferences on data mining
•  CIKM Conference - ACM Conference on Information and Knowledge
Management
•  DMIN Conference - International Conference on Data Mining
•  DMKD Conference – Research Issues on Data Mining and Knowledge
Discovery
•  ECDM Conference – European Conference on Data Mining
•  ECML-PKDD Conference – European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases
•  EDM Conference – International Conference on Educational Data Mining
•  ICDM Conference – IEEE International Conference on Data Mining
•  KDD Conference – ACM SIGKDD Conference on Knowledge Discovery
and Data Mining
CSE 4412 & CSE64112 Data Mining, Fall, 2014 46 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Computer science conferences on data mining
•  MLDM Conference – Machine Learning and Data Mining in Pattern
Recognition
•  PAKDD Conference - The annual Pacific-Asia Conference on Knowledge
Discovery and Data Mining
•  PAW Conference - Predictive Analytics World
•  SDM Conference – SIAM International Conference on Data Mining
•  SSTD Symposium – Symposium on Spatial and Temporal Databases
•  WSDM Conference – ACM Conference on Web Search and Data Mining
Data mining topics are also present on many
data management/database conferences such as the ICDE Conference,
SIGMOD Conference and International Conference on Very Large Data Bases
CSE 4412 & CSE64112 Data Mining, Fall, 2014 47 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data mining involves 6 common classes of tasks:
•  Anomaly detection (Outlier/change/deviation detection) – The
identification of unusual data records that might be interesting of data
errors tat require further investigation.
•  Association rule learning (Dependency modeling) – Searches for
relationships between variables. For example, a supermarket might gather
data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together
and use this information for marketing purposes, sometimes referred to as
“market basket analysis”.
•  Clustering – the task of discovering groups and structures in the data that
are in some ay or another similar, without usingknown structures in the
data
CSE 4412 & CSE64112 Data Mining, Fall, 2014 48 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data mining involves 6 common classes of tasks: (cont)
•  Classification – the task of generalizing known structure to apply to new
data. For example, an e-mail program might attempt to classify an e-mail as
“legitimate” or “spam”.
•  Regression – attempts to find a function that models the data with the least
error.
•  Summarization – providing a more compact representation of the data set,
including visualization and report generation.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 49 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data mining consists of five major elements:
•  Extract, transform, and load transaction data onto the data
warehouse system.
•  Store and manage the data in a multidimensional database
system.
•  Provide data access to business analysts and information
technology professionals.
•  Analyze the data by application software.
•  Present the data in a useful format, such as a graph or table.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 50 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Different levels of analysis are available:
•  Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.
•  Genetic algorithms: Optimization techniques that use processes such as
genetic combination, mutation, and natural selection in a design based on
the concepts of natural evolution.
•  Decision trees: Tree-shaped structures that represent sets of decisions.
These decisions generate rules for the classification of a dataset. Specific
decision tree methods include Classification and Regression Trees (CART)
and Chi Square Automatic Interaction Detection (CHAID) .
CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to predict which records will have a given
outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square
tests to create multi-way splits. CART typically requires less data preparation than CHAID.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 51 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Different levels of analysis are available: (cont)
•  Nearest neighbor method: A technique that classifies each record in a
dataset based on a combination of the classes of the k record(s) most
similar to it in a historical dataset (where k 1). Sometimes called the knearest neighbor technique.
•  Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
•  Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data
relationships.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 52 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining: Issues
•  One of the key issues raised by data mining technology is not a business or
technological one, but a social one. It is the issue of individual privacy.
Data mining makes it possible to analyze routine business transactions and
glean a significant amount of information about individuals buying
preferences.
•  Another issue is that of data integrity. Clearly, data analysis can only be as
good as the data that is being analyzed. A key implementation challenge is
integrating conflicting or redundant data from different sources. For
example, a bank may maintain credit cards accounts on several different
databases. The addresses (or even the names) of a single cardholder may be
different in each. Software must translate data from one system to another
and select the address most recently entered.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 53 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Data Mining: Issues (cont)
•  A technical issue is whether it is better to set up a relational database or a
multidimensional structure. In a relational db, data is stored in tables,
permitting ad hoc queries. In a multidimensional structure, sets of cubes are
arranged in arrays, with subsets created according to category. While
multidimensional structures facilitate multidimensional data mining,
relational structures have performed better in client/server environments.
•  Finally, there is cost. Although system hardware costs have dropped
dramatically, data mining and data warehousing tend to be self-reinforcing.
The more powerful the data mining queries, the greater the utility of the
information being gleaned from the data, and the greater the pressure to
increase the amount of data being collected and maintained, which
increases the pressure for faster, more powerful data mining queries. This
increases pressure for larger, faster systems, which are more expensive.
CSE 4412 & CSE64112 Data Mining, Fall, 2014 54 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Next Class
•  Course Introduction (continued)
Data mining terminology and introduction to data mining
concepts. Data Preprocessing: Why Preprocess the Data?;
Descriptive Data Summarization; Data Cleaning; Data
Integration and Transformation; Data Reduction; Data
Discretization and Concept Hierarchy Generation
CSE 4412 & CSE64112 Data Mining, Fall, 2014 55 CSE4412 & CSE6412 3.0 Data Mining
Instructor: Nick Cercone – 3050 LAS – [email protected],
Tuesdays, Thursdays 1:00-2:30 – LAS 3033
Fall Semester, 2014
Concluding Remarks The Road to Wisdom The road to wisdom? – Well, it's plain and simple to exXress: Er% and er% and er% again but less and less and less. CSE 4412 & CSE64112 Data Mining, Fall, 2014 56