Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining Overview
Professor P. Batchelor
Furman University
Overview
Introduction
Explanation of Data Mining Techniques
Advantages
Applications
Privacy
Data Mining
What is Data Mining?
“The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
KDD – “Knowledge Discovery in Databases”
“Attempts to discover rules and patterns from data”
Discover Rules Make Predictions
Areas of Use
Internet – Discover needs of customers
Economics – Predict stock prices
Science – Predict environmental change
Medicine – Match patients with similar problems cure
Example of Data Mining
Credit Card Company wants to discover information about
clients from databases. Want to find:
Clients who respond to promotions in “Junk Mail”
Clients that are likely to change to another competitor
Clients that are likely to not pay
Services that clients use to try to promote services affiliated
with the Credit Card Company
Anything else that may help the Company provide/ promote
services to help their clients and ultimately make more
money.
Data Mining & Data Warehousing
Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored under
a unified schema, at a single site.” (Silberschatz)
Collect data Store in single repository
Allows for easier query development as a single repository
can be queried.
Data Mining:
Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
Knowledge is power.
Discovery of Knowledge
Data Mining Techniques
Classification
Clustering
Regression (we have already looked at this)
Association Rules
Classification
Classification: Given a set of items that have several classes,
and given the past instances (training instances) with their
associated class,
Classification is the process of predicting the class of a new
item.
Classify the new item and identify to which class it belongs
Example: A bank wants to classify its Home Loan Customers
into groups according to their response to bank advertisements.
The bank might use the classifications “Responds Rarely,
Responds Sometimes, Responds Frequently”.
The bank will then attempt to find rules about the customers
that respond Frequently and Sometimes.
The rules could be used to predict needs of potential customers.
Technique for Classification
Decision-Tree Classifiers
Job
Engineer
Carpenter
Income
<30K
Bad
>50K
Good
Income
<40K
Bad
>90K
Good
Doctor
Income
>100K
<50K
Bad
Predicting credit risk of a person with the jobs specified.
Good
Clustering
“Clustering algorithms find groups of items that are similar.
… It divides a data set so that records with similar content
are in the same group, and groups are as different as
possible from each other. ”
Example: Insurance company could use clustering to group
clients by their age, location and types of insurance
purchased.
The categories are unspecified and this is referred to as
‘unsupervised learning’
Clustering
Group Data into Clusters
Similar data is grouped in the same cluster
Dissimilar data is grouped in the a differnt cluster
How is this achieved ?
Hierarchical
Group data into t-trees
K-Nearest Neighbor
A classification method that classifies a point by
calculating the distances between the point and points in
the training data set. Then it assigns the point to the
class that is most common among its k-nearest
neighbors (where k is an integer)
Association Rules
“An association algorithm creates rules that describe how
often events have occurred together.”
Example: When a customer buys a hammer, then 90%
of the time they will buy nails.
Association Rules
Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”
Example:
People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
Situations where there is high support for the
antecedent are worth careful attention
E.g. Hotdog sausages should be placed near hotdog buns in
supermarkets if there is also high confidence.
Association Rules
Confidence: “is a measure of how often the consequent is
true when the antecedent is true.”
Example:
90% of Hotdog bun purchases are accompanied by hotdog
sausages.
High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage
2 rules may have different confidence levels and
have the same support.
E.g. Hotdog sausage Hotdog bun may have a
much lower confidence than Hotdog bun Hotdog
sausage yet they both can have the same support.
Advantages of Data Mining
Provides new knowledge from existing data
Public databases
Government sources
Company Databases
Old data can be used to develop new knowledge
New knowledge can be used to improve services or products
Improvements lead to:
Bigger profits
More efficient service
Uses of Data Mining
Sales/ Marketing
Risk Assessment
Identify Customers that pose high credit risk
Fraud Detection
Diversify target market
Identify clients needs to increase response rates
Identify people misusing the system. E.g. People who have
two Social Security Numbers
Customer Care
Identify customers likely to change providers
Identify customer needs
Applications of Data Mining
Source IDC 1998
Privacy Concerns
Effective Data Mining requires large sources of data
To achieve a wide spectrum of data, link multiple data
sources
Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
Shopping History
Credit History
Bank History
Employment History
The users life story can be painted from the
collected data
Linking to Re-identify Data
Ethnicity
Name
Visit date
Address
Diagnosis
ZIP
Procedure
Birth
date
Medication
Sex
Total charge
Medical Data
Date
registered
Party
affiliation
Date last
voted
Voter List
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of
Law, Medicine and Ethics. 1997, 25:98-110.
{date of birth, gender, 5-digit ZIP}
uniquely identifies 87.1% of USA pop.
Perceived Concerns
Data mining lets you find out about my
private life
I don’t want you, my insurance company, the
government knowing everything
Data mining doesn’t always get it right
I don’t want to be put in jail because data
mining said so
I don’t want to be denied credit, a job,
insurance because data mining said so.
Real Concerns
Data mining lets you find out about my private life
Data mining doesn’t always get it right
Learned models allow conjectures
Learning the model requires collecting data
Our legal system is supposed to ensure due process
Data mining typically allows businesses to take risks they
otherwise wouldn’t
Identify people we can give instant credit
But without data mining, decisions would be slower and probably more restrictive.
Why is credit so easy to get, even though bankruptcies up?
Data Mining and Terrorism
Total Information Awareness (TIA).
The Information Awareness Office (IAO) was established
by the Defense Advanced Research Projects Agency in
January 2002 to bring together several DARPA projects
focused on applying surveillance and information technology
to track and monitor terrorists and other threats to U.S.
national security, by achieving Total Information
Awareness (TIA).
Following public criticism that the development and
deployment of this technology could potentially lead to a mass
surveillance system, the IAO was defunded by Congress in
2003.
However, several IAO projects continued to be funded, and
merely run under different names
Evidence Extraction and Link Discovery
Development of technologies and tools for automated
discovery, extraction and linking of sparse evidence contained
in large amounts of classified and unclassified data sources
(such as phone call records from the NSA call database,
internet histories, or bank records)
Design systems with the ability to extract data from multiple
sources (e.g., text messages, social networking sites, financial
records, and web pages).
Detect patterns comprising multiple types of links between
data items or people communicating (e.g., financial
transactions, communications, travel, etc.).
Designed to link items relating potential "terrorist" groups and
scenarios, and to learn patterns of different groups or
scenarios to identify new organizations and emerging threats.
Scalable Social Network Analysis
Aimed at developing techniques based on social network
analysis for modeling the key characteristics of terrorist
groups and discriminating these groups from other types of
societal groups.
Sean McGahan, of Northeastern University said the following
in his study of SSNA:
The purpose of the SSNA algorithms program is to extend techniques of
social network analysis to assist with distinguishing potential terrorist cells
from legitimate groups of people ... In order to be successful SSNA will
require information on the social interactions of the majority of people
around the globe. Since the Defense Department cannot easily distinguish
between peaceful citizens and terrorists, it will be necessary for them to
gather data on innocent civilians as well as on potential terrorists.
Does this worry you or make you feel more secure?
Human ID project
The Human Identification at a Distance (HumanID)
project developed automated biometric identification
technologies to detect, recognize and identify humans at
great distances for "force protection", crime prevention, and
"homeland security/defense" purposes.
Its goals included programs to:
Develop algorithms for locating and acquiring subjects out to 150 meters
(500 ft) in range.
Fuse face and gait recognition into a 24/7 human identification system.
Develop and demonstrate a human identification system that operates out
to 150 meters (500 ft) using visible imagery.
Develop a low power millimeter wave radar system for wide field of view
detection and narrow field of view gait classification.
Characterize gait performance from video for human identification at a
distance.
Develop a multi-spectral infrared and visible face recognition system.
Solutions
Data mining lets you find out about my
private life
Privacy-preserving data mining
Data mining doesn’t always get it right
Data scientists know it and are working on it
Educate the user
Privacy-Preserving Data Mining
Data Perturbation
Construct a data set with noise added
Miners given the perturbed data set
Reconstruct distribution to improve results
Solutions out there
Can be released without revealing private data
Decision trees, association rules
Debate: Does it really preserve privacy?
Can we prove impossibility of noise removal?
Privacy-Preserving Data Mining
Distributed Data Mining
Data owners keep their data
Encryption techniques to preserve
privacy
Collaborate to get data mining results
Proofs that private data is not disclosed
Solutions for Decision Trees, Association
Rules, Clustering
Different solutions needed depending on
how data is distributed, privacy constraints
What Next?
Data mining lets you find out about my private life
Constraints that allow us to restrict what models can be
learned
Can we ensure that data mining won’t produce
results that are amenable to misuse? (e.g., 100%
confidence models) Redlining example
Data mining doesn’t always get it right
Educate the public
What data mining does (and doesn’t do)
Do You Agree?
There is a great difference between an
inanimate machine knowing your secrets
and a person knowing the same.
Political solutions can control how and
why information goes from the machine
to trusted analysts who can act on the
knowledge.