Download DataMining_Overview - Computer Science | Furman University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining Overview
Professor P. Batchelor
Furman University
Overview





Introduction
Explanation of Data Mining Techniques
Advantages
Applications
Privacy
Data Mining






What is Data Mining?
“The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
KDD – “Knowledge Discovery in Databases”
“Attempts to discover rules and patterns from data”
Discover Rules  Make Predictions
Areas of Use




Internet – Discover needs of customers
Economics – Predict stock prices
Science – Predict environmental change
Medicine – Match patients with similar problems  cure
Example of Data Mining

Credit Card Company wants to discover information about
clients from databases. Want to find:





Clients who respond to promotions in “Junk Mail”
Clients that are likely to change to another competitor
Clients that are likely to not pay
Services that clients use to try to promote services affiliated
with the Credit Card Company
Anything else that may help the Company provide/ promote
services to help their clients and ultimately make more
money.
Data Mining & Data Warehousing

Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored under
a unified schema, at a single site.” (Silberschatz)



Collect data  Store in single repository
Allows for easier query development as a single repository
can be queried.
Data Mining:


Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
Knowledge is power.
Discovery of Knowledge
Data Mining Techniques




Classification
Clustering
Regression (we have already looked at this)
Association Rules
Classification






Classification: Given a set of items that have several classes,
and given the past instances (training instances) with their
associated class,
Classification is the process of predicting the class of a new
item.
Classify the new item and identify to which class it belongs
Example: A bank wants to classify its Home Loan Customers
into groups according to their response to bank advertisements.
The bank might use the classifications “Responds Rarely,
Responds Sometimes, Responds Frequently”.
The bank will then attempt to find rules about the customers
that respond Frequently and Sometimes.
The rules could be used to predict needs of potential customers.
Technique for Classification

Decision-Tree Classifiers
Job
Engineer
Carpenter
Income
<30K
Bad
>50K
Good
Income
<40K
Bad
>90K
Good
Doctor
Income
>100K
<50K
Bad
Predicting credit risk of a person with the jobs specified.
Good
Clustering



“Clustering algorithms find groups of items that are similar.
… It divides a data set so that records with similar content
are in the same group, and groups are as different as
possible from each other. ”
Example: Insurance company could use clustering to group
clients by their age, location and types of insurance
purchased.
The categories are unspecified and this is referred to as
‘unsupervised learning’
Clustering

Group Data into Clusters



Similar data is grouped in the same cluster
Dissimilar data is grouped in the a differnt cluster
How is this achieved ?
 Hierarchical


Group data into t-trees
K-Nearest Neighbor
 A classification method that classifies a point by
calculating the distances between the point and points in
the training data set. Then it assigns the point to the
class that is most common among its k-nearest
neighbors (where k is an integer)
Association Rules


“An association algorithm creates rules that describe how
often events have occurred together.”
Example: When a customer buys a hammer, then 90%
of the time they will buy nails.
Association Rules


Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”
Example:



People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
Situations where there is high support for the
antecedent are worth careful attention

E.g. Hotdog sausages should be placed near hotdog buns in
supermarkets if there is also high confidence.
Association Rules


Confidence: “is a measure of how often the consequent is
true when the antecedent is true.”
Example:





90% of Hotdog bun purchases are accompanied by hotdog
sausages.
High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage
2 rules may have different confidence levels and
have the same support.
E.g. Hotdog sausage  Hotdog bun may have a
much lower confidence than Hotdog bun  Hotdog
sausage yet they both can have the same support.
Advantages of Data Mining

Provides new knowledge from existing data



Public databases
Government sources
Company Databases

Old data can be used to develop new knowledge

New knowledge can be used to improve services or products

Improvements lead to:


Bigger profits
More efficient service
Uses of Data Mining

Sales/ Marketing



Risk Assessment


Identify Customers that pose high credit risk
Fraud Detection


Diversify target market
Identify clients needs to increase response rates
Identify people misusing the system. E.g. People who have
two Social Security Numbers
Customer Care


Identify customers likely to change providers
Identify customer needs
Applications of Data Mining
Source IDC 1998
Privacy Concerns




Effective Data Mining requires large sources of data
To achieve a wide spectrum of data, link multiple data
sources
Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
 Shopping History
 Credit History
 Bank History
 Employment History
The users life story can be painted from the
collected data
Linking to Re-identify Data
Ethnicity
Name
Visit date
Address
Diagnosis
ZIP
Procedure
Birth
date
Medication
Sex
Total charge
Medical Data
Date
registered
Party
affiliation
Date last
voted
Voter List
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of
Law, Medicine and Ethics. 1997, 25:98-110.
{date of birth, gender, 5-digit ZIP}
uniquely identifies 87.1% of USA pop.
Perceived Concerns

Data mining lets you find out about my
private life


I don’t want you, my insurance company, the
government knowing everything
Data mining doesn’t always get it right


I don’t want to be put in jail because data
mining said so
I don’t want to be denied credit, a job,
insurance because data mining said so.
Real Concerns

Data mining lets you find out about my private life



Data mining doesn’t always get it right



Learned models allow conjectures
Learning the model requires collecting data
Our legal system is supposed to ensure due process
Data mining typically allows businesses to take risks they
otherwise wouldn’t
 Identify people we can give instant credit
But without data mining, decisions would be slower and probably more restrictive.

Why is credit so easy to get, even though bankruptcies up?
Data Mining and Terrorism
Total Information Awareness (TIA).



The Information Awareness Office (IAO) was established
by the Defense Advanced Research Projects Agency in
January 2002 to bring together several DARPA projects
focused on applying surveillance and information technology
to track and monitor terrorists and other threats to U.S.
national security, by achieving Total Information
Awareness (TIA).
Following public criticism that the development and
deployment of this technology could potentially lead to a mass
surveillance system, the IAO was defunded by Congress in
2003.
However, several IAO projects continued to be funded, and
merely run under different names
Evidence Extraction and Link Discovery




Development of technologies and tools for automated
discovery, extraction and linking of sparse evidence contained
in large amounts of classified and unclassified data sources
(such as phone call records from the NSA call database,
internet histories, or bank records)
Design systems with the ability to extract data from multiple
sources (e.g., text messages, social networking sites, financial
records, and web pages).
Detect patterns comprising multiple types of links between
data items or people communicating (e.g., financial
transactions, communications, travel, etc.).
Designed to link items relating potential "terrorist" groups and
scenarios, and to learn patterns of different groups or
scenarios to identify new organizations and emerging threats.
Scalable Social Network Analysis




Aimed at developing techniques based on social network
analysis for modeling the key characteristics of terrorist
groups and discriminating these groups from other types of
societal groups.
Sean McGahan, of Northeastern University said the following
in his study of SSNA:
The purpose of the SSNA algorithms program is to extend techniques of
social network analysis to assist with distinguishing potential terrorist cells
from legitimate groups of people ... In order to be successful SSNA will
require information on the social interactions of the majority of people
around the globe. Since the Defense Department cannot easily distinguish
between peaceful citizens and terrorists, it will be necessary for them to
gather data on innocent civilians as well as on potential terrorists.
Does this worry you or make you feel more secure?
Human ID project


The Human Identification at a Distance (HumanID)
project developed automated biometric identification
technologies to detect, recognize and identify humans at
great distances for "force protection", crime prevention, and
"homeland security/defense" purposes.
Its goals included programs to:






Develop algorithms for locating and acquiring subjects out to 150 meters
(500 ft) in range.
Fuse face and gait recognition into a 24/7 human identification system.
Develop and demonstrate a human identification system that operates out
to 150 meters (500 ft) using visible imagery.
Develop a low power millimeter wave radar system for wide field of view
detection and narrow field of view gait classification.
Characterize gait performance from video for human identification at a
distance.
Develop a multi-spectral infrared and visible face recognition system.
Solutions

Data mining lets you find out about my
private life


Privacy-preserving data mining
Data mining doesn’t always get it right

Data scientists know it and are working on it

Educate the user
Privacy-Preserving Data Mining
Data Perturbation

Construct a data set with noise added


Miners given the perturbed data set


Reconstruct distribution to improve results
Solutions out there


Can be released without revealing private data
Decision trees, association rules
Debate: Does it really preserve privacy?

Can we prove impossibility of noise removal?
Privacy-Preserving Data Mining
Distributed Data Mining

Data owners keep their data


Encryption techniques to preserve
privacy


Collaborate to get data mining results
Proofs that private data is not disclosed
Solutions for Decision Trees, Association
Rules, Clustering

Different solutions needed depending on
how data is distributed, privacy constraints
What Next?

Data mining lets you find out about my private life



Constraints that allow us to restrict what models can be
learned
Can we ensure that data mining won’t produce
results that are amenable to misuse? (e.g., 100%
confidence models) Redlining example
Data mining doesn’t always get it right

Educate the public

What data mining does (and doesn’t do)
Do You Agree?


There is a great difference between an
inanimate machine knowing your secrets
and a person knowing the same.
Political solutions can control how and
why information goes from the machine
to trusted analysts who can act on the
knowledge.