Download Course Introduction - NYU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to
Data Mining
Dr. Hany Saleeb
Why Data Mining? —
Potential Applications
 Direct Marketing
 identify which prospects should be included in a mailing list
 Market segmentation
 identify common characteristics of customers who buy same products
 Market Basket Analysis
 Identify what products are likely to be bought together
 Insurance Claims Analysis
 discover patterns of fraudulent transactions
 compare current transactions against those patterns
What Is Data Mining?
 Combination of AI and statistical analysis to discover
information that is “hidden” in the data
 associations (e.g. linking purchase of pizza with beer)
 sequences (e.g. tying events together: marriage and purchase of
furniture)
 classifications (e.g. recognizing patterns such as the attributes of
employees that are most likely to quit)
 forecasting (e.g. predicting buying habits of customers based on
past patterns) Expert systems or small ML/statistical programs
What can data mining do?
 Classification
– Classify credit applicants as low, medium, high risk
– Classify insurance claims as normal, suspicious
 Estimation
– Estimate the probability of a direct mailing response
– Estimate the lifetime value of a customer
 Prediction
– Predict which customers will leave within six months
– Predict the size of the balance that will be transferred by a
credit card prospect
What can data mining do?
(cont’d)
 Association
– Find out items customers are likely to buy together
– Find out what books to recommend to Amazon.com users
 Clustering
– Difference from classification: classes are unknown!
Market Analysis and
Management
 Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
 Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining: On What
Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
Data Mining Process
Learning
Collecting relevant data
Model building
Understanding of business
Problem identification
Business strategy
and evaluation
Action
Requirements/challenges
in Data Mining
User interface
Mining methodology
Performance
Data source
Social and Security
Requirements/challenges
in Data Mining(2)
User interface
- Data Visualization
Understandability and interpretation of results
Information representation and rendering
Screen real-estate
- Interactivity
Manipulation of mined knowledge
focus and refine mining tasks
Focus and refine mining results
Requirements/challenges
in Data Mining(3)
Mining Methodology
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels
of abstraction
Incorporation of background knowledge
Query languages
Expression and visualization of results
Handling noise and incomplete data
Pattern evaluation
Requirements/challenges
in Data Mining (4)
Performance
Efficiency and scalability of data mining algorithms
Linear algorithms needed
Parallel and distributed methods
Incremental methods
Divide and conquer?
Requirements/challenges
in Data Mining(5)
Data Source
Diversity of data types
Handling complex types of data
Mining information from heterogenous data
bases or information repositories
Can we expect a DM algorithm to do well on all
types of data ?
Data glut
Are we collecting the right data for the right answer?
Distinguish between important and unimportant data
Requirements/challenges
in Data Mining(6)
Social and Security
-Social Impact
Private and sensitive data is gathered and mined
without individual’s knowledge and/or consent
Appropriate use and distribution of discovered
knowledge
- Regulations
Need for privacy and DM policies
Data Mining Tools
DBMiner : A free tool
 DBMiner: A data mining system originated in Intelligent
Database Systems Lab and further developed by DBMiner
Technology Inc.
 OLAM (on-line analytical mining) architecture for
interactive mining of multi-level knowledge in both
RDBMS and data warehouses
 Mining knowledge on Microsoft SQLServer 7.0 databases
and/or data warehouses
 Multiple mining functions: discovery-driven OLAP,
association, classification and clustering
Input and Output
 Input: SQLServer 7.0 data cubes which are constructed from single or
multiple relational tables, data warehouses or spread sheets (with OLEDB
and RDBMS connections)
 Multiple outputs
 Summarization and discovery-driven OLAP: crosstabs and graphical
outputs using MS/Excel2000
 Association: rule tables, rule planes and ball graphs
 Classification: decision trees and decision tables
 Clustering: maps and summarization graphs
 Others:
Data and cube views
Visualization of concept hierarchies
Visualization for task management
Visualization of 2-D and 3-D boxplots
Data Mining Tasks
DBMiner covers the following functions
Discovery-driven, OLAP-based multi-dimensional analysis
Association and frequent pattern analysis
Classification (decision tree analysis)
Cluster analysis
3-D cube viewer and analyzer
 Other function
OLAP service, cube exploration, statistical analysis
Sequential pattern analysis
Visual classification
Summary
The benefits of knowing one’s business is
critical; technologies are coming together
to support data mining.
Data mining is the process and result of
knowledge production, knowledge
discovery and knowledge management.