Download Business Intelligence and Data Mining - Hui Xiong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Learning Objectives
Business Intelligence and Data Mining
• Understand the need for business intelligence systems.
• Know the characteristics of reporting systems.
Dr. Hui Xiong
g
Rutgers University
• Know the purpose and role of data warehouses and data marts.
• Understand fundamental data‐mining techniques.
U d
df d
ld
i i
h i
• Know the purpose, features, and functions of knowledge management systems. The Need for Business Intelligence Systems
• According to a study done at the University of California at Berkeley, a total of 403 petabytes of new data were created.
• 403 petabytes is roughly the amount of all printed material ever written.
– The printed collection of the Library of Congress is .01 petabytes.
– 400 petabytes equals 40,000 copies of the print collection of the Library of Congress.
Figure 9‐1 How big is an Exabyte?
Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
The Need for Business Intelligence Systems (Continued)
• The generation of all these data has much to do with Moore’s Law.
• The capacity of storage devices increases as thei o t de ea e
their costs decrease.
• Today, storage capacity is nearly unlimited.
• We are drowning in data and starving for information.
Figure 9‐2 Hard‐Disk Storage Capacity
Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
1
Business Intelligence Tools
Business Intelligence Tools
• Tools for searching business data in an attempt to find patterns is called business intelligence (BI) tools.
• The processing of data is simple:
– Data are sorted and grouped.
– Simple totals and averages are calculated.
• Reporting tools are programs that read data f
from a variety of sources, process that data, i t f
th t d t
produce formatted reports, and deliver those reports to the users who need them.
• Reporting tools are used primarily for assessment
– They are used to address questions like:
•What has happened in the past?
•What is the current situation?
•How does the current situation compare to the past?
Business Intelligence Tools (Continued)
Business Intelligence Systems
• Data‐mining tools process data using statistical techniques, many of which are sophisticated and mathematically complex.
• Data mining involves searching for patterns and relationships among data.
• In
In most cases, data‐mining tools are used to make most cases data mining tools are used to make
predictions.
• For example, we can use one form of analysis to compute the probability that a customer will default on a loan. • Another way to distinguish the differences of reporting tools and data‐mining tools is :
– Reporting tools use simple operations like sorting, grouping, and summing.
– Data‐mining tools use sophisticated techniques.
• An information system is a collection of hardware, software, data, procedures, and people.
• The purpose of a business intelligence (BI) system is to provide the right information, to is to provide the right information to
the right user, at the right time.
• BI systems help users accomplish their goals and objectives by producing insights that lead to actions.
Business Intelligence Systems (Continued)
Reporting Systems
• A reporting tool can generate a report that shows a customer has canceled an important order.
• The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper user on a timely basis.
• A reporting system, however, alerts that customer’s salesperson with this unwanted news, and does so in time for the salesperson to try to alter the customer’s decision.
decision
• A data‐mining tool can create an equation that computes the probability that a customer will default on a loan.
• A data‐mining system uses that equation to enable banking personnel to assess new loan applications.
• Reporting
Reporting systems generate information from systems generate information from
data as a result of four operations:
– Filtering data
– Sorting data
– Grouping data
– Making simple calculations on the data
2
Figure 9‐3 Trade Data for NDX.X (NASDAQ 100)
Components of Reporting Systems
Figure 9‐4 Report Based on Trade Data in Figure 9‐3
Figure 9‐5 Components of a Reporting System
• A reporting system maintains a database of reporting metadata.
• The metadata describes the reports, users, groups, roles, events, and other entities involved in the reporting activity.
• The reporting system uses the metadata to prepare and deliver reports to the proper users on a timely basis.
Figure 9‐6 Summary of Report Characteristics
Report Type
• In terms of a report type, reports can be static or dynamic.
• Static reports are prepared once from the underlying data, and they do not change.
– Example, a report of past year’s sales
p ,
p
p
y
• Dynamic reports: the reporting system reads the most current data and generates the report using that fresh data.
– Examples are: a report on sales today and a report on current stock prices 3
Report Type (Continued)
Report Media
• Query reports are prepared in response to data entered by users.
• Reports are delivered via many different report media or channels.
• Online analytical processing (OLAP) reports allow the user to dynamically change the report grouping structures.
i
• Some reports are printed on paper, and others are created in a format like PDF whereby they a e p i e o ie e e e o i a y
can be printed or viewed electronically.
• Other reports are delivered to computer screens.
• Companies sometimes place reports on internal corporate Web sites for employees to access.
Report Media (Continued)
Report Media (Continued)
• Another report medium is a digital dashboard, which is an electronic display customized for a particular user.
• Other dashboards are particular to an organization.
– Vendors like Yahoo! and MSN provide common p
examples.
– Users of these services can define content they want‐
say, a local weather forecast, a list of stock prices, or a list of news sources.
– The vendor constructs the display customized for each user.
Figure 9‐7 Digital Dashboard Example
– The organization might have a dashboard that shows up‐to‐the‐
minute production and sales activities.
• Alerts are another form of report.
– Users can declare that they wish to receive notifications of events say via email or on their cell phones
events, say, via email or on their cell phones.
• Reports can be published via a Web service.
– The Web service produces the report in response to requests from the service‐consuming application.
Report Mode
• The report mode can be either push report or pull report.
• Organizations send a push report to users according to a preset schedule.
– Users receive the report without any activity p
y
y
on their part.
• Users must request a pull report.
– To obtain a pull report, a user goes to a Web portal or digital dashboard and clicks a link or button to cause the reporting system to produce and deliver the report.
4
Functions of Reporting Systems
Report Management
• Three functions of reporting systems are:
– Authoring
– Management
– Delivery
• The purpose of report management is to define who receives what reports, when, and by what means.
• Report authoring involves connecting to data sources, creating the reporting structure, and formatting the report.
• Reports that have been created using the report‐
authoring system are assigned groups and users.
Report Management (Continued)
• Assigning reports to groups saves the administrator work. – When a report is created, changed, or removed, the administrator need only change the report assignments to the group.
– All of the users in the group will inherit the changes.
g
p
g
• Metadata also indicates what channel is to be used and whether the report is to be pushed or pulled.
– If the report is to be pushed, the administrator declares whether the report is to be generated on a regular schedule or as an alert.
Report Delivery (Continued)
• Most report‐management systems allow the report administrator to define user accounts and user groups and to assign particular users to particular groups.
Report Delivery
• The report‐delivery function of a reporting system pushes reports or allows them to be pulled according to report‐management metadata.
• Reports can be delivered via an email server, Web site, XML Web services, or by other program‐specific means. means
• The report‐delivery system uses the operating system and other program security components to ensure that only authorized users receive authorized reports.
Online Analytical Processing
• The report‐delivery system also ensures that push reports are produced at appropriate times.
• Online analytical processing (OLAP) provides the ability to sum, count, average, and perform other simple arithmetic operations on groups of data.
• For query reports, the report‐delivery system serves as an intermediary between the user and
serves as an intermediary between the user and the report generator.
– It receives user query data, such as item numbers in an inventory query, passes the query data to the report generator, receives the resulting report, and delivers the report to the user.
• The remarkable characteristics of OLAP reports is that ey a e y a i
they are dynamic.
• The viewer of the report can change the report’s format, hence, the term online.
5
Online Analytical Processing
• An OLAP report has measures and dimensions.
• A measure is the data item of interest.
– It is the item that is to be summed or averaged or otherwise processed in the OLAP report.
• A dimension
A dimension is a characteristic of a measure.
is a characteristic of a measure
– Purchase data, customer type, customer location, and sales region are all examples of dimension.
Online Analytical Processing (Continued)
• With an OLAP report, it is possible to drill down into the data.
– This term means to further divide the data into more detail.
• Special‐purpose products called OLAP servers have been developed to perform OLAP analysis.
• A
An OLAP server reads data from an operational O A
f
database, performs preliminary calculations, and stores the results of those operations in an OLAP database.
Figure 9‐13 OLAP Family and Store Location by Store Type
Figure 9‐14 Role of OLAP Server and OLAP Database
Data Warehouses and Data Marts
Data Warehouses and Data Marts (Continued)
• Basic reports and simple OLAP analyses can be made directly from operational data.
• For the most part, such reports display the current state of the business; and if there are a few missing values or small inconsistencies with the data, no one is too concerned
too concerned.
• Operational data are unsuited to more sophisticated analyses, particularly, data‐mining analyses that require high‐quality input for accurate and useful results.
• Many organizations choose to extract operational data into facilities called data warehouses and data marts, both of which are facilities that prepare, store, and manage data specifically for data mining and other analyses.
• Programs read operational data and extract, clean, and g
p
,
,
prepare that data for BI processing.
• The prepared data are stored in a data‐warehouse database using data‐warehouse DBMS, which can be different from the organization’s operational DBMS.
6
Data Warehouses and Data Marts
Figure 9‐15 Components of a Data Warehouse
• Data warehouses include data that are purchased from outside sources.
• Metadata concerning the data, its source, its format, its assumptions and constraints, and other facts about the data is kept in a data‐warehouse metadata database.
p
• The data‐warehouse DBMS extracts and provides data to business intelligence tools such as data‐mining programs.
Figure 9‐16 Consumer Data Available for Purchase from Data Vendors
Problems with Operational Data (Continued)
• Inconsistent data are particularly common for data that have been gathered over time.
– When an area code changes, for example, the phone number for a given customer before the change will not match the customer’s number after the change.
• Some data inconsistencies occur from the nature of the business activity
business activity.
• Nonintegrated data can cause problems when data comes from different management information systems.
Figure 9‐17 Problems of Using Transaction Data for Analysis and Data Mining
Data Warehouses Versus Data Marts
• The data warehouse takes data from the data manufacturers (operational systems and purchased data), cleans and processes the data, and locates the data on the shelves, so to speak, of the data warehouse.
• A data mart is a data collection, smaller than the data warehouse, that addresses a particular component or functional area of the business.
7
Data Warehouse Versus Data Marts (Continued)
Figure 9‐18 Data Mart Examples
• The data warehouse is like the distributor in the supply chain and the data mart is like the retail store in the supply chain.
• Users in the data mart obtain data that pertain to a p
particular business function from the data warehouse.
• It is expensive to create, staff, and operate data warehouses and data marts.
Data Mining and Business Intelligence
Knowledge Discovery in Data
Dr Hui Xiong
Dr.
Rutgers University
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected and warehoused – Web data, e‐commerce
– purchases at department/
grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene p
expression data
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists – in classifying and segmenting data
– in Hypothesis Formation
8
Mining Large Data Sets ‐ Motivation
• There is often information “hidden” in the data that is not readily evident
• Human analysts may take weeks to discover useful information
• Much of the data is never analyzed at all
4,000,000
3 500 000
3,500,000
The Data Gap
3,000,000
2,500,000
Scale of Data
Organization
Walmart
Google
Yahoo
NASA satellites
NCBI GenBank
France Telecom
UK Land Registry
AT&T Corp
Scale of Data
~ 20 million transactions/day
~ 8.2 billion Web pages
~10 GB Web data/hr
~ 1.2 TB/day
~ 22 million genetic sequences
29 2 TB
29.2
18.3 TB
26.2 TB
2,000,000
1,500,000
1,000,000
Total new disk (TB) since
1995
500,000
Number of
analysts
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Why Do We Need Data Mining ?
• Leverage organization’s data assets
– Only a small portion (typically ‐ 5%‐10%) of the collected data is ever analyzed
“The great strength of computers is that
they can reliably manipulate vast amounts
of data very quickly. Their great weakness is
that they don’t have a clue as to what any
Why Do We Need Data Mining?
• As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible
– Data that may never be analyzed continues to be g
p
collected, at a great expense, out of fear that something which may prove important in the future is missing.
– Many queries of interest are difficult to state in a query language (Query formulation problem)
query language (Query formulation problem)
– “find all cases of fraud”
– “find all individuals likely to buy a FORD expedition”
– Growth rates of data precludes traditional “manually intensive” approach
– “find all documents that are similar to this customers problem”
(Latitude, Longitude)1
What is Data Mining?
• Many Definitions
– Non‐trivial extraction of implicit, previously unknown and potentially useful information from data
– Exploration & analysis, by automatic or semi‐automatic means, of large quantities of data in order to discover meaningful patterns
What is (not) Data Mining?
z
What is not Data Mining?
– Look up phone number in phone directory – Check the dictionary for the meaning of a word
z
What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
9
Data Mining: Confluence of Multiple Disciplines
?
20x20 ~ 2^400 ≈ 10^120 patterns
Data Mining Applications
• Market analysis
• Risk analysis and management
• Fraud detection and detection of unusual p
patterns (outliers)
(
)
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• DNA and bio‐data analysis
Fraud Detection & Mining Unusual Patterns
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, …
– Auto insurance: ring of collisions – Money laundering: suspicious monetary transactions – Medical insurance
Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone‐call fraud
• Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
– Anti‐terrorism
Data Mining and Business Intelligence
Data Mining Tasks …
Data
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
10
Milk
10
Clustering
Applications of Cluster Analysis
• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
• Understanding
Intraa
cluster
distances
are
minimize
d
Intercluster
distances
are
maximize
d
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. What is not Cluster Analysis?
Discovered Clusters
– Group related documents for browsing – Group genes and proteins that have similar functionality
– Group stocks with similar p
price fluctuations
1
2
3
4
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Industry Group
Technology1-DOWN
Technology2-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Financial-DOWN
Oil-UP
Oil
UP
• Summarization
– Reduce the size of large data sets
Use of K‐means to partition Sea Surface Temperature (SST) and Net Primary Production (NPP) into clusters that reflect the Northern and Southern Hemispheres. Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
Notion of a Cluster can be Ambiguous
• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name
• Results of a query
– Groupings are a result of an external specification
Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data
How many clusters?
Six Clusters
Two Clusters
Four Clusters
• Supervised classification
– Have class label information
• Association Analysis
– Local vs. global connections
11
Types of Clusterings
Partitional Clustering
• A clustering is a set of clusters
• Important distinction between hierarchical
and partitional sets of clusters • Partitional Clustering
– A division data objects into non‐overlapping subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree Hierarchical Clustering
Original Points
A Partitional Clustering
Other Distinctions Between Sets of Clusters
• Exclusive versus non‐exclusive
– In non‐exclusive clusterings, points may belong to multiple clusters.
– Can represent multiple classes or ‘border’ points
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
• Fuzzy versus non‐fuzzy
– In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
p1
p3
p4
– In some cases, we only want to cluster some of the data
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
Types of Clusters
• Well‐separated clusters
• Center‐based clusters
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities
Types of Clusters: Well‐Separated
• Well‐Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. • Contiguous clusters
• Density‐based clusters
b d l
• Property or Conceptual
• Described by an Objective Function
3 well-separated clusters
12
Types of Clusters: Center‐Based
Types of Clusters: Contiguity‐Based
• Center‐based
• Contiguous Cluster (Nearest neighbor or Transitive)
– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid,
average of all the points in the cluster, or a medoid, the most “representative” point of a cluster – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
4 center-based clusters
Types of Clusters: Density‐Based
• Density‐based
– A cluster is a dense region of points, which is separated by low‐density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present
and when noise and outliers are present. 8 contiguous clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent a particular concept.
6 density-based clusters
Characteristics of the Input Data Are Important
2 Overlapping Circles
Data Mining Tasks …
• Type of proximity or density measure
– This is a derived measure, but central to clustering • Sparseness
Data
– Dictates type of similarity
– Adds to efficiency
Tid Refund Marital
Status
• Attribute type
– Dictates type of similarity
• Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
10
Milk
13
Association Rule Discovery: Definition
Association Analysis: Applications
• Given a set of records each of which contain some number of items from a given collection
• Market‐basket analysis
– Rules are used for sales promotion, shelf management, and inventory management
– Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
• Telecommunication alarm diagnosis
– Rules are used to find combination of alarms that occur together frequently in the same time period
Rules Discovered:
• Medical Informatics
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
– Rules are used to find combination of patient symptoms and complaints associated with certain diseases
Data Mining Tasks …
Application Deployment Challenge
Data
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
Yes
Single
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
60K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
10
Milk
Predictive Modeling: Classification
Classification Example
• Find a model for class attribute as a function of the values of other attributes Model for predicting credit
Tid Employed
worthiness
1
Employed
Tid Employed
1
Yes
Graduate
# years at
present
address
5
2
Yes
High School
2
No
3
No
Undergrad
1
No
Tid Employed
Level of
Education
Credit
Worthy
Yes
No
Yes
4
Yes
High School
10
Yes
…
…
…
…
…
No
Education
Graduate
{ High school,
Undergrad }
Level of
Education
1
Yes
Graduate
2
Yes
High School
# years at
present
address
5
2
Credit
Worthy
Yes
Yes
Undergrad
# years at
present
address
7
Level of
Education
No
Graduate
3
3
Yes
High School
2
?
…
…
…
…
…
< 3 yr
> 7 yrs
< 7 yrs
Yes
No
Yes
No
?
10
No
3
No
Undergrad
1
No
4
Yes
High School
10
Yes
…
…
…
…
…
10
Test
Set
Number of
years
> 3 yr
?
2
10
Number of
years
Credit
Worthy
Training
Set
Learn
Classifier
Model
14
Examples of Classification Task
Classification: Application 1
• Predicting tumor cells as benign or malignant
• Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account‐holder as attributes.
– When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha‐helix, beta‐sheet, or random coil
• Categorizing news stories as finance, weather, entertainment, sports, etc
• Identifying intruders in the cyberspace
Classification: Application 2
Classification: Application 3
• Churn prediction for telephone customers
• Sky Survey Cataloging
– Goal: To predict whether a customer is likely to be lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and present customers, to find attributes.
– How often the customer calls, where he calls, what time‐of‐the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
Classifying Galaxies
• Stages of Formation
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classification Techniques
Class:
Intermediate
• Success Story: Could find 16 new high red‐shift quasars, some of the farthest objects that are difficult to find!
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Early
– Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image. • Measure image attributes (features) ‐ 40 of them per object.
• Model the class based on these features.
• Base Classifiers
– Decision Tree based Methods
– Rule‐based Methods
– Nearest‐neighbor
– Neural Networks
N
lN
k
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
• Ensemble Classifiers
– Boosting, Bagging, Random Forests
15
Example of a Decision Tree
Another Example of Decision Tree
MarSt
Splitting Attributes
ID
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
ID
Home
Owner
Yes
No
NO
MarSt
Married
Single, Divorced
Income
< 80K
NO
> 80K
YES
NO
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
60K
Married
Single,
Divorced
NO
Yes
Home
Owner
NO
No
Income
< 80K
> 80K
Yes
6
No
Married
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
YES
NO
No
There could be more than one tree that
fits the same data!
10
10
Model:
Decision
Tree
Training
Data
Decision Tree Classification Task
Tid
Attrib1
Attrib3
Class
1
Yes
Large
Attrib2
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
N
No
M di
Medium
75K
N
No
10
No
Small
90K
Yes
Apply Model to Test Data
Start from the
root of tree.
Home
Owner
Yes
Learn
Model
NO
Tid
Attrib1
Attrib3
Class
11
No
Small
Attrib2
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
No
Marital
Status
Married
Annual Defaulted
Income Borrower
80K
?
10
No
MarSt
Single, Divorced
10
Decision
Tree
Apply
Model
Test Data
Home
Owner
Income
< 80K
Married
NO
> 80K
YES
NO
10
Apply Model to Test Data
Test
Home
Marital
Data
Owner Status
No
Home
Owner
Yes
NO
Defaulted
Borrower
Home
Owner
?
No
Home
Owner
Yes
MarSt
Income
NO
Annual
Income
80K
10
No
Single, Divorced
< 80K
Married
Apply Model to Test Data
NO
Married
YES
Annual Defaulted
Income Borrower
80K
?
10
No
MarSt
Single, Divorced
Income
NO
> 80K
Marital
Status
Married
< 80K
NO
Married
NO
> 80K
YES
16
Apply Model to Test Data
Home
Owner
No
Home
Owner
Yes
NO
Apply Model to Test Data
Marital
Status
Married
Annual Defaulted
Income Borrower
Home
Owner
80K
No
?
Yes
MarSt
Single, Divorced
Income
< 80K
NO
Marital
Status
Married
Annual Defaulted
Income Borrower
80K
?
10
MarSt
Income
< 80K
NO
YES
Decision Tree Classification Task
No
Single, Divorced
NO
> 80K
NO
No
Home
Owner
Married
NO
> 80K
Married
Income
< 80K
Apply Model to Test Data
Yes
MarSt
NO
Home
Owner
Defaulted
Borrower
?
10
Single, Divorced
YES
Annual
Income
80K
No
NO
Married
> 80K
NO
Home
Owner
10
No
Marital
Status
Married
Assign
Defaulted
to “No”
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
Class
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn
Model
10
YES
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
10
Decision Tree Induction
Data Mining Tasks …
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
ID3 C4 5
– SLIQ,SPRINT
Data
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
10
Milk
17
Deviation/Anomaly Detection
• Detect significant deviations from normal behavior
• Applications:
– Credit
Credit Card Fraud Card Fraud
Detection
– Network Intrusion Detection
Anomaly Detection
• Challenges
– How many outliers are there in the data?
– Method is unsupervised
• Validation can be quite challenging (just like for clustering)
– Finding needle in a haystack
Finding needle in a haystack
• Working assumption
– There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Anomaly Detection Schemes Graphical Approaches
• General Steps
• Boxplot (1‐D), Scatter plot (2‐D), Spin plot (3‐D)
• Limitations
– Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
A
li
b
ti
h
h
t i ti
differ significantly from the normal profile
– Time consuming
– Subjective
• Types of anomaly detection schemes
– Graphical & Statistical‐based
– Distance‐based
– Model‐based
Statistical Approaches
• Assume a parametric model describing the distribution of the data (e.g., normal distribution) • Apply a statistical test that depends on – Data distribution
– Parameter of distribution (e.g., mean, variance)
Parameter of distribution (e g mean variance)
– Number of expected outliers (confidence limit)
Limitations of Statistical Approaches
• Most of the tests are for a single attribute
• In many cases, data distribution may not be known
• For high dimensional data, it may be For high dimensional data it may be
difficult to estimate the true distribution
18
Distance‐based Approaches
• Data is represented as a vector of features
• Three major approaches
– Nearest‐neighbor based
– Density based
– Clustering based
Density‐based: LOF approach
• For each point, compute the density of its local neighborhood
• Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors
p
g
• Outliers are points with largest LOF value
×
p2
×
p1
In the NN
approach, p2 is not
considered as
outlier, while LOF
approach find both
p1 and p2 as outliers
KDD Process
Nearest‐Neighbor Based Approach
• Approach:
– Compute the distance between every pair of data points
– There are various ways to define outliers:
• Data points for which there are fewer than p
Data oi t fo hi h the e a e fe e tha
neighboring points within a distance D
• The top n data points whose distance to the kth nearest neighbor is greatest
• The top n data points whose average distance to the k nearest neighbors is greatest Clustering‐Based
• Basic idea:
¾ Cluster the data into groups of different density
¾ Choose points in small cluster as candidate l t
did t
outliers
¾ Compute the distance between candidate points and non‐candidate clusters. ‐ If candidate points are far from all other non‐candidate points, they are outliers
KDD Process
• Develop an understanding of the application domain – Relevant prior knowledge, problem objectives, success criteria, current solution, inventory resources, constraints, terminology, cost and benefits
• Selection of data mining task
• Create target data set
Collect initial data, describe, focus on a subset of variables,
– Collect initial data, describe, focus on a subset of variables, verify data quality
• Select data mining approach • Data mining to extract patterns or models
D
i i
d l
• Interpretation and evaluation of patterns/models
• Consolidating discovered knowledge
• Data cleaning and preprocessing
– Remove noise, outliers, missing fields, time sequence information, known trends, integrate data
• Data Reduction and projection
– Feature subset selection, feature construction, discretizations, aggregations
– Classification, segmentation, deviation detection, link analysis
19
Challenges of Data Mining
Knowledge Discovery
•
•
•
•
•
•
•
•
Similarities Between Data Miners and Doctors Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
Data from Multi‐Sources Commercial and Research Tools
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
SAS: http://www.sas.com/
Clementine:
Clementine: http://www.spss.com/spssbi/clementine/
Data Characteristics
Intelligent Miner http://www‐3.ibm.com/software/data/iminer/
Insightful Miner http://www.insightful.com/products/product.asp?PID=26
Data Mining Techniques
Textbooks
Medical Devices
Knowledge Management
• Knowledge management systems concern the sharing of knowledge that is already known to exist, either in libraries of documents, in the heads of employees, or in other known sources.
• Knowledge management (KM) is the process of creating value from intellectual capital and sharing i
l f
i ll
l
i l d h i
that knowledge with employees, managers, suppliers, customers, and others who need that capital.
20
Knowledge Management (Continued)
• Knowledge management is a process that is supported by the five components of an information system.
– Its emphasis is on people, their knowledge, and effective means for sharing that knowledge with others.
• The benefits of KM concern the application of knowledge to enable employees and others to leverage organizational knowledge to work smarter.
Content Management Systems
• Content management systems are information systems that track organizational documents, Web pages, graphics, and related materials.
• Such systems differ from operational document systems in that they do not directly support business operations.
i
• KM content management systems are concerned with the creation, management, and delivery of documents that exist for the purpose of imparting knowledge.
• KM preserves organizational memory by capturing and storing the lessons learned and best practices of key employees.
Content Management Systems (Continued)
• Typical users of content management systems are companies that sell complicated products and want to share their knowledge of those products with employees and customers.
• The basic functions of content management systems are the same as for report management systems: author, h
f
h
manage, and deliver.
• The only requirement that content managers place on document authoring is that the document has been created in a standardized format.
Content Management Problems
• Documents may refer to one another or multiple documents may refer to the same product or procedure.
– When one of them changes, others must change as well.
– Some content management systems keep semantic g
g
linkages among documents so that content dependencies can be known and used to maintain document consistency.
• Document contents are perishable.
– Documents become obsolete and need to be altered, removed, or replaced.
• Multinational companies have to ensure document language translations.
Figure 9‐23 Document Management at Microsoft.com (as of December 2003)
Source: microsoft.com/backstage/inside.htm (accessed February 2004). © 2003 Microsoft Corporation. All rights reserved.
Figure 9‐24 Reporting Services: United States
Source: Used with permission of Tom Rizzo of Microsoft Corporation.
21
Figure 9‐25 Reporting Services: China
Content Delivery
• Almost all users of content management systems pull the contents.
• Users cannot pull content if they do not know it exists.
– The content must be arranged and indexed, and a facility for searching the content devised.
searching the content devised.
• Documents that reside behind a corporate firewall, however, are not publicly accessible and will not be reachable by Google or other search engines.
– Organizations must index their own proprietary documents and provide their own search capability for them.
Source: Used with permission of Tom Rizzo of Microsoft Corporation.
KM Systems to Facilitate the Sharing of Human Knowledge
• Nothing is more frustrating for a manager to contemplate than the situation in which one employee struggles with a problem that another employee knows how to solve easily.
• KM systems are concerned with the sharing not only of content, but also with the sharing of knowledge among humans.
– How can one person share her knowledge with another?
– How can one person learn of another person’s great idea?
Figure 9‐26 Technology Support of Sharing Human Knowledge
KM Systems to Facilitate the Sharing of Human Knowledge (Continued)
• Three forms of technology are used for knowledge‐ sharing among humans:
– Portals, discussion groups, and email
– Collaborations systems
Collaborations systems
– Expert systems
Portals
– Employees can share ideas by posting knowledge on a Web portal whereby managers and employees can pull the knowledge from the portal.
KM Systems to Facilitate the Sharing of Human Knowledge (Continued)
Discussion Groups
– Discussion groups allow employees or customers to post questions and queries seeking solutions to problems they have.
– Oracle, IBM, PeopleSoft, and other vendors support Oracle IBM PeopleSoft and other vendors support
product discussion groups where users can post questions and where employees, vendors, and other users can answer them.
– Later, the organization can edit and summarize the questions from such discussion groups into frequently asked questions (FAQs).
22
KM Systems to Facilitate the Sharing of Human Knowledge (Continued)
Discussion groups (continued)
– Basic email can also be used for knowledge‐sharing, especially if email lists have been constructed with KM in mind.
– Two human factors inhibit knowledge‐sharing.
• Employees can be reluctant to exhibit their ignorance.
• Competition exists between employees.
– A KM application may be ill‐suited to a competitive group.
• The company may be able to restructure rewards and incentives to foster sharing of ideas among employees.
Figure 9‐27 Net Meeting Graphic
KM Systems to Facilitate the Sharing of Human Knowledge (Continued)
Collaboration Systems
– Collaboration systems are information systems that enable people to work together more effectively.
– The Internet can be used as a broadcast medium for speeches, panel discussion, and other types of meetings.
– Web broadcasts, because they are digital, can be readily saved and replayed at the viewer’ss convenience.
and replayed at the viewer
convenience
– Web broadcasts can also be made interactive by combining them with discussion group bulletin boards that are live during the broadcast.
– Video conferencing is another popular form of IT‐supported meetings.
• Video‐conferencing equipment is expensive and normally is located in selected sites in the organization.
KM Systems to Facilitate the Sharing of Human Knowledge (Continued)
Expert Systems
– Expert systems are created by interviewing experts in a given business domain and codifying the rules stated by those experts.
– Many expert systems were created in the late 1980s Many expert systems were created in the late 1980s
and 1990s, and some of them have been successful.
– Expert systems suffer from three major disadvantages.
• They are difficult and expensive to develop.
• They are difficult to maintain.
• They were unable to live up to the high expectations set by their name.
23