Download Data Mining at Yasuda

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Knowledge Management
Concepts, Models and
Applications
(知识管理-概念,模型及应用)
Jing Luan, Ph.D. (栾晶)
Chief Planning, Research, & Knowledge Systems Officer
(计划,研究,知识系统主任)
Cabrillo College
Founder, Knowledge Discovery Laboratory
(知识发掘中心创建人)
Beijing, China
Octoboer, 2002
What’s Covered:
1.
2.
3.
4.
5.
6.
7.
8.
Why KM?
Key KM Concepts
Tiered Knowledge Management (TKM) Model
KM Applications in the US
Data Mining Background
Data Mining Algorithms and Applications
Demonstration of Data Mining
Q&A
© Jing Luan, 2002
2
Quotes





Knowledge is Information in Action – O’Dell
and Grayson (APQC)
Sharing knowledge is 90% culture, 5%
technology and the rest is magic - Bob
Buckman of Buckman Laboratories
We live in an increasingly data rich,
knowledge poor society (Luan)
KM is to bring people to people and people to
knowledge (Serban and Luan)
知识改变命运。
© Jing Luan, 2002
3
What is Knowledge?
Strong influence from Philosophy
 Epistemology (philosophy dealing with
origin of knowledge, foundations, and
limits)
 Ontology (= metaphysics, deals with the
essence of being/things)
 Plato (knowledge = personality, wisdom,
science)
© Jing Luan, 2002
4
World of Utter Confusion
(看不破的红尘)
Information Overload
 Misinformation (Stocks, Health)
 So many technologies so little clue
 Not tapping into existing knowledges
Solution:
 Use KM principles
 Distinguish job functions and group
technologies accordingly: TKM

© Jing Luan, 2002
5
Technology Confusion(技术迷茫)









Networking (IP, TCP/IP, VPN, WAN)
Security (firewall, DMZ, SSL)
Website (dynamic, push/poll, asp*, domain)
Intranet
OLAP vs. OLTP
Data warehouse (Star Schema, cubes)
Data mining (algorithms)
ZXT
Anything else?
© Jing Luan, 2002
6
Knowledge Confusion (知识迷茫)

What are data?


What is information


Meaningless by itself/(哲学-universe)
Meaning from data/(哲学-observations)
What is knowledge?

What does information mean to me/(哲学sense of being)
© Jing Luan, 2002
7
Why Knowledge Management
(KM)?







Technology advancement
Professional specialization vs. multi-discipline
approach
Competition (price, time2market, knowledge)
Workforce mobility and turnover
Capitalize on organizational knowledge
知本就是资本。
In sum, “you snooze you lose”. (稍纵即逝)
© Jing Luan, 2002
8
Current Western View of
Knowledge - 一分为二
Explicit Knowledge (显)
(Documented)
Tacit Knowledge (隐)
(Know-how embedded in people)
Easily codified
Personal
Context-specific
Storable
Transferable
Easily expressed and shared
Difficult to formalize
Difficult to capture /communicate
Sources
Sources
Databases and reports
Personal experiences
Manuals
Policies and procedures
Informal business processes
Historical understanding/culture
Website, advertisements
Committees, task forces,团体
© Jing Luan, 2002
9
Is Knowledge Management
Possible?
YES! … according to KPMG, Gartner Group
Fueled by technology and economics
 Storage capacity
 Internet, portal (门户), search engines
(搜索引擎)
 CRM (From “All customers are right” to
“Which one is better?”)
© Jing Luan, 2002
10
Knowledge Management Principle
First Principle (Tiers):
 Data => Information => Knowledge
To demonstrate this point…
Second Principle (Sharing):
 Knowledge Sharing (push/pull)
 Information Sharing (push/pull)
 Data Sharing (push/pull)
© Jing Luan, 2002
11
TKM: Explicit Knowledge Management
TIER THREE:
Many data
mining projects
fail due to lack
of understanding
of these three
tiers。
Mining :
Clementine, Enterprise Miner,
Statistica, Mineset, Darwin, SpotFire
Classical statistics
SPSS, SAS, BMDP, SysStat
TIER TWO
Querying:
BrioQuery, Business Objects, PowerPlay
Access, Foxpro, SPSS SmartViewer
Online Data Processing:
ASP, JSP, iHTML, XML
演示
(切片钻取)
TIER ONE
Data Engines
SQL Server, Oracle, Informix, Sybase, UniData, DB2
Enterprise Resource Planning (ERP)
PeopleSoft, Datatel, SAP, Oracle, Banner
Topography of Tiered Knowledge Management Model (TKM) for explicit knowledge
Courtesy of Jossey-Bass
© Jing Luan, 2002
12
Benefits of TKM - Explicit
Informed use of technology
 Planned skill upgrade
 Balancing resource allocation
 Defining relationship with IT
 Enhancing role of analyst
 Improving decision making process
 Purposeful outsourcing

© Jing Luan, 2002
13
Tiered Knowledge Management Model
(TKM)
Tiers:
Tiers:
three
two
one
Data
Mining
Middleware
OLAP
Knowledge Base
Knowledge Workers
Portals
CRM
Data Warehouses
Enterprise Resource Planning (ERP)
Courtesy of Jossey-Bass
Collaborative Working
Environment (CWE)
Knowledge
Mapping
Explicit Knowledge
one
two
three
Tacit Knowledge
© Jing Luan, 2002
14
Knowledge Example – all facets
Things related to Cell Phone that’s
knowledge intensive:
 Cell Phone design
 Cell Phone technology
 Cell Phone number
 Cell Phone conversation
 Cell Phone bill
 Cell Phone hazard
© Jing Luan, 2002
15
TKM Model Illustrated
Tier One
Tier Two
Tier Three
Data Holding
Medium
Information
Processing
Data Mining
Student enrollment data
Learning outcome data
Census data
Enrollment trends analysis
Student GPA report
Socio-economic status
Which student is likely to persist?
Which clusters of students will have GPA>3.75?
What are associated with any course-taking pattern?
Decisions
Insights
Knowledge
Competencies
Accountability
Portals
CRM
Tier One
Tier Two
Tier Three
Knowledge Base
Collaborative Working
Envt.
Knowledge Mapping
Personal experiences
Skills
Values
Relationships
Organization structures
Curriculum Committees
Identifying Mission/Policies
Writing Manuals
Faculty Experts
Group Leaders
Librarians
Analysts/Institutional Researchers
© Jing Luan, 2002
16
KM Taxonomy (分类)of Products










Business Intelligence (商业智能);
Knowledge Base;
Collaboration;
Content and Document Management;
Portals (i.e.,Yahoo);
Customer Relationship Management (CRM);
Data Mining (i.e., Clementine, LexQuest);
Workflow;
E-Learning;
Search.
© Jing Luan, 2002
17
Data Mining Topics covered:
Data Mining Overview: concept & demo
 Data mining, statistics and OLAP
 Skills needed
 Software evaluation
 Data Mining plan at your organization

© Jing Luan, 2002
18
Data Mining Definition
Data mining is for capitalizing on the advances
of technology and the extreme richness of
enterprise data for improving research and
decision making through uncovering hidden
trends and patterns that lend them to predicative
modeling using a combination of explicit
knowledge base, sophisticated analytical skills
and domain (行业) knowledge.
Jing Luan
© Jing Luan, 2002
19
Why Must Data Mining?








Best way out of sea of data
Workbench of major tools
Tolerant of multicollinearity
Appetite for large dataset
Sledge hammer vs. chisel (鸟枪换炮)
Analyst’s impact/affiliation with databases
Domain knowledge (行业知识)intensive research
A good addiction
© Jing Luan, 2002
20
But I Spent Years Learning Statistics!
But I Use OLAP For All My Work!
Statistics knowledge is very useful.
 Data mining cannot replace statistics in a
number of areas.
 There are overlapping areas.
 OLAP is the middle tier.
 We must go beyond counting heads!

© Jing Luan, 2002
21
How Do Data Mining, Statistics and
OLAP Compare
Data Mining
Statistics
OLAP
Predictive
Research
Historical
Neural Net
Regression, Structural
Equation
…
C5.0, C&RT
PCA, Discriminant
…
Kohonen, K-means,
TwoStep
Cluster Analysis,
Probability Density
Cubes
Spatial Visualization
2-3 dimension charts
2-3 dimension
charts
Machine Learning/
Artificial Intelligence
Mathematics
ETL, SQL
Unsupervised Learning
Descriptive Statistics,
Cluster
Analysis
© Jing Luan, 2002
Temporal/Trend
Reporting
22
2 TYPES OF DATA MINING

SUPERVISED (直接)
Purpose:
For classification (分类)
and estimation
(估计)
Models
C5.0,
C&RT,
NN, etc

UNSUPERVISED (间接)
Purpose
For clustering and
association (关联)
Models
Kohonen,
Kmeans,
TwoStep
GRI, etc.
1. Clustering (unsupervised) and Predictive Modeling (supervised) often go hand in
hand with clustering preceding the other.
© Jing Luan, 2002
2. “pre-classified data” means data without
target.
23
Data Mining Tasks
Predicting onto new data by using rules or
patterns and behaviors

Classification/Estimation
Understanding the groupings, trends, and
characteristics of your customer

Clustering/Association
Visualizing the Euclidean spatial relationships,
trends, and patterns of your data

Description
© Jing Luan, 2002
24
Cross-Industry Use of DM
Banking
Telco
Medicine
Higher Ed
Clustering
fraud,
segmentation
net load,
link analysis
Genomic,
cell
differentiation.
learning
outcomes,
Std groups
Predicting
credit risk
addl cards
peak hour,
addl services,
churn
disease
progress,
epidemics
GPA
donations
Visualizing
© Jing Luan, 2002
25
Artificial Neural Networks
(ANN) 神经元网络 (人工智能)
Multi-layer perceptron (MLP): feed forward back propagation
x1 # of Terms
w1
x2 GPA
Persistence
x3 Demographics
x4 Courses
x5 Fin Aid…
w5…n
 n

oj  f   oi w ji 
 i 1

xj…n
© Jing Luan, 2002
26
Decision Trees – Rule Induction
(决策树-归纳逻辑 )
Rule 1:
If Income ≧ $55,000 and # of Children =
3, then multiple policies
Rule 2:
If Income < $55,000, and single and Age
< 30, then single policy
Information theorem: H ( N ) 
n
 P(n) log2 P(n)
i 1
© Jing Luan, 2002
27
Clustering (聚类)
Fundamental to science and
understanding of our world
 No restrictions on number of clusters
 Clusters change continuously
 Grouping-Clustering-Classify-Typology
 Carnegie and Bloom in education
 Potentially increasing scoring accuracy

© Jing Luan, 2002
28
Data Mining in (电信)
Telecommunications Industry

Infrastructure (assets)
Network load analysis
 Employee productivity analysis
 Automation


Customer (sales, marketing)
Link analysis
 Usage by time, location
 Customer feedback
 Churn
 Rolling out new service (for either existing or
new customers)

© Jing Luan, 2002
29
Telecommunication Example –
Increasing profitable calls
A two step process:
1. Clustering
-Who is a high cost customer?
2. Predication
-Who is likely to be a profitable caller?
Tips: May recalculate call lengths and call intervals
to reveal what’s not in the data warehouse. May
visualize data first. Merge with customer surveys
data.
© Jing Luan, 2002
30
Banking Industry Use of DM
Customer Credit Risk (pay your bills!)
 Fraud (Jing: your account is frozen)
 Customer Value

1. Depositors and users
2. Customer typing
- Transacters
- Convenience users
- Revolvers
© Jing Luan, 2002
31
Evaluating Data Mining
Software








Company stability and customer feedback
User Interface
Scalability (up and down)
Server/Client (real-time, KDD)
Modeling capacities
Learning Curve
Join a listserv, such as CLUG
Cost
© Jing Luan, 2002
32
Data Mining Skills Set
Driving Forces of DM:
 Computer Storage
 Algorithms
 Knowledge
Management
Translate to Skill-set:
 Data domain expert
 Familiar w/ models
 Business domain
system level view of
decision making (以系
统的观点制定决策)
© Jing Luan, 2002
33
© Jing Luan, 2002
34
CRISP-DM

Business Understanding (Zero in on the specific
goal of the data mining task)


Data Understanding (Do you have the data?)
Data Preparation (case to variables, missing values,
recalculate fields)

Modeling (typing, balancing, test/validation datasets,
bootstrapping, cross-algorithm validation)


Evaluation
Deployment
© Jing Luan, 2002
35
Data Mining Plan at Your
Organization
1.
2.
3.
4.
5.
6.
7.
Determine business needs
Determine technology infrastructure and
management support
Determine data source (got milk, got DW?)
Identify mining areas
Invite an expert to jump start
Pilot test mining results
CRISP-DM and Real-time data mining,
Knowledge Discovery in Databases (KDD)
© Jing Luan, 2002
36
When Data Mining Is Not Needed?
The world is increasingly moving toward
predictive modeling – we must know
what’s next so as to better prepare
ourselves, but you don’t need it if:
 You are a mom-pop shop
 You have no data warehouse
 You do not have people using the tools
 You conduct small experimental studies
© Jing Luan, 2002
37
Luan’s One-percent Doctrine


Average five year growth of investment: 10%
(ROI = 10%)
What’s the ROI of Data Mining?






25,000 enrollment ($5,000/ea)
One-percent increase (250 * $5,000=$1,250,000)
Data mining total cost: ($75,000 + $50,000 =
$125,000)
ROI Ratio ($1,250,000 / $125,000 = 10)
Or ROI Rate = 1,000%
Or “Give me a buck and I will turn it into 10!”.
© Jing Luan, 2002
38
Lift Chart: Gain Chart
Hypothetical database marketing campaign
Lift
quota
35%
Savings ($)
25%
0
40th percentile
70th percentile
If every percentage point = $2,500, savings =(70% * $2,500) – (40% * $2,500)
= $175,000 - $100,000 = $75,000
BACK
© Jing Luan, 2002
39
Text mining

80% of information is in texts.
Email (not including SMS)
 Survey (political polls, marketing, CRM online
feedback)
 Articles (memos, policies, manuals)
 Books (what have you)
 Web pages (static & dynamic)


26% on paper and 20% in digital media
© Jing Luan, 2002
40
China Impression




Urge for learning is very strong
Technology understanding is deep
Believing that data mining only functions to give
a slight edge when economic growth levels off
Funding/Systematic approach not complete:



Lack of funding for explicit knowledge
Unique issues in tacit knowledge & 关系
Disparate tech advances
© Jing Luan, 2002
41
From data to…
An Analyst/Data Miner
His Boss
© Jing Luan, 2002
42
to information to power…
CEO (总裁)
Vice President
© Jing Luan, 2002
43
KPI (Key Process Indicator)
S (specific)
 M (measurable)
 A (attainable)
 R (realistic)
 T (timeline)

© Jing Luan, 2002
44
Who’s Coming to Dinner?
 Online
KM/DM discussion group and
future data mining workshops:
http://www.kdl1.com/kmdm/index.htm
© Jing Luan, 2002
45