Download 2 Anatomy of massive data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Anatomy of Massive Data
Mining
Zhangxi Lin
CAABI, Texas Tech University
FIFE, Southwestern University of Finance & Economics
Cellphone:18610660375, QQ/WeChat: 155970
http://zlin.ba.ttu.edu [email protected]
2015-06-16
Agenda




Business Data Examples
Review - Data mining procedure
Two-stage predictive modeling
Handling unstructured data
◦ Text Mining: CRM at Alibaba’s B2B Call Center
◦ Sentiment Analysis: Media-Aware Stock Trading
Based on Public Web Information

Understanding the nature of human beings in
socio-economic context
◦ Cyber Credit Assessment for Internet Finance
Survey

Data processing
1.
2.
3.

Data mining
1.
2.
3.
4.

I know how to cleanse data
I know how to do data exploration
I know how to fix data quality problems
Know how to develop a decision tree model
I know the principles of classification modeling
I know how to calculate GINI, or entropy given a decision
tree split
I know how to use confusion matrix to assess the
performance of a classification modeling
Tools
1.
2.
3.
I can do SAS programming
I know how to use SAS Enterprise Miner
I know how to use other data mining tools
To conduct good research projects
in big data

The following skills are highly recommended
◦ Data preparation: aggregation, cleansing,
conversion, quality checking
◦ Management massive data with DBMS and DW
◦ Basic data mining skills: classification, clustering,
association analysis, and ext mining
◦ Understand basic algorithms: CHAID, CRT, KMeans, SOM, etc.
◦ Ability to explain data mining results correctly
Advanced data mining techniques
Data quality diagnosis
 Handling imbalanced dataset
 Handling missing values
 Coping with the curse of dimensionality
 Multi-stage modeling
 Two-stage classification modeling
 Model performance assessment

BUSINESS DATA
EXAMPLES
表3
order_air
_user
order_sn
表2
order_air
order_sn
order_sn
refund_id
user_id
order_sn
order_sn
Dataset provided by
Qiyi Network at
CHongqing
order_sn
refund_id
user_id
表7
order_refund
表4
order_beselled
表5
order_caig
ou
表8
order_refu
nd_log
表9
order_rights
order_sn
user_id
表11
order_ship
表1
data_affix
order_sn
order_sn
表6
order_data
表10
order_table
order_sn
user_id
order_sn
user_id
Beijing 1039 Traffic Radio
(Ad revenue 3 billion RMB/year)
数据来源
录入系统方式
标准
化
交管局或 摄像头或其他方式采集的路况信息经
高
交委
过编辑文字化后传递至路况信息中心。
系统自动拨打采集点固定电话,采集
固定采集 点 根 据 路 况 选 择 【 拥 堵 】 【 缓 慢 】
高
点
【畅通】对应的按键,系统自动生成
标准化文字信息反馈至路况信息中心。
浮动车
通过交通台发放的手机预装客户端软
件,定期返回车辆行驶数据,根据手
机GPS系统,车速,判断路况。
高
信息播报 信息员拨打路况电话报路况,由路况
高
员
信息中心人工根据电话内容录入系统。
地点方向
定量
定性
准确
定量
准确
定性
准确(如手
机GPS不开,
定量
会缺少地点
方向等信息)
准确
定性
信息表述不
低
能保证完整 定性
清晰
本次提供数据样本为浮动车一周数据(包括常规路况和突发事件路况)
交通信息 全市热心志愿者通过交通广播APP客
志愿者 户端或短信平台,自动自发报路况。
Beijing’s Floating Vehicle Data
Data:
Location (X, Y)
and Time
Taxis in Fuzhou
This map is updated every 15 seconds
Data:
Location (X, Y)
and Time
REVIEW - DATA MINING
PROCEDURE
Data Mining Process
ISQS 6347, Data & Text Mining
12
Types of Attributes (Variables)

There are different types of attributes
◦ Nominal

Examples: ID numbers, eye color, zip codes
◦ Ordinal

Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
◦ Interval

Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
◦ Ratio

Examples: temperature in Kelvin, length, time, counts
ISQS 6347, Data & Text Mining
13
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses:
◦
◦
◦
◦
Distinctness:
Order:
Addition:
Multiplication:
= 
◦
◦
◦
◦
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
< >
+ */
ISQS 6347, Data & Text Mining
14
Discrete and Continuous Attributes

Discrete Attribute
◦ Has only a finite or countably infinite set of values
◦ Examples: zip codes, counts, or the set of words in a collection
of documents
◦ Often represented as integer variables.
◦ Note: binary attributes are a special case of discrete attributes

Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight.
◦ Practically, real values can only be measured and represented
using a finite number of digits.
◦ Continuous attributes are typically represented as floating-point
variables.
ISQS 6347, Data & Text Mining
15
Important Characteristics of Structured Data
◦ Dimensionality
 Curse of Dimensionality
◦ Sparsity
 Only presence counts
◦ Quality
 missing values, typos, outliers, etc.
◦ Resolution (frequency)
 Patterns depend on the scale
ISQS 6347, Data & Text Mining
16
Curse of Dimensionality

When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

Definitions of density and
distance between points,
which is critical for clustering
and outlier detection, become
less meaningful
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
ISQS 6347, Data & Text Mining
17
Dimensionality Reduction

Purpose:
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data
mining algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise

Techniques
◦ Principle Component Analysis
◦ Singular Value Decomposition
◦ Others: supervised and non-linear techniques
ISQS 6347, Data & Text Mining
18
Feature Subset Selection

Another way to reduce dimensionality of data

Redundant features
◦ duplicate much or all of the information contained in one
or more other attributes
◦ Example: purchase price of a product and the amount of
sales tax paid

Irrelevant features
◦ contain no information that is useful for the data mining
task at hand
◦ Example: students' ID is often irrelevant to the task of
predicting students' GPA
ISQS 6347, Data & Text Mining
19
Data Quality
What are data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?


Examples of data quality problems:
◦ Noise and outliers
◦ missing values
◦ duplicate data
ISQS 6347, Data & Text Mining
20
Noise

Noise refers to modification of original
values
◦ Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on
television screen
Two Sine Waves
Two Sine Waves + Noise
ISQS 6347, Data & Text Mining
21
Outliers

Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
ISQS 6347, Data & Text Mining
22
Missing Values

Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values
◦
◦
◦
◦
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
ISQS 6347, Data & Text Mining
23
Duplicate Data

Data set may include data objects that are
duplicates, or almost duplicates of one another
◦ Major issue when merging data from heterogeneous
sources

Examples:
◦ Same person with multiple email addresses

Data cleaning
◦ Process of dealing with duplicate data issues
ISQS 6347, Data & Text Mining
24
Data Preprocessing Tasks

Main tasks
◦
◦
◦
◦
◦
◦
Sampling
Aggregation
Feature creation
Attribute Transformation
Dimensionality Reduction
Feature subset selection
ISQS 6347, Data & Text Mining
25
The Process of Classification
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
ISQS 6347, Data & Text Mining
Learn
Classifier
Test
Set
Model
26
Data Mining Tools
SAS Enterprise Miner v13.2
Basic
◦ How to use the application main menu
◦ Using the pop-up menus
◦ Enterprise Miner documentation
◦ Project – Diagram
 The SEMMA methodology
◦ Sample
◦ Explore
◦ Modify
◦ Model
◦ Assess

ISQS 6347, Data & Text Mining
28
Case: German credit benchmark data set
1000 observations
 Clean data
 Target variable: “Good_Bad”
 Cost: $1 loss when “false negative” vs. $5
loss “when false positive”
 Prior probability of the target variable:
0.9:0.1 vs. sample probability 0.7:0.3

ISQS 6347, Data & Text
Mining
29
SAS Enterprise Miner
30
31
Refine analytic objective
Assess observed results
Gather results
Integrate deployment
Generate deployment methods
Apply analysis
Transform input data
Repair input data
Validate input data
Extract input data
Select cases
Define analytic objective
The Analytic Workflow
Analytic workflow
Open Source Data Mining Software –
Rapid Miner
Formerly YALE (Yet Another Learning Environment), is an
environment for machine learning, data mining, text mining,
predictive analytics, and business analytics.
 In a poll by KDnuggets, a data-mining newspaper, RapidMiner
ranked second in data mining/analytic tools used for real
projects in 2009 and was first in 2010.
 The RapidMiner project was started in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Artificial
Intelligence Unit of the University of Dortmund.
 In 2006 Ingo Mierswa and Ralf Klinkenberg founded the
company Rapid-I that is now the main contributor out of more
than 30 international developers further developing RapidMiner.

TWO-STAGE PREDICTIVE
MODELING
TEXT MINING
SENTIMENT ANALYSIS