Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Anatomy of Massive Data Mining Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.edu [email protected] 2015-06-16 Agenda Business Data Examples Review - Data mining procedure Two-stage predictive modeling Handling unstructured data ◦ Text Mining: CRM at Alibaba’s B2B Call Center ◦ Sentiment Analysis: Media-Aware Stock Trading Based on Public Web Information Understanding the nature of human beings in socio-economic context ◦ Cyber Credit Assessment for Internet Finance Survey Data processing 1. 2. 3. Data mining 1. 2. 3. 4. I know how to cleanse data I know how to do data exploration I know how to fix data quality problems Know how to develop a decision tree model I know the principles of classification modeling I know how to calculate GINI, or entropy given a decision tree split I know how to use confusion matrix to assess the performance of a classification modeling Tools 1. 2. 3. I can do SAS programming I know how to use SAS Enterprise Miner I know how to use other data mining tools To conduct good research projects in big data The following skills are highly recommended ◦ Data preparation: aggregation, cleansing, conversion, quality checking ◦ Management massive data with DBMS and DW ◦ Basic data mining skills: classification, clustering, association analysis, and ext mining ◦ Understand basic algorithms: CHAID, CRT, KMeans, SOM, etc. ◦ Ability to explain data mining results correctly Advanced data mining techniques Data quality diagnosis Handling imbalanced dataset Handling missing values Coping with the curse of dimensionality Multi-stage modeling Two-stage classification modeling Model performance assessment BUSINESS DATA EXAMPLES 表3 order_air _user order_sn 表2 order_air order_sn order_sn refund_id user_id order_sn order_sn Dataset provided by Qiyi Network at CHongqing order_sn refund_id user_id 表7 order_refund 表4 order_beselled 表5 order_caig ou 表8 order_refu nd_log 表9 order_rights order_sn user_id 表11 order_ship 表1 data_affix order_sn order_sn 表6 order_data 表10 order_table order_sn user_id order_sn user_id Beijing 1039 Traffic Radio (Ad revenue 3 billion RMB/year) 数据来源 录入系统方式 标准 化 交管局或 摄像头或其他方式采集的路况信息经 高 交委 过编辑文字化后传递至路况信息中心。 系统自动拨打采集点固定电话,采集 固定采集 点 根 据 路 况 选 择 【 拥 堵 】 【 缓 慢 】 高 点 【畅通】对应的按键,系统自动生成 标准化文字信息反馈至路况信息中心。 浮动车 通过交通台发放的手机预装客户端软 件,定期返回车辆行驶数据,根据手 机GPS系统,车速,判断路况。 高 信息播报 信息员拨打路况电话报路况,由路况 高 员 信息中心人工根据电话内容录入系统。 地点方向 定量 定性 准确 定量 准确 定性 准确(如手 机GPS不开, 定量 会缺少地点 方向等信息) 准确 定性 信息表述不 低 能保证完整 定性 清晰 本次提供数据样本为浮动车一周数据(包括常规路况和突发事件路况) 交通信息 全市热心志愿者通过交通广播APP客 志愿者 户端或短信平台,自动自发报路况。 Beijing’s Floating Vehicle Data Data: Location (X, Y) and Time Taxis in Fuzhou This map is updated every 15 seconds Data: Location (X, Y) and Time REVIEW - DATA MINING PROCEDURE Data Mining Process ISQS 6347, Data & Text Mining 12 Types of Attributes (Variables) There are different types of attributes ◦ Nominal Examples: ID numbers, eye color, zip codes ◦ Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} ◦ Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. ◦ Ratio Examples: temperature in Kelvin, length, time, counts ISQS 6347, Data & Text Mining 13 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: ◦ ◦ ◦ ◦ Distinctness: Order: Addition: Multiplication: = ◦ ◦ ◦ ◦ Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties < > + */ ISQS 6347, Data & Text Mining 14 Discrete and Continuous Attributes Discrete Attribute ◦ Has only a finite or countably infinite set of values ◦ Examples: zip codes, counts, or the set of words in a collection of documents ◦ Often represented as integer variables. ◦ Note: binary attributes are a special case of discrete attributes Continuous Attribute ◦ Has real numbers as attribute values ◦ Examples: temperature, height, or weight. ◦ Practically, real values can only be measured and represented using a finite number of digits. ◦ Continuous attributes are typically represented as floating-point variables. ISQS 6347, Data & Text Mining 15 Important Characteristics of Structured Data ◦ Dimensionality Curse of Dimensionality ◦ Sparsity Only presence counts ◦ Quality missing values, typos, outliers, etc. ◦ Resolution (frequency) Patterns depend on the scale ISQS 6347, Data & Text Mining 16 Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Randomly generate 500 points • Compute difference between max and min distance between any pair of points ISQS 6347, Data & Text Mining 17 Dimensionality Reduction Purpose: ◦ Avoid curse of dimensionality ◦ Reduce amount of time and memory required by data mining algorithms ◦ Allow data to be more easily visualized ◦ May help to eliminate irrelevant features or reduce noise Techniques ◦ Principle Component Analysis ◦ Singular Value Decomposition ◦ Others: supervised and non-linear techniques ISQS 6347, Data & Text Mining 18 Feature Subset Selection Another way to reduce dimensionality of data Redundant features ◦ duplicate much or all of the information contained in one or more other attributes ◦ Example: purchase price of a product and the amount of sales tax paid Irrelevant features ◦ contain no information that is useful for the data mining task at hand ◦ Example: students' ID is often irrelevant to the task of predicting students' GPA ISQS 6347, Data & Text Mining 19 Data Quality What are data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: ◦ Noise and outliers ◦ missing values ◦ duplicate data ISQS 6347, Data & Text Mining 20 Noise Noise refers to modification of original values ◦ Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise ISQS 6347, Data & Text Mining 21 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set ISQS 6347, Data & Text Mining 22 Missing Values Reasons for missing values ◦ Information is not collected (e.g., people decline to give their age and weight) ◦ Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values ◦ ◦ ◦ ◦ Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) ISQS 6347, Data & Text Mining 23 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another ◦ Major issue when merging data from heterogeneous sources Examples: ◦ Same person with multiple email addresses Data cleaning ◦ Process of dealing with duplicate data issues ISQS 6347, Data & Text Mining 24 Data Preprocessing Tasks Main tasks ◦ ◦ ◦ ◦ ◦ ◦ Sampling Aggregation Feature creation Attribute Transformation Dimensionality Reduction Feature subset selection ISQS 6347, Data & Text Mining 25 The Process of Classification Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set ISQS 6347, Data & Text Mining Learn Classifier Test Set Model 26 Data Mining Tools SAS Enterprise Miner v13.2 Basic ◦ How to use the application main menu ◦ Using the pop-up menus ◦ Enterprise Miner documentation ◦ Project – Diagram The SEMMA methodology ◦ Sample ◦ Explore ◦ Modify ◦ Model ◦ Assess ISQS 6347, Data & Text Mining 28 Case: German credit benchmark data set 1000 observations Clean data Target variable: “Good_Bad” Cost: $1 loss when “false negative” vs. $5 loss “when false positive” Prior probability of the target variable: 0.9:0.1 vs. sample probability 0.7:0.3 ISQS 6347, Data & Text Mining 29 SAS Enterprise Miner 30 31 Refine analytic objective Assess observed results Gather results Integrate deployment Generate deployment methods Apply analysis Transform input data Repair input data Validate input data Extract input data Select cases Define analytic objective The Analytic Workflow Analytic workflow Open Source Data Mining Software – Rapid Miner Formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010. The RapidMiner project was started in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at the Artificial Intelligence Unit of the University of Dortmund. In 2006 Ingo Mierswa and Ralf Klinkenberg founded the company Rapid-I that is now the main contributor out of more than 30 international developers further developing RapidMiner. TWO-STAGE PREDICTIVE MODELING TEXT MINING SENTIMENT ANALYSIS