Download What is Data Mining? - WCU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COURSE INTRODUCTION
Class 1
CSC 600: Data Mining
Today



What is Data Mining?
Syllabus / Course Webpage
Types of Data
How would you define data mining?
What is Data Mining?


Data Mining and Business Analytics deal with collecting
and analyzing data for better decision making.
Goal: solving business problems
Data collection (more and more data is being collected)
 Warehousing of data (readily available for analysis; data
from numerous sources already integrated)
 Computer storage and computer power cheaper every day
 Good software for performing analysis

Data Mining




… blends traditional data analysis (mathematical +
statistical) with sophisticated machine learning
algorithms
Programming ability to process big data Math
Businesses interested in decision making
CS
Business
“Art” of data mining
Predictive Data Mining

Moving from data to insights to decisions.
Data Mining Applications

Businesses collect lots of data:




Business Goals:


Purchase information
Web site browsing habits
Social network data
customer profiling, targeted marketing, fraud detection
Questions that analyst will try to answer by data mining:




“Who are the most profitable customers?”
“What products can be cross-sold?”
“What is the revenue outlook for the company next year?”
Many variables are collected; few turn out to be useful.
More Applications




Price Prediction
Fraud Detection
Risk Assessment
Diagnosis
Data Mining Applications

Medicine, Science, Engineering collecting lots of data




NASA / weather observations (collecting land surface, ocean,
atmosphere readings)
Molecular Biology data (large amounts of genomic data being gathered
to better understand function of genes)
Medical data (outcomes of procedures)
Questions that a scientist will try to answer using data mining:


“How is land surface precipitation and temperature affected by ocean surface
temperature?”
“How well can we predicts the beginning and end of the growing season for a
region?”
What we will do in this Course





Learn Basic-to-Intermediate Data Mining Techniques
Apply them on Datasets
Program using Python
Read, Understand, Discuss, Critique Scientific Papers
Perform Significant Individual Data Mining Project
Syllabus / Course Webpage
What is Data Mining?


“the process of
automatically discovering
useful information in large
data repositories”
“to find novel and useful
patterns that might
otherwise remain
unknown”
What is NOT data Mining?


“looking up records in a
MySQL database”
(database)
“finding relevant web
pages based on a
Google search query”
(information retrieval)
Data Mining and Knowledge Discovery

Input Data
•MySQL
•.csv
Process of converting raw data into useful
information
Data
Preprocessing
•Feature
Selection
•Dimensionality
Reduction
•Normalization
Data Mining
•Decision Trees
•Support Vector
Machines
•Linear
Regression
Postprocessing
•Visualization
•Pattern
Interpretation
Reporting to Boss
•“closing the
loop”
Input Data

Available in data in variety of formats:





Big Data / Data Warehouse



Flat files (.csv or .txt)
Spreadsheets (Excel .xls tougher to deal with)
Relational tables (MySQL)
Text, data on web page (scraping necessary)
Data spread out over multiple locations
CS programming ability often necessary
Sometimes enormous amount of effort

Digitizing hand-written notes
Preprocessing

To transform raw input data into an appropriate format
for subsequent analysis
Fusing data from multiple sources
 Cleaning data to remove noise
 Duplicate observations



“garbage in – garbage out” also applies to data mining
Selecting records and features that are relevant to the data
mining task at hand
Data Mining

Applying Appropriate Data Mining Task
 Linear
Regression
 Support Vector Machines
 Decision Trees
 Clustering
…
Postprocessing

Performing:
 Visualization
 Statistical
significant tests, confidence intervals,
hypothesis testing to eliminate spurious data mining
results
 (yikes,
math!)
Challenges of Data Mining

Scalability
 Gigabytes,
terabytes, petabytes, exabytes of data
 Storage, processing
 “are data mining algorithms scalable?”
 Limits of python statistical framework libraries
Challenges of Data Mining

High Dimensionality
 Datasets
with hundreds or thousands of attributes
 Some traditional data analysis techniques were
developed for low-dimensional data, and many not
work well with high-dimensional data
 Many variables are collected; few turn out to be useful.
Challenges of Data Mining

Heterogeneous and Complex Data
 Traditional
data analysis often deals with data sets
containing attributes of the same type (e.g. all
continuous, all categorical)
 Non-traditional data: collection of web pages (w/
semi-structured text and hyperlinks)
Challenges of Data Mining

Data Ownership
 “Good
data” being geographically distributed owned
by more than one organization (e.g. medical records)
 Access to “good data”
 Facebook
and google keep their collected data private
What is interesting in this data?
Sample Data
id
10
Home
Owner
Marital
Status
Annual
Income
Defaulted
Barrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K

Vocabulary:
 Column:
“attribute”,
“feature”, “field”,
“dimension”, “variable”
 Row: “instance”,
“record”,
“observation”
Data Mining Tasks
Predictive Tasks
1.

Objective: predict value
of a particular
attribute, based on the
values of other
attributes
• “Defaulted Barrower?” is the
target (or dependent variable)
• Attributes/features used for
making the prediction are known
as explanatory (or independent
variables)
10
id
Home
Owner
Marital
Status
Annual
Income
Defaulted
Barrower?
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Supervised Machine Learning

Machine Learning techniques automatically learn a
model of the relationship between a set of descriptive
features and a target feature from a set of historical
examples.
Data Mining Tasks
Descriptive Tasks
2.


Objective: derive
patterns (correlations,
clusters) that summarize
underlying relationships
in data
Often more exploratory
and requires an
explanation of found
results
10
id
Home
Owner
Marital
Status
Annual
Income
Defaulted
Barrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Available Datasets
References




Fundamentals of Machine Learning for Predictive
Data Analytics, 1st Edition, Kelleher et al.
Data Science from Scratch, 1st Edition, Grus
Introduction to Data Mining, 1st edition, Tan et al.
Data Mining and Business Analytics in R, 1st edition,
Ledolter