Download Data Foundations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Foundations
 Data Attributes and Features
 Data Pre-processing
 Data Storage
 Data Analysis
1
Data Attributes
 Describing data content and characteristics
 Representing data dimensions
 Set of all attributes: attribute vector or
attribute array.
2
1
Attribute Types
Nominal Data
Sortable Data
3
Numerical Attributes
Discrete vs. Continuous
4
2
Statistical Features of Data
Mean:
Median:
Standard Deviation:
5
Similarity Matrix
Distance for Categorical Data:
 p: total # of types;
 m: # of same types
6
3
Distance Measures (Numerical)
Euclidean Distance: L2
Manhattan Distance: L1
Minkowski Distance: Lp
Cosine Distance:
s( X , Y ) 
X
X

Y
Y
7
Data Uncertainty
Source of uncertainty
 Attribute error
 Missing attributes
 Data integration
error
 Resolution
conversion
 Application
uncertainty
8
4
Data Preprocessing
Data Products
Application
database
ETL: Extract,
Data Warehouse
Transform, & Load
Commercial
Intelligence
Analysis
9
Data Preprocessing
1. Data Cleaning
2. Data Integration
 Data Quality
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
10
5
Data Visualization Quality
Data-Ink Ratio:
11
Data Error Types
 Missing data
– Replace with constants
– Replace with average value of attributes
– Regression
– Manual filling
 Noisy data
– Regression
– Outlier analysis
12
6
Visual Data Cleaning
 Using visualization techniques for data
cleaning
13
Data Integration
 Combining data from multiple sources
– Structural conflicts
– Schema differences
– Data Conflicts
– Repeated data
 Providing a uniform visualization
14
7
Data Integration Example
Form 1:
Form 2:
Integrated Form:
15
Data Storage
 File Systems
 Databases and DBMS
 Data Warehouse
16
8
CSV file (comma-separated values)
17
Structured Files
 XML files: eXtensible Markup Language
<note> <to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
 XML Extensions: IVOA VOtable (Space programs), KML
(Web maps), etc.
 Special formats: e.g. HDF (Hierarchical Data Format) for
scientific data
18
9
Databases
 A database is an organized collection of data. The
data is typically organized to model aspects of
reality in a way that supports processes requiring
information. For example, modelling the
availability of rooms in hotels in a way that
supports finding a hotel with vacancies.
19
Relational Databases
 DBMS: Database Management System
 Data definition and query languages (SQL)
20
10
Visualization of a database
Challenge: Interactivity!
Example: National Science Foundation Database:
21
Data Warehouse
 A data warehouse is a system used for
reporting and data analysis. It is a central
repository of integrated data from one or
more disparate sources.
22
11
Database vs Data Warehouse
Database
Data Warehouse
Purpose
Data operation
Information in the data
Application
Business
Analysis
Users
Employee, DB
Administrator
Analyst, manager,
executive
Functionalities
Daily operations
decision making support
Data
Current
Historical, time-variant
Access
Read, write, mean, etc
Read
Focus
Input, query
information output
Size
1 GB ~ < 1 TB
TBs
23
Data Analysis
 Statistical Analysis
 Exploratory Analysis
 Data Mining
24
12
Statistical Analysis
 Statistical description:
properties, parameters,
distribution, correlations.
 Statistical prediction: using
probability methods and
sampling theory to predict
statistical properties
(distribution, correlation,
parameters, forecasting, etc.)
25
Data Exploration Using Visualization
 Raw data drawing
 Statistical drawing
 Multi-view
26
13
Data Trajectory
27
Data Comparison
28
14
Trend and Patterns
29
Relations
30
15
Line Chart
31
Line Chart: Sunspots
32
16
Bar Chart
33
Sorted Bar Chart
34
17
Bar Chart Labeling
35
Bar Chart: Scale
36
18
Bar Chart : Variant
37
Stacked Bar Chart
38
19
Stacked Bar Chart
39
Stacked Chart
40
20
Pie Chart
41
Pie Chart
42
21
Pie Charts of Van
Gogh Paintings
43
Histogram
44
22
Contour Map
45
Scatter Plot Matrix
46
23
Reference Lines in Scatter Plot
47
Heatmap
48
24
Box Plot
49
Box Plot Variations
50
25
Multi-View
51
Data Mining
 “Data mining, also popularly referred to as knowledge
discovery from data (KDD), is the automated or
convenient extraction of patterns representing knowledge
implicitly stored or captured in large databases, data
warehouses, the Web, other massive repositories or data
streams” – J. Han and M Kamber, “Data Mining: Concepts and
Techniques”
Mining
Data
Model
Validation
Knowledge
52
26
Data Mining Tasks
Description:
Algorithm
Features
Data
Prediction
1
2
Training
Data
Trained
Model
Trained
Model
Model
New
Data
Features
53
Data Mining Tasks
Descriptive Tasks:
Concept Description
Association Mining
Clustering
Outlier Analysis
Predictive Tasks:
Classification
Evolution Analysis
54
27
Data Mining Methods
Statistical Method:
Regression, Parameter
Estimation
Machine Learning:
Decision tree, Neural
Networks
Statistical Learning:
Probability model,
Basian networks
Algorithmic Method: Kmean, Graph operations
55
Data Mining New Applications
 Text Mining – summarizes, navigates, and
clusters documents contained in a database.
 Web Mining – integrates data and text mining
within a Web site; enhances the Web site with
intelligent behavior, such as suggesting related
links or recommending new products to the
consumer
56
28
Data Visualization Pipeline
57
Looping Model
58
29
Visual Data Mining and Visual Analytics
 Users are involved in the data mining
process through visualization and user
interactions.
 Certain tasks are difficult to be automated
– Validation of clustering results
– Checking data focal points and noise
– Expert knowledge input, etc.
59
Visual Analytics Paradigm
Knowledge
Visualization
Data
Processing
& Mining
User
Interaction
Revision of Mining
Engine & Decision
Making
60
30
Visual Analytics
By Daniel Keim
61
Visual Data Mining Examples:
Visualizing data correlations
62
31
Visual Data Mining Example:
Biomarker detection
Z 
1
N
N

i 1
f ( D 1 , Pi )
Z 
1
7
7

i 1
f ( D 1 , Pi )
Z
1 7
 f ( D2 , Pi )
7 i 1
Z
Z
Z 
1
7
7

i 1
f ( Pi , D j ),
Z
1
 f (D1, Pi ),i 1, 2, 7
3 i
1
f (Pi , Dj ),i 1, 2, 7
3i
1
f (Pi, Dj ),i 1, 2, 3, 5, 7
5i
63
Visual Data Mining Example:
Facial feature detection for medical diagnosis
64
32
Visual Data Mining Example:
Concept detection in text data
65
Visual Data Mining Example:
Visualizing decision trees
66
33
67
34