Download Courseware

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Course Methodology
„ Bilingual Lectures
Intelligent
„ Discussions
Information Processing
„ Seminars
„ Lab Work
„ Short Paper Assignments
Cao Sanxing
[email protected]
Communication University of China
„ Paper Examination
„ Teaching Assistant
Contents
Bibliography
„
„
„
„
„
„
„
Introduction to IIP
Data Mining and Knowledge Discovery
Fuzzy Theory and Applications
Knowledge Representation
Wavelet Analysis
Information Fusion
Evolutional Computing and Synergetic
Computing
„ Media Knowledge Engineering
„ Textbook:
Bibliography
Bibliography
„ References:
„ References:
„ T. Hastie et. al., The Elements of Statistical Learning:
Data Mining, Inference and Predication: SpringerVerlag, New York, Berlin, 2001.3
„ P. Giudici, Applied Data Mining: Statistical Methods for
Business and Industry, Wiley, 2003
„ 高隽: 智能信息处理方法导论,机械工业出版社, 2004.6
„ 陈永利, 李敬功等: 模糊集理论及其应用,科学出版社,
2006.9
S.Mallat: A Wavelet Tour of Signal
Processing,机械工业出版社
„ T.J.Ross: 模糊逻辑及其工程应用,钱同惠等
译,电子工业出版社
„ T. Dean, J. Allen and Y. Aloimonos, Artificial
Intelligence: Theories and Practice, Addison
Wesley, , transferred by Publishing House of
Electronics Industry, China, 2004.4
„
J.Han, M. Kamber: Data Mining, Concepts
and Techniques, Morgan Kaufmann
Publishers,HEP, 2001.10
„ J. S. Albus: Intelligent Systems:
Architecture, Design, Control, Wiley, PHEI, ,
2004.8
„
1
Chapter I
Introduction to IIP
„ Key Issues:
Chapter I
Introduction to IIP
„
Concept of Intelligent Information Processing
„
The Intelligence Environment
Chapter I
Introduction to IIP
Chapter I
Introduction to IIP
„ 1.1 Intelligence
„ 1.1 Intelligence
„
I believe that understanding of intelligence involves
understanding how knowledge is acquired,
represented, and stored; how intelligent behavior is
generated and learned; how motives, and emotions,
and priorities are developed and used; how sensory
signals are transformed into symbols; how symbols
are manipulated to perform logic; to reason about the
past, and plan for the future; and how the mechanisms
of intelligence produce the phenomena of illusion,
belief, hope, fear, and dreams-and yes even kindness
and love. To understand these functions at a
fundamental level, I believe, would be a scientific
achievement on the scale of nuclear physics, relativity,
and molecular genetics.
----- James Albus
„
Intelligence = Wisdom + Capacity
„
Intelligence =
Behavior + Reasoning + Adaptability
Chapter I
Introduction to IIP
Chapter I
Introduction to IIP
„ 1.2 Artificial Intelligence
„ 1.2 Artificial Intelligence
„
Artificial Intelligence is the science that uses
computers to simulate the functionality of
thinking.
„
Thinking is Computing.
--------- Turin, 1946
2
Chapter I
Introduction to IIP
Chapter I
Introduction to IIP
„ 1.2 Artificial Intelligence
„ 1.2 Artificial Intelligence
Representation
„
The Machine Learning Model
The
Environment
Learning
Learning
Modules
Knowledge
Base(s)
Executive
Modules
Reasoning
Chapter I
Introduction to IIP
Chapter I
Introduction to IIP
„ 1.3 Computational Intelligence
„ 1.3 Computational Intelligence
„
„
CI =
NN + EC + FS
„
„
„
„
„
CI as a subset of AI?
CI as a domain other than AI?
CI: Computational Intelligence
NN: Neural Networks
EC: Evolutionary Computation
FS: Fuzzy Systems
Chapter I
Introduction to IIP
The Environment
Carbon-based / Silicon-base Systems
Sensory
Intelligent Behaviors
World View
(Data + Knowledge)
Algorithms + PR
Chapter II
Data Mining and Knowledge
Discovery
Reasoning, Abstraction,
Summarization (CI)
3
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Content:
2.1. Why Data Mining
„
„
„
„
„
„
„
„
2.1 Why Data Mining
2.2 Concept and Basis of Data Mining
2.3 Data Mining Functionalities
2.4 Data Mining: Important Issues
2.5 Concept and Basis of Data Warehousing
2.6 The Multidimensional Data Model
2.7 Data Warehouse: Architecture and
Implementation
2.8 Further Development of Data Warehousing
and Mining
Chapter II
Data Mining and Knowledge Discovery
„ Key Issues:
„
Relation of Data and Knowledge
„
Concept of Data Warehouse
Chapter II
Data Mining and Knowledge Discovery
„ Wide Availability of huge amounts of data
2.1. Why Data Mining
„ Imminent need for turning data into useful
information
Motivation Leading to Data Mining
„ Data mining: a natural evolution result of
information technology.
Necessity
Necessity is
is the
the mother
mother of
of Invention.
Invention.
Chapter II
Data Mining and Knowledge Discovery
Data
Data Collection
Collection and
and Database
Database Creation
Creation
Chapter II
Data Mining and Knowledge Discovery
„ 1960s: Primitive file processing >> Database
„ 1970s: Hierarchical, network >> Relational,
query languages
Database
Database Management
Management Systems
Systems
„ 1980s: Wide adoption of relational technology,
Researches on new advanced data models
Advanced
Advanced Databases
Databases Systems
Systems
Web-based
Web-based Databases
Databases Systems
Systems
Data
Data Warehousing
Warehousing and
and Data
Data Mining
Mining
„ 1990s ~ new century: Great boost of
database and information systems, OLAP,
Data warehousing, Data mining
New
New Generation
Generation of
of Integrated
Integrated Information
Information Systems
Systems
4
Chapter II
Data Mining and Knowledge Discovery
„ Data warehouse: a repository of multiple
heterogeneous data sources, organized
under a unified schema at a single site in
order to facilitate management decision
making.
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Data warehousing technologies include:
Data cleansing
Data integration
„ On-Line Analytical Processing
„
„
Chapter II
Data Mining and Knowledge Discovery
„ OLAP: Analysis techniques with
functionalities such as
DATA
DATARICH…
RICH…
Summarization,
„ Consolidation,
„ Aggregation,
„ Ability to view information from different angles
„
„ OLAP includes the basic functionalities of
data mining, and the knowledge management
based on data models.
Chapter II
Data Mining and Knowledge Discovery
…
…INFORMATION
INFORMATIONPOOR
POOR
„ The fast-growing, tremendous amount of data,
collected and stored in large and numerous
databases, has far exceeded human capability for
comprehension without powerful tools.
Chapter II
Data Mining and Knowledge Discovery
TOMBS…
TOMBS…
of
ofdata
data
DATABASES…
DATABASES…
EXPERT
EXPERTSYSTEMS/
SYSTEMS/
KNOWLEDGE
KNOWLEDGEBASES…
BASES…
…
…DATA
DATAWAREHOUSES
WAREHOUSES
WITH
WITHDATA
DATAMINING
MINING
…
…GOLDEN
GOLDENNUGGETS
NUGGETS
of
ofknowledge
knowledge
„ By reshaping databases into a data warehouse, and with
the introduction of effective data mining techniques,
knowledge could be discovered and manipulated via
huge amount of data.
5
Chapter II
Data Mining and Knowledge Discovery
2.2. Concept and Basis of Data Mining
„ Key Issues:
„
Data Mining Process
„
The Basis of Data Mining
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
2.2. Concept and Basis of Data Mining
„ Data mining: Extracting (mining) Knowledge
from Large Amounts of Data.
Knowledge mining from databases
Knowledge extraction
„ Data/pattern analysis
„ Data archaeology
„ Data dredging
„ KDD: Knowledge Discovery in Databases
„
„
Chapter II
Data Mining and Knowledge Discovery
„ Data Mining Process:
Data cleaning
„ Data integration
„ Data transformation
„ Data mining
„ Pattern evaluation
„ Knowledge presentation
Evaluation and
Presentation
„
Knowledge
Data Mining
Selection and
Transformation
Cleaning and
Integration
Databases
Chapter II
Data Mining and Knowledge Discovery
„ Data mining is sometimes interactive with the
user or the knowledge base.
„ A broader view: Data mining is the process of
discovering interesting knowledge from large
amounts of data stored either in databases,
data warehouses, or other information
repositories.
Patterns
Flat files
Chapter II
Data Mining and Knowledge Discovery
„ Data mining architecture:
Data description and storage basis: Database,
data warehouse or other information
repository
„ Database or data warehouse server
„ Knowledge base
„ Data mining engine
„ Pattern evaluation module
„ GUI
„
6
Chapter II
Data Mining and Knowledge Discovery
„ From a data warehouse perspective, data
Graphical
Graphical user
user interface
interface
mining can be viewed as an advanced stage
of OLAP.
Pattern
Pattern evaluation
evaluation
Knowledge
Knowledge
base
base
Date
Date mining
mining engine
engine
Database
Database or
or
data
data warehouse
warehouse sever
sever
Data cleaning
Data integration
Database
Database
Chapter II
Data Mining and Knowledge Discovery
Filtering
Date
Date
warehouse
warehouse
Chapter II
Data Mining and Knowledge Discovery
„ Target of knowledge mined:
Decision making
„ Process control
„ Information management
„ Query processing
„
„ Data Mining is interdisciplinary:
„ Database technology, Statistics, Machine
Learning, High-performance computing,
Pattern recognition, Neural networks, Data
visualization, Information retrieval, Image and
signal processing, Spatial data analysis.
Chapter II
Data Mining and Knowledge Discovery
The Basis of Data Mining
„ Data mining could be carried out in different
types of data stores.
Relational Databases
Data Warehouses
„ Transactional Databases
„ Advanced Database Systems and Advanced
Database Applications
„
„
„ Therefore, data mining is considered one of
the most important frontiers in database
systems.
Chapter II
Data Mining and Knowledge Discovery
„
Advanced Database Systems and Advanced
Database Applications
„
„
„
„
„
„
„
Object-Oriented Databases
Object-Relational Databases
Spatial Databases
Temporal Databases and Time-Series Databases
Text Databases and Multimedia Databases
Heterogeneous Databases and Legacy
Databases
The World Wide Web
Chapter II
Data Mining and Knowledge Discovery
„ Relational Databases
„
DBMS
„
„
Data Storage, Data Access (Concurrent, Shared,
Distributed), Consistency and Security Ensuring
Relational Database Constitution
„
Tables
ƒ Attributes, Tuples
„
Relational Data Access:
„
Database queries written in a relational query
language, usually SQL, or with the assistance of
GUI.
7
Chapter II
Data Mining and Knowledge Discovery
„ Mining Relational Databases:
„
Upon the basis of statistical relational queries:
„
„
„ Mining Data Warehouses
„
sum, avg, count, max, min
A data warehouse facilitates the mining of
useful knowledge by:
Further: searching for trends and data patterns
„
Example: Analysis on customer data to predict
the credit risk of new customers based on their
income, age and previous credit information.
„
„
„
Chapter II
Data Mining and Knowledge Discovery
„
Deviation Detection
„
„
Chapter II
Data Mining and Knowledge Discovery
Collecting Information from multiple sources;
Storing the information under a unified schema;
Preparing the data by cleaning, transformation,
and integration
Organizing data around major subjects;
Providing information from a historical
perspective, typically summarized
Chapter II
Data Mining and Knowledge Discovery
„ Mining Data Warehouses
„
A data warehouse is usually modeled by a
multidimensional database structure.
„
„
„
Dimension <-> attribute / set of attributes
Cell <-> the value of some aggregate measure
Actual physical structure of a data warehouse
may be:
„
„
Client
Data source in Beijing
A relational data store
A multidimensional data cube
Date source in Shanghai
Clean
Transform
Integrate
Load
Data
warehouse
Date source in Guangzhou
Query and
Analysis tools
Client
Date source in Hongkong
Chapter II
Data Mining and Knowledge Discovery
„ Mining Data Warehouses
„
Data Warehouses and Data Marts
„
A data warehouse collects information about
subjects that span an ENTIRE ORGANIZATION
„
A data mart is a department subset of a data
warehouse. It focuses on selected subjects
ƒ Enterprise-wide
ƒ Department-wide
Chapter II
Data Mining and Knowledge Discovery
„ Mining Data Warehouses
„
Important: Data warehouses are suitable for
On-Line Analytical Processing.
(thanks to the multidimensional data views
and the pre-computed summarized data)
„
„
Drill-down
Roll-up
8
Chapter II
Data Mining and Knowledge Discovery
„ Mining Transactional Databases
A transactional database consists of a file
where each record represents a transaction.
„ A transaction typically includes a unique
transaction ID and a list of items making up
the transaction.
„ Transaction Database mining works best on
Market Basket Data Analysis.
„
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Mining Advanced Database Systems and
Applications
Object-oriented Databases
Object-relational Databases
„ Spatial Databases
„ Temporal / Time-Series Databases
„ Text / Multimedia Databases
„ Heterogeneous / Legacy Databases
„ World Wide Web
„
„
Chapter II
Data Mining and Knowledge Discovery
„ Descriptive Mining Tasks:
2.3. Data Mining Functionalities
„ Data Mining functionalities are used to
specify the kind of patterns to be found in
data mining tasks.
„
To characterize the general properties of data
in the database or other mining basis.
„ Predictive Mining Tasks:
„
To perform the inference on the current data in
order to make predictions.
„ Basic Data Mining tasks categories:
„
„
Descriptive
Predictive
Chapter II
Data Mining and Knowledge Discovery
REQUIREMENTS:
„ It is important for a data mining system to be
capable of mining multiple kinds of patterns,
to accommodate different user expectations
or applications.
„ Data Mining systems should be able to
discover patterns at various granularities.
„ Data Mining systems should also allow users
to specify hints to guide or focus the
search for interesting patterns.
Chapter II
Data Mining and Knowledge Discovery
„ Different kinds of Data mining functionalities:
Concept/Class Description
Association Analysis
„ Classification and Prediction
„ Cluster Analysis
„ Outlier Analysis
„ Evolution Analysis
„
„
9
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.1. Concept/Class Description:
Chapter II
Data Mining and Knowledge Discovery
„
Characterization and Discrimination
Data can be associated with Classes or
Concepts.
„ It can be useful to describe individual classes
and concepts in summarized, concise and yet
precise terms.
„ Such descriptions of a class or a concept are
called Class/concept descriptions.
Concept/Class Descriptions could be derived
via
„
Chapter II
Data Mining and Knowledge Discovery
„
Data characterization:
A summarization of the general
characteristics or features of a target class of
data.
„
Methods of data characterization:
„
„
Data cube-based OLAP roll-up
Attribute-oriented induction
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.2. Association Analysis
„
Association Analysis:
the Discovery of association rules showing
attribute-value conditions that occur frequently
together in a given set of data.
„
Data characterization
„
Data discrimination
„
Both
Chapter II
Data Mining and Knowledge Discovery
„
Data Discrimination:
A comparison of the general features of
target class data objects with the general
features of objects from one or a set of
contrasting classes.
„
Discrimination descriptions are usually
expressed in rule form: Discriminant Rules.
Chapter II
Data Mining and Knowledge Discovery
„
Formal Description:
X => Y
A1 ∧ A2 ∧ … ∧ Am
„
-> B1 ∧ B2 ∧ … ∧ Bn
Example
age (x, “20…29”) ∧ income (x, “20k…29k”)
=> buys (x, “MP3 Player”)
[support = 2%, confidence=60%]
10
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.3. Classification and Prediction
„
Classification:
The process of finding a set of Models that
describe and distinguish data classes or
concepts, for the purpose of being able to use
the model to predict the class of objects
whose class label is unknown.
Chapter II
Data Mining and Knowledge Discovery
Classification can be used for predicting the
class label of data objects.
„ This is highly relevant to prediction.
„
„
Classification and prediction may need to be
preceded by relevance analysis, which
attempts to identify attributes that do not
contribute to the classification or prediction
process. These attributes can then be
excluded.
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„
The derived model is based on the analysis of
a set of training data.
„
The derived model could be expressed in:
„
„
„
„
Classification (IF-THEN) rules,
Decision trees,
Mathematical formulae,
Neural networks
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.4. Cluster Analysis
„
To analyze data objects without consulting a
known class label. To find different patterns or
clusters that could be used in objects’
classification.
„
To generate class labels.
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.5. Outlier Analysis
„
„
Each cluster that is formed can be viewed as a
class of objects.
Clustering can also facilitate taxonomy
formation, that is, the organization of
observations into a hierarchy of classes that
group similar events together.
Outliers: data objects that do not comply with
the general behavior or model of the data.
„ Most data mining methods discard outliers as
noise or exceptions.
„ In some applications, such as Fraud Detection,
rare events or outliers could be more
interesting.
„
11
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ 2.3.6. Evolution Analysis
„
„
Outliers may be detected using statistical tests
that assume a distribution or probability model
for the data,
„
or using distance measures where objects that
are a substantial distance from any other
cluster are considered outliers.
To describe and model the trends or
regularities for objects whose behavior
changes over time.
„
„
„
Chapter II
Data Mining and Knowledge Discovery
2.4. Data Mining: Important Issues
Time-series data analysis
Sequence or periodicity pattern matching
Similarity-based data analysis.
Chapter II
Data Mining and Knowledge Discovery
Are all the patterns interesting?
What makes a pattern interesting?
„ Can a data mining system generate ALL the
interesting patterns?
„ Can a data mining system generate ONLY the
interesting patterns?
„
„
2.4.1. Filtering of Patterns Found
„
Brief view
„
A data mining system could potentially generate
thousands or even millions of patterns or rules.
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ A pattern is interesting if
Easily understood by humans
Valid on new or test data with some degree of
certainty.
„ Potentially useful.
„ Novel.
„
„
AN INTERESTING PATTERN ALWAYS
REPRESENTS KNOWLEDGE.
12
Chapter II
Data Mining and Knowledge Discovery
„ There are some Objective Measures of
Pattern Interestingness.
„
Support
„
Confidence
Chapter II
Data Mining and Knowledge Discovery
„ Generally speaking, each interestingness
measure is associated with a Threshold,
which may be controlled by users.
„ Rules below the threshold likely reflect noise,
Chapter II
Data Mining and Knowledge Discovery
„ Objective measures: INSUFFICIENT
exceptions, or minority cases and are
probably of less value.
Chapter II
Data Mining and Knowledge Discovery
„ Subjective interestingness measures are
based on user beliefs in the data.
… unless combined with subjective measures
„ These measures find patterns interesting if
„ Subjective measures are reflecting needs and
interests of a particular user.
Chapter II
Data Mining and Knowledge Discovery
„ Completeness of a data mining algorithm
Can a data mining system generate ALL the
interesting patterns?
„ Unrealistic and insufficient
„ User-provided constraints and interestingness
measures should be used to focus the search.
„
they are UNEXPECTED – contradicting a
user’s belief.
Chapter II
Data Mining and Knowledge Discovery
„ Optimization of a data mining system:
Can a data mining system generate ONLY the
interesting patterns?
„ Progresses have been made in this direction.
„ However, it remains a challenging issue in
data mining.
„
13
Chapter II
Data Mining and Knowledge Discovery
2.4.2. Classification of Data Mining Systems
Database
Database
technology
technology
Information
Information
science
science
Statistics
Statistics
Chapter II
Data Mining and Knowledge Discovery
„ Classification according to the kinds of
databases mined
Relational, transactional, object-oriented,
object-relational, data warehouse mining.
„ Spatial, time-series, text, multimedia, WWW
mining.
„
Machine
Machine
learning
learning
Data
Data
Mining
Mining
Visualization
Visualization
Other
Otherdisciplines
disciplines
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Classification according to the kinds of
knowledge mined.
Characterization, discrimination, association,
classification, clustering, outlier analysis,
evolution analysis.
„ Different granularities (levels of abstraction) of
knowledge mined.
„
„
„ Classification according to the kinds of
techniques utilized.
Degree of user interaction involved:
„
„
Systems that mine data regularities
vs.
Systems that mine data irregularities
General knowledge, primitive-level knowledge,
knowledge at multiple levels.
Chapter II
Data Mining and Knowledge Discovery
„
„
Autonomous systems, interactive exploratory
systems, query-driven systems.
Chapter II
Data Mining and Knowledge Discovery
„ Classification according to the applications
adapted.
Finance data mining systems, telcos, DNA,
stock markets, web, e-mail...
„ Media? Media!
„
Methods of data analysis employed:
„
Database-oriented, data warehouse-oriented,
machine learning, statistics, visualization, pattern
recognition, neural networks…
14
Chapter II
Data Mining and Knowledge Discovery
2.4.3. Other Major Issues in Data Mining and
Data Warehousing
„ Mining methodology and user interaction
issues
Mining different kinds of knowledge in
databases
„ Interactive mining of knowledge at multiple
levels of abstraction
„
Chapter II
Data Mining and Knowledge Discovery
„ Mining methodology and user interaction
issues
„
Presentation and visualization of data mining
results
„
Handling noisy or incomplete data
Chapter II
Data Mining and Knowledge Discovery
„ Performance Issues
Chapter II
Data Mining and Knowledge Discovery
„ Mining methodology and user interaction
issues
„
Incorporation of background knowledge
„
Data mining query languages and adhoc data
mining
Chapter II
Data Mining and Knowledge Discovery
„ Mining methodology and user interaction
issues
„
Pattern evaluation – the interestingness
problem
Chapter II
Data Mining and Knowledge Discovery
„ Issues relating to the diversity of database
types
„
„
Efficiency and scalability of data mining
algorithms
„
Handling of relational and complex types of
data
Parallel, distributed and incremental mining
algorithms
„
Mining information from heterogeneous
databases and global information systems
15
Chapter II
Data Mining and Knowledge Discovery
„
„
„
„
„
B.1. What is a Data Warehouse?
B.2. The Multidimensional Data Model
B.3. Data Warehouse Architecture
B.4. Data Warehouse Implementation
B.5. Further Development of Data Cube
Technology
„ B.6. From Data Warehousing to Data Mining
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
2.5 Concept and Basis of Data Warehousing
„ Data Warehousing provides architectures and
tools for business executives to
systematically organize, understand and use
their data to make strategic decisions.
Chapter II
Data Mining and Knowledge Discovery
„ In the last decade, many firms have spent a
„ Loosely speaking, a data warehouse refers to
large budget in building enterprise-wide data
warehouses.
„ Data warehousing is considered to be THE
LATEST MUST-HAVE MARKETING
WEAPON.
a database that is maintained separately from
an organization’s operational databases.
„ Data warehouse system allow for the
integration of a variety of application systems
that support information processing by
providing a solid platform of consolidated
historical data for analysis.
Chapter II
Data Mining and Knowledge Discovery
„ What is a data warehouse??
„ Definition of Data Warehouse by W. H.
Inmon:
„ A Data Warehouse is a subject-oriented,
integrated, time-variant and nonvolatile
collection of data in support of
management’s decision making process.
Chapter II
Data Mining and Knowledge Discovery
„ Subject-oriented: 面向主题的
A data warehouse is organized around major
subjects, such as customer, supplier, product
and sales.
„ It is not a database that concentrates on the
day-to-day operations and transaction
processing of an organization.
„ Data warehouse typically provide a simple and
concise view around particular subject issues
by excluding data not useful in decision
support
„
16
Chapter II
Data Mining and Knowledge Discovery
„ Integrated: 集成的
A data warehouse is usually constructed by
integrating multiple heterogeneous
sources.
„ Data cleansing and data integration
techniques are applied to ensure consistency
in naming conventions, encoding structures,
attribute measures and so on.
„
Chapter II
Data Mining and Knowledge Discovery
„ Nonvolatile: 非易失的,可记忆的
„ A Data warehouse is always a physically
separate store of data transformed from the
application data found in the operational
environment.
„ Due to this separation, a data warehouse does
not require transaction processing, recovery
and concurrency control mechanisms.
„ It usually requires only 2 operations in data
accessing:
„
„
„ Time-variant: 时变的,动态的
Data are stored in a data warehouse just to
provide information from a historical
perspective, usually a period of several years.
„ Every key structure in the data warehouse
contains an element of time either implicitly or
explicitly.
„
Chapter II
Data Mining and Knowledge Discovery
„ What is data warehousing??
„ Data warehousing is the process of
Constructing and Using data warehouses.
The construction of a data warehouse requires
data integration, data cleaning, and data
consolidation.
„ The utilization of a data warehouse often
necessitates a collection of decision support
technologies.
„
Initial Data Loading
Data Access
Chapter II
Data Mining and Knowledge Discovery
„ Data warehouses are used for:
Increasing customer focus
„ Repositioning products and managing product
portfolios
„ Analyzing operations and looking for sources
of profit
„ Managing the customer relationships
„
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Differences between Operational Database
Systems and Data Warehouses
„
OLTP vs. OLAP
„
„
„
„
„
Users and System orientation
Data Contents
Database Design
View
Access Patterns
17
Feature
OLTP
Characteristic
Orientation
User
operational processing
transaction
clerk,DBA,database professional
Function
day-to-day operations
DB design
ER based,application-oriented
Data
current;guaranteed up-to-date
Summarization
pimitive,highly detailed
View
detailed,flat relational
Unit of work
short,simple transaction
Access
read/write
Focus
data in
Operations
index/hash on primary key
Number of records tens
accessed
Number of users thousands
DB size
100MB to GB
Priority
high performance,high availability
Metric
transaction throughput
OLAP
informational processing
analysis
knwledge worker(e.g,manager,
executive,analyst)
long-term informational requirements,
decision support
star/snowflake,subject-oriented
historical;accuracy maintained over time
summarized,consolidated
summarized,multidmensional
complex query
mostly read
information out
lots of scans
millions
Operational Databases are tuned for:
„
„
„
„
„ Why have a separate Data warehouse?
„
To help promote the high performance of both
operational and analytical systems.
„
THEY ARE DIFFERENT
hundreds
100GB to TB
high flexbility,end-user autonomy
query throughput,response time
Chapter II
Data Mining and Knowledge Discovery
„
Chapter II
Data Mining and Knowledge Discovery
Indexing, searching, queries
Concurrency control
Raw data processing
Chapter II
Data Mining and Knowledge Discovery
2.6 The Multidimensional Data Model
„ Multidimensional Data Model is known as
Data warehouses are designed to support:
„
„
„
„
Complex queries
Large Data Groups Calculation
Multidimensional Data views
Historical, consolidated data processing
the basis of Data warehouses and OLAP
tools.
„ The Multidimensional Data Model views data
in the form of a Data Cube.
Chapter II
Data Mining and Knowledge Discovery
Chapter II
Data Mining and Knowledge Discovery
„ Basic Ideas of Data Cubes
„
„
What is a data cube?
„
A Data Cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
Dimensions: The perspectives or entities with
respect to which an organization wants to
keep records.
„
„
time, item, branch, location
Dimension Table: The relational table that
implements a dimension.
18
Chapter II
Data Mining and Knowledge Discovery
„
„
Chapter II
Data Mining and Knowledge Discovery
A Multidimensional Data Model is typically
organized around a Central Theme.
„
Facts: Numerical measures , or quantities by
which we want to analyze relationships
between dimensions.
Chapter II
Data Mining and Knowledge Discovery
Fact Tables: Relational tables that store the
names of the facts, or measures, as well as
keys to each of the related dimension tables.
Chapter II
Data Mining and Knowledge Discovery
Location=“Shanghai”
Location=“Beijing”
item
Item(type)
home
time(quarter)
home
entertainment
computer
phone
security
Q1
Q2
Q3
Q4
605
680
812
927
825
952
1023
1038
14
31
30
38
400
512
501
580
Chapter II
Data Mining and Knowledge Discovery
home
Time ent. comp. phone sec.
Q1
Q2
Q3
Q4
854 882
Location=“Guangzhou”
item
89
623
home
ent. comp. phone sec.
1087 968
38
Location=“Hongkong”
item
home
ent. comp. phone sec.
ent. comp. phone sec.
872
818 746
43
591
605 825
14
943 890
64
698
1130 1024
41
925
894 769
52
682
680 952
31
512
1032 924
59
789
1034 1048
45
1002 940 795
58
728
812 1023
30
501
1129 992
63
870
1142 1091
54
984
59
784
927 1038
38
580
978 864
400
Chapter II
Data Mining and Knowledge Discovery
(c
itie
s)
„ 4-D Cube: see the book
Lo
ca
tio
n
BJ
SH
GZ
HK
854
862
89
623
1087
968
38
872
818
746
43
591
Q1
605
825
14
400
Q2
680
952
31
512
Q3
812
1023
30
501
Q4
927
1038
38
580
H
C
P
S
870
789
698
984
1002
925
784
728
682
Time (quarters)
Location=“Shanghai”
item
BJ = Beijing
SH = Shanghai
GZ = Guangzhou
HK = Hong Kong
Cuboid
Construction of a lattice of cuboids
„ Group By
„ Base Cuboid: the cuboid that holds the lowest
level of summarization
„
„
H = Home entertainment
C = Computer
P = Phone
S = Security
Item (types)
19
Chapter II
Data Mining and Knowledge Discovery
„ Stars, Snowflakes, and Fact Constellations:
Chapter II
Data Mining and Knowledge Discovery
„ The Star Schema:
Schemas for Multidimensional Databases
„
„
„
2-D Relational Databases: Entity –
Relationship
Data Warehouse: Multidimensional Data
Model
Chapter II
Data Mining and Knowledge Discovery
time
Dimension table
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
sales
fact table
Item
Dimension table
time_key
time_key
item_key
item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
item_key
item_key
item_name
item_name
brand
brand
type
type
supplier_type
supplier_type
branch
dimension table
branch
dimension table
branch_key
branch_key
branch_name
branch_name
branch_type
branch_type
A large central table (fact table)
„
A set of smaller attendant tables (dimension
tables), one for each dimension
Chapter II
Data Mining and Knowledge Discovery
„ The Snowflake Schema:
„
location
dimension table
Chapter II
Data Mining and Knowledge Discovery
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
„
It is a variant of the Star Schema Model. In a
Snowflake Schema Model, some dimensions
are normalized, thus the data are further split
into additional tables.
location_key
location_key
street
street
city
city
province_or_state
province_or_state
country
country
branch_key
branch_key
branch_name
branch_name
branch_type
branch_type
time
dimension table
In star schema, a data warehouse contains:
sales
fact table
Item
dimension table
time_key
time_key
item_key
item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
item_key
item_key
Item_name
Item_name
brand
brand
type
type
supplier_key
supplier_key
supplier
dimension table
supplier_key
supplier_key
supplier_type
supplier_type
„
The major difference between Snowflake and
Star schema models is:
„
location
dimension table
location_key
location_key
street
street
city
city
Chapter II
Data Mining and Knowledge Discovery
„
The dimension tables of the snowflake model
may be kept in normalized form, for the purpose
of reducing redundancies.
(See the detailed explanation in book)
city
dimension table
city_key
city_key
city
city
province_or_state
province_or_state
country
country
20
Chapter II
Data Mining and Knowledge Discovery
„ The Fact Constellation Schema
Chapter II
Data Mining and Knowledge Discovery
time
dimension table
„
Sophisticated applications may require
multiple fact tables to share dimension tables.
„
This kind of schema could be viewed as a
collection of stars, and hence is called a
galaxy schema or a fact constellation.
Chapter II
Data Mining and Knowledge Discovery
„ Examples for Defining Star, Snowflake and
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
sales
fact table
item
dimension table
time_key
time_key
Item_key
Item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
item_key
item_key
Item_name
Item_name
brand
brand
type
type
supplier_type
supplier_type
location
dimension table
branch
dimension table
shipping
fact table
item_key
item_key
time_key
time_key
shipper_key
shipper_key
from_location
from_location
to_location
to_location
dollars_cost
dollars_cost
units_shipped
units_shipped
supplier
dimension table
supplier_key
supplier_key
supplier_type
supplier_type
Location_key
Location_key
shipper_type
shipper_type
location_key
location_key
street
street
city_key
city_key
city
city
province_or_state
province_or_state
country
country
branch_key
branch_key
branch_name
branch_name
branch_type
branch_type
Chapter II
Data Mining and Knowledge Discovery
„
Cube Definition:
Fact Constellation Schemas
„
The Data Mining Query Language
„ ‘DMQL’
„
„
Dimension Definition:
„
Chapter II
Data Mining and Knowledge Discovery
„
define cube <cube_name> [ <dimension_list>] :
<measure_list>
define dimension < dimension_name > as
( <attribute_or_subdimension_list>)
Chapter II
Data Mining and Knowledge Discovery
Example for Star Schema Definition
„
time
Dimension table
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
branch
dimension table
branch_key
branch_key
branch_name
branch_name
branch_type
branch_type
sales
fact table
time_key
time_key
item_key
item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
item
Dimension table
item_key
item_key
item_name
item_name
brand
brand
type
type
supplier_type
supplier_type
location
dimension table
location_key
location_key
street
street
city
city
province_or_state
province_or_state
country
country
define cube sale_star [ time, item, branch,
location ] : dollars_sold = sum (sales_in_dollars),
units_sold = count (*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key,
street, city, province_or_state, country)
21
Chapter II
Data Mining and Knowledge Discovery
„
Example for Snowflake Schema Definition
sales
fact table
time
dimension table
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
item
dimension table
time_key
time_key
item_key
item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
„
supplier
dimension table
item_key
item_key
item_name
item_name
brand
brand
type
type
supplier_key
supplier_key
supplier_key
supplier_key
supplier_type
supplier_type
location
dimension table
branch
dimension table
location_key
location_key
street
street
city_key
city_key
branch_key
branch_key
branch_name
branch_name
branch_type
branch_type
city
dimension table
city_key
city_key
city
city
province_or_state
province_or_state
country
country
Chapter II
Data Mining and Knowledge Discovery
„
Chapter II
Data Mining and Knowledge Discovery
define cube sale_snowflake [ time, item, branch,
location ] : dollar_sold = sum (sales_in_dollars),
units_sold = count (*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key,
street, city(city_key, city, province_or_state,
country))
Chapter II
Data Mining and Knowledge Discovery
Example for Fact Constellation Schema Definition
„
time
dimension table
sales
fact table
time_key
time_key
day
day
day_of_the_week
day_of_the_week
month
month
quarter
quarter
year
year
branch
dimension table
time_key
time_key
Item_key
Item_key
branch_key
branch_key
time_key
time_key
item_key
item_key
branch_key
branch_key
location_key
location_key
dollars_sold
dollars_sold
units_sold
units_sold
item
dimension table
item_key
item_key
item_name
item_name
brand
brand
type
type
supplier_type
supplier_type
location
dimension table
shipping
fact table
item_key
item_key
time_key
time_key
shipper_key
shipper_key
from_location
from_location
to_location
to_location
dollars_cost
dollars_cost
units_shipped
units_shipped
shipper
dimension table
shipper_key
shipper_key
shipper_name
shipper_name
location_key
location_key
shipper_type
shipper_type
Location_key
Location_key
city_key
city_key
city
city
province_or_state
province_or_state
country
country
Chapter II
Data Mining and Knowledge Discovery
define cube sales [ time, item, branch, location ] :
dollars_sold = sum (sales_in_dollars), units_sold
= count (*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key,
street, city, province_or_state, country)
Chapter II
Data Mining and Knowledge Discovery
„ Categorization and Computation of Data
„
define cube shipping [ time, item, shipper,
from_location, to_location ] : dollars_cost = sum
(cost_in_dollars), units_shipped = count (*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key,
shipper_name, location as location in cube sales,
shipper_type)
define dimension from_location as location in
cube sales
define dimension to_location as location in
cube sales
Cube Measures
A data cube measure is a numerical function
that can be evaluated at each point in the data
cube space.
„ A measure value is computed for a given point
by aggregating the data corresponding to the
respective dimension-value pairs defining the
given point.
„
22
Chapter II
Data Mining and Knowledge Discovery
„
Measures can be organized into 3 categories:
„
Distributive
count(), sum(), min(), max()
„
Algebraic
avg(), min_N(), max_N(), standard_deviation()
„
Holistic
median(), mode(), rank()
province_state
city
…
Canada
Canada
British
British
Vancouver
Vancouver
…
Ontario
Ontario
…
…
New
NewYork
York Buffalo
Buffalo
Illinois
Illinois
Chicago
Chicago
Chapter II
Data Mining and Knowledge Discovery
country
province_or_state
($0…$1000]
($0…$1000]
($0…$200]
($200…$400] ($400…$600]
($0…$200] ($200…$400]
($400…$600] ($600…$800]
($600…$800] ($900…$1000]
($900…$1000]
month
week
day
street
(a)
B.2. The Multidimensional Data
Model
year
quarter
city
Chapter II
Data Mining and Knowledge Discovery
Concept Hierarchies that are common to many
application s may be predefined in the data
mining system.
„ Concept Hierarchies could also be defined by
discretizing or grouping values for a given
dimension or attribute.
USA
USA
New
NewYork
York
…
…
Victoria
Toronto Ottawa
Ottawa
Victoria Toronto
A Concept Hierarchy defines a sequence of
mappings from a set of low-level concepts to
higher-level, more general concepts.
„
all
all
country
„ Introduction of Concept Hierarchies
„
Chapter II
Data Mining and Knowledge Discovery
all
Chapter II
Data Mining and Knowledge Discovery
($0…$100]
($0…$100]
($200…$300]
($200…$300]
($400…$500]
($400…$500]
($100…$200]
($100…$200] ($300…$400]
($300…$400]
($600…$700]
($600…$700]
($500…$600]
($500…$600]
($800…$900]
($800…$900]
($700…$800]
($700…$800]
($900…$1000]
($900…$1000]
(b)
23
B.2. The Multidimensional Data
Model
B.2. The Multidimensional Data
Model
„ OLAP Operations in the Multidimensional
Data Model
„
Roll-up
„ Drill-down
„ Slice and dice
„ Pivot(Rotate)
„ Other OLAP operations (drill-across, drillthrough)
„
Roll-up
„
Aggregation on a data cube:
ƒ Climbing up a concept hierarchy
ƒ Dimension reduction
„
Drill-down
„
Navigation from less detailed data to more
detailed data
ƒ Stepping down a concept hierarchy
ƒ Introducing additional dimensions
B.2. The Multidimensional Data
Model
location
continent
„
Slice and dice
„
„
customer
country
Slice: a selection on one dimension, resulting in a
subcube
Dice: definition of a subcube by a selection on
two or more dimensions
name
street
item
Pivot (Rotate)
„
Visualization, rotating the data axes in view, in
order to provide an alternative presentation of the
data.
B.3. Data Warehouse
Architecture
„ Basics for the Design and Construction of
Data Warehouses
„
The Design of a Data Warehouse: A Business
Analysis Framework
To design an effective data warehouse, one
needs to understand and analyze business needs
and construct a business analysis framework.
name
brand
category
type
month
quarter
year
time
B.3. Data Warehouse
Architecture
„
Four views regarding the design of a data
warehouse:
„
„
„
„
category
city
day
„
group
province_or_state
„
The top-down view
The data source view
The data warehouse view
The business query view
24
B.3. Data Warehouse
Architecture
„
Building and using a data warehouse is a
complex task since it requires
B.3. Data Warehouse
Architecture
„
The Process of Data Warehouse Design
„
„
„
„
Business skills;
Technology skills;
Program management skills
„
„
„
B.3. Data Warehouse
Architecture
„ A Three-tier Data Warehouse Architecture
Choose a business process to model;
Choose the grain of the business process;
(granularity)
Choose the dimensions that will apply to each
fact table record;
Choose the measures that will populate each fact
table record.
B.3. Data Warehouse
Architecture
Query/Report
Analysis
Data-Warehouse-oriented
OLAP Server
Data Mining
Data-Mart-oriented
OLAP Server
Front-end
Tools
OLAP
Server
What is a data warehouse architecture like????
Administration
Monitoring
Metadata
Repository
Operational Databases
B.3. Data Warehouse
Architecture
„
From the architecture point of view, there are
3 data warehouse architecture models
„
Enterprise warehouse
„
Data mart
„
Virtual warehouse
Data Mart
Data Warehouse
Data Mart
Data Warehouse
Server
Data Mart
External Sources
Data
B.3. Data Warehouse
Architecture
„
The top-down development of an enterprise
warehouse serves as a systematic solution
and minimizes integration problems. But
expensive.
„
A recommended method for development of
data warehouse systems is to implement the
warehouse in an incremental and evolutionary
manner.
25
B.3. Data Warehouse
Architecture
Multi-tier
Multi-tier
data
data
warehouse
warehouse
Distributed
Distributed
data
datamarts
marts
Data
Data
mart
mart
Enterprise
Enterprise
data
data
warehouse
warehouse
Data
Data
mart
mart
Model refinement
B.3. Data Warehouse
Architecture
Model refinement
„ Types of OLAP Servers:
„
Relational OLAP (ROLAP) Servers
„
Multidimensional OLAP (MOLAP) Servers
„
Hybrid OLAP (HOLAP) Servers
Define
Defineaahigh-level
high-levelcorporate
corporatedata
datamodel
model
B.4. Data Warehouse
Implementation
„ Efficient Computation of Data Cubes
„
B.4. Data Warehouse
Implementation
„
Multi-way Array Aggregation in the
Computation of Data Cubes
The compute cube Operator and its
Implementation
„
„
define cube sales [item, city, year]:
sum(sales_in_dollars)
compute cube sales
B.5. Further Development of Data
Cube Technology
B.6. From Data Warehousing to Data
Mining
„ Discovery-driven Exploration of Data Cubes
„ Data Warehouse Usage
„ Complex Aggregation at Multiple
Granularities: Multifeature Cubes
„ Other Developments
„
3 kinds of Data Warehouse application
„
„
„
Information Processing
Analytical Processing
Data Mining
26
B.6. From Data Warehousing to Data
Mining
B.6. From Data Warehousing to Data
Mining
„ From OLAP to OLAM
„
„
„
„
OLAP: On-Line Analytical Processing
OLAM: On-Line Analytical Mining
Reasons why OLAM is important:
„
„
„
On-Line Analytical Mining integrates OLAP
with data mining and mining knowledge in
multidimensional databases.
„
High quality of data in datawarehouses
Available information processing infrastructure
surrounding data warehouses
OLAP-based exploratory data analysis
On-Line selection of data mining functions
Chapter III
Fuzzy Theory and Application
Chapter III
Fuzzy Theory and Application
Chapter III
Fuzzy Theory and Application
OUTLINE
I.
INTRODUCTION and BASICS – Lecture 1
A. Why fuzzy sets
1. Data/complexity reduction
2. Control and fuzzy logic
3. Pattern recognition and cluster analysis
4. Decision making
B. Types of uncertainty
1. Deterministic, interval, probability
2. Fuzzy set theory, possibility theory
OBJECTIVES
1. To introduce fuzzy sets and how they are used
2. To define some types of uncertainty and study what
methods are used to with each of the types.
3. To define fuzzy numbers, fuzzy logic and how they
are used
4. To study methods of how fuzzy sets can be
constructed
5. To see how fuzzy set theory is used and applied in
cluster analysis
Chapter III
Fuzzy Theory and Application
II.
FUZZY SETS AND SYSTEMS – Lecture 2
A. Definitions
1. Sets – classical sets, fuzzy sets, rough sets, fuzzy
interval sets, type-2 fuzzy sets
2. Fuzzy numbers
B. Operations on fuzzy sets
1. Union
2. Intersection
3. Complement
C. Operations on fuzzy numbers
1. Arithmetic
2. Relations, equations
3. Fuzzy functions and the extension principle
27
Chapter III
Fuzzy Theory and Application
III.
FUZZY THEORY APPLICATION – Lecture 3
A. Introduction
B. Fuzzy propositions
C. Fuzzy hedges
D. Composition, calculating outputs
E. Defuzzification / action
IV.
FUZZY SET METHODS Cluster analysis – Lecture 4
Lecture 1
INTRODUCTION AND BASICS
ƒ Fuzzy sets are sets that have gradations of belonging
ƒ EXAMPLES:
Green
BIG
Near
Chapter III
Fuzzy Theory and Application
Chapter III
Fuzzy Theory and Application
A. Why fuzzy sets?
ƒ Classical sets, either an element belongs or it does
not
ƒ
EXAMPLES:
- Modeling with uncertainty requires more
than probability theory
- There are problems where boundaries are
gradual
- Set of integers – a real number is an integer or not
- You are either in an airplane or not
- Your bank account is x yuan, y jiao and z fen
Chapter III
Fuzzy Theory and Application
EXAMPLES:
What is the boundary of the China?
Is the boundary a mathematical curve?
What is the area of China?
Is the area a real number?
1. Data reduction – driving a car, computing with
language
2. Control and fuzzy logic
a. Appliances, automatic gear shifting in a car
b. Subway systems (control outperformed
humans in giving smoother rides)
28
Chapter III
Fuzzy Theory and Application
Example: Temperature control in NASA space shuttles
IF x AND y
THEN z is A
IF x IS Y
THEN z is A
Chapter III
Fuzzy Theory and Application
3. Pattern recognition, cluster analysis
„ A digital TV company that issues IC cards wants to
discover whether or not it is lost or being illegally
used prior to a customer reporting it missing
… etc.
If the temperature is hot and increasing very fast then air conditioner fan is
set to very fast and air conditioner temperature is coldest.
„ An Internet company wants to know what groups (sex,
age, ethnic, profession, income level…) of users are
accessing its portal content.
There are four types of propositions we will study later.
Chapter III
Fuzzy Theory and Application
4. Decision making
- Locate digital transmitters to optimally cover a given
area
- Locate service centers to optimally cover digital TV
user network.
- Position a satellite to cover the most number of
satellite TV users
- Design a content service in the following way: I want
the service to be very popular, temporally optimized,
last a rather long time and the cost of service is
acceptable to subscribers.
Chapter III
Fuzzy Theory and Application
Chapter III
Fuzzy Theory and Application
B. Types of Uncertainty
1. Deterministic – the difference between a
known real number value and its
approximation is a real number (a single
number). Here one has error. For example, if
we know the answer x must be the square root
of 2 and we have an approximation y, then the
error is x-y (or if you wish, y-x).
Types of sets (figure from Klir&Yuan)
2. Interval – uncertainty is an interval. For
example, measuring pi using Archimedes’
approach.
3. Probabilistic – uncertainty is a probability
distribution function
4. Fuzzy – uncertainty is a fuzzy membership
function
5. Possibilistic - uncertainty is a possibility
distribution function, generated by nested sets
29
Chapter III
Fuzzy Theory and Application
Error, uncertainty - information/data is often imprecise, incoherent,
incomplete
DEFINITION: The error is the difference between the exact value
(a real number) and a value at hand (an approximation). As
such, when one talks about error, one presupposes that there
exists a “true” (real number) value. The precision is the
maximum number of digits that are used to measure an
approximation. It is the property of the instrument that is being
used to measure or calculate the (exact) value. When a subset
is being used to measure/calculate, it corresponds to subset that
can no longer be subdivided. It depends on the granularity of
the input/output pairs (object/value pairs) or the resolution being
used.
Chapter III
Fuzzy Theory and Application
DEFINITION: Accuracy is the number of correct digits in
an approximation.
For example, a gps reading is (x,y) +/- …
DEFINITION: Item of information – is an ‘A-O-V-C’
quadruple (attribute, object, value, confidence)
(definition is from Dubois&Prade, Possibility Theory)
Attribute: a function that attaches value to the object; for
example: area, position, color; it’s the recipe that tells
us how to obtain an output (value) from an input (object)
Object: the entity (domain or input); for example, Sicily
for area or my shirt for color or room 4.2 for
temperature.
Value: the assignment or output of the attribute; for
example 211,417.6 sq. km. for Sicily or green for shirt
Confidence: reliability of the information
Chapter III
Fuzzy Theory and Application
VAGUENESS – lack of sharp distinction or boundaries,
our ability to discriminate between different states of
an event, undecidability (is a glass half full/empty)
Chapter III
Fuzzy Theory and Application
AMBIGUITY: a one to many relationship; for example,
she is tall, he is happy. There are a variety of
alternatives
1. Non-specificity: Suppose one has a heart
blockage and is prescribed a treatment. In this case
“treatment” is a non-specificity in that it can be an
angioplasty, medication, surgery (to name three
alternatives)
2. Dissonance/contradiction: One physician says to
operate and another says go to Hainan.
Chapter III
Fuzzy Theory and Application
LECTURE SUMMARY
INTRODUCTION and BASICS – Lecture 1
SET THEORY
PROBABILITY
POSSIBILITY
THEORY
FUZZY SET
THEORY
ROUGH SET
THEORY
A. Why fuzzy sets
1. Data/complexity reduction
2. Control and fuzzy logic
3. Pattern recognition and cluster analysis
4. Decision making
B. Types of uncertainty
1. Deterministic, interval, probability
2. Fuzzy set theory, possibility theory
30
Chapter III
Fuzzy Theory and Application
Chapter III
Fuzzy Theory and Application
Example – Surface modeling
Surface models
- The problem: Given a set of reading of the bottom of the ocean
whose values are uncertain, generate a surface that explicitly
incorporates this uncertainty mathematically and visually - The
approach: Consistent fuzzy surfaces
ASSIGNED QUESTIONS:
1. Understand and Explain the A-O-V-C Quadruple of
Information Item in Chinese.
- Here with just introduce the associated ideas
2. By reading over the 1st and 2nd section of Book
Chapter II, present the algorithmic manipulations of
the fuzzy set z%= f%(x)
Chapter III
Fuzzy Theory and Application
Imprecision in Points: Fuzzy Points (figures from Jorge dos Santos)
Chapter III
Fuzzy Theory and Application
Transformation of real-valued functions to fuzzy functions
Instead of a real-valued function z = f (x) or z = f (x, y) let’s now
consider a fuzzy function z%= f%(x) or z%= f%(x, y) where every
z% number
element x or (x,y) is associated with a fuzzy
.
Statement of the Interpolation Problem
Knowing the values {z%
} of a fuzzy function over a finite set of
i
points
2D
3D
{xi} or {(xi,yi)}, interpolate over the domain in question to
obtain a (nested) set of surfaces that represent the uncertainty in
the data.
.
Chapter III
Fuzzy Theory and Application
Computing surfaces
Given a data set of fuzzy numbers:
~
z =1− d fuzzy triangular= a /b/ c
N
~
p(x) = ∑~
zi Li (x)
i =1
N
[~
p(x)]α = ∑zi (α )Li (x)
i =1
Chapter III
Fuzzy Theory and Application
Computing surfaces – Example
~z = 0 . 5 /1 . 5 / 2 , ~z = 0 . 75 /1 /1 . 5
1
2
L1 ( x ) = x + 2 , L 2 ( x ) = 3 x − 1
x = 1 ⇒ L1 (1) = 1 + 2 = 3, L 2 (1) = 3 *1 − 1 = 2
⇒ ~
p (1) = 3 ~z1 + 2 ~z 2 = 1 . 5 / 4 . 5 / 6 + 1 . 5 / 2 / 3
= 3/6.5/9
[~
p (1)]α = 0 = [3, 9 ]
[~
p (1)] 0 . 5 = [ 4 . 75 , 7 .75 ]
[~
p (1)]1 = [6 . 5, 6 .5 ]
31
Fuzzy Interpolating Polynomial dos Santos & Lodwick)
Consistent Fuzzy Surfaces (curves)
The surfaces (curves) are defined enforcing the
following properties:
from Jorge
p%(x(figure
)
Utilizing alpha-levels to obtain fuzzy polynomials, we have:
{ z∈R : z= p (x), d ∈ [ z ] }
⎡p
%(x)⎤ ≡ ⎡⎢ pα−(x), pα+(x)⎤⎥ =
⎦
⎣
⎦α ⎣
1. The surfaces are defined analytically via the
fuzzy functions; that is, model directly the
uncertainty using fuzzy functions z%= f%(x) or z%= f%(x, y)
pα+(x)
2. All fuzzy surfaces maintain the characteristics
of the generating method. That is, if splines
are being used then all generated fuzzy
surfaces have the continuity and smoothness
conditions associated with the splines being
used.
pα−(x)
d
i α
i
z2+α
z1+α
z2−α
z1−α
x1
x
x2
Fuzzy Curves (figures from Jorge dos Santos & Lodwick)
2-D Example (from Jorge dos Santos & Lodwick)
P. Lagrange
60
50
z
40
30
20
50
0
20
15
10
-50
5
10
-100
0
0
-10
-50
0
x
50
100
150
0
15
25
50
90
121
143
165
200
zi-
19.5
14.9
5.8
-3.9
39.0
22.3
32.1
29.4
2.5
zi1
20.0
15.0
6.0
-4.0
40.0
23.0
33.0
30.0
3.0
zi+
20.3
15.6
6.3
-4.2
41.2
23.7
34.0
30.1
3.2
Spline linear
xi
20
10 40
20
40
60 15
80
20
100
120
25140
60
80
100
120
140
160 30
180
200
60
200
40
20
0
-20
0
160
180
200
Details of the Consistent Fuzzy Cubic Spline (figures
from Jorge dos Santos & Lodwick)
Fuzzy Curves (figures from Jorge dos Santos & Lodwick)
50
Cubic Spline
40
z
30
33
20
10
32
0
-10
0
20
40
60
80
100
x
120
140
160
180
200
z
31
30
50
Consistent
Cubic Spline
29
40
28
30
z
27
20
10
155
0
-10
0
20
40
60
80
100
x
120
140
160
180
160
x
165
170
200
32
Another Representation/View of the Fuzzy Points
(figure from Jorge dos Santos & Lodwick)
3-D Example (from Jorge dos Santos & Lodwick)
200
180
160
35
140
y
30
120
z 25
100
20
15
80
200
10
60
150
5
40
-50
x
0
20
0
-50
50
0
50
100
x
150
200
Fuzzy Surface via Triangulation (figure from
Jorge dos Santos & Lodwick)
250
y
100
50
100
150
200
0
Fuzzy Surfaces via Linear Splines
(figure from Jorge dos Santos & Lodwick)
Fuzzy Surfaces via Cubic Splines
(figure from Jorge dos Santos & Lodwick)
EXAMPLES
Cidalia Fonte will go over in more detail the ideas introduced here at a later time.
Example 1. Tejo River
- The problem
The dimension of water bodies, and consequently their position, is subject to
variation over time, especially in regions which are frequently flooded or
subject to tidal variations, creating considerable uncertainty in positioning
these geographical entities. River Tejo is an example, since frequent floods
occur in several places along its bed. The region near the village of
Constância, where rivers Tejo and Zezere meet, was the chosen for this
example.
A fuzzy geographical entity corresponding to rivers Tejo and Zezere is considered a
fuzzy set. To generate this fuzzy entity, the membership function has to be
constructed. This was done using a Digital Elevation Model of the region,
created from the contours of the 1:25 000 map of the Army Geographical
Institute of Portugal and information regarding the daily means of the river
water level registered in the hydrometric station of Almourol, located in the
vicinity, from 1982 to 1990. The variation of the water level during these year
are on the next slide:
33
T
Example (figure from Cidalia Fonte & Lodwick)
1990
1989
1989
1988
1987
1986
1985
1984
1984
The river limits represented on the map
1983
μ (x, y) = f [z(x, y)]
12
10
8
6
4
2
0
-2
1982
The membership function of
points to the fuzzy set is
given by:
meters above the 20m level
Example 1 (figures from Cidalia Fonte & Lodwick)
Line corresponding to the maximum water
level registered during the considered period
100
f(z)
100%
80%
60%
40%
20%
0%
μT( x , y )
y
1
Line corresponding to the region always
submerged during the considered period
x
20 21 22 23 24 25 26 28 29 30
altitude z
Example 2 – Landcover/use
(figures from Cidalia Fonte & Lodwick)
Example 2 – Landcover/use continued
μ Bareland ( x , y )
μ Water regions ( x , y )
Water regions
a)
b)
Vegetation
μ ( x, y ) = 1
Bareland
μ ( x , y ) = 0.75
μ ( x , y ) = 0.5
μ ( x , y ) = 0.25
μ ( x, y ) = 0
c)
μ Vegetation ( x , y )
d)
GIS - Display
y
μ forest ( x, y ) 1
0
x
y
Chapter IV
Knowledge Representation
b)
μ grass ( x, y ) 1
0
y
x
μwet regions ( x, y ) 1
c)
0
x
a)
34
Chapter IV
Knowledge Representation
„ Formal Logic and Intelligent Systems
„ Concepts of Knowledge Representation
„ Rules and Frames
„ Knowledge Representation Examples
„ Key Issue of the Chapter:
„
Concepts of Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
35
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
36
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
37
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
38
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Chapter IV
Knowledge Representation
Small
39
Chapter V
Wavelet Analysis
„ Content
„ Concepts in Wavelet Analysis
„ Characteristics of Wavelet Transform
„ Computational Features of Wavelet Transform
„ Anti-Wavelet Transform
„ Reconstruction Kernel
„ Categories of Wavelets
„ Emphases
„ Multi-resolution Analysis Theory
„ Orthogonal Wavelet Transform
„ Wavelet Package Analysis
§5.1 Concepts in Wavelet Analysis
„ Definition of the Wavelet Transform
„
§5.1 Concepts in Wavelet Analysis
„
and scaling of the basis function ψ (t ) .
Given a basis function ψ (t ) ,Suppose:
ψ a , b (t ) =
t−b
1
ψ(
)
a
a
Obviously, ψ a , b ( t ) are products of the shifting
„
With the continuous changing of a, b , a set of
functions, ψ a , b ( t ) , are produced.
where a, b are constants, and a > 0
ψ a ,b (t ) =
§5.1 Concepts in Wavelet Analysis
This
Thisindicates
indicatesthat
that
x(t)
x(t) isis
„
Square-Integrable
Square-Integrable
For an x (t ) ∈ L ( R ) ,
2
1
t−b
ψ(
)
a
a
§5.1 Concepts in Wavelet Analysis
„
Since a, b and t are continuous variables,
this definition of Wavelet Transform is known
the Wavelet Transform of x(t ) is defined as:
WTx ( a, b) =
1
t−b
x( t )ψ ∗ (
)dt
a
a∫
= ∫ x (t )ψ a∗, b (t )dt = 〈 x (t ),ψ a , b (t )〉
as Continuous Wavelet Transform (CWT).
„
CWT is the basis of Wavelet Analysis studies.
WTx (a, b) =
1
t −b
x(t )ψ ∗ (
)dt
∫
a
a
= ∫ x (t )ψ a∗, b (t )dt = 〈 x (t ),ψ a , b (t )〉
40
§5.1 Concepts in Wavelet Analysis
§5.1 Concepts in Wavelet Analysis
„ Important Concepts in Wavelet Transform
„ Important Concepts in Wavelet Transform
„
b : Time Shift
„
a : Scaling Factor
„
ψ (t ) : The Basic Wavelet
or the Mother Wavelet
„
„
ψ (t ) : The Basic Wavelet
ψ a , b ( t ) : The Wavelet Basis
the set of functions produced by
shifting and scaling of the Mother
Wavelet
t −b
or the Mother Wavelet
1
t −b
x(t )ψ ∗ (
)dt
a
a∫
∗
= ∫ x( t )ψ a , b ( t )dt = 〈 x( t ),ψ a , b ( t )〉
WTx (a, b) =
§5.1 Concepts in Wavelet Analysis
1
x(t )ψ ∗ (
)dt
a
a∫
∗
= ∫ x( t )ψ a , b ( t )dt = 〈 x( t ),ψ a , b ( t )〉
WTx (a, b) =
§5.1 Concepts in Wavelet Analysis
„ Functionality of Time Shift (Variable b)
„ The Wavelet Transform could be understood
ϕ (t )
as the Internal Product of the signal x(t) with
a set of Wavelet Basis Functions.
a
t
4a
ϕ (t − b)
„ The Mother Wavelet could be Real or
3a
Complex functions.
t
b
a
t −b
ϕ(
), a = 2
a
b
§5.1 Concepts in Wavelet Analysis
2a
b
b
t
§5.1 Concepts in Wavelet Analysis
„ The Expression of Wavelet Transform in the
Frequency Domain
ψ a , b (t ) =
1
t−b
ψ(
)
a
a
⇔
Ψa , b (Ω ) = a Ψ ( aΩ )e − jΩb
1
< X (Ω ), Ψa ,b (Ω ) >
2π
a +∞
=
X ( Ω ) Ψ ∗ ( aΩ ) e j Ω b dΩ
2π ∫− ∞
WTx ( a, b) =
41
§5.2 Characteristics of Wavelet
Transform
§5.2 Characteristics of Wavelet
Transform
„ The Constant-Q Feature
Q = Δ Ω / Ω0
„ Bandwidth versus Central Frequency
ΔΩ / a
= Δ Ω / Ω0 = Q
Ω0 / a
§5.2 Characteristics of Wavelet
Transform
a =1
a=2
a = 1/ 2
§5.2 Characteristics of Wavelet
Transform
Δt / 2
}
} 2Δ Ω
( a = 1 / 2 ) 2Ω 0
( a = 1)
Ω0
0
( a = 2) Ω 0 / 2
Δ
} t
} ΔΩ
2Δt
} ΔΩ / 2
What is Information Fusion?
Chapter VI
Data and Information Fusion
“Information fusion is an Information Process dealing with the:
• [association, correlation, and combination of data and
information]
information from
• [single and multiple sensors or sources] to achieve
• [refined estimates of parameters, characteristics, events, and
behaviors]
behaviors for observed entities in an observed field of view
•It is sometimes implemented as a Fully Automatic process or as a
HumanHuman-Aiding process for Analysis and/or Decision Support
42
Data and Information Fusion Process
& Functional Model
Most Simply--Multiple types of data
carrying various types
of information
(redundant and
complementary)
“Associated” or
“Correlated” to the
same object or event
or behavior
Multiple types of data
Related to things of interest
So that estimation
algorithms
(mathematical
techniques)—or—
automated reasoning
methods (artificial
intelligence techniques)
can produce better
estimates (than based
on any single type of
data)
To improve estimates about
those things
These Basic Ideas are Transferable to Many Types of Problems
How is Data/Information Fusion Done?
Signal Processing
L0: Sub-Object
Association/
Correlation
Application•Combinatoric
Optimization
Domain •Numerical/Statistical
KnowledgeEstimation Techniques
• Intel Sources
• Air Surveillance
• Surface Surveillance
• Space Surveillance
•OR, Statistical
Methods,
L1:
Combinatoric
Association/
Optim.
Correlation
• Numerical/Statistical
Estimation Techniques of
•Combinatoric
Optimization
• Knowledge-based,
Symbolic Techniques
Level 0
Level 1
Level 2
Level 3
Processing Processing Processing Processing
• Intel Sources
• Air Surveillance
• Surface
Surveillance
World
State of Interest
• Space Surveillance
•Instrumented,
Intelligent Mfg
Systems
Identity of:
Impact
Assessment
•Parts
Human
Computer
Interaction
•Benign
•Vehicles
•Critical
•Organs; Tumors
Level 4
Processing
Process
Refinement
Processing
Data Base
Management System
Human Engrg:
Support
Database
Fusion
•Decision-aiding
Database
•Active sensor control
•Visualization/display
•Patientmonitoring
Systems
•Fusion process control
•Trust in automation
systems
A Basic Issue: “Association”
--What measurement goes with what entity?
--Because we then use (fuse) those multiple msmts to get an improved (fused) estimate of
something
How to formulate an approach to this problem?
Measurement/Observable
Estimate Propagated to Msmt Time
“Closeness” score
Human
Computer
Interaction
L3: Impact Assessment
• Numerical/Statistical
Estimation Techniques
• Knowledge-based,
Symbolic Techniques
Measurement error
Instrumentation
•Environmental
effects
Estimation/prediction
error
L4: Adaptive Control Sensor
Formal (mathematical)
• Source Management; Info-theoretic Techniques
and Intelligent Control
Theory
• Process Adaptation; Control-theoretic Techniques
Leads to the formulation of a classic OR Assignment problem
with usual repertoire of solutions
Data Fusion “Processing Tree”
Data Fusion Tree Node
Now exploit the
Dat a Fusion Tree Node multiple
observational
data for a
fused estimate
DataCorrelation
Data
Association
HW
Radar
Off-Board
FN
MN
FN
Target fensiv
Management
IRST
Prior Data Fusion
Nodes & Sources
Situation
Refinement
Status or “Situation”:
Adaptive
Logic, eg:
Broad Range
HF/HE
Techniques
•Numerical/Statistical
Estimation Techniques
•Pattern Recognition
Techniques
Sub-object Data
Object
Association
& Movement,
Location,
Refinement
Estimation
•Intelligent
Transportation
Systems
Object Refinement
Broad Range of
Estimation
Methods
•Sensors and
L2: Situation Refinement
DATA FUSION DOMAIN
Data Preparation
(Common
Referencing)
Hypothesis
Generation
Hypothesis
Evaluation
Hypothesis
Selection
State
Estimation
&
Prediction
FN
User or
Next Fusion
Node
MN
MN
FN
Core Proces sing
MN
Miss ile
FN
FN
Things that
can cause
expected
observations
Optimally
asigning the
observations to an
How it is that
estimation process
observations
Source/Sensor Status
Resource Management Controls
which is
are related to
estimating a
the entities or
parameter of
objects
• Estimate/predict
• Gating and generation
• Detect and resolve
aggregate states
interest for object&
the
(A notion of - Kinematics.
data conflicts
attributes, ID
of feasible and confirmed
entity/object
From
each perspective (blue, red)
“closeness”—a
• Convert data to common
association hypothesei
•Estimate sensor/source misalignments
time and coordinate frame • Scoring of
“score”)
•Feed forward source/sensor status
• Compensate for
data associations
source misalignments
• Select, delete, or feedback
data associations
Us er
I nterf ace
MN
FN
FN
MN
RWR
MN
MN
FN
Def ensiv e Mgmt.
Countermeasure
MN
RFC M
FN
FN
FN
MN
MN
Expendables
FN
MN
MN
FN= Data F usi on Node
MN= R esource Man agement Node
Architecting these
systems can be difficult
43
Fusion-Based Automatic Object Recognition
Non-Defense Applications
(A Precursor to Visualization)
„ Condition-Based Maintenance (Multiply-
instrumented, high-value equipment—
predict/estimate “health” from sensor data)
„ Multi-spectral mammography (Tumor
detection from multiple imagery)
„ Intelligent Transportation systems
(Intersection collision avoidance from sidelooking radars and acoustic sensors)
FUSED COMPONENT
MATCH SCORES
SAR ONLY
34o pose
Collected
BRDM-2 Target
Model
E-O ONLY
Predicted
88o pose
Collected
Predicted
SAR Component
Match Scores
E-O Component
Match Scores
• Generate Hypothesis
(e.g. BRDM-2, 34o pose, articulation x)
Match Scores
← Low
• Predict Measurements
High →
• Evaluate Component-Level Match: Actual vs. Predicted
• Select Hypothesis with Best Match
Summary
„ Data Fusion is an information process embodied in
software, involving estimation algorithms to extract
maximum information from multiple observations
„ It is a maturing field of study requiring innovation in
application
以下为中文参考内容
„ It has been successfully employed in a broad range of
applications
„ Applicability to anthropometrical needs warrants
consideration
查询系统
数据分析
面向数据仓库的OLAP 服务
管理系统
监控系统
元数据存储
事务数据库
数据挖掘
面向数据集市的
OLAP 服务
前端工具层
OLAP服务层
数据集市
数据仓库服务层
数据集市
外部数据源
„ deduction
演绎
„ induction 归纳
导出,引出
推断,不明推论
„ predicate calculus 谓词积分
„ proposational network
命题网络
„ derivation
„ abduction
数据集市
数据仓库
„ inference推论
数据层
44