Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Steps of a KDD Process









Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining

summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines
Major Issues in Data Mining (1)


Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of
abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem
Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods
Major Issues in Data Mining (2)

Issues relating to the diversity of data types



Handling relational and complex types of data
Mining information from heterogeneous databases and
global information systems (WWW)
Issues related to applications and social impacts



Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering
 Process control and decision making
Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
Protection of data security, integrity, and privacy
An Overview of Data
Warehousing and OLAP
Technology
OLAP-Online Analytical
Processing

Data ware house enables OLAP to help decision
support.
Organize and format data in various format.

OLTP-Online Transaction Processing

It uses Operational datatbases.
Data w/h is kept separate from Operational DB.
It covers day to day operations of an org.
Such as purchase, inventory, manufacturing, banking,
payroll, accounting.




OLTP vs OLAP

OLTP applications typically automate clerical data processing tasks

Data warehouses, in contrast, are targeted for decision support

The transactions require detailed, up-to-date data, and read or update a few
(tens of) records accessed typically on their primary keys

Historical, summarized and consolidated data is more important than
detailed, individual records

Operational databases tend to be hundreds of megabytes to gigabytes in size

Enterprise data warehouses are projected to be hundreds of gigabytes
to terabytes in size

Consistency and recoverability of the database are critical, and maximizing
transaction throughput is the key performance metric

Query throughput and response times are more important than
transaction throughput
OLAP-characteristics




Use multi dimensional data analysis
techniques.
Provide advance data base support.
Provides easy to use end user interfaces.
Support client/server architecture.
Architecture and End-to-End
Process
Back End Tools and Utilities



Data Cleaning Tools: Tools that help to detect data
anomalies and correct them
E.g. To correct Inconsistent field lengths, inconsistent
descriptions, inconsistent value assignments, missing
entries and violation of integrity constraints
Types:
Data migration tools e.g. Warehouse Manager from
Prism
Data scrubbing tools e.g. Integrity
Data auditing tools: such tools may be considered as
variants of data mining tools
Back End Tools and Utilities (Contd.)

Load:
After extracting, cleaning and transforming, data
must be loaded into the warehouse. e.g., RedBrick
Table Management Utility
Additional preprocessing may still be required for:
checking integrity constraints; sorting;
summarization, aggregation and other computation
to build the derived tables stored in the warehouse;
building indices and other access paths; and
partitioning to multiple target storage areas.
Methods:
Batch load utilities
Pipelined and partitioned parallelism
To insert only updated table
Back End Tools and Utilities (Contd.)

Refresh: Done only if some OLAP queries need current
data

Most contemporary database systems provide
replication servers that support incremental techniques
for propagating updates from a primary database to one
or more replicas.

Techniques: Data shipping and Transaction shipping
Transaction shipping has the advantage that it does not
require triggers, which can increase the workload on the
operational source databases
Conceptual Model and Front End Tools

A popular conceptual model that influences the front-end
tools, database design, and the query engines for OLAP
is the multidimensional view of data in the warehouse.
Front End Tools

The spreadsheet is still the most compelling front-end
application for OLAP

Popular operations that are supported by the
multidimensional spreadsheet:
rollup (increasing the level of aggregation)
drill-down (decreasing the level of aggregation or
increasing detail) along one or more dimension
hierarchies
slice_and_dice (selection and projection)
pivot (re-orienting the multidimensional view of data).
Front End Tools

Other Applications : Traditional analysis by means of a
managed query environment
These applications often use raw data access tools and
optimize the access patterns depending on the back end
database server.
E.g. there are query environments (e.g., Microsoft
Access) that help build ad hoc SQL queries by “pointingand-clicking”
Database Design Methodology


The database designs recommended by ER diagrams
are inappropriate for decision support systems where
efficiency in querying and in loading data (including
incremental loads) are important
Schema used to represent the multidimensional data
model are:
Star schema
Snowflake schemas
Fact constellations
Star Schema
Snowflake Schema
Warehouse Servers





Data warehouses may contain large volumes of data
Thus, improving the efficiency of scans is important
Index Structures and their Usage: Warehouse servers
can use bit map indices, which support efficient index
operations (e.g., union, intersection).
Materialized Views and their Usage: strategy for using a
materialized view is to use selection on the materialized
view, or rollup on the materialized view by grouping and
aggregating on additional columns
Transformation of Complex SQL Queries: “unnesting”
complex SQL queries containing nested subqueries
Parallel Processing
Warehouse Servers (Contd.)

Server Architectures for Query Processing:

Specialized SQL Servers: The objective here is to provide
advanced query language and query processing support for SQL
queries over star and snowflake schemas in read-only
environments. e.g. Redbrick

ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in
the warehouse is stored) and client front end tools e.g. Microstrategy.

MOLAP Servers: These servers directly support the
multidimensional view of data through multidimensional storage
engine e.g. Essbase (Arbor)
Warehouse Servers (Contd.)

SQL Extensions:
Extended family of aggregate functions: rank,
percentile, mean, mode, median
Reporting Features: moving average
Multiple Group-By: Cube and Rollup
Comparisons
Metadata and Warehouse Management

Administrative metadata includes:
Descriptions of the source databases, back-end and front-end
tools; definitions of the warehouse schema, derived data,
dimensions and hierarchies, predefined queries and reports; data
mart locations and contents; physical organization such as data
partitions; data extraction, cleaning, and transformation rules;
data refresh and purging policies; and user profiles, user
authorization and access control policies
Business metadata includes:
Business terms and definitions, ownership of the data, and
charging policies
Operational metadata includes:
Information that is collected during the operation of the
warehouse: the lineage of migrated and transformed data; the
currency of data in the warehouse (active, archived or purged);
and monitoring information such as usage statistics, error
reports, and audit trails.
Metadata and Warehouse Management (Contd.)

A metadata repository is used to store and manage all
the metadata associated with the warehouse. E.g.
Platinum Repository and Prism Directory Manager

Warehouse management tools (e.g., HP Intelligent
Warehouse Advisor, IBM Data Hub, Prism Warehouse
Manager) are used for monitoring a warehouse

System and network management tools (e.g., HP
OpenView, IBM NetView,Tivoli) are used to measure
traffic between clients and servers, between warehouse
servers and operational databases

Workflow management tools been considered for
managing the extract-scrub-transform-load-refresh
process
Conclusion

There are substantial technical challenges in developing
and deploying decision support systems

While many commercial products and services exist,
there are still several interesting avenues for research
related to the different aspects in designing and
maintaining a data warehouse.
References:



Surajit Chaudhuri Umeshwar Dayal Microsoft Research, Redmond,
Umeshwar Dayal, Hewlett-Packard Labs, Palo Alto, An Overview of Data
Warehousing and OLAP Technology
Inmon, W.H., Building the Data Warehouse. John Wiley, 1992.
Athanasios Vavouras, Stella Gatziu, Klaus R. Dittrich, Modeling and
Executing the Data Warehouse Refreshment Process, Technical Report
2000.01, January 2000