Download Document

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Database model wikipedia , lookup

Forecasting wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Data analysis wikipedia , lookup

Business intelligence wikipedia , lookup

Understanding Data Analytics and
Data Mining
An important aspect of the decision-making process is the
ability to transform seemingly unrelated data into useful
information which is used to influence a person’s
decision. Understanding what data is needed to make
effective decisions and where that data comes from is
just one step in the process: the next step is mining or
analyzing that data to draw up useful conclusions to aid
in decision making.
The Understanding Data Analysis and Data Mining
presentation is designed to explore the general
principles behind this second step and support the
organization in understanding their options related to
using data effectively in their business.
Distinguishing Analysis and Mining
The terms, “data analysis” and “data mining,” are
sometimes used interchangeably, but they are distinctly
different in practice.
In data analysis, a hypothesis is formed and the data is
analyzed to support or disprove the hypothesis.
In data mining, no hypothesis is formed initially but the
data is analyzed to identify any interesting patterns
from which a hypothesis can be drawn.
Despite their differences, the techniques and methods for
both data analysis and data mining are similar.
Knowledge Discovery in Databases
The Knowledge Discovery in Databases
process includes the following steps:
Data Mining
Knowledge Presentation
Defining Data
Data are a set of facts.
Facts are true or proven.
Data can come in a variety of types:
Relational data
Operational data
Transactional data
Define Data Entry
A data entry is a single instance or record in a
database. They are also called data objects.
A data entry establishes relationship between
data elements.
person and address
customers and purchases
events and outcomes
Define Dimensions
A dimension is a collection of facts about a
measurable situation.
Dimensions define the who, what, where,
when, and how of a particular focus on the
Dimensions are used to construct how data
patterns are identified and analyzed.
Dimensions – Cube Schema
The cube rendering is a product of online
analytical processing (OLAP) and is used to
show how the different dimensions of data
can be viewed.
Retail Example:
4 retail locations
10 products
12 months
2 age groups
Dimensions – Star Schema
Star schemas are used to design how data is
organized in data warehouses.
Online Analytical Processing
Online Analytical Processing is an approach
for analyzing multidimensional data from
multiple perspectives interactively.
The acronym for online analytical processing
is OLAP.
Defining Patterns
A pattern is an expression of data which can be modeled.
Data analysis and data mining focuses on identifying,
understanding, and drawing conclusions about interesting
An interesting pattern has the following characteristics:
– It can be understood easily by humans
– It can be recreated, meaning it has some level
certainty to its validity
– It can be potentially used by the organization
– It is novel, innovative, and requires investigation
– For data analysis, it validates and confirms the
Queries are a mechanism for retrieving
information from a database: they consist of
Standard queries are predefined questions to
ask a database.
Data Mining Techniques
There are several techniques of note in data
Characterization and Discrimination
Associations and Correlations
Classification and regression
Clustering analysis
Outlier analysis
Characterization and Discrimination
Characterization will describe the data in
summary or general terms.
Discrimination will describe the data, usually
by means of comparison.
Association and Correlation
Associations and correlations are pattern
relationships made against data objects.
Often used in frequent pattern mining.
Classification and Regression
Classification attempts to find a predefined
data model to describe the data set.
Regression attempts to find an existing data
model to describe missing or unavailable
numerical data sets.
These are predictive approaches and utilize
methods such as decision trees and neural
Cluster Analysis
Data objects are analyzed without using class
labels, or generating class labels.
Image from
Outlier Analysis
Looks at the abnormalities in data: data that
does not behave as expected.
Cross Industry Standard Process for Data Mining
(CRISP-DM) was developed by the European
Strategic Program on Research in Information
Sample, Explore, Modify, Model, and Assess
(SEMMA) was developed by SAS Institute Inc.
The Toolkit
The Toolkit is designed to enable an organization to
improve their capabilities in data warehousing and data
analysis, while maintaining a level of neutrality between
specific technical solutions. The toolkit is comprised of
two parts: an introduction to the concepts and terms
used in these areas, and usable templates to pursue
and implement specific technical solutions
The goal of the Data Warehouse and Data Analysis Toolkit
is to define the contributing factors, major components,
and their relationships, while provide the basic tools to
take action based on the organization’s needs.
Moving Forward
The presentations found within the Toolkit provide
education about the different facets of Data
Warehousing and Data Analysis: they can be used for
self-edification or as the foundation for presenting a
case to different levels of the organization.
The process document, Developing Data Analysis
Capabilities, is intended to be a step-by-step guide in
creating a Data Analysis foundation in your
organizations. Multiple templates have been created to
support the process and aid organizations in their
efforts to improve their Data Analysis capabilities.