Download Lecture5-Part

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining and Big
Data
Ahmed K. Ezzat,
SQL Server 2008 and Data
Mining Overview
1
Outline

MS SQL Server 2008 and Data Mining

MS SQL Server 2008 and Data Mining Extensions (DMX)

Using MS SQL Server Data Mining

MS SQL Server Available Algorithms:

Naïve Bayes

Decision Tree

Time Series

Clustering

Association Rules

Neural Networks and Logisitc Regression
2
MS SQL Server 2008 and
Data Mining
3
MS SQL Server 2008 and Data Mining:
An Overview



Hard drive capacity increased (CRM, ERP, web server log
records, etc.) faster than increase in processing power; data
outpaced the capability to process it leading to data-rich and
knowledge-poor.
Main purpose of data mining is to extract knowledge from the
huge data at hand.
With traditional RDBMS, you can issue a query, including
OLAP, to find answers to interesting questions? In contrast
with data mining, you ask the question in terms of the data (
and possible hypothesis) and let the data mining tools to
either verify your hypothesis or to discover hypothesis you did
not think of!
4
MS SQL Server 2008 and Data Mining:
Data Mining Tasks

Classification: risk management,
targeted advertisement, etc. Find
a model that describes the class
attribute as function of input attributes.
Algorithms include: decision tree,
neural network, and Naïve Bayes.

Clustering: typically unsupervised
learning where all attributes are
treated equally. Most clustering
algorithms are iterative in nature
and stop when the model converges
when the clusters dynamics become
stable.
Decision tree
Clustering
5
MS SQL Server 2008 and Data Mining:
Data Mining Tasks

Association (market Basket Analysis):
In a sales situation, we would like to
identify products that are often in
the same shopping basket for cross
selling purposes.
Product Association

Regression: Similar to classification except instead of looking
for a pattern to describe a class, the goal is find a pattern to
determine a numerical value. Example: predict a coupon
redemption rate based on the face value, etc.
6
MS SQL Server 2008 and Data Mining:
Data Mining Tasks

Forecasting (predicting
future values): what will be
MSFT stock value tomorrow?
What will be the sales amount
of wine next month?

Sequence Analysis: tries to
find patterns in a sequence
of events called a sequence.
Next Figure is a web click
sequence: each node is a URL
category, and the line represent
transition between them with
weight that is probability of
transitions between these 2 URLS!
Time Series
Wen Navigation Sequence
7
MS SQL Server 2008 and Data Mining:
Data Mining Tasks

Deviation Analysis: is used to find rare cases that behave very
differently from the norm! Example is credit card fraud
detection, network intrusion detection, manufacture error
analysis, etc.
There is no standard technique. Usually applying decision
trees, clustering or neural network algorithms.
8
MS SQL Server 2008 and Data Mining:
Data Mining Project Cycle

Business problem formulation

Data Collection

Data cleaning and transformation

Model Building

Model Assessment

Reporting and prediction
9
MS SQL Server 2008 and
Data Mining extensions
(DMX)
10
MS SQL Server 2008 and Data Mining
Extensions (DMX): An Overview

DMX was created by Microsoft OLAP team leveraging OLE
DB as the application programming interface (API) and
created a query language as close to SQL as possible while
meeting the needs for data mining.

Evolving with time, target developers expanded to include
.NET developers using C# or VB .NET and OLE DB became
less relevant.
11
MS SQL Server 2008 and Data Mining
Extensions (DMX): The D.M. Process




First, you need to define
the problem!
Create a mining model
(an object)
Provide training data
to the model
Now, you can provide
new data and perform
The Data Mining Process
predictions (deductions)
of information using the patterns
discovered by the algorithm during the training
12
Using MS SQL Server Data
Mining
13
Using MS SQL Server 2008 Data Mining:
The BI Dev Studio

The BI Dev. Studio:
it is a tool that is
integrated into MS
Visual Studio shell
to provide a complete
development experience
for BI.
14
Using MS SQL Server 2008 Data Mining:
The BI Dev Studio







Solution explorer: this is where you manage your project and objects
are created
Window tabs: allow you to switch between designer windows
Designer window: edit/analyze your objects
Designer tabs: object aspects that you can edit or interact with the
object
Properties window: context-sensitive windows; allow you to display
properties of selected item
BI menu: it is context-sensitive menus specific to Analysis Services
objects, e.g., open the data source view (DSV)
Output window: displays messages when you build and deploy
projects
15
Using MS SQL Server 2008 Data Mining:
Understanding Immediate & Offline Modes

Immediate Mode: more natural for data mining users; you are
connected to an Analysis Services server:



When you open an object, you are getting the object from the
server
When you modify the object and save it; the object is
immediately updated on the server
Offline Mode: your project contains files that are stored on
your client machine:


Modifications to objects are stored in XML format on your hard
drive
The model and objects are not reflected in the server until you
decide to deploy them to the destination server
16
Using MS SQL Server 2008 Data Mining:
Creating & Modifying Data Sources

After you open your project, you must describe your source
data  create mining structures and models

Two objects in Analysis Services act as interfaces to your
data: the data source and the data source view (DSV)

Data source is a simple object that consists of connection
string, plus additional information indicating how to connect

DSV is an abstraction layer that enables you to modify the
way you look at data sources
17
Using MS SQL Server 2008 Data Mining:
Exploring Data and Evaluating Models

To learn/understand your data, leverage controls from Office
Web Components (OWC), the DSV Designer provides
functionality to explore your data in your different views.

After organizing, modifying, selecting, and understanding the
data you want to analyze, you can start to create data mining
objects. Two important objects that deal with data mining:
mining structures and mining models:
 Mining structure: defines the domain of a mining problem.
In addition, mining structure contains list of mining models
that use columns from the structure
 Mining model: apply a mining algorithm to the data in a
mining structure
18
MS SQL Server Available
Algorithms
19
MS SQL Server Available Algorithms

MS SQL Server Available Algorithms:

Naïve Bayes: enables you to create models with predictive
abilities; learning based on evidence using correlation between
the variables you are interested in and all other variables, e.g.,
figure out if congressman is Democrat or Republican based on
their voting records!

Decision Tree: one of the mot popular data mining techniques
because of the fast training performance with high degree of
accuracy, e.g., classify if loan applicant is high or low risk!

Time Series: consists of a series of data collected over
successive increments of time or other sequence indicator. Main
purpose is to forecast future series points based on past history
20
MS SQL Server Available Algorithms

MS SQL Server Available Algorithms:

Clustering: finds natural grouping inside your data when such
groupings are not obvious. In other words, find hidden variables
that accurately classifies your data. It is good technology to
discover hidden patterns but as usual you get best answers when
you ask your question the right way.

Association Rules (market basket analysis): perform the
market basket analysis on your customer’s transactions. You can
learn which products are commonly purchased together and how
likely a particular product is to purchased along with another.
Possible outcome is: 5% of your customers have bought X, Y and
Z together, and that 75% of these customers who bought X
and also bought Z. You could use this insight to manage
stock levels, etc.
21
MS SQL Server Available Algorithms

MS SQL Server Available Algorithms:

Neural Networks and Logisitc Regression: Human minds
analyze the problem’s facts and are weighted then these
weighted facts are grouped to lea to a conclusion.
Neural Networks are mathematical models for the above process.
It works by creating neural paths (relationships between In/Out)
that are used as patterns for further predictions.
Training Neural Network is time consuming more than other
models. The complexity comes from the fact that (1) any/all
inputs may be related somehow to ay/all outputs! (2) Different
combinations of inputs may be related differently to outputs!
22
MS SQL Server Available Algorithms

MS SQL Server Available Algorithms:
The MS Logistic Regression algorithm is a special case of a Neural
Network – one with single level of relationships. Typically used by
statisticians to model and predict the probability of events based on
inputs.
23
END
24