Download Using Data Mining and Machine Learning in Retail

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Using Data Mining and Machine Learning in Retail
Omeid Seide
Senior Manager, Big Data Solutions
Sears Holdings
Bharat Prasad
Big Data Solution Architect
Sears Holdings
Over a Century of Innovation
 A Fortune 100 company, nearly $40
billion in annualGrowing
revenue.
data
volumeslargest
fourth
 The nation’s
broad
line
Shortened
Tight IT
retailer
with almost 2,500 full-line
processing and
budgets
windows
specialty retail stores in the US
and
Canada.
 A front runner inThe
Big Data effortsEscalating
Latency in
Challenge
costs
including driving
personalized
data
marketing and generating savings from
legacy migration.
Hitting
ETL
 Running
one of the biggest scalability
rewards
complexity
ceilings
programs that captures
and
analyzes a
Demanding
business
very large number
of customer
requirements
transactions quickly.
2
What is Big Data?
Big Data can no longer be defined by the
amount of data, but by the type, speed,
and storage capacity needed to compute
and analyze that data.
3
Data, Data, and More Data
We are
creating so
much data,
so quickly,
that 90% of
the data in
the world
today has
been created
in the last 2
years.
4
The Problem with Large Scale Data Processing

With traditional computer processing--it can be difficult to
compute everything, due to storage space, processing time, and
cost.

This typically leads to incomplete computations, data latency, and
overall lack of quality analysis.

Hadoop brings infinite scalability, extremely large storage
capability, and fast data processing.
5
Enter Hadoop
Apache Hadoop is a framework which:
Runs applications on a large cluster built of commodity hardware.
Provides reliability and data motion to applications.
Implements a computational paradigm named MapReduce.
• Applications divided into small fragments of work for execution/
re-execution on any node in the cluster.
Provides a Distributed File System (HDFS) that stores data on compute
nodes, resulting in high aggregate bandwidth across the cluster. Both
Map/Reduce and the Distributed File System Framework automatically handle
the node failures.
6
Why Use Hadoop?
 Stability: Hadoop is “horizontally scalable.”
• Easily stores and processes petabytes of data, just by adding
hardware.
 Economical: Uses commodity based hardware.
 Efficient: Extremely powerful processing ability.
 Reliability: Data is replicated 3x times (min) in different locations; failed
tasks are rerun.
 Storage space & Capacity: Central Repository; Keep everything forever.
7
Big Data Analytics in Retail
 How can I better manage my inventory?
 How can I better understand my customers’ buying habits?
 How can I detect fraudulent activity?
 How can I create better targeted interaction with my customer?
 How do I get customers to purchase more products?
8
The Evolution Data Analysis
9
What is Mahout?
Top Apache Foundation software project
 Uses Scalable Machine Learning algorithms
 Collection of pre-built data-mining libraries
 Primary focus on collaborative filtering, clustering &
classification
 Houses a Java based math library that uses common math
operations
 Uses MapReduce paradigm
10
Examples of Data Mining & Machine Learning
11
3 Primary Algorithms
Clustering Recommendation Systems
Market Basket Analysis
12
Clustering
A process of grouping similar things in such a
way, so that ‘like items’ are grouped together
with other items that most closely represent
themselves.
13
Motivation behind Clustering
Why use Clustering??
 To better understand a customer’s buying behavior
 To develop targeted marketing campaigns
 To understand interest, motivation, and lifestyle, in
order more effectively move merchandise in and out of
stores
14
Recommendation Systems
An information filtering system that is used to
predict a users rating or preference, typically
using a collaborative, content-based or hybrid
approach to recommendations.
15
Collaborative Filtering
 Framework that filters and recommends
items based on user behavior, preferences
and activities.
 Based on their similarities to others.
 Recommenders
 User based
 Item based
 Online and Offline support

Can utilize Hadoop
 Uses numerous similarity measurements,
such as Cosine, LLR, Tanimoto, Pearson,
and more.
16
Content- Based Filtering
 Looks at the item and the
users preference in order,
and provides a
recommendation.
Users
Ratings
A
B
Feature
Values
User
Profile
C
Content used in the past
 Allows for highly precise
recommendations.
 Difficulty when making
recommendation over
cross-sections of service
when used for crossselling.
Matching
X
Y
Z
Feature
Values
Content with
similar feature
values is
recommended
Content
Profile
profile
Contents
17
Market-Basket Model
 A model used to describe the commonality of several relationships
between two objects.
 Items: anything that is purchased
 Basket: a set of items
 The numbers of items in a basket is typically small, and the number
of baskets is typically large
18
Market Basket Models
 A list of Purchasers
 Additional “Purchaser” data is can be useful (but
is not needed)
 A list of transactions
 Seek to identify purchasing patterns
 What items are normally purchased together
 What is the purchasing sequence
 Is there a seasonality effect to purchasing
 Categorize buying behavior
 Translate buying behavior into actionable insight
 Targeted promotions
 Inventory placement
 Store layout
 Cross- Selling
19
Frequent Itemsets
Any set of items that appears regularly within multiple baskets
 Originally used to analyze a physical “supermarket basket”
 Best used to link commonly bought together pairs that often have no
relationship to each other
Example: Diapers & Beer
 A major store chain discovered that diapers and beer were regularly
appearing in baskets together. Theory was that if you bought diapers you
are likely to have a baby at home, with a baby at home it is less likely that
you go to a bar to drink, and more likely you will have a beer at home.
20
Applying Market Baskets Models
Retail Stores
 Showroom floor planning
 Catalog layout
 Crossing selling
 Fraud Analysis
21
Big Data Stack
Data Visualization & Reporting
Consumption
Consumption
Layer
Layer
Semantic
SemanticLayer
Layer
Computation/Acc
Computation/Acc
ess
essLayer
Layer
Storage
StorageLayer
Layer
Data Analytics
Hive/Pig
Advance
Query
Hive/Pig
Storage-hdfs
On-Promises
Integration
Integration
Integration
Layer
Integration
Layer
Layer
Layer
Advance
Query
Storage-hdfs
Cloud
NOSQL
NOSQLDB
DB
Security
Security
SecurityLayer
Layer
Frequency
Frequency
Data Mining
Metadata
Data Governance & Integration --ETL/ELT
On demand
Real-Time
Streaming
Time series
22
Open vs Closed Stack
Distribution
Distribution
Consump
Consump
tion
tionLayer
Layer
Semanti
Semanti
c cLayer
Layer
Computat
Computat
ion/Acces
ion/Acces
s sLayer
Layer
Storage
Storage
Layer
Layer
/NO
/NOSQL
SQL
DB
DB
Security
Security
Layer
Layer
Integratio
Integratio
n nLayer
Layer
Source
Source
Layer
Layer
Blo
g
23
Questions?
24