Download Presentation - Information Services

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

"Analytic Tasks from Business perspective"
ISI CODATA International training Workshop on Big Data
18 March,2015
DRTC,ISI Bangalore
SQC & OR Division, ISI,Bangalore
• Organization is in the business to ensure sustainable
growth in profit.
• Financial results does not mean the profit alone.
• Organization looks for
• Quantum of money- Revenue
• Quality of money- Margin/ Savings/ Profit
• Speed of money- Liquidity/ Cash flow
• Ease of Money- Ease of doing business
• First three are termed as Hard aspects of Businesssometimes spelt out as `looking for hard savings- i.e.
short-term financial results’.
• The fourth one calls for Soft gain. Customers,
Employees, Management, Suppliers should feel
comfortable to do business or work with the
Organization. Soft aspect is very important for any
• `Soft’ aspect only brings the ability of an
Organization to address overall sensitivity to all
the stakeholders.
• Anybody who is impacted by the products and
processes of an Organization is called stakeholder.
• Customer is somebody who can be defined as ‘Ultimate
recipient of the products or services for their respective
• Addressing sensitivity has three important issues:
o Customer has to be satisfied.
o Governments (Legal requirements) has to be at least
complied with.
o We need to be sensitive to all other stakeholders.
Sensitivity need not mean satisfaction. Sensitivity may
mean transparency on values/ policies/ practices/
processes/ requirements/ attitude or behavior etc.
• Stakeholders support the Organization to ensure
long-term survival. Any gain in sensitivity to the
stakeholders would keep the Organization better
Hence the real meaning of the business is
o Business = Hard (Finance) + Soft (Sensitivity to the
• Existing Resources: More output from less
resources -----> More output from less
investment------>Reducing opportunities for
waste------>Cash Flow
• Existing Resources: More output from same
resources ----->Reducing Defects ---->Problem
• More Resources: More output from more
resources/investments -----> Expand Capacity
----->Prevent Defects from the beginning
------> Revenue
• More Resources: More output from more
resources/investments ----->Increase Market
Share ----->New Product------> Prevent Defects
from the beginning ------>Revenue
What is Big Data
Everyday, we create 2.5 quintillion (10^18)bytes of data-so much that
90% of the data in the world today has been created in last two years
alone. These data come from everywhere ( to name a few):
Sensors used to gather climate information
Posts to social media site
Digital pictures and videos
Purchase transaction records
Cell phone GPS signals”etc.
High Volume
High Velocity
Large Variety
Poor Veracity
BIG DATA: Volume
• Enterprise are acquiring very large volume of data
through variety of sources
• Some examples of use:
- Sentiment Analysis - Twitter Data –Terabytes of
Tweets are created each day which can be used
for improved product sentiment analysis
- Predict Power Consumption-Convert billions of
annual meter readings into better predict power
consumption say every hour per minute
BIG DATA: Velocity
• For the time sensitive processes such as catching fraud,
preventing accidents, giving life saving medication etc. Big
data must be used as it streams into an enterprise in order to
maximize its value
• Some examples of use:
-Scrutinize millions of credit card transactions each day to
identify potential fraud
-Analyze billions of daily call detail records in real time to
predict customer churn faster
-In ICU,analyze blood chemistry/ECG readings in real time to
deliver life saving medication
BIG DATA: Variety
• Big data can be of any type-structured and unstructured data
such as text, sensor data,audio,video,click streams,logfiles and
more. New insights are found when analysing these data
types together.
• Some examples of use:
-Monitor live video feeds from surveillance cameras to
identify potential threats
- Utilize image,audio,video and web information about a
customer to give better product usage trainings, safety tips
and recommendations.
BIG DATA: Veracity
• Accuracy is a big concern in Big Data. There is no
easy way to segregate good data from bad.
• Some concerns:
-Among thousands of reviews of hotels which ones
are authentic and which ones are not
-How to find out the truth from thousands of product
- How to identify a rumour from a informed
Data and Extraction of Information - Current Scenario
The growth of data availability is mind-boggling. According to Intel the quantity of
information generated from dawn of human history till 2003 – some 5 exabytes – is now
created every two days
Data processing and storage costs have decreased by a factor of 1000 over the past decade
Technologies like Hadoop and MapReduce eliminates the need to structure the data in
rigidly defined formats – a costly, labour-intensive proposition
Powerful techniques for analyzing data to extract various insights have been developed and
software are available to enable easy implementation
Advanced statistical, optimization, machine-learning and data-mining techniques enable
extraction of hitherto unavailable insights
At present technology allows us to keep a lot of data on phenomenon as well as
individual entities
It should be possible for us to learn a lot about phenomenon and entities from
these data and these knowledge may be used for improved decision making
Ability to store the right data in appropriate structure and extract meaningful
information from the same is, therefore, becoming crucial for business success
Some Examples
An automobile manufacturer wants to understand how the fault and failure related data
captured through the sensors may be used to classify the condition of vehicles so that
preventive maintenance may be carried out optimally. Similar situations are applicable to large
manufacturers having many machines, e.g. miners, aircraft manufacturers.
Insurers may wish to classify drivers as very risky, risky, safe etc. on the basis of their driving
habits so that insurance premium may be fixed intelligently
A company engaged in oil exploration may need to estimate the time and expenses of drilling
under different geological conditions before taking up a drilling assignment
A company in any segment may wish to forecast the total demand based on past demands as
well as past and current economic conditions
Manufacturers of consumer electronics may need to understand the sentiment of people
communicating over social media about their products
A large retailer may like to understand the impact of a natural disaster like a hurricane on
purchase behaviour
An e-commerce company may want to know the impact of making changes in the portal or
sales policy on the quantum of sales
Credit card as well as health insurance companies may wish to identify fraudulent transactions
so that appropriate actions may be initiated
A retailer may like to suggest additional products a customer may be willing to buy on the basis
of the current as well as past surfing data
Definition of Business Analytics (BA)
BA is the science and the art of improving business functions using data and
analytical techniques
It is a science since it uses theories of probability, statistics, data mining
techniques and a well defined process
It is an art because, like a brilliant painter, the analyst has to draw from a diverse
pallets of colours (data sources) to find the perfect mix that will yield actionable
It is also an art as the analyst must have a deep level of creativity and business
understanding to be able to clearly identify the problem, understand the
implementation challenges and effectively communicate the proposed solution.
As the saying goes – in business analytics problems will often have to be taken
rather than being given
Note: The solution of analytics problems will often come in the following forms
Insights that may be acted upon (or stop unnecessary actions)
Models or solutions that may be used to improve effectiveness of
business functions
Automatic solutions embedded in software systems
Components of Business Analytics
Data acquisition, engineering and
processing (mostly compilation)
Understanding the organizational structure and skill
required for effective implementation, identification
of important business problems, understanding how
information are used and created by line managers
(understanding of cognitive and behavioral sciences),
setting up measurement systems to assess success,
linking business analytics to the strategy of the firm
Operational data bases, data
warehouses, online processing and
mining, enterprise information
acquisition and cleaning, big data
technologies like Hadoop
Application of statistical and data mining
tools on specific business problems
Formulation of the business problem in statistical
terms, breaking down a problem into a set of canonical
analytic tasks, identification of statistical / data mining
tools, verification of assumptions, avoiding traps like
selection bias, processing and interpreting data, model
validation, presentation of quantitative solution, design
of data collection plans, analysis of data collected on
campaign basis (e.g. surveys)
Identification of types and classes of problems (horizontals and verticals)
Business Analytics Process
Business Understanding
Data Preparation
Model Building and
Types of Analytics Problems
• Analytics problems may be classified from two
perspectives, namely
– Method of analyses
– Type of business problem
Two Broad Types from Methods Perspective
• Supervised learning
– Understanding the behaviour of a target (response /
dependent / Y) variable as a set of inputs (independent /
explanatory / X) vary
– Typically attempts are made to develop a function to
estimate the target
– These methods are often called dependency analyses
• Unsupervised learning
– Discovering associations and patterns among a set of input
measures. After patterns are found, the analyst is
responsible for finding how to interpret and use them.
– These analyses do not attempt to estimate some Y on the
basis of X variables. Rather, attempts are made to
understand relationships / patterns of X variables.
– These methods are often called inter-dependency analyses
Examples of Supervised Learning
• Predict whether a patient, hospitalized due to a heart attack, will
have a second heart attack. The prediction is to be based on
demographic, diet and clinical measurements for that patient.
• Predict the price of a stock in 6 months from now, on the basis of
company performance measures and economic data.
• Predict whether a particular credit card transaction is fraudulent.
The prediction is to be based on past transaction history,
transaction type, reputation of the merchants involved and other
similar variables
• Identify the impact of different variables like price, relative brand
position, general economic condition, level of competition, and
product type (luxury / necessity…) on the demand of a particular
product during a given period
Examples of Unsupervised Analytics
Find typical profile of employees who quit quickly
Find products that are usually sold together
Group cities with respect to their characteristics
Develop a scale to measure brand position
Analytic Tasks from Business Perspective
Hypotheses testing
Classification and class probability estimation
Value estimation, explanatory and causal models
Discovering dimensions, and construction and validation of measures
Profiling – understanding behavioural pattern of individual entities
Associations and co-occurrence grouping
Exploration of phenomenon and understanding trends
Link prediction
Constrained optimizations (primarily LP, its variants and network
Most business problems can be solved using a combination of
these tasks. As an analyst one should be in a position to break a
problem in terms of these tasks.
Hypotheses Testing
• Hypotheses are statements about a given
phenomenon, e.g. increasing number of years
of education increases earning potential;
design A produces a lower defect rate
compared to design B; a particular design of a
web page leads to more conversion compared
to another
• Hypotheses testing consists of determining
the plausibility of the statements on the basis
of data
Classification and Class Probability Estimation
• There are situations where the target is classified, e.g.
whether a particular credit card transaction is
fraudulent or not; whether a customer will renew her
contract or not; whether a sales bid will be won, lost or
abandoned by the customer; how to classify a loan
application as low, high or medium risk
• The problem is to allocate the target variable to one of
the classes based on the value of some explanatory
• In most cases the probability that the target will belong
to different classes is first estimated. An allocation to a
particular class is made on the basis of the estimated
Value Estimation
• Some business problems require estimating the value
of a target variable rather than classifying the same.
• Some examples of value estimation are – finding the
lifetime value of a customer; estimating the effort
required to complete a software development project;
finding the total number of cheques that may arrive for
• The value needs to be estimated based on certain
explanatory variables and hence this task comes under
the broad class of supervised analytics (dependency
Discovering Dimensions and Constructing Scales
• Some business problems require understanding unobserved variables, e.g.
it may be important to identify the different dimensions of customer
satisfaction or skill so that the same can be measured.
• Developing dimensions of unobserved variables in terms of observable
variables to facilitate its measurement is called scale construction
• Businesses may also need to group a large number of variables being
measured into a small number of dimensions. This enables reducing the
dimension of a problem.
– A retail store may try to discover the different dimensions of store appearance
and performance on the basis of a large number of measurements. Controlling
a small set of dimensions may be easier than controlling a large number of
– Construction of indices like corruption index or performance measures require
discovering the different dimensions of these unobserved variables
• Note: The unobserved (unobservable) variables are often called latent
variables or constructs
Profiling – Understanding Behavioural Pattern
• This is an example of unsupervised learning
• Profiling is often referred to as behaviour
• Examples of profiling questions are
– What are the usual characteristics of persons
buying this brand of automobile
– Can we describe the credit card spending pattern
of a particular customer
– What is the typical cell phone usage for this
customer segment
Association and (Co-Occurrence) Grouping
• Some business analytics tasks requires grouping a set of entities
such that actions may be initiated for the entire group. Some
examples are
– Customers may be grouped with respect to payment behaviour such that
finance team may have different collection strategies and targets for the
different groups
– A company wants to carry out test marketing in large number of tier-II
cities but faces a budget constraint. In order to overcome the budget
constraint, the company groups cities into a small number of clusters and
test markets in one randomly selected city from each group. It is assumed
that the cities within each cluster are likely to have similar behaviour.
– Employees of an organization may be grouped with respect to a number
of behavioural traits. The performance as well as propensity to leave the
company may be similar within groups while between group differences
may be large.
– Identification of simultaneous occurrences like which products are bought
together frequently; which events take place together (a behaviour of the
customer in the beginning of a project may indicate a behaviour later)
Link Prediction
• This analytic task attempts to predict connections
between data items usually by suggesting that a link
should exist between the items and also estimating the
strength of the link.
• Link prediction is common in social networking system
– since you and Amitava share 10 friends, maybe you
would like to be Amitava’a friend
• Link prediction can attempts to measure the strength
of the links and use the same for recommendation
systems, e.g. for recommendation of movies, books or
other product purchase
Phenomenon and Trend Understanding
• There are situations when we try to understand how a
system will behave by changing the inputs. Examples
– Variation of customer waiting time (at a bank, call center,
garage…) as the arrival, service time and number of servers
– Stock-out conditions depending on variation of lead time
to purchase, demand, market conditions leading to
changes of demand
– Default risks based upon change of economic conditions
1. These tasks are same as classification or value estimation. However, in some cases
the situations become so complex that developing models become extremely
difficult. In certain cases closed form models may not even exist. In these cases the
phenomenon understanding / classification / value estimation is carried out using
a technique called simulation.
• This is an important class of problems where
we try to maximize or minimize an objective
function subject to a number of constraints
– Planning the transportation system within a city
such that utilization is maximum
– Allocating jobs to different machines so that the
waiting time is minimized
Relationship Between Fundamental Tasks and Techniques
Fundamental Task
Statistical / Data Mining Techniques
Phenomenon Understanding
Descriptive Statistics , EDA, hypothesis testing, graphical analysis and
data visualization, contingency tables
Logistic Regression, Discriminant Analysis, Decision Trees – CART and
CHAID, ANN, Support Vector Machine
Value Estimation
Table Lookup, Naive Bayesian, Nearest Neighbor, Regression models –
MLR and its variants including shrinkage methods, Count Regression
and zero inflated models, Cox Regression, Survival Analysis, different
non-parametric methods, ANN
Distributional and descriptive analyses, clustering
Association and Grouping
Correlation and similarity analyses, MTS, Cluster analysis, multi
dimensional scaling, market basket analysis / association rule mining
Link Prediction
Graph Theory and graph traversal rules
Phenomenon Exploration and
Trend Understanding
Time series analyses, ARIMA models, Other forecasting models, Time
Series Regression, Simulation
Discovering Dimensions
Principal Component Analysis, Factor analysis, SEM, Cronbach’s Alpha
Linear Programming and its variants, Genetic Algorithms and Swarm
• Business analytics problems may be looked at from two perspectives –
methods and business problems
• Statistical learning problems may be divided into two broad classes from
methods perspective – supervised and unsupervised
• Supervised learning consists of building model to establish relationship
between a dependent variable (target) Y and a set of input (explanatory)
variables. Unsupervised learning finds patterns of association of the inputs
• Business problems may be divided into nine major classes. Real life
analytics problems are often combination of these problems
• An expert in Business Analytics must know the different techniques of
supervised and unsupervised learning. At the same time s/he should be in
a position to construct a business problems in terms of the fundamental
tasks. S/he should be aware of the techniques of supervised /
unsupervised learning typically used to address the different fundamental