* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistics for Marketing and Consumer Research
Survey
Document related concepts
Transcript
Cluster Analysis (from Chapter 12) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Cluster analysis • It is a class of techniques used to classify cases into groups that are • relatively homogeneous within themselves and • heterogeneous between each other • These groups are called clusters Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2 Market segmentation • Cluster analysis is especially useful for market segmentation • Segmenting a market means dividing its potential consumers into separate sub-sets where • Consumers in the same group are similar with respect to a given set of characteristics • Consumers belonging to different groups are dissimilar with respect to the same set of characteristics • This allows one to calibrate the marketing mix differently according to the target consumer group Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3 Other uses of cluster analysis • Clustering of similar brands or products according to their characteristics allow one to identify competitors, potential market opportunities and available niches. • Data reduction • Factor analysis and principal component analysis allow to reduce the number of variables. • Cluster analysis allows to reduce the number of observations, by grouping them into homogeneous clusters. • Maps profiling simultaneously consumers and products, market opportunities and preferences as in preference or perceptual mappings. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4 Steps to conduct a cluster analysis • • • • • Select a distance measure Select a clustering algorithm Define the distance between two clusters Determine the number of clusters Validate the analysis Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5 Distance measures for individual observations • To measure similarity between two observations a distance measure is needed. • Multiple variables require an aggregate distance measure • The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6 Examples of distances Dij x n k 1 ki xkj 2 Euclidean distance A B A n Dij xki xkj City-block (Manhattan) distance k 1 B Dij distance between cases i and j xkj value of variable xk for case j Problems: Different measures = different weights Correlation between variables (double counting) Solution: Standardization, rescaling, principal component analysis Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7 Clustering procedures • Hierarchical procedures • Agglomerative (start from n clusters to get to 1 cluster) • Divisive (start from 1 cluster to get to n clusters) • Non hierarchical procedures • K-means clustering (knowledge of the number of clusters (c) is required). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8 Distance between clusters • Algorithms vary according to the way the distance between two clusters is defined. • The most common algorithm for hierarchical methods include • • • • • single linkage method complete linkage method average linkage method Ward algorithm centroid method Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9 Linkage methods • Single linkage method (nearest neighbour): distance between two clusters is the minimum distance among all possible distances between observations belonging to the two clusters. • Complete linkage method (furthest neighbour): nests two cluster using as a basis the maximum distance between observations belonging to separate clusters. • Average linkage method: the distance between two clusters is the average of all distances between observations in the two clusters Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10 Hierarchical vs. non-hierarchical methods Hierarchical Methods No decision about the number of clusters Problems when data contain a high level of error Can be very slow, preferable with small data-sets Initial decisions are more influential (one-step only) At each step they require computation of the full proximity matrix Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Non-hierarchical methods Faster, more reliable, works with large data sets Need to specify the number of clusters Need to set the initial seeds Only cluster distances to seeds need to be computed in each iteration 11 The number of clusters c • Two alternatives • Determined by the analysis • Fixed by the researchers • In segmentation studies, the c represents the number of potential separate segments. • Preferable approach: “let the data speak” • Hierarchical approach and optimal partition identified through statistical tests (stopping rule for the algorithm) • However, the detection of the optimal number of clusters is subject to a high degree of uncertainty • If the research objectives allow a choice rather than estimating the number of clusters, non-hierarchical methods are the way to go. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12 Example: fixed number of clusters • A retailer wants to identify several shopping profiles in order to activate new and targeted retail outlets • The budget only allows him to open three types of outlets • A partition into three clusters follows naturally, although it is not necessarily the optimal one. • Fixed number of clusters and (k-means), non hierarchical approach Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13 Determining the optimal number of cluster from hierarchical methods (in SPSS) • Agglomeration schedule (programma di agglomerazione) • Icicle plot (grafico a “stalattite”) • Dendrogram Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14