This post is the second part in the customer segmentation analysis. The first post focused on kmeans clustering in R
to segment customers into distinct groups based on purchasing habits. This post takes a different approach, using Pricipal Component Analysis (PCA) in R
as a tool to view customer groups. Because PCA attacks the problem from a different angle than kmeans, we can get different insights. We’ll compare both the kmeans results with the PCA visualization. Let’s see what happens when we apply PCA.
Table of Contents
 Why PCA?
 Where We Left Off
 Getting Ready for PCA
 Applying PCA
 Visualizing the Results
 Conclusions
 Recap
 Further Reading
Why PCA?
PCA is a dimensionality reduction algorithm: PCA takes your data and decomposes it using transformations into principal components (PC). Each PC is selected in the orthogonal direction that maximizes the linear variance of the data. In essence, PCA is nothing more than an algorithm that takes numeric data in x, y, z coordinates and changes the coordinates to x’, y’, and z’ that maximize the linear variance. A great article on the background behind PCA is Pricipal Component Analysis: Explained Visually. As the title suggests, the article has great visuals to explain the algorithm.
How does this help in customer segmentation / community detection? Unlike kmeans, PCA is not a direct solution. What PCA helps with is visualizing the essence of a data set. Because PCA selects PC’s based on the maximum linear variance, we can use the first few PC’s to describe a vast majority of the data set without needing to compare and contrast every single feature. By using PC1 and PC2, we can then visualize in 2D and inspect for clusters. We can also combine the results with the kmeans groups to see what kmeans detected as compared to the clusters in the PCA visualization.
Before we jump into PCA, it’s a good idea to review where we left off in the previous customer segmentation post.
Where We Left Off
In the first post, we used kmeans clustering to analyze the bikes data set
, a collection of excel files that contains data for bike shops (customers), bikes (products), and sales orders for the bike manufacturer, Cannondale. The bike shops and sales orders are fictional / simulated (see the orderSimulatoR post for more on this), but the bikes (products) are actual models from Cannondale’s website.
A hypothesis was formed that bike shops purchase bikes based on bike features such as unit price (high end vs affordable), primary category (Mountain vs Road), frame (aluminum vs carbon), etc. The sales orders were combined with the customer and product information and grouped to form a matrix of sales by model and customer. The kmeans()
function was run on a range of potential clusters, k, and the silhouette()
function from the cluster
package was used to determine the optimal number of clusters.
Getting Ready for PCA
Rather than run through the previous post, the sections below can be used to get everthing ready for PCA.
Getting the Data
You can access the data here if you would like to follow along. You’ll need to download the following files:

orders.xlsx: Contains the fictional sales orders for Cannondale.
customer.id
in the orders.xlsx file relates tobike shop.id
in the bikeshops.xlsx file, andproduct.id
in the orders.xlsx file relates tobike.id
in the bikes.xlsx file. 
bikes.xlsx: Contains information on products (e.g. bike model, primary category, secondary category, unit price, etc).
bike.id
is the primary key. 
bikeshops.xlsx: Contains information on customers (e.g. customer name and location).
bikeshop.id
is the primary key.
The script to load and configure the data into a customer trends matrix is shown below.
Reading the Data
This script will read the data. Make sure you have the excel files in a folder named “data” in your current working directory prior to running the script below.
# Read Cannondale orders data 
library(xlsx) # Used to read bikes data set
customers < read.xlsx2("./data/bikeshops.xlsx", sheetIndex = 1,
colClasses = c("numeric", "character", "character",
"character", "character", "numeric"))
products < read.xlsx2("./data/bikes.xlsx", sheetIndex = 1,
colClasses = c("numeric", "character", "character",
"character", "character", "numeric"))
orders < read.xlsx2("./data/orders.xlsx", sheetIndex = 1, colIndex = 2:7,
colClasses = c("numeric", "numeric", "Date", "numeric",
"numeric", "numeric"))
Manipulating the Data
The script below combines the orders
, customers
and products
data frames into orders.extended
, which is a data frame that simulates output we would get from an SQL query of a sales orders database / ERP system. The data is then manipulated to form customerTrends
, which has the data structured such that the rows contain products and the columns contain purchase quantity (as percentage of total) by customer. The output, customerTrends
, is then used for kmeans clustering.
# Combine orders, customers, and products data frames 
library(dplyr)
orders.extended < merge(orders, customers, by.x = "customer.id", by.y="bikeshop.id")
orders.extended < merge(orders.extended, products, by.x = "product.id", by.y = "bike.id")
orders.extended < orders.extended %>%
mutate(price.extended = price * quantity) %>%
select(order.date, order.id, order.line, bikeshop.name, model,
quantity, price, price.extended, category1, category2, frame) %>%
arrange(order.id, order.line)
# Group by model & model features, summarize by quantity purchased 
library(tidyr) # For spread function
customerTrends < orders.extended %>%
group_by(bikeshop.name, model, category1, category2, frame, price) %>%
summarise(total.qty = sum(quantity)) %>%
spread(bikeshop.name, total.qty)
customerTrends[is.na(customerTrends)] < 0 # Remove NA's
# Convert price to binary high/low category 
library(Hmisc) # Needed for cut2 function
customerTrends$price < cut2(customerTrends$price, g=2)
# Convert customer purchase quantity to percentage of total quantity 
customerTrends.mat < as.matrix(customerTrends[,(1:5)]) # Drop first five columns
customerTrends.mat < prop.table(customerTrends.mat, margin = 2) # columnwise pct
customerTrends < bind_cols(customerTrends[,1:5], as.data.frame(customerTrends.mat))
Here’s the first 6 rows of customerTrends
:
knitr::kable(head(customerTrends))
model  category1  category2  frame  price  Albuquerque Cycles  Ann Arbor Speed  Austin Cruisers  Cincinnati Speed  Columbus Race Equipment  Dallas Cycles  Denver Bike Shop  Detroit Cycles  Indianapolis Velocipedes  Ithaca Mountain Climbers  Kansas City 29ers  Las Vegas Cycles  Los Angeles Cycles  Louisville Race Equipment  Miami Race Equipment  Minneapolis Bike Shop  Nashville Cruisers  New Orleans Velocipedes  New York Cycles  Oklahoma City Race Equipment  Philadelphia Bike Shop  Phoenix Bipeds  Pittsburgh Mountain Machines  Portland Bipeds  Providence Bipeds  San Antonio Bike Shop  San Francisco Cruisers  Seattle Race Equipment  Tampa 29ers  Wichita Speed 

Bad Habit 1  Mountain  Trail  Aluminum  [ 415, 3500)  0.0174825  0.0066445  0.0081301  0.0051151  0.0101523  0.0128205  0.0117340  0.0099206  0.0062696  0.0181962  0.0181504  0.0016026  0.0062893  0.0075949  0.0042135  0.0182648  0.0086705  0.0184783  0.0074074  0.0129870  0.0244898  0.0112755  0.0159151  0.0108696  0.0092251  0.0215054  0.0026738  0.0156250  0.0194175  0.0059172 
Bad Habit 2  Mountain  Trail  Aluminum  [ 415, 3500)  0.0069930  0.0099668  0.0040650  0.0000000  0.0000000  0.0170940  0.0139070  0.0158730  0.0031348  0.0110759  0.0158456  0.0000000  0.0094340  0.0000000  0.0112360  0.0167428  0.0173410  0.0021739  0.0074074  0.0095238  0.0040816  0.0190275  0.0026525  0.0108696  0.0239852  0.0000000  0.0026738  0.0078125  0.0000000  0.0000000 
Beast of the East 1  Mountain  Trail  Aluminum  [ 415, 3500)  0.0104895  0.0149502  0.0081301  0.0000000  0.0000000  0.0042735  0.0182529  0.0119048  0.0094044  0.0213608  0.0181504  0.0016026  0.0251572  0.0000000  0.0140449  0.0167428  0.0086705  0.0086957  0.0172840  0.0242424  0.0000000  0.0126850  0.0053050  0.0108696  0.0092251  0.0053763  0.0000000  0.0156250  0.0097087  0.0000000 
Beast of the East 2  Mountain  Trail  Aluminum  [ 415, 3500)  0.0104895  0.0099668  0.0081301  0.0000000  0.0050761  0.0042735  0.0152108  0.0059524  0.0094044  0.0181962  0.0138289  0.0000000  0.0220126  0.0050633  0.0084270  0.0076104  0.0086705  0.0097826  0.0172840  0.0086580  0.0000000  0.0232558  0.0106101  0.0155280  0.0147601  0.0107527  0.0026738  0.0234375  0.0291262  0.0019724 
Beast of the East 3  Mountain  Trail  Aluminum  [ 415, 3500)  0.0034965  0.0033223  0.0000000  0.0000000  0.0025381  0.0042735  0.0169492  0.0119048  0.0000000  0.0102848  0.0181504  0.0032051  0.0000000  0.0050633  0.0042135  0.0152207  0.0202312  0.0043478  0.0049383  0.0051948  0.0204082  0.0162086  0.0026525  0.0201863  0.0073801  0.0322581  0.0000000  0.0078125  0.0097087  0.0000000 
CAAD Disc Ultegra  Road  Elite Road  Aluminum  [ 415, 3500)  0.0139860  0.0265781  0.0203252  0.0153453  0.0101523  0.0000000  0.0108648  0.0079365  0.0094044  0.0000000  0.0106598  0.0112179  0.0157233  0.0278481  0.0210674  0.0182648  0.0375723  0.0152174  0.0172840  0.0103896  0.0163265  0.0126850  0.0026525  0.0139752  0.0073801  0.0053763  0.0026738  0.0078125  0.0000000  0.0098619 
KMeans Clustering
We used the kmeans()
function to perform kmeans clustering. The kmeans post goes into great detail on how to select the right number of clusters, which I skipped for brevity. We’ll use the groups for and clusters with the PCA analysis. Five clusters was the theoretical best solution, and four clusters was the best solution upon inspection of the data.
# KMeans Clustering (used later) 
set.seed(11) # For reproducibility
km4.out < kmeans(t(customerTrends[,(1:5)]), centers = 4, nstart = 50)
set.seed(11) # For reproducibility
km5.out < kmeans(t(customerTrends[,(1:5)]), centers = 5, nstart = 50)
Applying PCA
Now, back to our main focus: PCA. Applying PCA is very simple once the data is formatted. We use the prcomp()
function available in base R
. It’s strongly recommended to scale and center the data (for some reason this is not the default). Further reading on PCA in R
can be found here.
# PCA using prcomp() 
pca < prcomp(t(customerTrends[,(1:5)]), scale. = T, center = T) # Perform PCA
Once PCA is performed, it’s a good idea to take a look at the variance explained. As stated before, the goal of PCA is to reduce the dimensions of the data. PCA does this by transforming the data into dimensions that are orthogonal to the variation. The greater the variance explained, the more information that is summarized by the PC. The prcomp()
function returns this variance by PC. You can use summary(pca)
to get the variance explained. Let’s take a look at the first nine PC’s visually.
PC1 and PC2 combined explain 44% of the variance of the data, and there’s a steep dropoff between PC2 and PC3. This means that plotting PC’s 1 and 2 will give us a reasonably good understanding of the data, and adding more PC’s beyond PC2 will result in minimal improvement. Note that it won’t always happen that there is a significant dropoff after PC2, and if more PC’s explain variance we would need to evaluate them as well.
Visualizing the Results
We’ll plot PC1 and PC2, and we’ll color the clusters initially using the kmeans groups previously created. To do this, we first need to get the pca
data into a data frame that combines the customers with the PC’s. Luckily, the ggfortify
package has a function called fortify()
to do just that. Once our data is fortified, we create two data frames with the kmeans groups added to the end so we can group on them. We’ll use the kmeans group to color the PCA results so we can compare to kmeans and gain insights.
# Manipulate data for PCA Analyis 
library(ggfortify) # For fortify()
pca.fortify < fortify(pca) # fortify() gets pca into usable format
# Add group (short for color) column using k=4 and k=5 groups
pca4.dat < cbind(pca.fortify, group=km4.out$cluster)
pca5.dat < cbind(pca.fortify, group=km5.out$cluster)
Plotting the PC’s
The script below provides sample ggplot
scripts to create a general plot for PC1 and PC2. The script uses my favorite interactive plotting library, plotly
, to turn the ggplot
results into interactive plots. The scripts used for the next section follow the same general procedure, so I won’t show the code for brevity.
# Plotting PC1 and PC2 using ggplot and plotly 
library(ggplot2)
library(plotly)
# Script for plotting k=4
gg2 < ggplot(pca4.dat) +
geom_point(aes(x=PC1, y=PC2, col=factor(group), text=rownames(pca4.dat)), size=2) +
labs(title = "Visualizing KMeans Clusters Against First Two Principal Components") +
scale_color_brewer(name="", palette = "Set1")
# Use plotly for inteactivity
plotly2 < ggplotly(gg2, tooltip = c("text", "x", "y")) %>%
layout(legend = list(x=.9, y=.99))
PCA Viz: K=4 KMeans Grouping
The PCA segmentation shows some interesting results. It appears that there are five clusters as opposed to four. The kmeans results would not pick this up, because we selected four clusters. Maybe your thinking that the silhouette analysis from the previous post indicated that we should pick , so this is where we went wrong. Let’s see what happens with the five cluster kmeans.
PCA Viz: K=5 KMeans Grouping
Well, that’s not what we thought was going to happen! It looks like Group 2 was incorrectly classified. From the kmeans post, this actually makes sense: We inspected the clusters, and Group 2 and 4 were very similar in their preferences for bikes in the lowend price range. We decided to combine these clusters by switching to clusters. Next, we’ll look at clustering based on inspection of the PCA plot.
PCA Viz: Visual Inspection Grouping
From the PCA visualization, we can see there are two bike shops that do not belong to Group 4. Based on our visual inspection, we can modify the groups (column 128) by switching Group 2 (San Antonio Bike Shop & Philadelphia Bike Shop) with the incorrectly classified Group 4 shops (Denver Bike Shop & Kansas City 29ers).
# Switch Group 2 Bike Shops with misclassified Bike Shops in Group 4 
pca.final.dat < pca5.dat
pca.final.dat[rownames(pca.final.dat) %in%
c("San Antonio Bike Shop", "Philadelphia Bike Shop"), 128] < 4
pca.final.dat[rownames(pca.final.dat) %in%
c("Denver Bike Shop", "Kansas City 29ers"), 128] < 2
And, let’s visualize the results.
Everything looks good, but we need to inspect the newly created Group 2’s preferences to make sure they should truly be a standalone customer segment.
Group 2 Inspection
The script below modifies the customerTrends
data frame to select only customer columns that are in the group we want to inspect. The script then takes the average of the customer’s percent of quantity purchased vs total quantity purchased (the values in the customer columns). The data frame is sorted by most frequently purchased so we can see the group’s central tendency. For Group 2, the central tendency is a preference for lowend mountain bikes.
# Inspect Group 2 Preferences 
# Select only groups in group num and perform rowwise average of bike prefs
library(dplyr)
group.num < 2 # Set group number
group.names < rownames(pca.final.dat[pca.final.dat$group == group.num, ])
groupTrends < customerTrends %>%
select(model:price, match(group.names, names(.))) # Use match() to select column names
group.avg < apply(groupTrends[6:ncol(groupTrends)], 1, mean) # Take average of values
groupTrends < bind_cols(groupTrends, as_data_frame(group.avg)) %>%
arrange(group.avg)
knitr::kable(head(groupTrends, 10)) # Top ten products by group avg. pct. purchased
model  category1  category2  frame  price  Denver Bike Shop  Kansas City 29ers  value 

Catalyst 2  Mountain  Sport  Aluminum  [ 415, 3500)  0.0256410  0.0187266  0.0221838 
Trail 5  Mountain  Sport  Aluminum  [ 415, 3500)  0.0204259  0.0216076  0.0210168 
FSi Carbon 4  Mountain  Cross Country Race  Carbon  [ 415, 3500)  0.0165146  0.0247767  0.0206456 
Scalpel 29 4  Mountain  Cross Country Race  Aluminum  [ 415, 3500)  0.0186875  0.0210314  0.0198595 
Catalyst 4  Mountain  Sport  Aluminum  [ 415, 3500)  0.0199913  0.0184385  0.0192149 
FSi 1  Mountain  Cross Country Race  Aluminum  [ 415, 3500)  0.0204259  0.0175742  0.0190000 
Trail 4  Mountain  Sport  Aluminum  [ 415, 3500)  0.0152108  0.0218957  0.0185532 
Trail 1  Mountain  Sport  Aluminum  [ 415, 3500)  0.0204259  0.0164218  0.0184238 
Trail 2  Mountain  Sport  Aluminum  [ 415, 3500)  0.0208605  0.0158456  0.0183530 
Beast of the East 1  Mountain  Trail  Aluminum  [ 415, 3500)  0.0182529  0.0181504  0.0182017 
Group 4 Inspection
Let’s compare to Group 4. Rerun the previous script changing group.num
from 2 to 4. We can see that Group 4’s preference is similar to Group 2 in that both groups prefer lowend/affordable bikes. However, Group 4’s top purchases contain a mixture of Mountain and Road, while Group 2’s top purchases are exclusively Mountain. It appears there is a difference!
model  category1  category2  frame  price  Albuquerque Cycles  Dallas Cycles  Detroit Cycles  Los Angeles Cycles  Minneapolis Bike Shop  New York Cycles  Philadelphia Bike Shop  Phoenix Bipeds  Portland Bipeds  Providence Bipeds  San Antonio Bike Shop  value 

FSi 2  Mountain  Cross Country Race  Aluminum  [ 415, 3500)  0.0174825  0.0256410  0.0119048  0.0471698  0.0258752  0.0246914  0.0040816  0.0183228  0.0186335  0.0129151  0.0215054  0.0207476 
Slice Ultegra  Road  Triathalon  Carbon  [ 415, 3500)  0.0104895  0.0085470  0.0099206  0.0251572  0.0076104  0.0049383  0.0571429  0.0133897  0.0248447  0.0092251  0.0537634  0.0204572 
CAAD12 Disc 105  Road  Elite Road  Aluminum  [ 415, 3500)  0.0139860  0.0085470  0.0297619  0.0157233  0.0136986  0.0222222  0.0204082  0.0176180  0.0310559  0.0092251  0.0215054  0.0185229 
Catalyst 3  Mountain  Sport  Aluminum  [ 415, 3500)  0.0314685  0.0427350  0.0178571  0.0188679  0.0152207  0.0024691  0.0040816  0.0140944  0.0186335  0.0202952  0.0161290  0.0183502 
FSi Carbon 4  Mountain  Cross Country Race  Carbon  [ 415, 3500)  0.0104895  0.0128205  0.0238095  0.0188679  0.0152207  0.0271605  0.0448980  0.0183228  0.0139752  0.0129151  0.0000000  0.0180436 
Synapse Carbon Disc 105  Road  Endurance Road  Carbon  [ 415, 3500)  0.0139860  0.0170940  0.0099206  0.0125786  0.0304414  0.0271605  0.0285714  0.0091614  0.0170807  0.0166052  0.0107527  0.0175775 
CAAD8 Sora  Road  Elite Road  Aluminum  [ 415, 3500)  0.0069930  0.0213675  0.0099206  0.0188679  0.0273973  0.0098765  0.0040816  0.0204369  0.0217391  0.0202952  0.0215054  0.0165892 
Synapse Disc 105  Road  Endurance Road  Aluminum  [ 415, 3500)  0.0279720  0.0213675  0.0019841  0.0094340  0.0121766  0.0271605  0.0204082  0.0155039  0.0093168  0.0129151  0.0215054  0.0163404 
CAAD12 105  Road  Elite Road  Aluminum  [ 415, 3500)  0.0069930  0.0299145  0.0277778  0.0188679  0.0106545  0.0098765  0.0326531  0.0084567  0.0093168  0.0018450  0.0215054  0.0161692 
Trigger Carbon 4  Mountain  Over Mountain  Carbon  [ 415, 3500)  0.0069930  0.0085470  0.0099206  0.0125786  0.0060883  0.0197531  0.0285714  0.0119803  0.0124224  0.0239852  0.0322581  0.0157362 
Conclusion
PCA can be a valuable crosscheck to kmeans for customer segmentation. While kmeans got us close to the true customer segments, visually evaluating the groups using PCA helped identify a different customer segment, one that the kmeans solution did not pick up.
Recap
This post expanded on our customer segmentation methodology by using PCA to visually examine the clusters. We manipulated our sales order data to obtain a format that relates products to customer purchases. We used the prcomp()
function to perform PCA on our formatted data frame. We fortified the PCA output using the fortify()
function from the ggfortify
package, which enabled us to plot the PC’s by customer. We added the kmeans cluster groups to the fortified data frame for visual inspection of the kmeans clusters. We saw that differences can arise because kmeans programatically determines the clusters while PCA allows us to visually inspect the clusters. The end result was an improvement to the customer segmentation by attacking the community detection problem from two different angles and combining the results!
Further Reading

Pricipal Component Analysis: Explained Visually: This article is an excellent place to start for those that are new to PCA or those that would like to understand the details.

Computing and Visualizing PCA in R: This is a great article that takes PCA to the next level in
R
with biplots, predictions, and the caret package.