Example of using k-means to segment realistic customer data

A common use of cluster analysis is to segment customers.

Here's a worked example of using K-means clustering to segment a realistic customer dataset for targeted marketing. We'll walk through:

Simulating realistic customer data
Preprocessing
Running K-means
Analyzing and labeling segments
Marketing strategy per segment

Simulate Customer Dataset

We're creating a realistic, synthetic dataset to mimic typical customer features used in marketing:

Age
Annual Income (k$)
Spending Score (1–100)
Online Engagement (1–10)
Region (categorical)

These features are common inputs in marketing analytics. Clustering based on them helps uncover patterns like "high-value young customers" or "low-engagement older shoppers."

import pandas as pd
import numpy as np

np.random.seed(42)

n_customers = 300
data = pd.DataFrame({
    'Age': np.random.normal(35, 10, n_customers).astype(int),
    'Annual Income (k$)': np.random.normal(60, 20, n_customers).astype(int),
    'Spending Score': np.random.randint(1, 101, n_customers),
    'Online Engagement': np.random.randint(1, 11, n_customers),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_customers)
})

Preprocessing the Data

Encoding categorical variables: We one-hot encode Region so that it's usable in numerical clustering algorithms.
Feature scaling: We standardize the data using StandardScaler.

We do this because K-means uses Euclidean distance, which is sensitive to feature scale. Unscaled data can bias clustering toward variables with larger numeric ranges.

Encoding turns categories like "North" or "South" into numbers that K-means can process.

from sklearn.preprocessing import StandardScaler

data_encoded = pd.get_dummies(data, columns=['Region'], drop_first=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(data_encoded)

Choosing Number of Clusters and Fitting K-Means

We run K-means for various values of k (1 to 9).
We use the elbow method to determine the best number of clusters.
We fit the final K-means model and assign each customer to a cluster.

Choosing the right k is critical. Too few clusters will miss nuance; too many creates noise.

The elbow point shows where adding more clusters yields diminishing returns in explaining the data variance.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

inertia = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(K_range, inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Fit final model (we assume 4 clusters will be picked after elbow method)
kmeans = KMeans(n_clusters=4, random_state=42)
data['Cluster'] = kmeans.fit_predict(X_scaled)

This shows why we chose 4 clusters:

Analyze and Label Customer Segments

We group the data by cluster and compute average feature values.
Based on these profiles, we label each cluster with a descriptive name.

Clusters are just numbers (e.g., Cluster 0, 1). To use them in marketing, we must interpret what they mean. Labeling turns raw clusters into personas, like "Young High Spenders" or "Disengaged Seniors."

3D visualisation of customer segments

Here's how to create a 3D visualization of your customer clusters using Principal Component Analysis (PCA) to reduce the data to three dimensions and then plot with matplotlib.

This visualization helps you see the separation between clusters and interpret how distinct or overlapping they are, which can be useful when presenting your segmentation analysis.

We reduce the scaled feature space to 3 principal components using PCA (which captures the majority of variance) and then plot each customer colored by cluster.

K-means clustering works in high-dimensional space, but that’s hard to visualize.
3D PCA plots help you intuitively understand the shape, size, and separation of clusters.

from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

# Reduce to 3 principal components
pca = PCA(n_components=3)
components = pca.fit_transform(X_scaled)

# Add PCA components to the DataFrame for plotting
data['PCA1'] = components[:, 0]
data['PCA2'] = components[:, 1]
data['PCA3'] = components[:, 2]

# 3D Scatter Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Plot each cluster in a different color
scatter = ax.scatter(
    data['PCA1'],
    data['PCA2'],
    data['PCA3'],
    c=data['Cluster'],
    cmap='tab10',
    s=50,
    alpha=0.8
)

ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_zlabel('PCA Component 3')
ax.set_title('3D Visualization of Customer Segments (K-Means + PCA)')
plt.legend(*scatter.legend_elements(), title="Cluster")
plt.show()

Interpretation

If clusters are well-separated, K-means has likely found meaningful groupings.

If there's significant overlap, consider:

Adding more features
Trying different values of k
Using other clustering methods (e.g. DBSCAN, Gaussian Mixture Models)

Focusing on customer behaviour

K-means treats all features equally when calculating clusters.

But in marketing we are trying to influence behaviour around buying, so it makes sense to look for segments that represent differences in behaviour.

If (below) we look at how Spending Score, Income and Online Engagement are spread across clusters we can see each cluster has a similar data spread for each of these features, so the clustering doesn't really distinguish based on these behavioural indicators.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate customer data
np.random.seed(42)
n_customers = 300
data = pd.DataFrame({
    'Age': np.random.normal(35, 10, n_customers).astype(int),
    'Annual Income (k$)': np.random.normal(60, 20, n_customers).astype(int),
    'Spending Score': np.random.randint(1, 101, n_customers),
    'Online Engagement': np.random.randint(1, 11, n_customers),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_customers)
})

# One-hot encode region
data_encoded = pd.get_dummies(data, columns=['Region'], drop_first=True)

# --- Standard Clustering ---
scaler = StandardScaler()
X_full_scaled = scaler.fit_transform(data_encoded)

kmeans_full = KMeans(n_clusters=4, random_state=42)
data['Cluster_Full'] = kmeans_full.fit_predict(X_full_scaled)

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(16, 6))

sns.boxplot(x='Cluster_Full', y='Spending Score', data=data, ax=axes[0])
axes[0].set_title('Spending Score by Cluster (Full Feature Set)')
axes[0].set_xlabel('Cluster')
axes[0].set_ylabel('Spending Score')

sns.boxplot(x='Cluster_Full', y='Annual Income (k$)', data=data, ax=axes[1])
axes[1].set_title('Annual income by Cluster (Full Feature Set)')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Annual Income')

sns.boxplot(x='Cluster_Full', y='Online Engagement', data=data, ax=axes[2])
axes[2].set_title('Online Engagement by Cluster (Full Feature Set)')
axes[2].set_xlabel('Cluster')
axes[2].set_ylabel('Online Engagement')

plt.tight_layout()
plt.show()

Focusing on behavioural features

By contrast, if we only cluster based on the behavioural features we are interested in, we get a different result, showing clear differentiation between clusters based on our behavioural features:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate customer data
np.random.seed(42)
n_customers = 300
data = pd.DataFrame({
    'Age': np.random.normal(35, 10, n_customers).astype(int),
    'Annual Income (k$)': np.random.normal(60, 20, n_customers).astype(int),
    'Spending Score': np.random.randint(1, 101, n_customers),
    'Online Engagement': np.random.randint(1, 11, n_customers),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_customers)
})

# One-hot encode region
data_encoded = pd.get_dummies(data, columns=['Region'], drop_first=True)


# --- Behavior-Driven Clustering  ---
# Use only spending score + behavioral features
features_behavior = ['Spending Score', 'Online Engagement', 'Annual Income (k$)']
X_behavior = data[features_behavior]

X_behavior_scaled = scaler.fit_transform(X_behavior)
kmeans_behavior = KMeans(n_clusters=4, random_state=42)
data['Cluster_Behavior'] = kmeans_behavior.fit_predict(X_behavior_scaled)

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(16, 6))

sns.boxplot(x='Cluster_Behavior', y='Spending Score', data=data, ax=axes[0])
axes[0].set_title('Spending Score by Cluster (Behavioural features)')
axes[0].set_xlabel('Cluster')
axes[0].set_ylabel('Spending Score')

sns.boxplot(x='Cluster_Behavior', y='Annual Income (k$)', data=data, ax=axes[1])
axes[1].set_title('Annual income by Cluster (Behavioural features)')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Annual Income')

sns.boxplot(x='Cluster_Behavior', y='Online Engagement', data=data, ax=axes[2])
axes[2].set_title('Online Engagement by Cluster (Behavioural features)')
axes[2].set_xlabel('Cluster')
axes[2].set_ylabel('Online Engagement')

plt.tight_layout()
plt.show()

alt text

We can now visualise these clusters in 3D (we don't need to use PCA because we have reduced our model down to only 3 features before cluster analysis):


from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

data['Cluster'] = kmeans.fit_predict(X_behavior_scaled)

# 3D Scatter Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Plot each cluster in a different color
scatter = ax.scatter(
    data['Spending Score'],
    data['Online Engagement'],
    data['Annual Income (k$)'],
    c=data['Cluster'],
    cmap='tab10',
    s=50,
    alpha=0.8
)

ax.set_xlabel('Spending Score')
ax.set_ylabel('Online Engagement')
ax.set_zlabel('Annual Income (k$)')

ax.set_title('3D Visualization of Customer Segments (K-Means on Behavioural Features)')
plt.legend(*scatter.legend_elements(), title="Cluster")
plt.show()

With a different charting library (Plotly) we can create an interactive chart that makes it easier to visualise the clusters. Unfortunately in this web page we don't have the interactivity, please checkout and run the linked notebook.

import plotly.express as px
import pandas as pd
import numpy as np


# Create interactive 3D scatter plot
fig = px.scatter_3d(data, x='Spending Score', y='Online Engagement', z='Annual Income (k$)', color='Cluster',
                    title='Interactive 3D Scatter Plot of Clusters',
                    labels={'x': 'X Axis', 'y': 'Y Axis', 'z': 'Z Axis'})

# Show the plot
fig.show()

Example of using k-means to segment realistic customer data

Example of using k-means to segment realistic customer data

Simulate Customer Dataset

Preprocessing the Data

Choosing Number of Clusters and Fitting K-Means

Analyze and Label Customer Segments

3D visualisation of customer segments

Interpretation

Focusing on customer behaviour

Focusing on behavioural features

See also