How to Apply Machine Learning for Customer Segmentation

Telmo Silvaon November 15, 2024

Last updated on September 12, 2025

Customer segmentation is a big deal and challenge for marketing teams to personalize messaging, improve customer satisfaction, and optimize product offerings.

This guide takes a detailed approach to building a customer segmentation model using machine learning and Python. Read on to get practical recommendations from our Data Scientists for each step to avoid common pitfalls.

Identify Key Data Points for Your Customer Segmentation

For the purpose of this article, we will use the example of a customer segmentation project in a eCommerce organization.

In this industry, customer data is often vast and varied, including transaction history, browsing behavior, and customer demographics. Typically, you would use the following features for your segmentation:

Demographic data: customer age, gender, and location.
Behavioral data: Number of orders, frequency of orders, average purchase value, product categories browsed, session duration, order source channel (mobile app, website, affiliates, etc.)
Engagement data: Frequency of website visits, email open rates, and click-through rates.

Not sure which features you should use for your machine learning project? Prioritize features that align with business goals, e.g., increasing retention or upselling.

For a more granular segmentation, we also recommend enriching your data with external data such as social media activity or industry benchmarks.

Prepare Your Data for Machine-Learning-Based Segmentation

Extracting Data From All Systems

Ideally, your CRM, web analytics, and transactions data is centralized in a data warehouse or data lake for easier access. With ClicData’s native connectors, collect data from any business application, database or API. But if you don’t have access to native connectors, Python is one way to go for extracting data.

Cleaning Redundant or Incomplete Data and Detecting Outliers

When working on a machine learning-based customer segmentation, you’re typically faced with three common challenges: data redundancy, missing data and outliers. Let’s see how to handle these situations:

Data Redundancy

Data redundancy often arises when multiple features capture similar information, leading to high correlations between them.

For example, features like total purchase amount and average purchase amount may both describe spending behavior, or session duration and pages viewed might both indicate engagement.

When redundant features are used in clustering, they can distort the clustering process, causing the algorithm to place undue weight on them. You end up with inaccurate clusters that don’t reflect the reality of customer differences.

Why is it so common? Redundancy is particularly common in customer segmentation because data from multiple sources (e.g., CRM systems, transaction logs, and web analytics) are often merged. Each source may contain overlapping features, leading to correlation and redundancy.

Now, how can you avoid data redundancy?

Before clustering, compute a correlation matrix to check for highly correlated features. Using Python:

Python

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

If two or more features are highly correlated (e.g., correlation coefficient > 0.8), consider removing one or combining them into a single feature. For instance, instead of using total purchase amount and average purchase amount, calculate the purchase frequency or normalize the total purchase amount by dividing it by the number of purchases.

You can also use techniques like Principal Component Analysis (PCA)to transform correlated features into a reduced set of uncorrelated principal components. This ensures that only unique information is retained in the data.

Key takeaways
1. Redundant features can distort clusters, so always check for correlation.
2. Apply feature selection and reduction techniques to retain only distinct information.
3. Document the features you’ve removed or combined to keep the analysis interpretable.

Incomplete Data

Having missing values in your dataset can be problematic because clustering algorithms require complete data to accurately calculate distances between data points. Missing data can stem from a variety of issues, such as unrecorded customer details, system errors, or customer interactions not being tracked uniformly across all channels.

Why is it so common? Customer data often comes from diverse sources, and each source may have different standards for capturing data. For example, demographic information may be missing for some customers, or purchase history may be incomplete if a customer primarily engages with the business offline but is partially recorded online.

Now, how can you fix incomplete data?

Impute Missing Data

For numerical data, use mean, median, or mode imputation to fill missing values. These are quick solutions but may introduce bias if data is not missing at random.

For categorical data, mode imputation (replacing missing values with the most common value) or label encoding can be used.

Advanced imputation techniques, such as K-nearest neighbors (KNN) imputation or multiple imputation, can produce more accurate fills by considering the relationships among features.

Python

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # 'mean', 'median', 'most_frequent' for categorical
data['feature'] = imputer.fit_transform(data[['feature']])

If a feature has more than a certain threshold (e.g., 30%) of missing values, consider omitting it altogether, as the imputation might introduce too much noise or bias.

In some cases, missing data can be predicted based on other features. For instance, if age is missing, it might be estimated based on purchasing behavior or engagement data.

Key takeaways
1. Choosing an imputation method based on the amount and pattern of missing data. Simple imputation works well for a few missing values, while predictive modeling or KNN imputation is better for more complex cases.
2. Documenting imputation methods used to ensure transparency in the analysis.
3. Avoiding the issue entirely by omitting features is sometimes the most reliable approach, especially if the feature is not essential for clustering.

Detecting outliers

Outliers are data points that are significantly different from the majority of the data, which can distort statistical analysis and machine learning models. In customer segmentation, outliers may represent unusual customer behavior (e.g., extremely high purchase amounts or session durations) that does not align with typical patterns. If not addressed, outliers can skew clustering algorithms like K-means, which rely on distance metrics and are sensitive to extreme values.

Why is it so common?

Outliers often arise due to data entry errors, rare but legitimate customer behaviors, or external factors such as a promotional event that leads to a few large transactions.

In customer segmentation, outliers may not always be incorrect but can be extreme and might not align with the patterns of the general population.

You can use a statistical technique called Interquartile Range (IQR) method to detect outliers, or the z-score method.

Using Interquartile Range (IQR) method to detect outliers

The IQR is the range between the first (25th percentile) and third (75th percentile) quartiles of the data.

Outliers are usually defined as points that fall below Q1−1.5 × IQR or above Q3 + 1.5 × IQR

Python

# Calculate the IQR for each numerical column
Q1 = data[numerical_columns].quantile(0.25)
Q3 = data[numerical_columns].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers_iqr = (data[numerical_columns] < lower_bound) | (data[numerical_columns] > upper_bound)

# Remove outliers
cleaned_data_iqr = data[~outliers_iqr.any(axis=1)]

print(f"Original data shape: {data.shape}")
print(f"Cleaned data shape: {cleaned_data_iqr.shape}")

When to Use:

Works well when the data is not normally distributed and for more robust detection, especially with skewed data.
Suitable for numerical features where values are expected to be in a specific range.

Using the Z-score Method to detect outliers

The z-score method identifies outliers by measuring how far each data point deviates from the mean in terms of standard deviations.

Typically, data points with a z-score greater than 3 (or less than -3) are considered outliers.

Formula: z-score = (X−μ)/σ

X is the data point, μ is the mean, σ is the standard deviation.

Python

import pandas as pd
import numpy as np

# Sample dataset
data = pd.DataFrame({
    'purchase_amount': [50, 55, 52, 70, 75, 1000, 53, 60, 58, 45, 65, 3000]
})

# Calculate z-scores
data['z_score'] = (data['purchase_amount'] - data['purchase_amount'].mean()) / data['purchase_amount'].std()

# Define threshold (commonly 3 or -3 for typical outlier detection)
threshold = 3
outliers = data[np.abs(data['z_score']) > threshold]

# Filter out the outliers
cleaned_data = data[np.abs(data['z_score']) <= threshold].drop(columns=['z_score'])

print("Outliers detected:n", outliers)
print("Data without outliers:n", cleaned_data)

Key takeaways
1. Use the z-score method for normally distributed data. For skewed data, prefer the IQR method as it’s less sensitive to skewness.
2. Adjust thresholds if needed. For instance, in z-score, lowering the threshold (e.g., to 2.5) will detect more outliers but might also remove legitimate data points. In IQR, changing the 1.5 multiplier can expand or narrow the range of accepted values.
3. Test the impact of outlier removal on clustering results by comparing model performance with and without outlier removal. Removing outliers generally improves clustering accuracy and consistency.

Scaling and Transforming Your Data

Scaling and transforming data are crucial steps in preparing a dataset for clustering algorithms, particularly for distance-based algorithms like K-means and DBSCAN.

There are multiple possible techniques like z-score scaling, min-max scaling, robust scaling, or log transformation. They all help standardize the data so that all features contribute equally to the clustering results, preventing your customer segmentation from being skewed by outliers or variations in units.

Why scaling and transformation are important:

Distance Consistency: Many clustering algorithms calculate distances between data points. If features are on different scales, those with larger ranges will disproportionately influence the results.

Uniform Contribution: Features with high variance (e.g., purchase amount) might overshadow features with low variance (e.g., age) if not scaled appropriately.

Model Performance: Proper scaling and transformation generally improve clustering accuracy, leading to more meaningful clusters.

For our ecommerce customer segmentation project, we’re going to focus on two techniques, Z-score scaling and log transformation.

Standardization (Z-score Scaling)

Standardization transforms data to have a mean of zero and a standard deviation of one. It’s useful when features have different units or scales but follow approximately normal distributions.

Why z-score scaling is effective

Handles Diverse Feature Scales: E-commerce datasets often include features with vastly different scales, such as purchase amount, session duration, and number of visits. Standardization brings these features to a common scale with a mean of zero and standard deviation of one, which is critical for distance-based clustering algorithms like K-means.

Suitable for Normally Distributed Data: If features in the dataset are approximately normally distributed, as often found with customer engagement metrics (e.g., session duration, number of visits), standardization maintains the distribution while making features comparable in scale.

Balances Feature Influence: By scaling all features to have similar ranges, no single feature dominates the clustering process. This helps the algorithm consider all features equally, leading to more balanced clusters.

Formula: Z=(X−μ)σ

X is the data point, μ is the mean of the feature, σ is the standard deviation of the feature.

Python

from sklearn.preprocessing import StandardScaler
import pandas as pd

# Sample dataset with different scales
data = pd.DataFrame({
    'purchase_amount': [50, 100, 150, 200, 1000],
    'age': [25, 35, 45, 20, 60]
})

scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:n", standardized_data)

ou ce code-là ? 

from sklearn.preprocessing import StandardScaler

# Assume `data` is your e-commerce dataset
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

Log Transformation

Log transformation reduces skewness in features with large ranges by compressing the higher end of the distribution. It’s particularly helpful for features like purchase amount or income, which can have high positive skewness.

Why log transformation is effective:

Reduces Skewness in High-Value Features: In e-commerce, features like purchase amount, order frequency, or lifetime value tend to be highly skewed due to a small number of high-value customers. Log transformation compresses the range of these values, reducing skewness and helping the model better interpret the data.

Improves Cluster Formation: By compressing large values, log transformation makes it easier for clustering algorithms to detect meaningful patterns without high-value outliers disproportionately affecting results.

Better Representation of Behavioral Patterns: E-commerce customer behavior can vary widely, and log transformation normalizes exponential behavior patterns, making segmentation more reliable and insightful.

Formula: X′ = log(X+1)

The +1 is used to handle zero or negative values safely.

Python

import numpy as np

data['log_purchase_amount'] = np.log(data['purchase_amount'] + 1)
print("Log Transformed Data:n", data[['purchase_amount', 'log_purchase_amount']])

ou bien ce code ?

import numpy as np

# Apply log transformation to skewed features
data['log_purchase_amount'] = np.log(data['purchase_amount'] + 1)

Choose Between Machine Learning Algorithms For Your Customer Segmentation

K-means Clustering

K-means is one of the most widely used clustering algorithms, especially popular for customer segmentation due to its simplicity and efficiency.

K-means performs well with large datasets, making it suitable for e-commerce data. This algorithm generates results that are easier to interpret because each customer is assigned to one of the k clusters based on proximity to a central point, called the centroid. It’s also more flexible as K-means can work with a wide range of features, such as demographics, purchase behavior, and engagement metrics.

Example Use Case: Segmenting e-commerce customers into groups based on spending behavior and frequency of visits to offer personalized promotions.

Python

from sklearn.cluster import KMeans

# Assuming `scaled_data` is your preprocessed and scaled dataset
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(scaled_data)
data['Cluster'] = kmeans.labels_

How do you choose k? Determining the optimal number of clusters is essential to generate an efficient segmentation. The elbow method or the silhouette score can help you decide on the right value for k.

Python code for the Elbow method:

Python

import matplotlib.pyplot as plt

sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    sse.append(kmeans.inertia_)

plt.plot(range(1, 11), sse)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('Elbow Method for Optimal k')
plt.show()

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density rather than distance from a centroid.

DBSCAN can identify clusters of arbitrary shape and is less affected by outliers, as it can treat noise as separate points. It’s best for datasets with noise or irregular shapes, like customer clickstream data where behavioral clusters are not clearly defined.

Example Use Case: Grouping customers based on browsing behavior where some outliers (like unusually high session counts) may exist.

Python

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)
data['Cluster'] = dbscan.fit_predict(scaled_data)

Test and Interpret Hyperparameters

Hyperparameter tuning is essential to optimize clustering algorithms.

Let’s explore tuning hyperparameters for several clustering algorithms, with practical examples of how they impact the segmentation process.

K-means: Tuning n_clusters

The n_clusters parameter in K-means specifies the number of clusters into which the data should be grouped. Selecting the right number of clusters is essential, as too few clusters may combine distinct customer segments, while too many clusters may result in over-segmentation with little practical value.

Example: Suppose we are segmenting an e-commerce dataset with various customer behavior metrics (e.g., purchase frequency, average order value, and browsing time). We’ll test values for n_clusters from 2 to 15 to find the optimal clustering configuration.

Techniques for Choosing n_clusters:

Ask your marketing team… In some cases, your team has a good appreciation of the number of segments they see in purchasing trends.

The Elbow Method: Evaluates the sum of squared errors (SSE) for different numbers of clusters. The “elbow” point (where the rate of SSE decrease slows) indicates the optimal number of clusters.

Python

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

sse = []
for k in range(2, 16):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    sse.append(kmeans.inertia_)

plt.plot(range(2, 16), sse)
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('Elbow Method for Optimal k')
plt.show()

Or the Silhouette Score: Measures the cohesion and separation of clusters. A higher silhouette score indicates well-defined clusters, typically ranging from -1 to 1.

Python

from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 16):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    score = silhouette_score(scaled_data, kmeans.labels_)
    silhouette_scores.append(score)

plt.plot(range(2, 16), silhouette_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.show()

Using these plots, we can choose the number of clusters that balances cohesion (low SSE) and separation (high silhouette score). For example, if k = 4 provides both an “elbow” point in the SSE plot and a high silhouette score, we’d select 4 clusters.

DBSCAN: Tuning eps and min_samples

DBSCAN, which identifies clusters based on data density, relies on two main parameters:

eps: The maximum distance between two points to be considered neighbors.
min_samples: The minimum number of points required to form a dense region (i.e., a cluster).

Example: In a dataset tracking customer interactions (e.g., clicks, session duration, number of items viewed), we might expect some clusters of highly active users, some moderately active users, and some outliers who rarely engage. DBSCAN is ideal for this, as it can isolate noise (low-activity users) and naturally discover clusters of varying shapes.

Hyperparameter Tuning Steps:

Finding eps with a Nearest Neighbors Plot: Plot the distances to each point’s k-th nearest neighbor, and look for a point where the distance “jumps,” indicating a natural boundary.
Tuning min_samples: Set min_samples based on the expected minimum cluster size or experiment with values like 3, 5, or 10 to find the most cohesive clusters.

Python

from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

# Calculate k-nearest neighbors
neigh = NearestNeighbors(n_neighbors=5)
nbrs = neigh.fit(scaled_data)
distances, indices = nbrs.kneighbors(scaled_data)

# Sort and plot distances
distances = np.sort(distances[:, 4], axis=0)
plt.plot(distances)
plt.xlabel('Points sorted by distance')
plt.ylabel('5th Nearest Neighbor Distance')
plt.title('Choosing eps with KNN Distance Plot')
plt.show()

# DBSCAN with chosen parameters
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(scaled_data)
data['Cluster'] = dbscan.labels_

In this code:

The k-nearest neighbors plot helps identify a suitable eps value where the distance to neighbors suddenly increases.
After determining eps, we can run DBSCAN with min_samples=5. Adjust min_samples based on the density of clusters observed.

Key takeaways
1. K-means (n_clusters): Use the elbow method or silhouette score to identify the optimal number of clusters for balanced customer groups.
2. DBSCAN (eps and min_samples): Use a k-nearest neighbors plot to determine eps and experiment with min_samples to define cohesive customer segments without noise.

Analyze Output Results

After clustering, it’s essential to evaluate the characteristics of each cluster to interpret the results and derive actionable insights.

Cluster Analysis

Compare feature distributions within each cluster to identify unique characteristics. For instance, you might find that Cluster 1 contains high-spending customers, while Cluster 3 has infrequent but loyal visitors.

Python Example for Cluster Analysis:

Python

for i in range(grid.best_params_['n_clusters']):
    cluster = data[data['Cluster'] == i]
    print(f'Cluster {i} Summary:')
    print(cluster.describe())

Then, translate cluster characteristics into personas.

For example:

Bargain Shoppers: Customers who make frequent small purchases.
Loyal High-Spenders: Customers with high purchase amounts and frequent visits.

Visualize clusters to better understand relationships and structure.

Python Code for Visualization:

Python

import seaborn as sns

# Assuming you have performed PCA for dimensionality reduction to 2D
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=data['Cluster'], palette='viridis')
plt.title('E-commerce Customer Segmentation')
plt.show()

or you can simply use ClicData’s built-in dashboard designer to visualize your Python script’s output results.

Practical Takeaways
1. Use the results to design targeted marketing campaigns for different customer types.
2. Re-evaluate clusters regularly, as customer behavior may change over time, affecting segmentation.

Build Your Customer Segmentation Using Python in ClicData

Looking for an efficient way to build develop your machine learning model? Do it right from ClicData’s data management and analytics platform.

Our python scripting module (Data Scripts) you can seamlessly develop segmentation models using algorithms like K-means, and DBSCAN for example. ClicData’s support for dimensionality reduction techniques, data transformation, and hyperparameter tuning ensures that your models are both accurate and optimized.

Our built-in visualization tools enable you to instantly create dashboards and reports that can be shared with your teams across your organization. As customer behaviors and data evolve, the ability to iterate, refine, and automate segmentation models within ClicData means that your insights will stay relevant and scalable.

Interested to see Data Scripts? Book a call with our product experts.

How to Apply Machine Learning for Customer Segmentation

Identify Key Data Points for Your Customer Segmentation

Prepare Your Data for Machine-Learning-Based Segmentation

Extracting Data From All Systems

Cleaning Redundant or Incomplete Data and Detecting Outliers

Data Redundancy

Incomplete Data

Detecting outliers

Scaling and Transforming Your Data

Standardization (Z-score Scaling)

Log Transformation

Choose Between Machine Learning Algorithms For Your Customer Segmentation

K-means Clustering

DBSCAN Clustering

Test and Interpret Hyperparameters

Analyze Output Results

Cluster Analysis

Build Your Customer Segmentation Using Python in ClicData

Table of Contents

Share this Blog

Other Blogs

Embedded Analytics for SaaS: How to Give Every Client Their Own Dashboard Without Building It In-House

When Dashboards Are Ignored: A 6-Step Adoption Plan

Automated Dashboard Alerts: How to Build a BI System That Notifies You Before Problems Escalate

How to Apply Machine Learning for Customer Segmentation

Identify Key Data Points for Your Customer Segmentation

Prepare Your Data for Machine-Learning-Based Segmentation

Extracting Data From All Systems

Cleaning Redundant or Incomplete Data and Detecting Outliers

Data Redundancy

Incomplete Data

Detecting outliers

Scaling and Transforming Your Data

Standardization (Z-score Scaling)

Log Transformation

Choose Between Machine Learning Algorithms For Your Customer Segmentation

K-means Clustering

DBSCAN Clustering

Test and Interpret Hyperparameters

Analyze Output Results

Cluster Analysis

Build Your Customer Segmentation Using Python in ClicData

Why User-Friendly Data Dashboards Matter And How to Make Yours Better

How to Build and Deploy Forecasting Models in SQL and Python in BI Platforms

Table of Contents

Share this Blog

Other Blogs

Embedded Analytics for SaaS: How to Give Every Client Their Own Dashboard Without Building It In-House

When Dashboards Are Ignored: A 6-Step Adoption Plan

Automated Dashboard Alerts: How to Build a BI System That Notifies You Before Problems Escalate