How to Apply Machine Learning for Customer Segmentation

Table of Contents

    Customer segmentation is a big deal and challenge for marketing teams to personalize messaging, improve customer satisfaction, and optimize product offerings.

    This guide takes a detailed approach to building a customer segmentation model using machine learning and Python. Read on to get practical recommendations from our Data Scientists for each step to avoid common pitfalls.

    Identify Key Data Points for Your Customer Segmentation

    For the purpose of this article, we will use the example of a customer segmentation project in a eCommerce organization.

    In this industry, customer data is often vast and varied, including transaction history, browsing behavior, and customer demographics. Typically, you would use the following features for your segmentation:

    • Demographic data: customer age, gender, and location.
    • Behavioral data: Number of orders, frequency of orders, average purchase value, product categories browsed, session duration, order source channel (mobile app, website, affiliates, etc.)
    • Engagement data: Frequency of website visits, email open rates, and click-through rates.

    Not sure which features you should use for your machine learning project? Prioritize features that align with business goals, e.g., increasing retention or upselling.

    For a more granular segmentation, we also recommend enriching your data with external data such as social media activity or industry benchmarks.

    Prepare Your Data for Machine-Learning-Based Segmentation

    Extracting Data From All Systems

    Ideally, your CRM, web analytics, and transactions data is centralized in a data warehouse or data lake for easier access. With ClicData’s native connectors, collect data from any business application, database or API. But if you don’t have access to native connectors, Python is one way to go for extracting data.

    Cleaning Redundant or Incomplete Data and Detecting Outliers

    When working on a machine learning-based customer segmentation, you’re typically faced with three common challenges: data redundancy, missing data and outliers. Let’s see how to handle these situations:

    Data Redundancy

    Data redundancy often arises when multiple features capture similar information, leading to high correlations between them.

    For example, features like total purchase amount and average purchase amount may both describe spending behavior, or session duration and pages viewed might both indicate engagement.

    When redundant features are used in clustering, they can distort the clustering process, causing the algorithm to place undue weight on them. You end up with inaccurate clusters that don’t reflect the reality of customer differences.

    Why is it so common? Redundancy is particularly common in customer segmentation because data from multiple sources (e.g., CRM systems, transaction logs, and web analytics) are often merged. Each source may contain overlapping features, leading to correlation and redundancy.

    Now, how can you avoid data redundancy?

    Before clustering, compute a correlation matrix to check for highly correlated features. Using Python:

    Python
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    corr_matrix = data.corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
    plt.title("Correlation Matrix")
    plt.show()

    If two or more features are highly correlated (e.g., correlation coefficient > 0.8), consider removing one or combining them into a single feature. For instance, instead of using total purchase amount and average purchase amount, calculate the purchase frequency or normalize the total purchase amount by dividing it by the number of purchases.

    You can also use techniques like Principal Component Analysis (PCA)to transform correlated features into a reduced set of uncorrelated principal components. This ensures that only unique information is retained in the data.

    Key takeaways
    1. Redundant features can distort clusters, so always check for correlation.
    2. Apply feature selection and reduction techniques to retain only distinct information.
    3. Document the features you’ve removed or combined to keep the analysis interpretable.

    Incomplete Data

    Having missing values in your dataset can be problematic because clustering algorithms require complete data to accurately calculate distances between data points. Missing data can stem from a variety of issues, such as unrecorded customer details, system errors, or customer interactions not being tracked uniformly across all channels.

    Why is it so common? Customer data often comes from diverse sources, and each source may have different standards for capturing data. For example, demographic information may be missing for some customers, or purchase history may be incomplete if a customer primarily engages with the business offline but is partially recorded online.

    Now, how can you fix incomplete data?

    Impute Missing Data

    For numerical data, use mean, median, or mode imputation to fill missing values. These are quick solutions but may introduce bias if data is not missing at random.

    For categorical data, mode imputation (replacing missing values with the most common value) or label encoding can be used.

    Advanced imputation techniques, such as K-nearest neighbors (KNN) imputation or multiple imputation, can produce more accurate fills by considering the relationships among features.

    Python
    from sklearn.impute import SimpleImputer
    
    imputer = SimpleImputer(strategy='mean')  # 'mean', 'median', 'most_frequent' for categorical
    data['feature'] = imputer.fit_transform(data[['feature']])

    If a feature has more than a certain threshold (e.g., 30%) of missing values, consider omitting it altogether, as the imputation might introduce too much noise or bias.

    In some cases, missing data can be predicted based on other features. For instance, if age is missing, it might be estimated based on purchasing behavior or engagement data.

    Key takeaways
    1. Choosing an imputation method based on the amount and pattern of missing data. Simple imputation works well for a few missing values, while predictive modeling or KNN imputation is better for more complex cases.
    2. Documenting imputation methods used to ensure transparency in the analysis.
    3. Avoiding the issue entirely by omitting features is sometimes the most reliable approach, especially if the feature is not essential for clustering.

    Detecting outliers

    Outliers are data points that are significantly different from the majority of the data, which can distort statistical analysis and machine learning models. In customer segmentation, outliers may represent unusual customer behavior (e.g., extremely high purchase amounts or session durations) that does not align with typical patterns. If not addressed, outliers can skew clustering algorithms like K-means, which rely on distance metrics and are sensitive to extreme values.

    Why is it so common?

    Outliers often arise due to data entry errors, rare but legitimate customer behaviors, or external factors such as a promotional event that leads to a few large transactions.

    In customer segmentation, outliers may not always be incorrect but can be extreme and might not align with the patterns of the general population.

    You can use a statistical technique called Interquartile Range (IQR) method to detect outliers, or the z-score method.

    Using Interquartile Range (IQR) method to detect outliers

    The IQR is the range between the first (25th percentile) and third (75th percentile) quartiles of the data.

    Outliers are usually defined as points that fall below Q1−1.5 × IQR or above Q3 + 1.5 × IQR

    Python
    # Calculate the IQR for each numerical column
    Q1 = data[numerical_columns].quantile(0.25)
    Q3 = data[numerical_columns].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define the lower and upper bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Identify outliers
    outliers_iqr = (data[numerical_columns] < lower_bound) | (data[numerical_columns] > upper_bound)
    
    # Remove outliers
    cleaned_data_iqr = data[~outliers_iqr.any(axis=1)]
    
    print(f"Original data shape: {data.shape}")
    print(f"Cleaned data shape: {cleaned_data_iqr.shape}")

    When to Use:

    • Works well when the data is not normally distributed and for more robust detection, especially with skewed data.
    • Suitable for numerical features where values are expected to be in a specific range.

    Using the Z-score Method to detect outliers

    The z-score method identifies outliers by measuring how far each data point deviates from the mean in terms of standard deviations.

    Typically, data points with a z-score greater than 3 (or less than -3) are considered outliers.

    Formula: z-score = (X−μ)/σ​

    X is the data point, μ is the mean, σ is the standard deviation.

    Python
    import pandas as pd
    import numpy as np
    
    # Sample dataset
    data = pd.DataFrame({
        'purchase_amount': [50, 55, 52, 70, 75, 1000, 53, 60, 58, 45, 65, 3000]
    })
    
    # Calculate z-scores
    data['z_score'] = (data['purchase_amount'] - data['purchase_amount'].mean()) / data['purchase_amount'].std()
    
    # Define threshold (commonly 3 or -3 for typical outlier detection)
    threshold = 3
    outliers = data[np.abs(data['z_score']) > threshold]
    
    # Filter out the outliers
    cleaned_data = data[np.abs(data['z_score']) <= threshold].drop(columns=['z_score'])
    
    print("Outliers detected:\n", outliers)
    print("Data without outliers:\n", cleaned_data)

    Key takeaways
    1. Use the z-score method for normally distributed data. For skewed data, prefer the IQR method as it’s less sensitive to skewness.
    2. Adjust thresholds if needed. For instance, in z-score, lowering the threshold (e.g., to 2.5) will detect more outliers but might also remove legitimate data points. In IQR, changing the 1.5 multiplier can expand or narrow the range of accepted values.
    3. Test the impact of outlier removal on clustering results by comparing model performance with and without outlier removal. Removing outliers generally improves clustering accuracy and consistency.

    Scaling and Transforming Your Data

    Scaling and transforming data are crucial steps in preparing a dataset for clustering algorithms, particularly for distance-based algorithms like K-means and DBSCAN.

    There are multiple possible techniques like z-score scaling, min-max scaling, robust scaling, or log transformation. They all help standardize the data so that all features contribute equally to the clustering results, preventing your customer segmentation from being skewed by outliers or variations in units.

    Why scaling and transformation are important:

    • Distance Consistency: Many clustering algorithms calculate distances between data points. If features are on different scales, those with larger ranges will disproportionately influence the results.
    • Uniform Contribution: Features with high variance (e.g., purchase amount) might overshadow features with low variance (e.g., age) if not scaled appropriately.
    • Model Performance: Proper scaling and transformation generally improve clustering accuracy, leading to more meaningful clusters.

    For our ecommerce customer segmentation project, we’re going to focus on two techniques, Z-score scaling and log transformation.

    Standardization (Z-score Scaling)

    Standardization transforms data to have a mean of zero and a standard deviation of one. It’s useful when features have different units or scales but follow approximately normal distributions.

    Why z-score scaling is effective

    • Handles Diverse Feature Scales: E-commerce datasets often include features with vastly different scales, such as purchase amount, session duration, and number of visits. Standardization brings these features to a common scale with a mean of zero and standard deviation of one, which is critical for distance-based clustering algorithms like K-means.
    • Suitable for Normally Distributed Data: If features in the dataset are approximately normally distributed, as often found with customer engagement metrics (e.g., session duration, number of visits), standardization maintains the distribution while making features comparable in scale.
    • Balances Feature Influence: By scaling all features to have similar ranges, no single feature dominates the clustering process. This helps the algorithm consider all features equally, leading to more balanced clusters.

    Formula: Z=(X−μ)σ

    X is the data point, μ is the mean of the feature, σ is the standard deviation of the feature.

    Python
    from sklearn.preprocessing import StandardScaler
    import pandas as pd
    
    # Sample dataset with different scales
    data = pd.DataFrame({
        'purchase_amount': [50, 100, 150, 200, 1000],
        'age': [25, 35, 45, 20, 60]
    })
    
    scaler = StandardScaler()
    standardized_data = scaler.fit_transform(data)
    
    print("Standardized Data:\n", standardized_data)
    
    ou ce code-? 
    
    from sklearn.preprocessing import StandardScaler
    
    # Assume `data` is your e-commerce dataset
    scaler = StandardScaler()
    standardized_data = scaler.fit_transform(data)

    Log Transformation

    Log transformation reduces skewness in features with large ranges by compressing the higher end of the distribution. It’s particularly helpful for features like purchase amount or income, which can have high positive skewness.

    Why log transformation is effective:

    • Reduces Skewness in High-Value Features: In e-commerce, features like purchase amount, order frequency, or lifetime value tend to be highly skewed due to a small number of high-value customers. Log transformation compresses the range of these values, reducing skewness and helping the model better interpret the data.
    • Improves Cluster Formation: By compressing large values, log transformation makes it easier for clustering algorithms to detect meaningful patterns without high-value outliers disproportionately affecting results.
    • Better Representation of Behavioral Patterns: E-commerce customer behavior can vary widely, and log transformation normalizes exponential behavior patterns, making segmentation more reliable and insightful.

    Formula: X′ = log(X+1)

    The +1 is used to handle zero or negative values safely.

    Python
    import numpy as np
    
    data['log_purchase_amount'] = np.log(data['purchase_amount'] + 1)
    print("Log Transformed Data:\n", data[['purchase_amount', 'log_purchase_amount']])
    
    ou bien ce code ?
    
    import numpy as np
    
    # Apply log transformation to skewed features
    data['log_purchase_amount'] = np.log(data['purchase_amount'] + 1)

    Choose Between Machine Learning Algorithms For Your Customer Segmentation

    K-means Clustering

    K-means is one of the most widely used clustering algorithms, especially popular for customer segmentation due to its simplicity and efficiency.

    K-means performs well with large datasets, making it suitable for e-commerce data. This algorithm generates results that are easier to interpret because each customer is assigned to one of the k clusters based on proximity to a central point, called the centroid. It’s also more flexible as K-means can work with a wide range of features, such as demographics, purchase behavior, and engagement metrics.

    Example Use Case: Segmenting e-commerce customers into groups based on spending behavior and frequency of visits to offer personalized promotions.

    Python
    from sklearn.cluster import KMeans
    
    # Assuming `scaled_data` is your preprocessed and scaled dataset
    kmeans = KMeans(n_clusters=5, random_state=42)
    kmeans.fit(scaled_data)
    data['Cluster'] = kmeans.labels_

    How do you choose k? Determining the optimal number of clusters is essential to generate an efficient segmentation. The elbow method or the silhouette score can help you decide on the right value for k.

    Python code for the Elbow method:

    Python
    import matplotlib.pyplot as plt
    
    sse = []
    for k in range(1, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        sse.append(kmeans.inertia_)
    
    plt.plot(range(1, 11), sse)
    plt.xlabel('Number of Clusters')
    plt.ylabel('SSE')
    plt.title('Elbow Method for Optimal k')
    plt.show()

    DBSCAN Clustering

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density rather than distance from a centroid.

    DBSCAN can identify clusters of arbitrary shape and is less affected by outliers, as it can treat noise as separate points. It’s best for datasets with noise or irregular shapes, like customer clickstream data where behavioral clusters are not clearly defined.

    Example Use Case: Grouping customers based on browsing behavior where some outliers (like unusually high session counts) may exist.

    Python
    from sklearn.cluster import DBSCAN
    
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    data['Cluster'] = dbscan.fit_predict(scaled_data)

    Test and Interpret Hyperparameters

    Hyperparameter tuning is essential to optimize clustering algorithms.

    Let’s explore tuning hyperparameters for several clustering algorithms, with practical examples of how they impact the segmentation process.

    K-means: Tuning n_clusters

    The n_clusters parameter in K-means specifies the number of clusters into which the data should be grouped. Selecting the right number of clusters is essential, as too few clusters may combine distinct customer segments, while too many clusters may result in over-segmentation with little practical value.

    Example: Suppose we are segmenting an e-commerce dataset with various customer behavior metrics (e.g., purchase frequency, average order value, and browsing time). We’ll test values for n_clusters from 2 to 15 to find the optimal clustering configuration.

    Techniques for Choosing n_clusters:

    Ask your marketing team… In some cases, your team has a good appreciation of the number of segments they see in purchasing trends.

    The Elbow Method: Evaluates the sum of squared errors (SSE) for different numbers of clusters. The “elbow” point (where the rate of SSE decrease slows) indicates the optimal number of clusters.

    Python
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    
    sse = []
    for k in range(2, 16):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        sse.append(kmeans.inertia_)
    
    plt.plot(range(2, 16), sse)
    plt.xlabel('Number of Clusters')
    plt.ylabel('Sum of Squared Errors (SSE)')
    plt.title('Elbow Method for Optimal k')
    plt.show()

    Or the Silhouette Score: Measures the cohesion and separation of clusters. A higher silhouette score indicates well-defined clusters, typically ranging from -1 to 1.

    Python
    from sklearn.metrics import silhouette_score
    
    silhouette_scores = []
    for k in range(2, 16):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        score = silhouette_score(scaled_data, kmeans.labels_)
        silhouette_scores.append(score)
    
    plt.plot(range(2, 16), silhouette_scores)
    plt.xlabel('Number of Clusters')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score for Optimal k')
    plt.show()

    Using these plots, we can choose the number of clusters that balances cohesion (low SSE) and separation (high silhouette score). For example, if k = 4 provides both an “elbow” point in the SSE plot and a high silhouette score, we’d select 4 clusters.

    DBSCAN: Tuning eps and min_samples

    DBSCAN, which identifies clusters based on data density, relies on two main parameters:

    • eps: The maximum distance between two points to be considered neighbors.
    • min_samples: The minimum number of points required to form a dense region (i.e., a cluster).

    Example: In a dataset tracking customer interactions (e.g., clicks, session duration, number of items viewed), we might expect some clusters of highly active users, some moderately active users, and some outliers who rarely engage. DBSCAN is ideal for this, as it can isolate noise (low-activity users) and naturally discover clusters of varying shapes.

    Hyperparameter Tuning Steps:

    1. Finding eps with a Nearest Neighbors Plot: Plot the distances to each point’s k-th nearest neighbor, and look for a point where the distance “jumps,” indicating a natural boundary.
    2. Tuning min_samples: Set min_samples based on the expected minimum cluster size or experiment with values like 3, 5, or 10 to find the most cohesive clusters.
    Python
    from sklearn.neighbors import NearestNeighbors
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.cluster import DBSCAN
    
    # Calculate k-nearest neighbors
    neigh = NearestNeighbors(n_neighbors=5)
    nbrs = neigh.fit(scaled_data)
    distances, indices = nbrs.kneighbors(scaled_data)
    
    # Sort and plot distances
    distances = np.sort(distances[:, 4], axis=0)
    plt.plot(distances)
    plt.xlabel('Points sorted by distance')
    plt.ylabel('5th Nearest Neighbor Distance')
    plt.title('Choosing eps with KNN Distance Plot')
    plt.show()
    
    # DBSCAN with chosen parameters
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    dbscan.fit(scaled_data)
    data['Cluster'] = dbscan.labels_

    In this code:

    • The k-nearest neighbors plot helps identify a suitable eps value where the distance to neighbors suddenly increases.
    • After determining eps, we can run DBSCAN with min_samples=5. Adjust min_samples based on the density of clusters observed.

    Key takeaways
    1. K-means (n_clusters): Use the elbow method or silhouette score to identify the optimal number of clusters for balanced customer groups.
    2. DBSCAN (eps and min_samples): Use a k-nearest neighbors plot to determine eps and experiment with min_samples to define cohesive customer segments without noise.

    Analyze Output Results

    After clustering, it’s essential to evaluate the characteristics of each cluster to interpret the results and derive actionable insights.

    Cluster Analysis

    Compare feature distributions within each cluster to identify unique characteristics. For instance, you might find that Cluster 1 contains high-spending customers, while Cluster 3 has infrequent but loyal visitors.

    Python Example for Cluster Analysis:

    Python
    for i in range(grid.best_params_['n_clusters']):
        cluster = data[data['Cluster'] == i]
        print(f'Cluster {i} Summary:')
        print(cluster.describe())

    Then, translate cluster characteristics into personas.

    For example:

    1. Bargain Shoppers: Customers who make frequent small purchases.
    2. Loyal High-Spenders: Customers with high purchase amounts and frequent visits.

    Visualize clusters to better understand relationships and structure.

    Python Code for Visualization:

    Python
    import seaborn as sns
    
    # Assuming you have performed PCA for dimensionality reduction to 2D
    sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=data['Cluster'], palette='viridis')
    plt.title('E-commerce Customer Segmentation')
    plt.show()

    or you can simply use ClicData’s built-in dashboard designer to visualize your Python script’s output results.

    Practical Takeaways
    1. Use the results to design targeted marketing campaigns for different customer types.
    2. Re-evaluate clusters regularly, as customer behavior may change over time, affecting segmentation.

    Build Your Customer Segmentation Using Python in ClicData

    Looking for an efficient way to build develop your machine learning model? Do it right from ClicData’s data management and analytics platform.

    Our python scripting module (Data Scripts) you can seamlessly develop segmentation models using algorithms like K-means, and DBSCAN for example. ClicData’s support for dimensionality reduction techniques, data transformation, and hyperparameter tuning ensures that your models are both accurate and optimized.

    Our built-in visualization tools enable you to instantly create dashboards and reports that can be shared with your teams across your organization. As customer behaviors and data evolve, the ability to iterate, refine, and automate segmentation models within ClicData means that your insights will stay relevant and scalable.

    Interested to see Data Scripts? Book a call with our product experts.