Aligning Multi-Armed Bandits for Dynamic Optimization of Customer Assignments in Recommendation Models

Nish · May 30, 2024

Table of Contents

Introduction

When you have an app or website one of your goals is to get people to engage with what you are creating, whether it be a product, service or something in between.

Recommendation systems play a crucial role in helping businesses attract, engage, and retain customers. However, the effectiveness of these systems largely depends on the accurate assignment of customers to the most suitable recommendation models. This is where multi-armed bandit algorithms come into play. In this blog post, we will explore how aligning multi-armed bandits for dynamic customer allocation can significantly optimize the performance of recommendation models.

Multi-Armed Bandits Explained

For historical context the term Bandit comes from the slot machines present in casinos, they were coined bandit since they ended up “stealing your money” just like a bandit would. Multi-Armed comes from fact people used to pull the levers (arms) of many slot machines at once.

Showcasing a set of old vintage slot machines
Showcasing a set of old vintage slot machines taken from ..

Defintions

Lots of deep theory exists in this space but from a practical perspective having a basic grasp on some defintions is more than enough to begin working with these algorithms.

  • Bandit specific
    • The Multi-Armed Bandit Problem essentially is the problem of not knowing what the optimal decision to make is when faced with multiple options (“arms”).
      • The process by which you can attempt solve these problems is to understand the specific context and then apply Multi-Armed Bandit Algorithms which are purposefully designed to tackle these problems.
    • Reward just refers to the payoff or return recieved after selecting a particular action or arm of the multi-armed-bandit.
    • Each arm of the bandit algorithm is associated with a reward distribution which tells you the likliehood of recieving a reward on any given round. When a given arm is pulled the reward recieved can be thought of as a realisation of this distribution.
      • This distribution typically is unknown to the entity pulling the arms.
    • The goal is to maximize the reward over a series of trials (arm pullings) while minimizing the regret, which is the difference between the reward recieved from an arm pull and the maximum reward that you would recieve if you always had chosen the best arm.
      • Just like in RL the reward hypothesis can be used to assert that all goals can be thought of as a maximization of cumulative reward.
    • The rewards themselves are stochastic in nature, meaning that they are randomly dertermined by the underlying probability distribution.
      • This randomness introduces uncertainty and complexity into the decision making process as you must balance the tradeoff between exploring new arms to gather more information about their reward distributions (exploration) and exploiting known arms that have shown a high rewards in the past (exploitation). This is known as the exploration-exploitation trade-off problem.
    • Some problem forumlations might have rewards that are context dependent where the reward distributions change based on external or contextual information available at the time of making a decision. This could add an additional strategic layer to the problem as the optimal action may vary with changes in the context.
  • Business specific
    • An episode refers to the duration of time which a given customer stays assigned to the same bandit after which they are reallocated based on some process.
    • When applying customer allocations a standard way is by staggering.
      • The best way to think about it is as follows:
        1. Take the full customer base.
        2. Divide them into groups dependent on the length of the episodes (e.g if episode length = 28 days then group size = 1/28th of the base).
        3. Each day a single group gets allocated to an experiment.
          • Therefore each day there would exist a group that is expiring from there previous episode and getting reallocated to another experiment.
        4. After a full episode length (e.g 28 days) all customers (across all groups) would have been through an entire episode and therefore would begin being reassigned (rolled over) to another episode.

Resources

Theory is well established in this space and as of now a few universally recognised reference materials exist along with a host of additional articles.

Algorithms and Techniques

Plenty of algorithms and variants of them exist, we’ll touch on the three more common ones.

ε-Greedy

Pros:

  1. Simplicity: Easier to implement and understand in terms of how it implements the trae-off between exploration and exploitation.

Cons:

  1. Inefficient Exploration: Randomly explores at a fixed rate (set by \(\epsilon\)). Intuitively you should explore more initially to gain data and overtime once you have gathered more data you should exploit however in this case depending on your \(\epsilon\) you might not explore enough initially and not exploit enough later on.
  2. Flexibility: You’d want to choose $\epsilon$ based on the context of the problem but tuning it to find the right value might not be easy.
  3. Sub-optimal Decisioning: Due to constant exploration rate $\epsilon$ converging to the optimal bandit might take longer.
ε-Greedy Algorithm
    import numpy as np
    # e-greedy policy
    class eGreedyPolicy:
        
        def __init__(self, epsilon):
            """
            Initialize the eGreedyPolicy with a given epsilon value.

            Parameters
            ----------
            epsilon : float
                The probability of choosing a random action instead of the greedy action.
            """
            self.epsilon = epsilon
        
        def choose_bandit(self, k_array, reward_array, n_bandits):
            """
            Choose a bandit to pull based on the epsilon-greedy policy.

            Parameters
            ----------
            k_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the number of times a particular action has been taken.
            reward_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the rewards received for a particular action.
            n_bandits : int
                The number of bandits.

            Returns
            -------
            int
                The index of the chosen bandit.
            """
            # Successes and total draws
            success_count = reward_array.sum(axis=1)
            total_count = k_array.sum(axis=1)
            
            # Ratio of successes vs total
            success_ratio = success_count / total_count
            
            # Choosing best greedy action or random depending on epsilon probability
            if np.random.random() < self.epsilon:
                # Returning random action, excluding best
                return np.random.choice(np.delete(list(range(n_bandits)), np.argmax(success_ratio)))
            else:
                # Returning best greedy action
                return np.argmax(success_ratio)

    

Upper Confidence Bound (UCB)

Pros:

  1. Incorporating uncertainty: Instead of random exploration of bandits at a constant rate the exploration is embedded into the measure the algorithm is maximising by weighting with the frequency that the particular bandit in question has been assigned up until that point. By adding the extra square root term it upweights things which haven’t been seen as much, making it a better explorer than ε-greedy.
  2. Higher average reward: UCB doesn’t randomly explore and often achieves a higher average reward compared to ε-greedy since the additional term means more exploration earlier on (less reward) but way less for later rounds (more reward) causing the average to be better more often than not down the stretch. This also fits more closely with our intuition.

Cons:

  1. Complexity: It is more complex to implement and understand compared to ε-greedy.
  2. Computational Overhead: Since the algorithm requires calculating an additional confidence bound it can be a little more demanding than ε-greedy.
  3. Initial Exploration: Due to weighted exploration it may initially explore sub-optimal bandits more frequently meaning higher initial regret.
UCB Algorithm
    import numpy as np

    # UCB policy
    class UCBPolicy:
        
        def __init__(self):
            """
            Initialize the UCBPolicy.

            This constructor does not require any parameters.
            """
            pass
        
        def choose_bandit(self, k_array, reward_array, n_bandits):
            """
            Choose a bandit to pull based on the Upper Confidence Bound (UCB) policy.

            Parameters
            ----------
            k_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the number of times a particular action has been taken.
            reward_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the rewards received for a particular action.
            n_bandits : int
                The number of bandits.

            Returns
            -------
            int
                The index of the chosen bandit.
            """
            # Successes and total draws
            success_count = reward_array.sum(axis=1)
            total_count = k_array.sum(axis=1)
            
            # Ratio of successes vs total
            success_ratio = success_count / total_count
            
            # Computing square root term
            sqrt_term = np.sqrt(2 * np.log(np.sum(total_count)) / total_count)
            
            # Returning best greedy action
            return np.argmax(success_ratio + sqrt_term)
    
    

Thompson Sampling

Pros:

  1. Optimal Uncertainty Handling: Uses a Bayesian approach to model uncertainty, this is more sophisticated and has been shown to be empirically better than the other methods.
  2. More Adaptive: The exploration leverages the idea of sampling from posterior distributions which is a better representation than trying to either completely randomly explore (case of greedy) or altering the update measure (case of UCB). This leads to more efficient exploration and exploitation.
  3. Better Performance: It has been shown to outperform other algos in terms of cumulative reward across rounds.

Cons:

  1. Complexity: It is more complex to implement and understand compared to the other techniques.
  2. Computational Cost: Since the algorithm requires sampling and keeping track of the entire posterior distribution it can be costly especially if you have lots of bandits and rounds.
  3. Parameter Sensitive: It can be parameter sensitive in the sense that your choice on prior distributions.
Thompson Sampling Algorithm
    import numpy as np

    class TSPolicy:
        """
        Implements the Thompson Sampling policy for the Multi-Armed Bandit problem.

        Thompson Sampling is a probabilistic method that balances the exploration-exploitation trade-off by choosing actions based on the probability of them being optimal. This implementation assumes a Bernoulli bandit model where rewards are binary.

        Attributes
        ----------
        buffer_size : int
            The size of the buffer used to limit the history of rewards and actions considered. This is particularly useful for non-stationary bandits where the reward distributions change over time.

        Methods
        -------
        choose_bandit(k_array, reward_array, n_bandits)
            Chooses a bandit to play based on the Thompson Sampling algorithm.
        """

        def __init__(self, buffer_size):
            """
            Initializes the TSPolicy with a specified buffer size.

            Parameters
            ----------
            buffer_size : int
                The size of the buffer to limit the history of rewards and actions. A larger buffer size allows for more historical data to be considered, while a smaller size makes the policy adapt faster to changes in the reward distributions.
            """
            self.buffer_size = buffer_size

        def choose_bandit(self, k_array, reward_array, n_bandits):
            """
            Chooses a bandit to play based on the Thompson Sampling algorithm.

            This method calculates the posterior distribution of the success probability for each bandit based on the history of actions and rewards. It then samples from these distributions and chooses the bandit with the highest sample.

            Parameters
            ----------
            k_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the number of times a particular action has been taken. The shape of `k_array` is (n_bandits, n_actions).
            reward_array : numpy.ndarray
                A 2D array where each row corresponds to a bandit and each column corresponds to the rewards received for a particular action. The shape of `reward_array` is (n_bandits, n_actions).
            n_bandits : int
                The number of bandits.

            Returns
            -------
            int
                The index of the chosen bandit.
            """
            # Limiting the size of the buffer
            reward_array = reward_array[:, :k_array.sum(axis=1)][:, -self.buffer_size:]
            success_count = reward_array.sum(axis=1)
            failure_count = k_array.sum(axis=1) - success_count

            # Drawing a sample from each bandit distribution
            samples_list = [np.random.beta(1 + success_count[bandit_id], 1 + failure_count[bandit_id]) for bandit_id in range(n_bandits)]

            # Returning bandit with the best sample
            return np.argmax(samples_list)

    

Application to Recommendation Models

In the context of recommendation systems, multi-armed bandits can be applied to dynamically allocate customers to different recommendation models. Each recommendation model represents an “arm,” and the goal is to assign customers to the model that is most likely to provide them with relevant and satisfying recommendations. By using multi-armed bandit algorithms, the system can continuously learn and adapt its customer assignment strategy based on the observed performance of each recommendation model.

The better the performance the more assignments the arm will get and vice a versa.

Below I will share some of the core steps along with considerations you can take in order to apply bandit algorithms to your data. Specifically I’ll be showcasing the application of the Thompson Sampling algorithm

Defining our data

You need data for training your bandit algorithm as well as choosing the customers you want to perform assignments for. For demo purposes I’ll simultate some random data but in practice you’d want to create your dataset given your data sources. This most likely would require using some SQL dilect to transform your source data into something usable.

The key fields required:

  1. customer_id : int Some unique identifier for a customer, you can use multiple fields in practice to define a unique customer if a single id doesn’t exist.
  2. bandit_name : str The bandit that the given customer has been assigned too.
  3. reward : int Some reward corresponding to the performance of the customer, we will consider discrete reward her which could correpond to whether the customer clicked or engaged while being recommended content from the given bandit.

The training data typically would be some time back while the scoring would be based on the most recent events.

training_data = generate_fictitious_data(num_rows=10_000, customer_id_start=1)
assignment_data = generate_fictitious_data(num_rows=5000, customer_id_start=10_001)
generate_fictitious_data function
    def generate_fictitious_data(num_rows=5000, customer_id_start=1):
        """
        Generate a pandas DataFrame with fictitious data containing customer_id, bandit_name, and reward.

        Parameters
        ----------
        num_rows : int 
            Number of rows in the DataFrame. Default is 5000.
        customer_id_start : int, optional
            Starting id for the customer_id. Default is 1.

        Returns
        -------
        df : pd.DataFrame
            DataFrame containing the generated data.
        """
        # Generate customer_id as unique numbers
        customer_ids = list(range(customer_id_start, customer_id_start + num_rows))

        # Generate bandit_name as random choices from 'A', 'B', 'C'
        bandit_names = [random.choice(['A', 'B', 'C']) for _ in range(num_rows)]

        # Generate reward as random binary values (0 or 1)
        rewards = [random.choice([0, 1]) for _ in range(num_rows)]

        # Create a dictionary with the generated data
        data = {
            'customer_id': customer_ids,
            'bandit_name': bandit_names,
            'reward': rewards
        }

        # Create a DataFrame from the dictionary
        df = pd.DataFrame(data)

        return df
    
    

Training our bandit

  • High level steps:
    1. Setup a dataset that can store our bandit parameters.
    2. Take these parameters and initialise our Multi-Armed (Thompson Sampling) bandit algo with them.
    3. Feed in our historic training dataset and use it to update our Multi-Armed bandit model.
    4. Update (overwrite) our stored bandit parameters with the final parameters after training.
  • We can also add in some visualisations too.
    1. Visualise the posterior distributions for each bandit after training to gauge the models understanding.

Initialise key constants

episode_length = 28
bandit_names = ['A', 'B', 'C']

Create a bandit parameter dataset

You’ll want to have someplace to store your bandit parameters since over time these will change and you’d need to use the latest parameters during assignment. Here I have decided to create a dataframe to hold these but for a real production system you could use some data store.

bandit_parameters = create_bandit_parameters_dataframe(bandit_names)
bandit_parameters.head()
create_bandit_parameters_dataframe function
    def create_bandit_parameters_dataframe(bandit_names):
        """
        Creates a DataFrame to store bandit parameters.

        Parameters
        ----------
        bandit_names : list
            The names of the bandits.

        Returns
        -------
        pd.DataFrame
            The created DataFrame with columns for bandit name, alpha, beta, and last_updated.
        """
        return pd.DataFrame({
            'bandit_name': bandit_names,
            'alpha': [1] * len(bandit_names),
            'beta': [1] * len(bandit_names),
            'last_updated': [datetime.now()] * len(bandit_names)
        })
    

Train your bandit

We can now train our bandit by applying it to our historical training data, this allows the algorithm to learn upfront which bandit arm tends to give the most reward and so can these learnings can be put into good use when assigning customers.

ts, bandit_parameters = train_bandits(bandit_parameters, 
                    training_data, 
                    bandit_names)
bandit_parameters.head()
train_bandits functionality
    class ThompsonSampling:
    """
    Implements the Thompson Sampling algorithm for the Multi-Armed Bandit problem.

    Attributes
    ----------
    bandit_names : list
        The names of the bandits (arms).
    alpha : np.ndarray
        Array of alpha parameters (successes) for the Beta distribution of each bandit.
    beta : np.ndarray
        Array of beta parameters (failures) for the Beta distribution of each bandit.

    Methods
    -------
    choose_bandit()
        Chooses a bandit to play based on the Thompson Sampling algorithm.
    update(chosen_bandit, reward)
        Updates the parameters of the chosen bandit's Beta distribution based on the observed reward.
    """
    def __init__(self, bandit_names, alpha=None, beta=None):
        """
        Initialises the ThompsonSampling class with the specialised number of bandits.

        Parameters
        ----------
        n_bandits : int
            The number of bandits (arm).
        """

        self.bandit_names = bandit_names
        self.alpha = alpha if alpha is not None else {name: 1 for name in bandit_names}
        self.beta = beta if beta is not None else {name: 1 for name in bandit_names}

    def choose_bandit(self):
        """
        Chooses a bandit to play based on the Thompson Sampling algorithm.

        This method samples from the Beta distribution of each bandit and selects the bandit with the highest sample.

        Returns
        -------
        str
            The name of the chosen bandit.
        """

        expected_reward_samples = {name: beta.rvs(self.alpha[name], self.beta[name], size=1) for name in self.bandit_names}
        return max(expected_reward_samples, key=expected_reward_samples.get)

    def update(self, chosen_bandit, reward):
        """
        Updates the parameters of the chosen bandit's Beta distribution based on the observed reward.

        Parameters
        ----------
        chosen_bandit : str
            The name of the chosen bandit.
        reward : int
            The observed reward (1 for success, 0 for failure).
        """
        if reward:
            self.alpha[chosen_bandit] += 1
        else:
            self.beta[chosen_bandit] += 1
    
    def train_bandits(bandit_params_df, customer_data_df, bandit_names, limit=0):
        """
        Perform assignment of customers to bandits over a specified episode length using Thompson Sampling.

        Parameters
        ----------
        bandit_params_df : pd.DataFrame
            The DataFrame containing bandit parameters.
        customer_data_df : pd.DataFrame
            The DataFrame containing customer data.
        bandit_names : list
            The names of bandits (arms).
        limit : int, optional
            The limit on the number of customers to consider for training. Default is 0 (no limit).

        Returns
        -------
        ts : ThompsonSampling
            The trained ThompsonSampling instance.
        """
        # Retrieve bandit parameters from DataFrame
        bandit_params = get_bandit_parameters_from_dataframe(bandit_params_df, bandit_names)
        alpha = {name: bandit_params[name][0] for name in bandit_names}
        beta = {name: bandit_params[name][1] for name in bandit_names}
        
        ts = ThompsonSampling(bandit_names, alpha, beta)
        
        if limit > 0:
            customer_data_df = customer_data_df.sample(n=limit)
        
        for _, row in customer_data_df.iterrows():
            ts.update(row['bandit_name'], row['reward'])

        # Update the bandit parameters in DataFrame
        for bandit_name in bandit_names:
            update_bandit_parameters_in_dataframe(bandit_params_df, bandit_name, ts.alpha[bandit_name], ts.beta[bandit_name])
        
        return ts, bandit_params_df

    def get_bandit_parameters_from_dataframe(bandit_params_df, bandit_names):
        """
        Retrieves the alpha and beta parameters for each bandit from the DataFrame.

        Parameters
        ----------
        bandit_params_df : pd.DataFrame
            The DataFrame containing bandit parameters.
        bandit_names : list
            The names of the bandits.

        Returns
        -------
        dict
            A dictionary with bandit names as keys and (alpha, beta) tuples as values.
        """
        bandit_params = {}
        for name in bandit_names:
            row = bandit_params_df[bandit_params_df['bandit_name'] == name]
            if not row.empty:
                bandit_params[name] = (row['alpha'].values[0], row['beta'].values[0])
            else:
                bandit_params[name] = (1, 1)
        return bandit_params
    
    def update_bandit_parameters_in_dataframe(bandit_params_df, bandit_name, alpha, beta):
        """
        Updates the alpha and beta parameters for a specific bandit in the DataFrame.

        Parameters
        ----------
        bandit_params_df : pd.DataFrame
            The DataFrame containing bandit parameters.
        bandit_name : str
            The name of the bandit.
        alpha : float
            The updated alpha parameter.
        beta : float
            The updated beta parameter.
        """
        if bandit_name in bandit_params_df['bandit_name'].values:
            bandit_params_df.loc[bandit_params_df['bandit_name'] == bandit_name, ['alpha', 'beta', 'last_updated']] = [alpha, beta, datetime.now()]
        else:
            new_row = pd.DataFrame({
                'bandit_name': [bandit_name],
                'alpha': [alpha],
                'beta': [beta],
                'last_updated': [datetime.now()]
            })
            bandit_params_df = pd.concat([bandit_params_df, new_row], ignore_index=True)
    

Visualise the posterior distributions

It’s useful to get a feel for exactly what the model has learned and this can be done by visualising the posterior distributions generated for each of the bandits limbs. If the distributions are heavily peaked with differing expected values this can mean the model will exploit more since it’s more likely that a large reward will come from the distribution with the higher expected value (more divergence implies less overlap therefore unlikely to get anything from the low distributions).

plot_posterior_distributions(ts)
plot_posterior_distributions function
    def plot_posterior_distributions(ts_instance, fig_size=(12,8)):
        """
        Plots the posterior Beta distributions for each bandit in the ThompsonSampling instance.

        Parameters
        ----------
        ts_instance : ThompsonSampling
            An instance of the ThompsonSampling class.
        fig_size : tuple, optional
            A tuple representing the figure size (width, height).

        Returns
        -------
        None
            The function displays the plot and doesn't return anything.
        """

        n_bandits = len(ts_instance.bandit_names)
        n_rows = np.ceil(n_bandits / 2).astype(int)
        n_cols = 2

        fig, axes = plt.subplots(n_rows, n_cols, figsize=fig_size)
        if n_rows > 1:
            axes = axes.flatten()
        else:
            axes = [axes]

        for sub_plot, bandit_name in enumerate(ts_instance.bandit_names):
            a = ts_instance.alpha[bandit_name]
            b = ts_instance.beta[bandit_name]
            reward = np.linspace(0, 1, 100)
            y = beta.pdf(reward, a, b)
            peak_reward = reward[np.argmax(y)]

            axes[sub_plot].plot(reward, y, label=f"Beta({a}, {b})")
            axes[sub_plot].set_title(f"Posterior: {bandit_name} \n Success: {int(a)}, Failure: {int(b)}")
            axes[sub_plot].set_xlabel('Expected Reward')
            axes[sub_plot].axvline(x=peak_reward, color='red', linestyle=':', linewidth=2, label=f"'Average' Expected Reward: {peak_reward:.2f}")
            # axes[sub_plot].text(peak_reward, -1, f'(peak_reward:.2f)', ha='center', va='bottom', color='red')

            axes[sub_plot].legend()
            axes[sub_plot].grid(True)
            axes[sub_plot].set_facecolor('lightgrey')

        for j in range(sub_plot + 1, n_rows * n_cols):
            axes[j].axis('off')

        plt.tight_layout()
        plt.show()
    
View of the posterior distribution plot for each bandit using our functionality.
View of the posterior distribution plot for each bandit using our functionality.

As you can see the distributions are fairly close togther and not divergent. This makes intuitive sense as it aligns with what we’d expect given the process used to construct our training dataset.

Performing customer assignment

  • High level steps:
    1. Grab the customers that you want to allocate.
    2. Take the initially trained bandit algorithm and apply it to each customer making sure to update its parameters after each assignment.
    3. Make the assignments for all customers and store them alongside other metadata such as dates for the start and end of the assignment inside some data structure.
    4. Update our stored bandit parameters with the final values.
  • We can also add some visualisations in too:
    1. Visualise the customer groups most recent reward distribution (based on the episode they are rolling out of).
    2. Visualise the final bandit allocation distributions.

Visualsing real reward distribution

plot_reward_distribution(assignment_data, 
                               group_col="bandit_name", 
                               reward_col="reward")
plot_reward_dsitrbution function
    def plot_reward_distribution(data, group_col, reward_col, fig_size=(12, 8)):
        """
        Plot the distribution of reward.

        Parameters
        ----------
        data : pd.DataFrame
            The input dataset containing customer ID, group column, reward and other fields.
        group_col : str
            The name of the column indicating the group for each customer.
        reward_col : str
            The name of the column containing the reward values.
        fig_size : tuple, optional
            The size of the figure (width, height) in inches. Default is (12, 8).

        Returns
        ------
        None
            The function displays the plot and does not return any value.
        """

        groups = data[group_col].unique()
        num_groups = len(groups)

        n_cols=2
        n_rows = (num_groups + n_cols - 1) // n_cols
        fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=fig_size)

        for i, group in enumerate(groups):
            # Calculate the row and column indices for the current subplot
            row_idx = i // n_cols
            col_idx = i % n_cols

            group_data = data[data[group_col] == group]
            reward_counts = group_data[reward_col].astype(int).value_counts()
            reward_counts = reward_counts.reindex([0, 1], fill_value=0)

            plot_data = pd.DataFrame({
                'Reward': reward_counts.index,
                'Count': reward_counts.to_numpy()
            })

            sns.barplot(x='Reward', y='Count', data=plot_data, ax=axes[row_idx, col_idx])
            axes[row_idx, col_idx].set_title(f'Group: {group}')
            axes[row_idx, col_idx].set_xlabel(f'Reward')
            axes[row_idx, col_idx].set_ylabel(f'Count')
            axes[row_idx, col_idx].set_ylim([0, data.shape[0]])
            axes[row_idx, col_idx].set_facecolor('lightgray')

        for bar in axes[row_idx, col_idx].patches:
            count = bar.get_height()
            axes[row_idx, col_idx].text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(int(count)), ha='center', va='bottom')

        plt.tight_layout()
        plt.show()
    
A view of the rewards across limbs correponding to the customers who are just finishing their episodes that we are about to reassign.
A view of the rewards across limbs correponding to the customers who are just finishing their episodes that we are about to reassign.

Performing Assignments

customer_assignments, bandit_parameters = assign_customers(bandit_parameters, 
                       assignment_data, 
                       bandit_names, 
                       episode_length, 
                       ts)
customer_assignments.head()
assign_customers function
    def assign_customers(bandit_params_df, customer_data_df, bandit_names, episode_length, ts, limit=0):
        """
        Perform assignment of customers to bandits over a specified episode length using Thompson Sampling.
        
        Parameters
        ----------
        bandit_params_df : pd.DataFrame
            The DataFrame containing bandit parameters.
        customer_data_df : pd.DataFrame
            The DataFrame containing customer data.
        bandit_names : list
            The names of the bandits.
        episode_length : int
            The length of the episode in days.
        ts : ThompsonSampling
            The trained ThompsonSampling instance.
        limit : int, optional
            The limit on the number of customers to assign. Default is 0 (no limit).
            
        Returns 
        -------
        customer_assignments : pd.DataFrame
            A DataFrame containing each customer's ID, the assigned bandit, and the start and end dates of their episode.
        """
        assignments = []
        start_date = datetime.now()
        
        if limit > 0:
            customer_data_df = customer_data_df.sample(n=limit)
        
        # Perform assignment process
        for _, row in customer_data_df.iterrows():
            chosen_bandit = ts.choose_bandit()
            ts.update(row['bandit_name'], row['reward'])
            assignments.append({
                'customer_id': row['customer_id'],
                'bandit_name': chosen_bandit,
                'start_date': start_date,
                'end_date': start_date + timedelta(days=episode_length)
            })
            
        # Update the bandit parameters in DataFrame
        for bandit_name in bandit_names:
            update_bandit_parameters_in_dataframe(bandit_params_df, bandit_name, ts.alpha[bandit_name], ts.beta[bandit_name])
            
        customer_assignments = pd.DataFrame(assignments)
        
        return customer_assignments, bandit_params_df
    

Visualising assignment distributions

plot_bandit_assignment_distributions(customer_assignments, 
                                           "bandit_name", 
                                           "end_date")
plot_bandit_assignment_distributions function
    def plot_bandit_assignment_distributions(data, bandit_col, group_date, fig_size=(8, 6)):
        """
        Plot the counts of assignments made by the bandit algorithm.

        Parameters
        ----------
        data : pd.DataFrame
            The input dataset containing the bandit assignments.
        bandit_col : str
            The name of the column to calculate the counts (bandit assignment col).
        group_date : str
            The name of the column containing the date (identifying a given group).
        fig_size : tuple, optional
            The size of the figure (width, height) in inches. Default is (8, 6).

        Returns
        -------
        None
            The function displays the plot and doesn't return anything.
        """

        column_counts = data[bandit_col].value_counts()
        group_name = f"{data[group_date].unique()[0].strftime('%Y-%m-%d')}"

        fig, ax = plt.subplots(figsize=fig_size)

        sns.barplot(x=column_counts.index, y=column_counts.to_numpy(), ax=ax)
        ax.set_title(f"Distribution of Customer Allocation for group end_date {group_name}")
        ax.set_xlabel('Limb')
        ax.set_ylabel('Count')
        ax.set_facecolor('lightgrey')

        for bar in ax.patches:
            count = bar.get_height()
            ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(int(count)), ha='center', va='bottom')

        plt.tight_layout()
        plt.show()
    
Snapshot of what the assignments look like for a particular group of customers.
Snapshot of what the assignments look like for a particular group of customers.

Future Directions

Many ways to extend the work here, I have listed a fiew directions below:

  1. Handle continuous rewards
    • At the moment the Thompson Sampling algorithm only applies for the discrete reward case.
    • To adapt for continuous rewards things like the bayesian prior and update logic will need to be extended.
  2. Over aggresive assignment handling
    • At the moment if you train on too much data your posterior distributions can become extremely peaked and divergent which can lead to almost always assigning to the most performant bandit.
    • This might not always be desired espeically in a practical setting as you may want some variation to continue to monitor behaviour between the recommendation models.
    • Some potential ways to adapt are:
      • Sliding Window: This would cap the total amount of rewards that a given bandit could recieve thereby limiting how peaked the distributions can become. Challenge might be optimizing the threshold for the problem.
  3. Contextual reward handling
    • In alot of real world cases the reward can vary over time due to a variety of factors including the context.
    • To handle this you’d want to adapt the “Vanilla MAB” approach for more of a “Contextual MAB” approach.
    • Plently of resources touch on it, this somewhat recent lecture at the PyData conference might be a nice place to start.

Conclusions

With all this we come to the end, hopefully it’s been insightful and have got a peer into the purpose behind Multi-Armed Bandit problems and specifically how Thompson Sampling can be appled to the case of Customer Assignment and how it can be beneficial.

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Aligning Multi-Armed Bandits for Dynamic Optimization of Customer Assignments in Recommendation Models},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Multi-Armed-Bandits/}},
  note = {[Online; accessed]},
  year = {2024}
}

x.com, Facebook