Table of Contents
- Introduction
- Multi-Armed Bandits Explained
- Algorithms and Techniques
- Application to Recommendation Models
- Future Directions
- Conclusions
Introduction
When you have an app or website one of your goals is to get people to engage with what you are creating, whether it be a product, service or something in between.
Recommendation systems play a crucial role in helping businesses attract, engage, and retain customers. However, the effectiveness of these systems largely depends on the accurate assignment of customers to the most suitable recommendation models. This is where multi-armed bandit algorithms come into play. In this blog post, we will explore how aligning multi-armed bandits for dynamic customer allocation can significantly optimize the performance of recommendation models.
Multi-Armed Bandits Explained
For historical context the term Bandit comes from the slot machines present in casinos, they were coined bandit since they ended up “stealing your money” just like a bandit would. Multi-Armed comes from fact people used to pull the levers (arms) of many slot machines at once.
Defintions
Lots of deep theory exists in this space but from a practical perspective having a basic grasp on some defintions is more than enough to begin working with these algorithms.
- Bandit specific
- The Multi-Armed Bandit Problem essentially is the problem of not knowing what the optimal decision to make is when faced with multiple options (“arms”).
- The process by which you can attempt solve these problems is to understand the specific context and then apply Multi-Armed Bandit Algorithms which are purposefully designed to tackle these problems.
- Reward just refers to the payoff or return recieved after selecting a particular action or arm of the multi-armed-bandit.
- Each arm of the bandit algorithm is associated with a reward distribution which tells you the likliehood of recieving a reward on any given round. When a given arm is pulled the reward recieved can be thought of as a realisation of this distribution.
- This distribution typically is unknown to the entity pulling the arms.
- The goal is to maximize the reward over a series of trials (arm pullings) while minimizing the regret, which is the difference between the reward recieved from an arm pull and the maximum reward that you would recieve if you always had chosen the best arm.
- Just like in RL the reward hypothesis can be used to assert that all goals can be thought of as a maximization of cumulative reward.
- The rewards themselves are stochastic in nature, meaning that they are randomly dertermined by the underlying probability distribution.
- This randomness introduces uncertainty and complexity into the decision making process as you must balance the tradeoff between exploring new arms to gather more information about their reward distributions (exploration) and exploiting known arms that have shown a high rewards in the past (exploitation). This is known as the exploration-exploitation trade-off problem.
- Some problem forumlations might have rewards that are context dependent where the reward distributions change based on external or contextual information available at the time of making a decision. This could add an additional strategic layer to the problem as the optimal action may vary with changes in the context.
- The Multi-Armed Bandit Problem essentially is the problem of not knowing what the optimal decision to make is when faced with multiple options (“arms”).
- Business specific
- An episode refers to the duration of time which a given customer stays assigned to the same bandit after which they are reallocated based on some process.
- When applying customer allocations a standard way is by staggering.
- The best way to think about it is as follows:
- Take the full customer base.
- Divide them into groups dependent on the length of the episodes (e.g if episode length = 28 days then group size = 1/28th of the base).
- Each day a single group gets allocated to an experiment.
- Therefore each day there would exist a group that is expiring from there previous episode and getting reallocated to another experiment.
- After a full episode length (e.g 28 days) all customers (across all groups) would have been through an entire episode and therefore would begin being reassigned (rolled over) to another episode.
- The best way to think about it is as follows:
Resources
Theory is well established in this space and as of now a few universally recognised reference materials exist along with a host of additional articles.
- Books
- Introduction to Multi-Armed Bandits; Aleksandrs Slivkins
- Universally recognised textbook, not super long but gives a good touch on most things.
- Bandit Algorithms; Tor Lattimore & Csaba Szepesvari
- Universally recognised reference book, alot longer but provides good technical grounding on the topic.
- Introduction to Multi-Armed Bandits; Aleksandrs Slivkins
- Lecture Material
- Online Learning and Multi-Armed Bandits Oxford
- Provides some links to prerequisite materials which is nice.
- Online Probability Summary Book
- Useful to brush up on Beta & Gamma distributions in particular.
- Online Learning and Multi-Armed Bandits Oxford
- Articles
- Introduction to Thompson Sampling: the Bernoulli bandit
- Good intro with particularily nice visualations to help internalise the bandit learning process.
- Introduction to Thompson Sampling: the Bernoulli bandit
Algorithms and Techniques
Plenty of algorithms and variants of them exist, we’ll touch on the three more common ones.
ε-Greedy
Pros:
- Simplicity: Easier to implement and understand in terms of how it implements the trae-off between exploration and exploitation.
Cons:
- Inefficient Exploration: Randomly explores at a fixed rate (set by \(\epsilon\)). Intuitively you should explore more initially to gain data and overtime once you have gathered more data you should exploit however in this case depending on your \(\epsilon\) you might not explore enough initially and not exploit enough later on.
- Flexibility: You’d want to choose $\epsilon$ based on the context of the problem but tuning it to find the right value might not be easy.
- Sub-optimal Decisioning: Due to constant exploration rate $\epsilon$ converging to the optimal bandit might take longer.
ε-Greedy Algorithm
Upper Confidence Bound (UCB)
Pros:
- Incorporating uncertainty: Instead of random exploration of bandits at a constant rate the exploration is embedded into the measure the algorithm is maximising by weighting with the frequency that the particular bandit in question has been assigned up until that point. By adding the extra square root term it upweights things which haven’t been seen as much, making it a better explorer than ε-greedy.
- Higher average reward: UCB doesn’t randomly explore and often achieves a higher average reward compared to ε-greedy since the additional term means more exploration earlier on (less reward) but way less for later rounds (more reward) causing the average to be better more often than not down the stretch. This also fits more closely with our intuition.
Cons:
- Complexity: It is more complex to implement and understand compared to ε-greedy.
- Computational Overhead: Since the algorithm requires calculating an additional confidence bound it can be a little more demanding than ε-greedy.
- Initial Exploration: Due to weighted exploration it may initially explore sub-optimal bandits more frequently meaning higher initial regret.
UCB Algorithm
Thompson Sampling
Pros:
- Optimal Uncertainty Handling: Uses a Bayesian approach to model uncertainty, this is more sophisticated and has been shown to be empirically better than the other methods.
- More Adaptive: The exploration leverages the idea of sampling from posterior distributions which is a better representation than trying to either completely randomly explore (case of greedy) or altering the update measure (case of UCB). This leads to more efficient exploration and exploitation.
- Better Performance: It has been shown to outperform other algos in terms of cumulative reward across rounds.
Cons:
- Complexity: It is more complex to implement and understand compared to the other techniques.
- Computational Cost: Since the algorithm requires sampling and keeping track of the entire posterior distribution it can be costly especially if you have lots of bandits and rounds.
- Parameter Sensitive: It can be parameter sensitive in the sense that your choice on prior distributions.
Thompson Sampling Algorithm
Application to Recommendation Models
In the context of recommendation systems, multi-armed bandits can be applied to dynamically allocate customers to different recommendation models. Each recommendation model represents an “arm,” and the goal is to assign customers to the model that is most likely to provide them with relevant and satisfying recommendations. By using multi-armed bandit algorithms, the system can continuously learn and adapt its customer assignment strategy based on the observed performance of each recommendation model.
The better the performance the more assignments the arm will get and vice a versa.
Below I will share some of the core steps along with considerations you can take in order to apply bandit algorithms to your data. Specifically I’ll be showcasing the application of the Thompson Sampling algorithm
Defining our data
You need data for training your bandit algorithm as well as choosing the customers you want to perform assignments for. For demo purposes I’ll simultate some random data but in practice you’d want to create your dataset given your data sources. This most likely would require using some SQL dilect to transform your source data into something usable.
The key fields required:
- customer_id : int Some unique identifier for a customer, you can use multiple fields in practice to define a unique customer if a single id doesn’t exist.
- bandit_name : str The bandit that the given customer has been assigned too.
- reward : int Some reward corresponding to the performance of the customer, we will consider discrete reward her which could correpond to whether the customer clicked or engaged while being recommended content from the given bandit.
The training data typically would be some time back while the scoring would be based on the most recent events.
training_data = generate_fictitious_data(num_rows=10_000, customer_id_start=1)
assignment_data = generate_fictitious_data(num_rows=5000, customer_id_start=10_001)
generate_fictitious_data function
Training our bandit
- High level steps:
- Setup a dataset that can store our bandit parameters.
- Take these parameters and initialise our Multi-Armed (Thompson Sampling) bandit algo with them.
- Feed in our historic training dataset and use it to update our Multi-Armed bandit model.
- Update (overwrite) our stored bandit parameters with the final parameters after training.
- We can also add in some visualisations too.
- Visualise the posterior distributions for each bandit after training to gauge the models understanding.
Initialise key constants
episode_length = 28
bandit_names = ['A', 'B', 'C']
Create a bandit parameter dataset
You’ll want to have someplace to store your bandit parameters since over time these will change and you’d need to use the latest parameters during assignment. Here I have decided to create a dataframe to hold these but for a real production system you could use some data store.
bandit_parameters = create_bandit_parameters_dataframe(bandit_names)
bandit_parameters.head()
create_bandit_parameters_dataframe function
Train your bandit
We can now train our bandit by applying it to our historical training data, this allows the algorithm to learn upfront which bandit arm tends to give the most reward and so can these learnings can be put into good use when assigning customers.
ts, bandit_parameters = train_bandits(bandit_parameters,
training_data,
bandit_names)
bandit_parameters.head()
train_bandits functionality
Visualise the posterior distributions
It’s useful to get a feel for exactly what the model has learned and this can be done by visualising the posterior distributions generated for each of the bandits limbs. If the distributions are heavily peaked with differing expected values this can mean the model will exploit more since it’s more likely that a large reward will come from the distribution with the higher expected value (more divergence implies less overlap therefore unlikely to get anything from the low distributions).
plot_posterior_distributions(ts)
plot_posterior_distributions function
As you can see the distributions are fairly close togther and not divergent. This makes intuitive sense as it aligns with what we’d expect given the process used to construct our training dataset.
Performing customer assignment
- High level steps:
- Grab the customers that you want to allocate.
- Take the initially trained bandit algorithm and apply it to each customer making sure to update its parameters after each assignment.
- Make the assignments for all customers and store them alongside other metadata such as dates for the start and end of the assignment inside some data structure.
- Update our stored bandit parameters with the final values.
- We can also add some visualisations in too:
- Visualise the customer groups most recent reward distribution (based on the episode they are rolling out of).
- Visualise the final bandit allocation distributions.
Visualsing real reward distribution
plot_reward_distribution(assignment_data,
group_col="bandit_name",
reward_col="reward")
plot_reward_dsitrbution function
Performing Assignments
customer_assignments, bandit_parameters = assign_customers(bandit_parameters,
assignment_data,
bandit_names,
episode_length,
ts)
customer_assignments.head()
assign_customers function
Visualising assignment distributions
plot_bandit_assignment_distributions(customer_assignments,
"bandit_name",
"end_date")
plot_bandit_assignment_distributions function
Future Directions
Many ways to extend the work here, I have listed a few directions below:
- Handle continuous rewards
- At the moment the Thompson Sampling algorithm only applies for the discrete reward case.
- To adapt for continuous rewards things like the bayesian prior and update logic will need to be extended.
- Over aggresive assignment handling
- At the moment if you train on too much data your posterior distributions can become extremely peaked and divergent which can lead to almost always assigning to the most performant bandit.
- This might not always be desired espeically in a practical setting as you may want some variation to continue to monitor behaviour between the recommendation models.
- Some potential ways to adapt are:
- Sliding Window: This would cap the total amount of rewards that a given bandit could recieve thereby limiting how peaked the distributions can become. Challenge might be optimizing the threshold for the problem.
- Contextual reward handling
- In alot of real world cases the reward can vary over time due to a variety of factors including the context.
- To handle this you’d want to adapt the “Vanilla MAB” approach for more of a “Contextual MAB” approach.
- Plently of resources touch on it, this somewhat recent lecture at the PyData conference might be a nice place to start.
SlidingWindow Approach
When using a Sliding Window approach it might be necessary to restrict how peaked the posterior distributions get. In our case this is necessary since we want to limit the aggressiveness of the final limb assignments and if we don’t restrict the window size then the resulting distributions will end up diverging (minimal overlap) and therefore resulting samples would almost always be chosen from the distribution with the highest average expected reward.
sliding window algorithm
There are several different approaches to handle this problem as listed below:
- Empirical Tuning
- Steps:
- Specify your desired resultant assignment proportions along with an acceptable threshold deviation value away from these proportions.
- Looping over some predefined window sizes (smallest to largest).
- Each window size instantiates its own SlidingWindowThompsonSampling algorithm and is trained on some specified historical training dataset.
- Make assignments on a historical assignment dataset.
- Evaluating each window size based on how close resulting assignments are to desired proportions are on the historical assignment dataset.
- Choose the largest possible window size (exploitation) which still yields an acceptable level of closeness from the desired proportions.
- This can help the distributions from becoming too peaked by training historically on data and then measuring assignments on some other data and ensuring the proportions of assignments are close enough to what we desired. This way the window size has balanced exploiting known top-performing bandits and exploration by ensuring assignments are diverse.
- Steps:
- Adaptive Windowing
- Steps:
- Change the window size depending on the changes within the underlying data stream.
- Measure the mean between difference sub windows of your rewards for each bandit.
- If a difference is present between the mean reward of two sub-windows by some delta of the standard deviation then we say a difference is present.
- Upon each update of the distributions reward a calculation is done to see if a change is detected on any sub window.
- If yes then the window size is halved to allow the distribution to pick up the change more easily.
- If no then the window size is just incremented to make it a little higher.
- Change the window size depending on the changes within the underlying data stream.
- This can help the distributions from becoming too peaked by constantly adapting the window sizes over time as updates are made and ensures this is done dynamically.
- Steps:
Empircal Tuning of Window size
Here is a brief look at what the output of the empirical tuning process would look like. You’ll want to insure you use appropriate training and assignment datasets but the general process should look similar:
# Define parameters
bandit_names = ['A', 'B', 'C']
window_sizes = [25, 50, 75, 100, 500, 1000]
desired_proportions = {
'A': 0.33,
'B': 0.33,
'C': 0.34
}
difference_threshold = 0.10
# Generate training and assignment data
training_data = generate_fictitious_data(num_rows=10_000, customer_id_start=1)
assignment_data = generate_fictitious_data(num_rows=5_000, customer_id_start=10_001)
# Call the empirical_window_size_tuning function
best_window_size, best_assignments = empirical_window_size_tuning(
bandit_names=bandit_names,
window_sizes=window_sizes,
training_data=training_data,
assignment_data=assignment_data,
desired_proportions=desired_proportions,
difference_threshold=difference_threshold
)
empirical tuning functionality
You can then take a quick look at the resulting best window size and go back and use this inside your prior code (when defining the window size to use when training your intial bandit model).
# Display the results
print(f"Best Window Size: {best_window_size}")
if best_assignments is not None:
display(best_assignments.head())
else:
print("No window size met the desired proportions within the specified threshold.")
Output:
Best Window Size: 25
customer_id bandit_name
0 10001 C
1 10002 B
2 10003 B
3 10004 C
4 10005 A
If all things have gone well then when viewing the distribution of your assignments you should hopefully see resemblance to the proportions you specified at the beginning of tuning.
best_assignments['bandit_name'].value_counts(normalize=True)
In other words something like:
bandit_name
A 0.4206
C 0.3188
B 0.2606
Name: proportion, dtype: float64
Dynamic Adaption of Window size
Now I’ll leave this to the user to have a think about; I suspect a number of ways can be used to perform adaptive windowing. Below is a brief sketch out of one of those ways.
class AdaptiveSlidingWindowThompsonSampling(ThompsonSampling):
"""
Implement an Adaptive Sliding Window functionality onto our Thompson Sampling class.
Attributes
----------
bandit_names : list
The names of the bandits (arms).
alpha : np.ndarray
Array of alpha parameters (successes) for the Beta distribution of each bandit.
beta : np.ndarray
Array of beta parameters (failures) for the Beta distribution of each bandit.
window_size : int
The size of the window (max number of rewards that can be used to build posterior distributions).
rewards : Dict[list]
List of rewards (both a reward and no reward would add an element) for each of the bandits.
adwin : Dict[ADWIN]
An ADWIN class instance for each of the bandit names to monitor its reward and subsequently adjust window size to avoid it getting too peaked.
Methods
-------
update(chosen_bandit, reward)
Updates the parameters of the chosen bandit's Beta distribution based on the observed reward.
"""
def __init__(self, bandit_names, initial_window_size=1000, delta=0.002, **kwargs):
super().__init__(bandit_names, **kwargs)
self.window_size = initial_window_size
self.rewards = {name: deque() for name in bandit_names}
self.adwin = {name: ADWIN(delta) for name in bandit_names}
def update(self, chosen_bandit, reward):
rewards_deque = self.rewards[chosen_bandit]
adwin = self.adwin[chosen_bandit]
rewards_deque.append(reward)
adwin.add_element(reward)
if adwin.detected_change():
self.window_size = max(1, self.window_size // 2) # Shrink window to make it more adaptive to observed change
else:
self.window_size = min(self.window_size * 2, 1000) # Grow window to make estimates more stable
if len(rewards_deque) > self.window_size:
rewards_deque.popleft()
reward_sum = int(sum(rewards_deque))
new_alpha = 1 + reward_sum
new_beta = 1 + (len(rewards_deque) - reward_sum)
self.alpha[chosen_bandit] = new_alpha
self.beta[chosen_bandit] = new_beta
Where the quantification of growth and shrinkage of the window is somewhat arbitrary (might be worth tweaking).
ADWIN functionality
Conclusions
With all this we come to the end, hopefully it’s been insightful and have got a peer into the purpose behind Multi-Armed Bandit problems and specifically how Thompson Sampling can be appled to the case of Customer Assignment and how it can be beneficial.