Background
Collaborative Filtering (CF) methods attempt to utilise the intuition that users with similar tastes should be recommended similar items. CF became widely known following the strong performance of such approaches in the Netflix Prize competition in the years running up to the grand prize being awarded in 2009, see here. They are still seen as a strong algorithmic option for industrial recommendation systems in certain settings.
There are varying levels of complexity that can be employed when performing collaborative filtering. The diagram below places the different forms of CF in the context of recommendation more generally. We cover each of these in turn below after defining some mathematical notation/formalism used throughout.
Formalism
Consider a recommendation setting with \(N_I\) items and \(N_U\) users. Denote the rating of item \(i\) according to user \(j\) by \(R_{(i,j)}\). The resulting dataset, i.e. \(\{ R_{(i,j)} \vert i=1,\dots, N_I ; j=1,\dots, N_U \}\), can be represented by the matrix defined below:
\[R = \begin{pmatrix} R_{(1,1)} & R_{(1,2)} & R_{(1,3)} & \dots & R_{(1,N_U)}\\ R_{(2,1)} & R_{(2,2)} & R_{(2,3)} & \dots & R_{(2,N_U)}\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ R_{(N_I,1)} & R_{(N_I,2)} & R_{(N_I,3)} & \dots & R_{(N_I,N_U)} \end{pmatrix}\]Note that in practice, this matrix will contain missing values due to the fact that each user cannot rate every item. For example, in the case of 3 items and 4 users, it may resemble
\[R = \begin{pmatrix} 1 & ? & 9 & 3 \\ 8 & 3 & ? & 8 \\ ? & 7 & 8 & 2 \end{pmatrix}\]Memory-Based CF
Memory-based can be broadly categorised into user-based and item-based CF.
User-Based CF
The idea with user-based CF (also known as user-user CF) is to provide suitable recommendations to a user by looking at the likes/dislikes of users similar to them. A standard way to do this is (for concreteness consider recommending $n$ items for user $i$) is:
- Replace missing values in the matrix $R$ with $0$s (and possibly subtract each user’s mean score from their scores to reduce the effects of more/less generous users).
- Compute the similarity (using e.g. cosine similarity or Pearson correlation) between user $i$ and all other users. This will result in similarity matrix $S_{(i,j)}$, the $ij^{\text{th}}$ component of which is the similarity between user $i$ and user $j$.
- For users $j$ with an $S_{(i,j)} > \text{some threshold}$ or for the top $m$ most similar users, compute the similarity-weighted average rating for each watched item watched by those users, i.e. $\sum_{j} S_{(i,j)} R_{(i,j)}$.
- The items with the top $n$ similarity-weighted scores can be recommended.
Item-Based CF
The idea with item-based CF (also known as item-item CF) is to recommend similar items to a user that has shown a liking to a particular item (think Amazon “customers who bought this item also bought”). A standard way to do this is (for concreteness consider recommending $n$ items given a purchase of item $i$) is:
- Replace missing values in the matrix $R$ with $0$s (and possibly subtract each user’s mean score from their scores to reduce the effects of more/less generous users).
- Compute the similarity (using e.g. cosine similarity or Pearson correlation) between item $i$ and all other items. This will result in similarity matrix $S_{(i,j)}$, the $i,j^{\text{th}}$ component of which is the similarity between item $i$ and item $j$.
- For items $j$ with an $S_{(i,j)} > \text{some threshold}$ or for the top n most similar items, compute the similarity-weighted average rating for each item.
- The items with the top $n$ similarity-weighted scores can be recommended.
Model-Based CF
The idea with model-based CF is to use a generative approach to model the matrix $R$. In the below, we use a linear generative model. In principle this could be expanded to something more complex (e.g. a neural network) if desired.
To see how this works, imagine that we knew that each item $i$ could be characterised by a vector $\mathbf{q}^{(i)} \in \mathbb{R}^K$, with $K < N_I, K < N_U$. In a movie setting, this could capture e.g. how arty/new/action-packed the movie is. In an NBA setting this would likely capture the type of product/service being targeted by the card. Imagine that we also knew that each user could be characterised by a vector $\mathbf{p}^{(j)} \in \mathbb{R}^K$ as well. This would capture how much a user likes arty/new/action-packed movies. In the NBA setting one could think of this as capturing the product/service type that the customer would be most receptive to. Then the quantity $\hat{R}_{(i,j)} = \langle \mathbf{q}^{(i)}, \mathbf{p}^{(j)} \rangle$
could be used to represent our estimate of $R_{(i,j)}$.
Training
In model-based collaborative filtering we aim to learn these $\mathbf{q}^{(i)}$s and $\mathbf{p}^{(j)}$s. This is done by minimising the squared loss observed when predicting $R_{(i,j)}$ via $\hat{R}_{(i,j)} = \langle \mathbf{q}^{(i)}, \mathbf{p}^{(j)} \rangle$ for observed entries, i.e. by solving the optimisation problem
\[\textrm{minimise w.r.t. } \{\mathbf{q}^{(i)}\}_{i=1}^{N_I}, \{\mathbf{p}^{(j)}\}_{j=1}^{N_U} \textrm{ the objective function } \sum_{( \textrm{observed } i,j)} \left( R_{(i,j)} - \langle \mathbf{q}^{(i)}, \mathbf{p}^{(j)} \rangle \right)^2.\]In practice, this objective is also regularised by adding $L_2$ terms of the form $\sum_i |\mathbf{q}^{(i)}|^2_2$ and $\sum_j |\mathbf{p}^{(j)}|^2_2$. This objective can be minimised using gradient descent/alternating least squares.
The result of this optimisation procedure/training phase is a set of ${\mathbf{q}^{(i)}}$s and ${\mathbf{p}^{(j)}}$s which can be grouped into matrices of the form
\[Q = \begin{pmatrix} \leftarrow & \mathbf{q}^{(1)} & \rightarrow \\ \leftarrow & \mathbf{q}^{(2)} & \rightarrow \\ \vdots & \vdots & \vdots \\ \leftarrow & \mathbf{q}^{(N_I)} & \rightarrow \end{pmatrix}, \quad P = \begin{pmatrix} \leftarrow & \mathbf{p}^{(1)} & \rightarrow \\ \leftarrow & \mathbf{p}^{(2)} & \rightarrow \\ \vdots & \vdots & \vdots \\ \leftarrow & \mathbf{p}^{(N_U)} & \rightarrow \end{pmatrix}.\]In this format, the predicted ratings are given by $\hat{R} = Q P^\mathsf{T}$. Note the similarity of this to Singular Value Decomposition - as a result of this similarity, this approach is often referred to as “(Low Rank) Matrix Factorisation.”
Using the outputs of training
Given the result of the collaborative filtering training procedure, we have the matrices $P$ and $Q$. These can then be utilised to perform recommendation in various ways. For example, one could compute $\hat{R} = Q P^\mathsf{T}$, read off the highest value for a user $j$ and use the item associated with this entry as the recommendation. One could also use the idea of cosine similarity to items. For example, if a customer has a fan of item $i$, then recommend the item with the greatest cosine similarity to item $i$. Having the user and item embeddings $P$ and $Q$ also facilitates straightforward visualisation and clustering style analysis of users and items.
Application to Next Best Action systems
When it comes to NBAs, the idea is to use the training procedure above to learn the embeddings, and then use these to predict the rating of a new card. The next best action is then the card with the highest predicted rating.
Some important considerations that might push you away from using model-based CF for NBAs are:
- The training procedure is computationally intensive.
- The matrix $R$ is sparse, so the training procedure may not perform well.
- None of the above methodologies use customer or item metadata that would typically be available in an NBA setting.
Conclusion
Collaborative filtering is a strong algorithmic option for recommendation systems. It is a strong option for NBAs, but is not without its drawbacks and it’s recommended that if you’re to use a CF approach for an NBA, you should try to supplement and make it part of a hybrid approach (Content-based,Contextual Bandit, RL).