8 minute read

Many applications in life, such as recommendation systems and search engines, are based on ranking models. Yet, materials regarding ranking for beginners are hard to find. For that reason, I will try to provide a gentle introduction to ranking models in this article.

1. Problem framing

Ranking is a task in machine learning that concern the relative importance or relevance of different items. Perhaps it’s easier to understand ranking if we juxtapose its problem framing with 2 other (more) popular ML tasks: Classification and Regression.

Regression: Given object A that is represented by a list of characteristics X, predict its quantitative attribute y.

E.g. Given person James that have X = [Male, 40 y/o, Single] predict his y = grocery spending per week ($).

Classification: Given object A that is represented by a list of characteristics X, predict its categorical attribute y.

E.g. Given person James that has X = [Male, 40 y/o, Spend $65/week on grocery] predict his relationship status y = [Single or In a relationship].

We can see that both tasks only concern one object and want to predict an attribute of that object. The ranking task, however, is different in that it concerns 2 object types at once.

Ranking: Given object A of type U and an array of objects B_0, B_1, …, B_k of type V. Sort the array B in terms of compatibility with A. The Pair (U, V) can represent a lot of different things.

E.g. In the information retrieval system literature, (U, V) = (Query, Document). Google is a prime example, given a query represented by the text X = "How to save the whale", Google returns a list of websites ranked by relevancy.

E.g. In a recommender system, (U, V) = (User, Item). For example, given person James that has X = [Male, 40 y/o, Single], rank the grocery items that he would want to buy [Beer, Pork, Bread, Salt].

2. Approaches for the ranking task

Now that we understand the problem, let’s discuss a few approaches to solve it.

First of all, to sort an array, we have to assign a numerical value to it. To be more specific, we assign a score to each pair of (A, B_i). Different ways to assign the score make the differences between approaches.

(User, Item) pair Score
(James, Beer) 8
(James, Salt) 3
(James, Pork) 9
(James, Bread) 6

=> The ranked recommendation for James would be [Pork, Beer, Bread, Salt]

We will take a look into 2 approaches, the Point-wise propensity prediction approach and the Pair-wise approach, and we will use the example above to guide the process.

2.1. Point-wise propensity prediction

In this approach, we want to model the probability of an arbitrary item being bought by James and use that as the ranking score. To do that, we formulate a binary classification of whether or not an item was bought in the past.

We form the ground-truth dataset by looking at the purchase history and choosing the positive example (the 1s) as the items that James has bought in the past. However, only “right answers” is not enough for the model to learn, we also need to provide “wrong answers”. We create that by randomly sampling items from the set of items that James didn’t buy. This is called “Negative sampling”. The number of negative examples can be adjusted relative to the number of positive examples, a factor of 1x (same number) or 2x are the most popular choices.

To increase the volume of data, we do the same for all users, not just James. Let’s say James bought Chicken and Wine in the past and Alex has bought Shampoo, the ground truth set will look like this:

(User, Item) pair Bought
(James, Chicken) 1
(James, Wine) 1
(James, Butter) 0
(James, Eggs) 0
(Alex, Shampoo) 1

Once we have the training data, all that’s left is just to train a binary classification model to distinguish between the 1s and the 0s pairs. This can be done by a variety of algorithms such as logistics regression, decision tree, boosted trees, neural networks, etc. The output of such a model will represent the probability of being bought and we can use it to rank new items for each user.

2.2. Pair-wise ranking model

Note that in method 1 above, we only care about each individual (user, item) pair’s “probability of being bought” and use that to rank the item. For the pair-wise approach, we will train the model to predict a score $s$ directly by using pairs of bought - not bought items.

We can start by forming the “ground truth” dataset as above. Then we create pairs of bought - not bought items associated with James.

User, (Bought, Not Bought) pair True?
(James, (Chicken, Butter)) 1
(James, (Butter, Chicken)) 0
(James, (Wine, Butter)) 1
(James, (Wine, Eggs)) 1

Notice that $(James, (Chicken, Butter)) = 1$ because $Chicken > Butter$ while $(James, (Butter, Chicken)) = 0$ because $Butter < Chicken$. On the other hand, the 2 data points have the same information, just different expressions, we can eliminate the second one without any information loss.

You might question: “But without 0s, who can you train a model? Didn’t you just say in the previous section that the model needs both 1s and 0s?

The answer is: The model can still learn because it aims to predict the latent ranking score $s$ for each item w.r.t a user, not the probability of any pairs being higher or lower.

The latent ranking score is a hidden attribute of any (user, item) pairs that we assume to be present, it could be a combination of the user’s taste and the item’s characteristics that are aggregated into 1 score. The way we make the model learn to predict that score is by creating a loss function that takes a ground truth example of (user, (item1, item2)) = 1 and provides feedback on whether or not the model predicts the correct relative ranking. For example, if James likes Chicken more than Eggs, then the model’s prediction should be $s_{Chicken} < s_{Eggs}$ for James. A popular pair-wise loss function is the RankNet loss (Burges 2010): \(C = log(1 + e^{-\sigma(s_i - s_j)})\) Where $\sigma$ denotes the sigmoid function; item i is more relevant than item j.

If you are not familiar with latent factors or loss functions, don’t worry about it. They exist in most machine learning algorithms. For example in method 1 (Point-wise propensity prediction approach) above, if we use a neural network, the propensity is the latent factor and it uses binary cross-entropy as the loss function.

The reason I mention RankNet loss is the highlight the fact that it does not work on 1 example alone like in method 1, but on a pair of examples (user, i) and (user, j).

Please note that the scores $s_i$ and $s_j$ are only comparable for the same user. It’s not meaningful to say: John like Beer more than Sally like Chicken wings.

3. Implementation

Finally, let’s talk about how to build a ranking model using Python code.

In section 1, I provided examples of the feature vector X for the regression and classification tasks, it is easy because it was just about James, so we just need to provide relevant characteristics of James. How do we expand that to represent a pair of (user, item)? There are many different ways, but for simplicity, we will only discuss the most simple form in this article: Concatenate the user’s attributes and the item’s attributes. For example, for the pair (James, Beer), we can try a few different approaches:

  1. Represent Beer by its type and nutritional value and concatenate with James’s attributes => X = [Male, 40 y/o, Single, Alcoholic Drink, low sugar content, low protein content]. The first 3 describe James and the latter 3 describe Beer. This approach often works when we have detailed information about the items. But this approach can fail to address the differences in “taste”, for example, both Beer and Wine might have the same features [Alcoholic Drink, low sugar content, low protein content], but some users will like Beer and others like wine.
  2. Represent Beer by its identity and concatenate with James’s attributes => X = [Male, 40 y/o, Single, Beer]. This approach address “taste” but only work for cases where the number of items is limited (say < 100-ish). Of course, you will then have to encode Beer’s identity to a numerical value or use an algorithm that can handle categorical data. When the number of items is large (e.g. Amazon’s product recommendation), it is not feasible anymore.
  3. Represent Beer by its learned latent factors and concatenate with Jame’s attributes. There are different ways to learn the latent factors such as Matrix Factorization or Neural collaborative filtering. But it is out of the scope of this article.

Once you have the X (predictors - described above) and the y (target variable - described in section 2), all you need to do is to use an ML algorithm to build a model. For approach 1 (Point-wise), it is as easy as using scikit-learn to train a LogisticRegression model. For approach 2 (Pair-wise), we are lucky because LightGBM has provided us with the implementation of LGBMRanker, you can take a look at a short tutorial on Stack Overflow here.

4. Conclusion

Overall, this article provides a beginner’s introduction to the ranking model, from problem framing, to approaches, and implementation. I hope it can be useful for many.

5. References

  • From RankNet to LambdaRank to LambdaMART: An Overview by Christopher J.C. Burges, 2010, https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf
  • LightGBM ranking example, Stack Overflow, accessed on May 11, 2023, https://stackoverflow.com/a/67621253

Updated: