AIdventure

TF-IDF & BM25 Explained

May 1, 2024

TF-IDF & BM25 Explained

This post is a summary of the Understanding TF-IDF and BM-25 article by KMW Technology. Here we going straight to the point, to use this post as a quick reminder of how these algorithms work.

These algorithms are used to rank documents based on a query . They are used in search engines, recommendation systems, etc. Mainly they reward the relevance of a query with respect to a document. When applied to all documents, we can known which documents are more relevant to a query.

TF-IDF

Assumptions

The TF-IDF algorithm is based on two concepts:

Comparison of the IDF and the logarithm of the IDF.

Formula

The TF-IDF formula is the following:

TF-IDF(t,d)=TF(t,d)×IDF(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

Where tt is a term and dd is a document. We can expand the formula as follows:

TF-IDF(t,d)=counts of t in d×logtotal number of documentsnumber of documents containing t\text{TF-IDF}(t, d) = \text{counts of t in d} \times \log{\frac{\text{total number of documents}}{\text{number of documents containing t}}}

Example

Let’s walk through an example of vectorizing a small corpus consisting of two documents:

1. Vocabulary

First, we identify the unique terms in the corpus to build our vocabulary. We usually sort them alphabetically:

["deep", "I", "learning", "love", "machine"]

2. Term Frequency (TF)

Next, we calculate the Term Frequency (TF) for each term in each document. In this simple version, we just use the raw count.

TermD1 (“I love machine learning”)D2 (“I love deep learning”)
deep01
I11
learning11
love11
machine10

3. Inverse Document Frequency (IDF)

Now we calculate the IDF for each term using the formula IDF(t)=log(NDF(t))\text{IDF}(t) = \log(\frac{N}{DF(t)}), where N=2N=2 (total number of documents). We will use base 10 for the logarithm.

TermDF (Doc Frequency)IDF CalculationIDF Value
deep1log10(21)\log_{10}(\frac{2}{1})0.301
I2log10(22)\log_{10}(\frac{2}{2})0
learning2log10(22)\log_{10}(\frac{2}{2})0
love2log10(22)\log_{10}(\frac{2}{2})0
machine1log10(21)\log_{10}(\frac{2}{1})0.301

4. TF-IDF Vectors

Finally, we compute the TF-IDF score for each term in each document by multiplying TF by IDF.

For Document 1:

TermTFIDFTF-IDF
deep00.3010
I100
learning100
love100
machine10.3010.301

For Document 2:

TermTFIDFTF-IDF
deep10.3010.301
I100
learning100
love100
machine00.3010

The final TF-IDF vectors corresponding to the vocabulary ["deep", "I", "learning", "love", "machine"] are:

Notice how common words like “I”, “love”, and “learning” have a score of 00 because they appear in all documents and provide no discriminative power in this small corpus. The unique words “machine” and “deep” receive higher scores, identifying the specific topic of each document.

Problems

BM25

The BM25 algorithm is based on the TF-IDF algorithm, but trying to solve some of its problems.

Assumptions

Comparison of the TF and the TF with saturation with different values of k.

Formula

Although some modifications were proposed for the IDF part of the formula, the most important changes were described in the assumptions section. The BM25 formula, then, is the following:

BM25(t,d)=TF(t,d)×(k1+1)TF(t,d)+k1×(1b+b×doc_lengthavg_doc_length)×logtotal number of documentsnumber of documents containing t\text{BM25}(t, d) = \frac{\text{TF}(t, d) \times (k_1 + 1)}{\text{TF}(t, d) + k_1 \times (1 - b + b \times \frac{\text{doc\_length}}{\text{avg\_doc\_length}})} \times \log{\frac{\text{total number of documents}}{\text{number of documents containing t}}}

Problems

Credits