Term Frequency-Inverse Document Frequency (commonly abbreviated as TF-IDF) is a formula commonly used in Natural Language Processing (NLP) to determine the relative importance of a word. The formula is comprised of two sub-formulas, term frequency and inverse document frequency. The basic assumption of this formula is that if a word appears more in one document and less in every other document in the corpus then it is very important to that specific document. In the following sections I will be describing and giving examples of the two parts of the formula, explaining how they are combined, and then giving an example of how to use this to summarize a whole document. This work is based off of the content in my paper on extractive summarization methods.

Term Frequency

The term frequency of a term within a document can be described simply as a raw count of how many times the term appears within the document, this is shown below.

$$ tf(t, d) = f_{t, d} $$

Where \(t\) is the term to search for and \(d\) is the document to search in.

This means that if the word cat shows up in your document 10 times then \(tf(``cat", document) = 10\). There are other ways to calculate the term frequency, these include binary frequencies (1 if the word is in the document else 0), frequencies scaled by document length, and others. Wikipedia has more information on different formulas for TF.

Inverse Document Frequency

This means that if the word cat shows up in your document 10 times then \(tf(''cat'', document) = 10\). There are other ways to calculate the term frequency, these include binary frequencies (1 if the word is in the document else 0), frequencies scaled by document length, and others. Wikipedia has more information on different formulas for IDF. The inverse document frequency can be thought of as an offset that makes sure words such as the, an, and, etc. do not get large term importance scores. The idf score is lower for a word that is used throughout the corpus and higher for a word that is used in only one or two different documents. The equation can be seen below.

$$ idf(t, D) = log(\frac{|D|}{|\{d \in D : t \in d\}|}) $$

Where t is the term to search for and D is the set of all documents within the corpus.

Putting it all Together

To calculate the TF-IDF of a term the TF and IDF scores of a word are multiplied together.

$$ tf-idf_{term}(t, d, D) = tf(t, d) \times idf(t, D) $$

Because TF-IDF cannot be applied to a sentence as a whole we make the assumption that if a sum score of all the words in a sentence is high then the sentences relevance is also high. This is seen in the equation below.

$$ tf-idf_{sentence}(S,d, D) = \frac{1}{|S|} \times \sum_{t \in S}tf-idf_{term}(t, d, D) $$

Using this you can rank the sentences of a document within a corpus. For example, when run over the whole Moby Dick corpus the top 3 most important sentences in chapter one are as follows:

  1. On the contrary, passengers themselves must pay.

  2. Whaling voyage by one Ishmael.

  3. For to go as a passenger you must needs have a purse, and a purse is but a rag unless you have something in it.

Conclusions

TF-IDF is a good way to quickly decide on important sentences within a document. It is used by the autoTLDR bot on Reddit. However, because the algorithm cannot shorten any sentences nor can it understand the information it will never produce good sumaries across the board.