One of the problems in Natural Language Processing (and a problem I'm facing at work) is how to cluster documents into groups based on their contents. There are two broad approaches to solving the document clustering problem, supervised and unsupervised machine learning. Supervised machine learning relies on labeled data and unsupervised learning tries to categorize the data without any prior labels. These two methods both have their ups and downs that I will not go into here. For the task of clustering stock descriptions we will be using unsupervised clustering, more specifically k-means clustering.

The starting point for the dataset is the S&P 500 list. Using this list I made a small script to download the descriptions of the stocks from Robinhood. The format of the resulting tab separated file can be seen in the table below.

AAPAdvance Auto Parts, Inc. engages in the supply and distribution of aftermarket automot…
CVSCVS Health Corp. engages in the provision of health care services. It operates through…
WMBThe Williams Cos., Inc. operates as an energy infrastructure company, which explores,…
GOOGAlphabet, Inc. is a holding company, which engages in the business of acquisition and…
MSFTMicrosoft Corp. engages in the development and support of software, services, devices…

Some imports1 are required before we can get going, but then we can collect all the stocks into a list of the ticker name and the descriptions.

# Import the data
with open("stock_descs.tsv") as file:
    stocks = [tuple(line.replace("\n", "").split('\t')) for line in file]

Then we can train a sklearn TfidfVectorizer model on the data and transform all of the descriptions into their tf-idf values for each word in the corpus.

# Train TfidfVectorizor
corpus = [stock[1] for stock in stocks]
tf_idf = TfidfVectorizer(stop_words="english",
                         preprocessor=lambda doc: " ".join(preprocess_string(doc)))
tf_idf_matrix = tf_idf.fit_transform(corpus)

We can then train KMeans models with K values ranging from 2 to 30. Using the elbow method we find that the best K value is 13.

# Elbow method
scores = []
ks = range(2, 30)
for k in ks:
    kmeans = KMeans(n_clusters=k).fit(tf_idf_matrix)
    kmeans.fit(tf_idf_matrix)
    scores.append(kmeans.score(tf_idf_matrix))
# Plot the elbow
plt.plot(ks, scores, 'bx-')
plt.xlabel('# Clusters')
plt.ylabel('KMeans Score')
choice = 13
plt.gcf().gca().add_artist(
    Ellipse((choice, scores[choice-2]), 1.125, 4, color='r', fill=False, lw=5)
)
plt.title('The Elbow Method showing the optimal k')
plt.show()

mu4e main menu

Then we can sort the stocks into their predicted classes.

# Train the KMeans clusterer with the specified k
kmeans = KMeans(n_clusters=13, random_state=0)
predictions = kmeans.fit_predict(tf_idf_matrix)

# Gather all clusters and their stocks
clusters = defaultdict(list)
for i, prediction in enumerate(predictions):
    clusters[prediction].append(stocks[i])

And finally print the keywords and descriptions for any given category. In this case I chose category 5 which seems to be some sort of life insurance category.

def top_words(cluster_number: int) -> List[str]:
    tf_idf_scores = tf_idf.transform([stock[1] for stock in clusters[cluster_number]]).sum(axis=0)
    scored_words = [(tf_idf_scores[0,i], term) for i, term in enumerate(tf_idf.get_feature_names())]
    sorted_words = sorted(scored_words, key=lambda x: -x[0])
    return [word[1] for word in sorted_words][:5]

def pretty_print_cluster(cluster_number: int) -> None:
    print("Keywords:")
    print("\n".join(top_words(cluster_number)))
    print("\nStock Descriptions:")
    print("\n".join(stock[1] for stock in clusters[cluster_number]))

pretty_print_cluster(5)

Keywords:

  • insur

  • life

  • segment

  • individu

  • group

Stock Descriptions:

  • Aflac, Inc. is a holding company, which engages in the provision financial protection services…

  • The Allstate Corp. engages in the property and casualty insurance business and the sale of life, accident, and health insurance products through its subsidiaries…

  • American International Group, Inc. engages in the provision of a range of property casualty insurance, life insurance, retirement products, and other financial services to commercial and individual customers…

  • Anthem, Inc. provides life, hospital and medical insurance plans. It offers a broad spectrum of network-based managed care health benefit plans to the large and small employer, individual, Medicaid, and Medicare markets…

  • and more…

Conclusions

TF-IDF clustering is a naive, yet computationally inexpensive way of grouping similar documents. It is by no means the current state of the art but there are still a lot of uses for it in natural language processing. It is also a good project to start to draw links between natural language processing and more traditional machine learning algorithms like k-means.

Footnotes


1
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from random import shuffle
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.patches import Ellipse
from collections import defaultdict
from gensim.parsing import preprocess_string
from typing import List