One of the problems in Natural Language Processing (and a problem I'm facing at work) is how to cluster documents into groups based on their contents. There are two broad approaches to solving the document clustering problem, supervised and unsupervised machine learning. Supervised machine learning relies on labeled data and unsupervised learning tries to categorize the data without any prior labels. These two methods both have their ups and downs that I will not go into here. For the task of clustering stock descriptions we will be using unsupervised clustering, more specifically k-means clustering.
The starting point for the dataset is the S&P 500 list. Using this list I made a small script to download the descriptions of the stocks from Robinhood. The format of the resulting tab separated file can be seen in the table below.
|AAP||Advance Auto Parts, Inc. engages in the supply and distribution of aftermarket automot…|
|CVS||CVS Health Corp. engages in the provision of health care services. It operates through…|
|WMB||The Williams Cos., Inc. operates as an energy infrastructure company, which explores,…|
|GOOG||Alphabet, Inc. is a holding company, which engages in the business of acquisition and…|
|MSFT||Microsoft Corp. engages in the development and support of software, services, devices…|
Some imports1 are required before we can get going, but then we can collect all the stocks into a list of the ticker name and the descriptions.
# Import the data with open("stock_descs.tsv") as file: stocks = [tuple(line.replace("\n", "").split('\t')) for line in file]
Then we can train a sklearn TfidfVectorizer model on the data and transform all of the descriptions into their tf-idf values for each word in the corpus.
# Train TfidfVectorizor corpus = [stock for stock in stocks] tf_idf = TfidfVectorizer(stop_words="english", preprocessor=lambda doc: " ".join(preprocess_string(doc))) tf_idf_matrix = tf_idf.fit_transform(corpus)
We can then train KMeans models with K values ranging from 2 to 30. Using the elbow method we find that the best K value is 13.
# Elbow method scores =  ks = range(2, 30) for k in ks: kmeans = KMeans(n_clusters=k).fit(tf_idf_matrix) kmeans.fit(tf_idf_matrix) scores.append(kmeans.score(tf_idf_matrix))
# Plot the elbow plt.plot(ks, scores, 'bx-') plt.xlabel('# Clusters') plt.ylabel('KMeans Score') choice = 13 plt.gcf().gca().add_artist( Ellipse((choice, scores[choice-2]), 1.125, 4, color='r', fill=False, lw=5) ) plt.title('The Elbow Method showing the optimal k') plt.show()
Then we can sort the stocks into their predicted classes.
# Train the KMeans clusterer with the specified k kmeans = KMeans(n_clusters=13, random_state=0) predictions = kmeans.fit_predict(tf_idf_matrix) # Gather all clusters and their stocks clusters = defaultdict(list) for i, prediction in enumerate(predictions): clusters[prediction].append(stocks[i])
And finally print the keywords and descriptions for any given category. In this case I chose category 5 which seems to be some sort of life insurance category.
def top_words(cluster_number: int) -> List[str]: tf_idf_scores = tf_idf.transform([stock for stock in clusters[cluster_number]]).sum(axis=0) scored_words = [(tf_idf_scores[0,i], term) for i, term in enumerate(tf_idf.get_feature_names())] sorted_words = sorted(scored_words, key=lambda x: -x) return [word for word in sorted_words][:5] def pretty_print_cluster(cluster_number: int) -> None: print("Keywords:") print("\n".join(top_words(cluster_number))) print("\nStock Descriptions:") print("\n".join(stock for stock in clusters[cluster_number])) pretty_print_cluster(5)
Aflac, Inc. is a holding company, which engages in the provision financial protection services…
The Allstate Corp. engages in the property and casualty insurance business and the sale of life, accident, and health insurance products through its subsidiaries…
American International Group, Inc. engages in the provision of a range of property casualty insurance, life insurance, retirement products, and other financial services to commercial and individual customers…
Anthem, Inc. provides life, hospital and medical insurance plans. It offers a broad spectrum of network-based managed care health benefit plans to the large and small employer, individual, Medicaid, and Medicare markets…
TF-IDF clustering is a naive, yet computationally inexpensive way of grouping similar documents. It is by no means the current state of the art but there are still a lot of uses for it in natural language processing. It is also a good project to start to draw links between natural language processing and more traditional machine learning algorithms like k-means.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from scipy.spatial.distance import cdist from random import shuffle import numpy as np from matplotlib import pyplot as plt from matplotlib.patches import Ellipse from collections import defaultdict from gensim.parsing import preprocess_string from typing import List