My senior capstone project for my computer science degree is research focused on summarizing sentences. My group mate and I decided to try and accomplish this by converting sentences into Emoji. We think that this will produce a more information-dense string. This problem is rather similar to a plethora of different problems in computer science and other, unrelated, domains. Within computer science, it is adjacent to the Emoji prediction and Emoji embedding problems. Outside of our domain, it is similar to problems involving translations to, and from, ideographic languages. This algorithm is the first shot at implementation after a short-ish literature review.
Before the rest of the algorithm is explained, it is important to understand the underlying technology. sent2vec is a model that is used to generate a vector embedding of a sentence. This vector embedding is an array containing 700 floats that place the meaning of the sentence into a vector space. The main use of sent2vec in this algorithm is to determine the closeness of two sentences using the cosine similarity.
The general idea behind the algorithm is that a sentence can be split into a
series of n-grams such that each n-gram maximizes the cosine similarity between
itself and an Emoji definition in the sent2vec vector space. For example: the
"Christmas music rings from the clock tower" may be split into the
['Christmas', 'music', 'ring', 'from the clock tower'] with
these individual n-grams being close in the sent2vec space to the following
["🎄", "🎻", "💍", "🏫"]. Currently, each possible combination of
n-grams is generated and queried to see which combination gives the lowest
average cosine distance.
The actual algorithm contains three main parts. First, Emoji are loaded into a
list of 2-tuples with the values of the pair being the Emoji and vector
representation of that Emoji description. The Emoji descriptions were acquired
from the emoji_joined.txt file found in the data dir in the emoji2vec repo. The
sent2vec model is the
wiki_unigrams model found in the sent2vec repo.
# Define the array to store the (emoji, repr) 2-tuple emoji_embeddings =  # Open the file that stores the emoji, description 2-tuple list with open("emoji_joined.txt") as emojis: for defn in emojis: # The file is tab-delim split = defn.split("\t") # Get the emoji and the description from the current line emoji = split[-1].replace("\n", "") desc = split # Add each emoji and embedded description to the list emoji_embeddings.append((emoji, s2v.embed_sentence(desc)))
The second part of the algorithm is the
closest_emoji function. This takes in
a sentence and returns the Emoji with the most similar description embedding in
the vector space. This function has the
@lru_cache decorator which means the
last 100 function return values will be cached. This is cleared after each
summary is finished
@lru_cache(maxsize=100) def closest_emoji(sent: str) -> Tuple[str, int]: """ Get the closest emoji to the given sentence Args: sent(List[str]): Sentence to check Ret: (Tuple[str, int]) Closest emoji, the respective cosine similarity """ # Embed the sentence using sent2vec emb = s2v.embed_sentence(sent) # Start the lowest cosine at higher than it could ever be lowest_cos = 1_000_000 # The best emoji starts as an empty string placeholder best_emoji = "" # Loop through the dictionary for emoji in emoji_embeddings: # Get the current emoji's embedding emoji_emb = emoji # Check the cosine difference between the emoji's embedding and # the sentence's embedding curr_cos = cosine(emoji_emb, emb) # If it lower than the lowest then it is the new best if curr_cos < lowest_cos: lowest_cos = curr_cos best_emoji = emoji # Return a 2-tuple containing the best emoji and its cosine differnece return best_emoji, lowest_cos
The third (and most important) function in the algorithm is the actual summarization function. This function loops through each possible n-gram combination and then returns the Emoji, the cosine difference (labeled as uncertainty), and the n-grams of the combo with the lowest average cosine difference for each n-gram.
def summarize(sent:str) -> Tuple[List[str], List[float], List[str]]: """ Summarize the given sentence into emojis Args: sent(str): Sentence to summarize Rets: (Tuple[List[str], List[float], List[str]]): (Emoji Sentence, List of Uncertainty values for the corresponding emoji, list of n-grams used to generate the corresponding emoji) """ # Clean the sentence sent = clean_sentence(sent) # Generate all combinations of sentences sent_combos = combinations_of_sent(sent) # Init "best" datamembers as empty or exceedingly high best_emojis = "" best_n_grams =  best_uncertainties = [100_000_000] # Iterate through every combination of sentence combos for sent_combo in sent_combos: # Start the local data members as empty emojis = "" uncertainties =  # Iterate through each n_gram adding the uncertainty and # emoji to the lists for n_gram in sent_combo: close_emoji, cos_diff = closest_emoji(n_gram) emojis += close_emoji uncertainties.append(cos_diff) # Check if the average uncertainty is less than the best # TODO: Maybe a median check would be helpful as well? if (sum(uncertainties)/len(uncertainties) < sum(best_uncertainties)/len(best_uncertainties)): # Update the best emojis best_emojis = emojis best_n_grams = sent_combo best_uncertainties = uncertainties[:] # Clear the function cache on closest_emoji because it is unlikely # the next run will make use of them closest_emoji.cache_clear() # Return the emoji "sentence", list of all the cosine similarities, # and all of the n-grams return (best_emojis, best_uncertainties, best_n_grams)
The major downfall of this algorithm is the lack of data that it is currently using. There are 1661 Emoji in the corpus and only 6088 definitions, which gives an average of \(\approx\) 4 definitions per Emoji. When you put that in the context of 700 dimensional space that's not much variation. If more data was used the vector space would become more populated and each n-gram would have a closer Emoji. Putting the limits on the dataset aside this algorithm is still incredibly slow. The major flaw of searching every single combination of words in a sentence is the time it takes. It's about 5 seconds for a sentence that is 6 words long, and the curve it follows after that is not pretty. I can see no smart or quick way of speeding this up. Maybe genetic algorithms? Maybe harder caching? If you have any ideas please reach out to me.
Here is a look at some of the more accurate results. The less accurate ones are just garbage.
|Input Sentence||Similarity||n-grams||Output Emojis|
|christmas music rings from the clock tower||0.983||
|It isn't perfect but it is a start||0.818||
|The sun is rising over new york city||0.881||
This algorithm fits some of the requirements we set out to fill but there is still so much to be done. For starters the training could be improved. Right now we are only training off of short descriptions of the emoji. I think if we expanded to more datasets (maybe /r/emojipasta or something similar) it may have a better shot at transcribing more sentences. Either way it is a good jumping off point into the world of English \(\rightarrow\) Emoji translation. The entire jupyter notebook that I used for this algorithm is available in this github repo.