Naive Sentence to Emoji Translation
My senior capstone project for my computer science degree is research focused on summarizing sentences. My group mate and I decided to try and accomplish this by converting sentences into Emoji. We think that this will produce a more information-dense string. This problem is rather similar to a plethora of different problems in computer science and other, unrelated, domains. Within computer science, it is adjacent to the Emoji prediction and Emoji embedding problems. Outside of our domain, it is similar to problems involving translations to, and from, ideographic languages. This algorithm is the first shot at implementation after a short-ish literature review.
Before the rest of the algorithm is explained, it is important to understand the underlying technology. sent2vec is a model that is used to generate a vector embedding of a sentence. This vector embedding is an array containing 700 floats that place the meaning of the sentence into a vector space. The main use of sent2vec in this algorithm is to determine the closeness of two sentences using the cosine similarity.
The general idea behind the algorithm is that a sentence can be split into a series of n-grams such that each n-gram maximizes the cosine similarity between itself and an Emoji definition in the sent2vec vector space. For example: the sentence
"Christmas music rings from the clock tower" may be split into the following n-grams:
['Christmas', 'music', 'ring', 'from the clock tower'] with these individual n-grams being close in the sent2vec space to the following Emoji
["🎄", "🎻", "💍", "🏫"]. Currently, each possible combination of n-grams is generated and queried to see which combination gives the lowest average cosine distance.
The actual algorithm contains three main parts. First, Emoji are loaded into a list of 2-tuples with the values of the pair being the Emoji and vector representation of that Emoji description. The Emoji descriptions were acquired from the emoji_joined.txt file found in the data dir in the emoji2vec repo. The sent2vec model is the wiki_unigrams model found in the sent2vec repo.
The second part of the algorithm is the
closest_emoji function. This takes in a sentence and returns the Emoji with the most similar description embedding in the vector space. This function has the
@lru_cache decorator which means the last 100 function return values will be cached. This is cleared after each summary is finished
The third (and most important) function in the algorithm is the actual summarization function. This function loops through each possible n-gram combination and then returns the Emoji, the cosine difference (labeled as uncertainty), and the n-grams of the combo with the lowest average cosine difference for each n-gram.
The major downfall of this algorithm is the lack of data that it is currently using. There are 1661 Emoji in the corpus and only 6088 definitions, which gives an average of $$\approx$$ 4 definitions per Emoji. When you put that in the context of 700 dimensional space that's not much variation. If more data was used the vector space would become more populated and each n-gram would have a closer Emoji.
Putting the limits on the dataset aside this algorithm is still incredibly slow. The major flaw of searching every single combination of words in a sentence is the time it takes. It's about 5 seconds for a sentence that is 6 words long, and the curve it follows after that is not pretty. I can see no smart or quick way of speeding this up. Maybe genetic algorithms? Maybe harder caching? If you have any ideas please reach out to me.
Here is a look at some of the more accurate results. The less accurate ones are just garbage.
|Input Sentence||Similarity||n-grams||Output Emojis|
|christmas music rings from the clock tower||0.983||['christmas', 'music', 'ring', 'from the clock tower']||🎄🎻💍🏫|
|It isn't perfect but it is a start||0.818||["it is n't", 'perfect but it is', 'a', 'start']||🙅💯💯🌱|
|The sun is rising over new york city||0.881||['the sun is rising over', 'new york', 'city']||🌄🗽🚏|
This algorithm fits some of the requirements we set out to fill but there is still so much to be done. For starters the training could be improved. Right now we are only training off of short descriptions of the emoji. I think if we expanded to more datasets (maybe /r/emojipasta or something similar) it may have a better shot at transcribing more sentences. Either way it is a good jumping off point into the world of English $$\rightarrow$$ Emoji translation. The entire jupyter notebook that I used for this algorithm is available in this github repo.