Motivation

My senior capstone project for my computer science degree is research focused on summarizing sentences. My group mate and I decided to try and accomplish this by converting sentences into Emoji. We think that this will produce a more information-dense string. This problem is rather similar to a plethora of different problems in computer science and other, unrelated, domains. Within computer science, it is adjacent to the Emoji prediction and Emoji embedding problems. Outside of our domain, it is similar to problems involving translations to, and from, ideographic languages. This algorithm is the first shot at implementation after a short-ish literature review.

Methodology

Before the rest of the algorithm is explained, it is important to understand the underlying technology. sent2vec is a model that is used to generate a vector embedding of a sentence. This vector embedding is an array containing 700 floats that place the meaning of the sentence into a vector space. The main use of sent2vec in this algorithm is to determine the closeness of two sentences using the cosine similarity.

The general idea behind the algorithm is that a sentence can be split into a series of n-grams such that each n-gram maximizes the cosine similarity between itself and an Emoji definition in the sent2vec vector space. For example: the sentence "Christmas music rings from the clock tower" may be split into the following n-grams: ['Christmas', 'music', 'ring', 'from the clock tower'] with these individual n-grams being close in the sent2vec space to the following Emoji ["🎄", "🎻", "💍", "🏫"]. Currently, each possible combination of n-grams is generated and queried to see which combination gives the lowest average cosine distance.

Algorithm

The actual algorithm contains three main parts. First, Emoji are loaded into a list of 2-tuples with the values of the pair being the Emoji and vector representation of that Emoji description. The Emoji descriptions were acquired from the emoji_joined.txt file found in the data dir in the emoji2vec repo. The sent2vec model is the wiki_unigrams model found in the sent2vec repo.

# Define the array to store the (emoji, repr) 2-tuple
emoji_embeddings = []
# Open the file that stores the emoji, description 2-tuple list
with open("emoji_joined.txt") as emojis:
    for defn in emojis:
        # The file is tab-delim
        split = defn.split("\t")

        # Get the emoji and the description from the current line
        emoji = split[-1].replace("\n", "")
        desc = split[0]

        # Add each emoji and embedded description to the list
        emoji_embeddings.append((emoji, s2v.embed_sentence(desc)))

The second part of the algorithm is the closest_emoji function. This takes in a sentence and returns the Emoji with the most similar description embedding in the vector space. This function has the @lru_cache decorator which means the last 100 function return values will be cached. This is cleared after each summary is finished

@lru_cache(maxsize=100)
def closest_emoji(sent: str) -> Tuple[str, int]:
    """
    Get the closest emoji to the given sentence

    Args:
        sent(List[str]): Sentence to check
    Ret:
        (Tuple[str, int]) Closest emoji, the respective cosine similarity

    """

    # Embed the sentence using sent2vec
    emb = s2v.embed_sentence(sent)

    # Start the lowest cosine at higher than it could ever be
    lowest_cos = 1_000_000

    # The best emoji starts as an empty string placeholder
    best_emoji = ""

    # Loop through the dictionary
    for emoji in emoji_embeddings:
        # Get the current emoji's embedding
        emoji_emb = emoji[1]

        # Check the cosine difference between the emoji's embedding and
        # the sentence's embedding
        curr_cos = cosine(emoji_emb, emb)

        # If it lower than the lowest then it is the new best
        if curr_cos < lowest_cos:
            lowest_cos = curr_cos
            best_emoji = emoji[0]

    # Return a 2-tuple containing the best emoji and its cosine differnece
    return best_emoji, lowest_cos

The third (and most important) function in the algorithm is the actual summarization function. This function loops through each possible n-gram combination and then returns the Emoji, the cosine difference (labeled as uncertainty), and the n-grams of the combo with the lowest average cosine difference for each n-gram.

def summarize(sent:str) -> Tuple[List[str], List[float], List[str]]:
    """
    Summarize the given sentence into emojis

    Args:
        sent(str): Sentence to summarize
    Rets:
        (Tuple[List[str], List[float], List[str]]): (Emoji Sentence,
        List of Uncertainty values for the corresponding emoji,
        list of n-grams used to generate the corresponding emoji)
    """
    # Clean the sentence
    sent = clean_sentence(sent)

    # Generate all combinations of sentences
    sent_combos = combinations_of_sent(sent)
    # Init "best" datamembers as empty or exceedingly high
    best_emojis = ""
    best_n_grams = []
    best_uncertainties = [100_000_000]
    # Iterate through every combination of sentence combos
    for sent_combo in sent_combos:
        # Start the local data members as empty
        emojis = ""
        uncertainties = []
        # Iterate through each n_gram adding the uncertainty and
        # emoji to the lists
        for n_gram in sent_combo:
            close_emoji, cos_diff = closest_emoji(n_gram)
            emojis += close_emoji
            uncertainties.append(cos_diff)

        # Check if the average uncertainty is less than the best
        # TODO: Maybe a median check would be helpful as well?
        if (sum(uncertainties)/len(uncertainties) <
            sum(best_uncertainties)/len(best_uncertainties)):
            # Update the best emojis
            best_emojis = emojis
            best_n_grams = sent_combo
            best_uncertainties = uncertainties[:]

    # Clear the function cache on closest_emoji because it is unlikely
    # the next run will make use of them
    closest_emoji.cache_clear()

    # Return the emoji "sentence", list of all the cosine similarities,
    # and all of the n-grams
    return (best_emojis, best_uncertainties, best_n_grams)

Downsides

The major downfall of this algorithm is the lack of data that it is currently using. There are 1661 Emoji in the corpus and only 6088 definitions, which gives an average of \(\approx\) 4 definitions per Emoji. When you put that in the context of 700 dimensional space that's not much variation. If more data was used the vector space would become more populated and each n-gram would have a closer Emoji. Putting the limits on the dataset aside this algorithm is still incredibly slow. The major flaw of searching every single combination of words in a sentence is the time it takes. It's about 5 seconds for a sentence that is 6 words long, and the curve it follows after that is not pretty. I can see no smart or quick way of speeding this up. Maybe genetic algorithms? Maybe harder caching? If you have any ideas please reach out to me.

Results

Here is a look at some of the more accurate results. The less accurate ones are just garbage.

Input SentenceSimilarityn-gramsOutput Emojis
christmas music rings from the clock tower0.983christmas, music, ring, from the clock tower🎄🎻💍🏫
It isn't perfect but it is a start0.818it is n't, perfect but it is, a, start🙅💯💯🌱
The sun is rising over new york city0.881the sun is rising over, new york, city🌄🗽🚏

Conclusion

This algorithm fits some of the requirements we set out to fill but there is still so much to be done. For starters the training could be improved. Right now we are only training off of short descriptions of the emoji. I think if we expanded to more datasets (maybe /r/emojipasta or something similar) it may have a better shot at transcribing more sentences. Either way it is a good jumping off point into the world of English \(\rightarrow\) Emoji translation. The entire jupyter notebook that I used for this algorithm is available in this github repo.