## Introduction

Throughout the past semester I have been working on my senior capstone project for my CS undergraduate. The project is to create Emoji summaries for sentences and one of the integral parts of this algorithm is separating a sentence into a sequence of n-grams that represent it. In the initial algorithm, I took a naive approach of generating every single combination of n-grams, summarizing them all, and then returning the summary with the highest result. While this worked it did have some downsides. The main disadvantage of attacking the problem in this way was some of the n-grams contained two big ideas from the sentence and they were only getting translated to one Emoji. It was clear that this was not ideal. The approach detailed within this post uses information about the dependencies between words within the sentence in hopes that this will produce an output that is more representative.

## Dependency Trees

Dependency trees are a way to map the dependencies of words within a sentence. The example in figure 1 maps the dependencies within the sentence "I finished my homework just before the class started". Because the node containing "I" is a child of the "finished" node there is a direct relation between the two words. In this case "I" is the object that actually "finished" something. In this dependency tree (and in the algorithm) the actual relations that link words together are neither shown or used.

## Rules for Collapse

The primary assumption behind this algorithm is that some dependencies can be collapsed, or simplified to produce larger n-grams that make up the sentence. As of right now the algorithm only executes two rules but it is easy to imagine other rules that could be used. The rules are as follows:

1. If a node has only one child then the two nodes can be combined into one

2. If multiple leaves are on the same level and have the same parent then they can be combined into one node

## Results

This algorithm is used directly in the sentence to Emoji algorithm, so it makes the most sense to present results as relative to its' application. Table 1 below shows the results from the old exhaustive n-gram sequencing algorithm as compared to this dependency relation informed algorithm.

SentenceExhaustiveDependency Tree
The student drew a snowflake on the chalk boardThe student drew a snowflake on the chalk boardThe student drew a snowflake the chalk on board
I finished the homework just before class startedI finished the homework just before class startedI finished the homework just before class started
Can you calculate the number of giraffes that have ever existed?can you calculate the number of giraffe that have ever existedcan you calculate number that have ever of giraffes existed

It is relatively clear that the tree collapse n-gram generation gives a more comprehensive n-gram sequence for the sentence. However, this may be more of a reflection of the naive-ness of our initial algorithm. The exhaustive algorithm relies entirely on the dataset that is used for the output Emoji. The lacklustre dataset is something that will be improved shortly.

## Implementation

Below, in figure 5, is the algorithm implemented in Python 3 with spaCy and NLTK. The general flow of the algorithm is that it performs the collapses detailed above and returns a list of lists. Each list within the result is an n-gram split into its' constituent words. It is implemented as such so the word can also contain the initial location of the word within the input so it can later be sorted. Sorting is trivial and it (along with the rest of the code) is contained within this Juypter notebook within the algorithms repo.

def pos_n_gram_sequence(node, n_grams=None):
"""
Turn the sentence given into an n-gram sequence informed
by part of speech tagging

Args:
node(Token or str): Root token for the sentence or the
sentence as a string if first run through
n_grams(List, Optional): List of n-grams for recursion
"""
# Check if the n_grams list is none. If so it is the
# first run through of the function then set up the
# list and the sentence as an spaCy NLP root node.
# We do it with none because python gets
# slightly weird if you do this with an empty list
if n_grams is None:
n_grams = []
if type(node) is str:
node = list(nlp(node).sents)[0].root

# Rule 1.
# While the current node only has one child append the
# data from the current node to a backlog list and then
# loop down to the next node, checking it's child count
# so on and so forth
current_node = node
parent_child_data = []
while current_node.n_lefts + current_node.n_rights == 1:
# Appending both the node's token and its' position in the
# sentence so we can sort later
parent_child_data.append((current_node.orth_, current_node.i))
# Set the current node to the only child of the current node
current_node = list(current_node.children)[0]

# Add the current node's data to the parent-child dependency
# list so we can just add this to the result later
parent_child_data.append((current_node.orth_, current_node.i))
n_grams.append(parent_child_data)

# Rule 2. will only work if the current node has more than one
# child. If it has less than that then we already appended
# everything relevant to the list
if not current_node.n_lefts + current_node.n_rights > 1:
return

# Get the children of the current node that have children
children_with_children = [child for child in current_node.children
if len(list(child.children)) > 0]

# Leaves are nodes that don't have children. Any children
# of the current node that are not in the
# children_with_children list are leaves and should be
# collapsed
leafs = [(child.orth_, child.i)
for child in current_node.children
if child not in children_with_children]

# Append the leafs to the n_grams
n_grams.append(leafs)

# Recurse through all the non-leaf children
for child in children_with_children:
pos_n_gram_sequence(child, n_grams)

return n_grams

## Future Work

The one main future iteration for this algorithm involves generating more potential n-gram sequences. The current implementation only collapses the tree once to produce one n-gram sequence but it wouldn't be that hard to further collapse the tree to produce more. This further collapse could prove to improve the summary for longer sentences and could be combined with other n-gram sequence scoring techniques.