Throughout the past semester I have been working on my senior capstone project for my CS undergraduate. The project is to create Emoji summaries for sentences and one of the integral parts of this algorithm is separating a sentence into a sequence of n-grams that represent it. In the initial algorithm, I took a naive approach of generating every single combination of n-grams, summarizing them all, and then returning the summary with the highest result. While this worked it did have some downsides. The main disadvantage of attacking the problem in this way was some of the n-grams contained two big ideas from the sentence and they were only getting translated to one Emoji. It was clear that this was not ideal. The approach detailed within this post uses information about the dependencies between words within the sentence in hopes that this will produce an output that is more representative.
Dependency trees are a way to map the dependencies of words within a sentence. The example in figure 1 maps the dependencies within the sentence "I finished my homework just before the class started". Because the node containing "I" is a child of the "finished" node there is a direct relation between the two words. In this case "I" is the object that actually "finished" something. In this dependency tree (and in the algorithm) the actual relations that link words together are neither shown or used.
The primary assumption behind this algorithm is that some dependencies can be collapsed, or simplified to produce larger n-grams that make up the sentence. As of right now the algorithm only executes two rules but it is easy to imagine other rules that could be used. The rules are as follows:
If a node has only one child then the two nodes can be combined into one
If multiple leaves are on the same level and have the same parent then they can be combined into one node
This algorithm is used directly in the sentence to Emoji algorithm, so it makes the most sense to present results as relative to its' application. Table 1 below shows the results from the old exhaustive n-gram sequencing algorithm as compared to this dependency relation informed algorithm.
|The student drew a snowflake on the chalk board||
|I finished the homework just before class started||
|Can you calculate the number of giraffes that have ever existed?||
It is relatively clear that the tree collapse n-gram generation gives a more comprehensive n-gram sequence for the sentence. However, this may be more of a reflection of the naive-ness of our initial algorithm. The exhaustive algorithm relies entirely on the dataset that is used for the output Emoji. The lacklustre dataset is something that will be improved shortly.
Below, in figure 5, is the algorithm implemented in Python 3 with spaCy and NLTK. The general flow of the algorithm is that it performs the collapses detailed above and returns a list of lists. Each list within the result is an n-gram split into its' constituent words. It is implemented as such so the word can also contain the initial location of the word within the input so it can later be sorted. Sorting is trivial and it (along with the rest of the code) is contained within this Juypter notebook within the algorithms repo.
def pos_n_gram_sequence(node, n_grams=None): """ Turn the sentence given into an n-gram sequence informed by part of speech tagging Args: node(Token or str): Root token for the sentence or the sentence as a string if first run through n_grams(List, Optional): List of n-grams for recursion """ # Check if the n_grams list is none. If so it is the # first run through of the function then set up the # list and the sentence as an spaCy NLP root node. # We do it with none because python gets # slightly weird if you do this with an empty list if n_grams is None: n_grams =  if type(node) is str: node = list(nlp(node).sents).root # Rule 1. # While the current node only has one child append the # data from the current node to a backlog list and then # loop down to the next node, checking it's child count # so on and so forth current_node = node parent_child_data =  while current_node.n_lefts + current_node.n_rights == 1: # Appending both the node's token and its' position in the # sentence so we can sort later parent_child_data.append((current_node.orth_, current_node.i)) # Set the current node to the only child of the current node current_node = list(current_node.children) # Add the current node's data to the parent-child dependency # list so we can just add this to the result later parent_child_data.append((current_node.orth_, current_node.i)) n_grams.append(parent_child_data) # Rule 2. will only work if the current node has more than one # child. If it has less than that then we already appended # everything relevant to the list if not current_node.n_lefts + current_node.n_rights > 1: return # Get the children of the current node that have children children_with_children = [child for child in current_node.children if len(list(child.children)) > 0] # Leaves are nodes that don't have children. Any children # of the current node that are not in the # children_with_children list are leaves and should be # collapsed leafs = [(child.orth_, child.i) for child in current_node.children if child not in children_with_children] # Append the leafs to the n_grams n_grams.append(leafs) # Recurse through all the non-leaf children for child in children_with_children: pos_n_gram_sequence(child, n_grams) return n_grams
The one main future iteration for this algorithm involves generating more potential n-gram sequences. The current implementation only collapses the tree once to produce one n-gram sequence but it wouldn't be that hard to further collapse the tree to produce more. This further collapse could prove to improve the summary for longer sentences and could be combined with other n-gram sequence scoring techniques.