Context Free Grammars and the Tyranny of Node

May 22nd, 2016

While language and grammar are undoubtably linguistic constructs, they have very practical uses inside the realm of computer science. However, while looking for language and grammar parsing tools written for Node, I was disappointed to find no easy-to-use and suitably-flexible libraries. In response, I began development on Tyranny, a node module that allows for the description and parsing of arbitrary context-free grammars. Here is what it is, how it works, and what it does.

What is a language?

In order to understand grammar as it applies to Tyranny, it is helpful to first understand languages. For many, a language is a means for two people to convey ideas that carry across time and space. For programmers, a language is a code specification that can be used to describe procedures and algorithms. However, in computer science and mathematics, the idea of a language is far more abstract. In fact, a language is just a set of words, using a specific alphabet, that are considered “valid.”

An example of this, assuming that we are considering the language that includes only English words, would use an alphabet containing the characters A to Z and a language resembling L = {"A", "AARDVARK", "AARON", "ABDICATE", ...}. If a word w is an element of the set L, then it is in our language. Otherwise, it is not a member of the language. However, since we are ultimately just checking a boolean value for any w (is it contained in L), we can also think of the language as being described by a program M. If M(w) == true, then w is in L, otherwise if M(w) == false, then w is not in L. In this case, since M describes the language but is not itself the language, we say that M decides L.

For English, this program might be very simple. For instance, it could search for w in the Merriam-Webster dictionary and return true if it finds it, and false otherwise. For a more programmatic example, let us consider the language, defined over the binary alphabet that only contains 0 and 1, which includes only the words that have no 0s in them. Thus, the language looks like:

Bash

content_copy

L = {"1", "11", "111", ...}

And our decider might look like:

JavaScript

content_copy

M = (w) => w.indexOf("0") < 0

Which returns true if and only if the first instance of “0” in w is at an index less than 0, signaling that it never occurs.

What is a Grammar?

Now that we have a general idea of what a language is, let’s look at what a grammar is. The first thing to understand about grammars is that they operate on ordered lists of words, most commonly referred to as sentences. The other concept to understand is that grammars generate sentences from multiple languages. This last point may not make sense with our instinctive definition of language, but if we consider the mathematical definition, and we see that a word can belong to multiple languages, this makes more sense. For instance, “dog” belongs to both the languages ENGLISH and NOUN. Thus, we may think of the english grammar as generating sentences using the languages NOUN, VERB, ADJECTIVE, etc.

Describing Grammars

The most common way that grammars are specified is as a list of substitutions. These substitutions are small sentence fragments that contain either terminal characters, non-terminal characters, or an arrangement of both. This may sound confusing, but it is very intuitive. A terminal character or “token” is a language identifier. For example, noun is a terminal character that can generate any word in the NOUN language. In contrast to this, a non-terminal character is one for which there is a substitution defined in our grammar. For instance, if we have defined the lanugages NOUN, which contains all nouns, and CONJ, which contains all conjunctions, we could define the simple grammar

Bash

content_copy

PAIR --> noun conj noun

Which says that PAIR – a non-terminal – can be expanded into a sentence containing a noun, followed by a conjunction, followed by another noun. Thus, PAIR can generate the sentences “dog and cat,” as well as “table or chair.” This is an example of a regular grammar, defined as such because it contains only non-terminals on the left and only terminals on the right. We can generate even stronger grammars by adding more substitutions such as:

Bash

content_copy

PAIR --> noun conj noun
PAIR --> verb conj verb

So in addition to the previous sentences, we can also generate ones such as “run and climb.” Yet, even by adding this, we still are unable to generate many reasonable grammars. One such example of this are palindromes. Since palindromes cannot be computed by a regular grammar (try it for yourself!), we need something even more powerful. Let us look at a special case of palindromes: parentheses matching. We say that a parenthetical statement is grammatically correct if it follows the form (ⁿ)ⁿ, where a^{n represents the the symbol a repeated n times. Looking at this we see that whenever we perform a substitution that adds a left-paren, we must also add a right-paren. If “open” has the language L = {"("} and “close” has the language L = {")"}, then we can define PAREN as the grammar}

Bash

content_copy

PAREN --> open close
PAREN --> open PAREN close

Thus, if we want to generate “((()))” from PAREN, we substitute as

Bash

content_copy

PAREN
      --> [open PAREN close]
      --> open [open PAREN close] close
      --> open open [open close] close close
      --> ((()))

This type of grammar, with non-terminals on the left and a mix of terminals and non-terminals on the right of the substitution, is known as a context-free grammar. While context-free grammars cannot generate all sentences, they are generally sufficient for most applications.

These are the type of grammars that Tyranny works with.

Uses of Grammars

Before diving into how Tyranny works, we should spend a moment looking at why being able to define and generate grammars is useful. When describing grammars, we thought about them as “building up” sentences from a set of rules. While this is conceptually true, more often then not, grammar definitions are used to deconstruct a sentence into its grammatical components. From there, the application may modify or tag them in some way based on their grammar. In practice, the use of grammars is not unlike the use of regular expressions. A grammar can be “executed” on a string to return whether or not that grammar matches the string, and if so, some details about the structure.

In this way, grammars are most commonly used for parsing data into a sensible and machine understandable format. While much more complex, this is similar to how compiler, natural language processing, and other linguistic tasks are completed. This is where Tyranny enters the picture.

All Hail the King

Tyranny is a recent project of mine that allows developers to generate, parse, and process arbitrary regular and context free grammars. Tyranny derives its name from a 3-stage model of parsing, the Judge, Jury, and Executioner.

The first stage, Judge, lets the user specify serveral named regular languages that can be used as terminals in the Tyrant’s grammar. When used to parse a string, it judges whether it is composed of words that are in the composite language, and if so, determines to which specific language each word belongs.

The second stage, Jury, receives the parsed words from the Judge and determines if is valid according to the spcified grammar. It does this recursively, so if presented with an ambiguous grammar, it may return any of the grammars that construct it.

However, if the sentence is valid in the grammar, it will always return its grammatical structure. Finally, the third stage, Executioner, allows the user to specify a function to be executed on any given sentence or sentence fragment in the grammar. Since it evaluates this function lazily, the function will not be called until parsing is complete.

Examples

Altogether, even at its current stage, Tyranny makes potentially complex parsing very simple. For instance, let’s look at the code to perform parentheses matching.

JavaScript

content_copy

var tyrant = require("./tyranny").tyrant
var george = new tyrant()
george.addTokens({
        "OPEN": /^\($/,
        "CLOSE": /^\)$/
})
george.addRules({
        "E": "(OPEN CLOSE)|(OPEN {E} CLOSE)|[{E} {E}]"
})

The two important pieces here are the addTokens and addRules functions. As mentioned previously, addTokens allows for the specification of regular languages via regular expressions.

In this case, both languages have only a single member, ( and ) respectively. In addRules, we see something very similar to the regular expressions used in addTokens. This is Tyranny’s grammar syntax.

Tyranny’s grammar follows a perl-style regex syntax to make it easily accessible, and then uses Tyranny itself to parse it out into a grammar structure. The syntax right now is fairly simple: TERMINAL refers to a language specified by addTokens and {NON-TERMINAL} refers to a non-terminal specified by addRules. EXPR EXPR is used for concatenation, EXPR|EXPR is used for alternation (match either of them), and EXPR* is used as a Kleene-star style operator (match 0 or more instances of EXPR). In addition (EXPR) is used for match grouping and will group the results in the output, whereas [EXPR] is used for logical grouping, and does not affect the output in any way. Now that we understand what this code is specifying, we can see how this would match any valid parentheses. It also adds a clause that allows for parentheses expression to be concatentated as long as they themselves are valid. We can test this out by executing it on a few strings, and we find

JavaScript

content_copy

> george.parse("()")
[[ '(', ')' ] 
> george.parse("(())")
[[ '(', [ '(', ')' ], ')' ]]
> george.parse("(()())")
[[ '(', [ '(', ')' ], [ '(', ')' ], ')' ]]

…where each pair of parentheses is in an array with itself, its match, and anything between them. If we try this with invalid matchings we find

JavaScript

content_copy

> george.parse("(()")
false
> george.parse(")(")
false

…that Tyranny is incapable of parsing these strings, as it should be. A more interesting example that makes use of the Executioner stage is a program to parse and execute postfix arithmetic.

JavaScript

content_copy

var frank = new tyrant()
frank.addTokens({
        "NUMBER": /^[0-9\.]+[ ]*$/,
        "PLUS": /^\+[ ]*$/,
        "MINUS": /^\-[ ]*$/,
        "TIMES": /^\*[ ]*$/,
        "DIVIDE": /^\/[ ]*$/,
})
var number = frank.compile("NUMBER").apply( parseFloat )
var plus   = frank.compile("{E} {E} PLUS").apply(   (l) => l[0]+l[1] )
var minus  = frank.compile("{E} {E} MINUS").apply(  (l) => l[0]-l[1] )
var times  = frank.compile("{E} {E} TIMES").apply(  (l) => l[0]*l[1] )
var divide = frank.compile("{E} {E} DIVIDE").apply( (l) => l[0]/l[1] )
frank.addRules({
        "NUMBER": number,
        "PLUS": plus,
        "MINUS": minus,
        "TIMES": times,
        "DIVIDE": divide,
        "E": "{NUMBER}|{PLUS}|{MINUS}|{TIMES}|{DIVIDE}"
})

Here, we specify several of the grammars manually, by compiling them and then calling the apply function on them which allows for the specification of an Executioner function. For instance, “num” will match anything in the NUMBER language, and then parse it into a float. Likewise, “plus” will match any two expressions followed by a “+”, and then add the value of the two expressions together. After we specify our sentence fragments NUM, PLUS, MINUS, TIMES, and DIVIDE, we specify the full sentence, E, which can be any of those. Executing this, we find

JavaScript

content_copy

> frank.parse("3")
[3]
> frank.parse("3 2 +")
[5]
> frank.parse("4 3 2 + * 5 /")
[4]

As one familiar with postfix arithmetic would expect.

So What

Tyranny is still very much a work in progress. Even relatively short (~100 Tokens) grammars can take nearly seconds to execute which can cause massive slowdowns during initial compilation and later execution. It also relies on a very limited expression syntax that would benefit from more flexibility. However, it demonstrates the practicality of a context free grammar parser, and fills a gap in currently available javascript modules. If you would like to see more about Tyranny, you can follow development on github or stay tuned for more updates.

Update (4th October 2016)

For those interested, tyranny is finally up and running on npm. While it still has a number of inefficiencies, you can install it with npm install tyranny. If you use it, please report any bugs you find to the above github repository.