Stream Based Processing in Node

April 24th, 2016

Over the past few weeks, I have begun working on a set of tools called hakkit for helping me to write CTF scripts in node.js. Many of the ideas are lifted off of pwnlib, but soon after I started, I realized that by utilizing Node’s stream APIs, I could take my tools a step further than the pwnlib ones. By behaving similar to unix file descriptors, Node streams allow for powerful and extensible data manipulation.

What is a Stream

For those who are unfamiliar with streams, streams are a Node-specific API that allow for streaming of data between logical endpoints. At first blush, this can seem relatively useless, wouldn’t it be easier to store everything in an object and just pass that around? While this is often true for small pieces of data, once you add on several layers of abstraction and multiple types of data manipulation, the code can become unwieldy and slow. Instead, a stream lets you manipulate and parse data as you go, while keeping the memory allocation (buffers) small, and reducing the access time for data retrieval and updating.

Take, for instance, the act of searching for a word in the dictionary. If this has to be done repeatedly, it may be worthwhile to store its contents in memory. If, however, it only needs to be checked once, then it makes much more sense to stream the data. As the data is read in from the file, it can be immediately checked for the desired contents, and then discarded once the data has been checked. Furthermore, the process can stop reading once it has found the entry it is looking for.

Ultimately, a stream is a data processing mechanism that has one of the following attributes

It can supply data to another stream (readable)
It can consume data from another stream (writable)
It can transform data from one stream and send it to another (transform)
It can both supply and consume data from either a single stream or two different streams (duplex)

These streams can then be “piped” together into a more powerful stream capable of doing complex data manipulation. Here is an example of a very simple stream that reads in data, converts it to hex, and writes it back to the file system.

Input File->Hexify->Output File

Notice that this stream is not duplexed — data can only flow one direction. This is actually the most common way to use a stream, and the way most people are used to seeing them (even if they don’t recognise it as a stream). Notice how I mentioned that one stream can be “piped” into another, if that sounds similar to sh terminology — that is because there is an incredible similarity. Although node implements them with buffers and objects, Node streams are logically equivalent to file descriptor (fd) streams in *nix systems. In the same way that you could run

Bash

content_copy

$ cat input.txt | hexify > output.txt

to perform a similar function by redirecting the file descriptor, node streams can be just as easily chained.

Why Use Streams

Given this, what are the advantages of using streams in Node? In addition to the aforementioned speed and memory benefits, one of the most obvious answers is modularity. This is a buzzword that Node developers especially love to throw around, but it does have some merit to it. Just as shell commands can be used in a number of different applications, a well developed stream can be applied to numerous use cases without needing to be modified. In addition, streams are designed to emulate how a generic application runs, and can interface very easily with a wide number of data sources. One thing that surprised me while working on HakKit was how easy interfacing with a non-node command was using streams. Streams also benefit from being either lazy or greedy, depending on what the writable stream wants, and what the readable stream can provide. As a result, it can easily handle infinite or non-halting data sources (such as a network request).

However, despite all these benefits, streams are infrequently used in module APIs. Aside from packages that use them internally but never expose them, the only package I have spent any considerable amount of time with that used data streams was node-png. At first, I was also reluctant to use them despite their obvious benefits. Ultimately, I think this arises because streams are confusing and not especially well documented. While streams are a great fit for Node’s asynchronous structure, oftentimes developers are used to them behaving synchronously, such as in shell commands which have the benefit of being both intuitive and concise.

With HakKit, I have started to abstract the actual stream implementation away and provide instead a generic api that can be used for interfacing with any type of stream. For instance, the following code will perform the dictionary searching as mentioned earlier

JavaScript

content_copy

var hakkit = require("hakkit")
var file = new hakkit.file("/usr/share/dict/web2")
var tube = new hakkit.tube(file)
var data = tube.recvline()
while (data) {
    if (data.toString() == "banana\n") {
        console.log("Found")
        tube.close()
        break
    }
    data = tube.recvline()
}

While this certainly is not as concise as

Bash

content_copy

$ cat /usr/share/dict/web2 | grep "banana"

it starts to approach a level of readability while maintaining the full functionality of javascript syntax. More importantly, because each of these objects are just abstracted streams, it provides the same underlying functionality as the shell command. Forcing an inherently asynchronous task to be synchronous — such as through the use of tube.recvline() — has clear issues associated with it, yet ease of use and intuitiveness is often crucial for designing good scripting tools, and the hope is that this syntax balances between the two.

While streams are not practical for every application, there are many cases in which a stream makes the most sense for manipulating data. Hopefully this was helpful, and keep an eye out for new developments with HakKit.