The Cloudflare Blog

A History of HTML Parsing at Cloudflare: Part 1

Andrew Galloni — Thu, 28 Nov 2019 08:44:00 GMT

To coincide with the launch of streaming HTML rewriting functionality for Cloudflare Workers we are open sourcing the Rust HTML rewriter (LOL HTML) used to back the Workers HTMLRewriter API. We also thought it was about time to review the history of HTML rewriting at Cloudflare.

The first blog post will explain the basics of a streaming HTML rewriter and our particular requirements. We start around 8 years ago by describing the group of ‘ad-hoc’ parsers that were created with specific functionality such as to rewrite e-mail addresses or minify HTML. By 2016 the state machine defined in the HTML5 specification could be used to build a single spec-compliant HTML pluggable rewriter, to replace the existing collection of parsers. The source code for this rewriter is now public and available here: https://github.com/cloudflare/lazyhtml.

The second blog post will describe the next iteration of rewriter. With the launch of the edge compute platform Cloudflare Workers we came to realise that developers wanted the same HTML rewriting capabilities with a JavaScript API. The post describes the thoughts behind a low latency streaming HTML rewriter with a CSS-selector based API. We open-sourced the Rust library as it can also be used as a stand-alone HTML rewriting/parsing library.

What is a streaming HTML rewriter ?

A streaming HTML rewriter takes either a HTML string or byte stream input, parses it into tokens or any other structured intermediate representation (IR) - such as an Abstract Syntax Tree (AST). It then performs transformations on the tokens before converting back to HTML. This provides the ability to modify, extract or add to an existing HTML document as the bytes are being processed. Compare this with a standard HTML tree parser which needs to retrieve the entire file to generate a full DOM tree. The tree-based rewriter will both take longer to deliver the first processed bytes and require significantly more memory.

HTML rewriter

For example; consider you own a large site with a lot of historical content that you want to now serve over HTTPS. You will quickly run into the problem of resources (images, scripts, videos) being served over HTTP. This ‘mixed content’ opens a security hole and browsers will warn or block these resources. It can be difficult or even impossible to update every link on every page of a website. With a streaming HTML rewriter you can select the URI attribute of any HTML tag and change any HTTP links to HTTPS. We built this very feature Automatic HTTPS rewrites back in 2016 to solve mixed content issues for our customers.

The reader may already be wondering: “Isn’t this a solved problem, aren’t there many widely used open-source browsers out there with HTML parsers that can be used for this purpose?”. The reality is that writing code to run in 190+ PoPs around the world with a strict low latency requirement turns even seemingly trivial problems into complex engineering challenges.

The following blog posts will detail the journey of how starting with a simple idea of finding email addresses within an HTML page led to building an almost spec compliant HTML parser and then on to a CSS selector matching Virtual Machine. We learned a lot on this journey. I hope you find some of this as interesting as we did.

Rewriting at the edge

When rewriting content through Cloudflare we do not want to impact site performance. The balance in designing a streaming HTML rewriter is to minimise the pause in response byte flow by holding onto as little information as possible whilst retaining the ability to rewrite matching tokens.

The difference in requirements compared to an HTML parser used in a browser include:

Output latency

For browsers, the Document Object Model (DOM) is the end product of the parsing process but in our case we have to parse, rewrite and serialize back to HTML. In the case of Cloudflare’s reverse proxy any content processing on the edge server results in latency between the server and an eyeball. It is desirable to minimize the latency impact of HTML handling, which involves parsing, rewriting and serializing back to HTML. In all of these stages we want to be as fast as possible to minimize latency.

Parser throughput

Let’s assume that usually browsers rarely need to deal with HTML pages bigger than 1Mb in size and an average page load time is somewhere around 3s at best. HTML parsing is not the main bottleneck of the page loading process as the browser will be blocked on running scripts and loading other render-critical resources. We can roughly estimate that ~3Mbps is an acceptable throughput for browser’s HTML parser. At Cloudflare we have hundreds of megabytes of traffic per CPU, so we need a parser that is faster by an order of magnitude.

Memory limitations

As most users must realise, browsers have the luxury of being able to consume memory. For example, this simple HTML markup when opened in a browser will consume a significant chunk of your system memory before eventually halting a browser tab (and all this memory will be consumed by the parser) :

Unfortunately, buffering of some fraction of the input is inevitable even for streaming HTML rewriting. Consider these 2 HTML snippets:


            These seemingly similar fragments of HTML will be treated completely differently when encountered at the end of an HTML page. The first fragment will be parsed as a start tag and the second one will be ignored. By just seeing a `<` character followed by a tag name, the parser can’t determine if it has found a start tag or not. It needs to traverse the input in the search of the closing `>` to make a decision, buffering all content in between, so it can later be emitted to the consumer as a start tag token.
This requirement forces browsers to indefinitely buffer content before eventually giving up with the out-of-memory error.
In our case, we can’t afford to spend hundreds of megabytes of memory parsing a single HTML file (actual constraints are even tighter - even using a dozen kilobytes for each request would be unacceptable). We need to be much more sophisticated than other implementations in terms of memory usage and gracefully handle all the situations where provided memory capacity is insufficient to accomplish parsing.
    
      v0 : “Ad-hoc parsers”
      
        
      
    
    As usual with big projects, it all started pretty innocently.
    
      Find and obfuscate an email
      
        
      
    
    In 2010, Cloudflare decided to provide a feature that would stop popular email scrapers. The basic idea of this protection was to find and obfuscate emails on pages and later decode them back in the browser with injected JavaScript code. Sounds easy, right? You search for anything that looks like an email, encode it and then decode it with some JavaScript magic and present the result to the end-user.
However, even such a seemingly simple task already requires solving several issues. First of all, we need to define what an email is, and there is no simple answer. Even the infamous regex supposedly covering the entire RFC is, in fact, outdated and incomplete as the new RFC added lots of valid email constructions, including Unicode support. Let’s not go down that rabbit hole for now and instead focus on a higher-level issue: transforming streaming content.
Content from the network comes in packets, which have to be buffered and parsed as HTTP by our servers. You can’t predict how the content will be split, which means you always need to buffer some of it because content that is going to be replaced can be present in multiple input chunks.
Let’s say we decided to go with a simple regex like `[\w.]+@[\w.]+`. If the content that comes through contains the email “test@example.org”, it might be split in the following chunks:
            
            
            
            
            
In order to keep good Time To First Byte (TTFB) and consistent speed, we want to ensure that the preceding chunk is emitted as soon as we determine that it’s not interesting for replacement purposes.
The easiest way to do that is to transform our regex into a state machine, or a finite automata. While you could do that by hand, you will end up with hard-to-maintain and error-prone code. Instead, Ragel was chosen to transform regular expressions into efficient native state machine code. Ragel doesn’t try to take care of buffering or anything other than traversing the state machine. It provides a syntax that not only describes patterns, but can also associate custom actions (code in a host language) with any given state.
In our case we can pass through buffers until we match the beginning of an email. If we subsequently find out the pattern is not an email we can bail out from buffering as soon as the pattern stops matching. Otherwise, we can retrieve the matched email and replace it with new content.
To turn our pattern into a streaming parser we can remember the position of the potential start of an email and, unless it was already discarded or replaced by the end of the current input, store the unhandled part in a permanent buffer. Then, when a new chunk comes, we can process it separately, resuming from a state Ragel remembers itself, but then use both the buffered chunk and a new one to either emit or obfuscate.
Now that we have solved the problem of matching email patterns in text, we need to deal with the fact that they need to be obfuscated on pages. This is when the first hints of HTML “parsing” were introduced.
I’ve put “parsing” in quotes because, rather than implementing the whole parser, the email filter (as the module was called) didn’t attempt to replicate the whole HTML grammar, but rather added custom Ragel patterns just for skipping over comments and tags where emails should not be obfuscated.
This was a reasonable approach, especially back in 2010 - four years before the HTML5 specification, when all browsers had their own quirks handling of HTML. However, as you can imagine, this approach did not scale well. If you’re trying to work around quirks in other parsers, you start gaining more and more quirks in your own, and then work around these too. Simultaneously, new features started to be added, which also required modifying HTML on the fly (like automatic insertion of Google Analytics script), and an existing module seemed to be the best place for that. It grew to handle more and more tags, operations and syntactic edge cases.
    
      Now let’s minify..
      
        
      
    
    In 2011, Cloudflare decided to also add minification to allow customers to speed up their websites even if they had not employed minification themselves. For that, we decided to use an existing streaming minifier - jitify. It already had NGINX bindings, which made it a great candidate for integration into the existing pipeline.
Unfortunately, just like most other parsers from that time as well as ours described above, it had its own processing rules for HTML, JavaScript and CSS, which weren’t precise but rather tried to parse content on a best-effort basis. This led to us having two independent streaming parsers that were incompatible and could produce bugs either individually or only in combination.
    
      v1 : "(Almost) HTML5 Spec compliant parser"
      
        
      
    
    Over the years engineers kept adding new features to the ever-growing state machines, while fixing new bugs arising from imprecise syntax implementations, conflicts between various parsers, and problems in features themselves.
By 2016, it was time to get out of the multiple ad hoc parsers business and do things ‘the right way’.
The next section(s) will describe how we built our HTML5 compliant parser starting from the specification state machine. Using only this state machine it should have been straight-forward to build a parser. You may be aware that historically the parsing of HTML had not been entirely strict which meant to not break existing implementations the building of an actual DOM was required for parsing. This is not possible for a streaming rewriter so a simulator of the parser feedback was developed. In terms of performance, it is always better not to do something. We then describe why the rewriter can be ‘lazy’ and not perform the expensive encoding and decoding of text when rewriting HTML. The surprisingly difficult problem of deciding if a response is HTML is then detailed.
    
      HTML5
      
        
      
    
    By 2016, HTML5 had defined precise syntax rules for parsing and compatibility with legacy content and custom browser implementations. It was already implemented by all browsers and many 3rd-party implementations.
The HTML5 parsing specification defines basic HTML syntax in the form of a state machine. We already had experience with Ragel for similar use cases, so there was no question about what to use for the new streaming parser. Despite the complexity of the grammar, the translation of the specification to Ragel syntax was straightforward. The code looks simpler than the formal description of the state machine, thanks to the ability to mix regex syntax with explicit transitions.
            
            
            
            
            
A visualisation of a small fraction of the HTML state machine. Source: https://twitter.com/RReverser/status/715937136520916992
    
      HTML5 parsing requires a ‘DOM’
      
        
      
    
    However, HTML has a history. To not break existing implementations HTML5 is specified with recovery procedures for incorrect tag nesting, ordering, unclosed tags, missing attributes and all the other possible quirks that used to work in older browsers. In order to resolve these issues, the specification expects a tree builder to drive the lexer, essentially meaning you can’t correctly tokenize HTML (split into separate tags) without a DOM.
            
            
            
            
            
HTML parsing flow as defined by the specification
For this reason, most parsers don’t even try to perform streaming parsing and instead take the input as a whole and produce a document tree as an output. This is not something we could do for streaming transformation without adding significant delays to page loading.
An existing HTML5 JavaScript parser - parse5 - had already implemented spec-compliant tree parsing using a streaming tokenizer and rewriter. To avoid having to create a full DOM the concept of a “parser feedback simulator” was introduced.
    
      Tree builder feedback
      
        
      
    
    As you can guess from the name, this is a module that aims to simulate a full parser’s feedback to the tokenizer, without actually building the whole DOM, but instead preserving only the required information and context necessary for correctly driving the state machine.
After rigorous testing and upstreaming a test runner to parse5, we found this technique to be suitable for the majority of even poorly written pages on the Internet, and employed it in LazyHTML.
            
            
            
            
            
LazyHTML architecture
    
      Avoiding decoding - everything is ASCII
      
        
      
    
    Now that we had a streaming tokenizer working, we wanted to make sure that it was fast enough so that users didn’t notice any slowdowns to their pages as they go through the parser and transformations. Otherwise it would completely circumvent any optimisations we’d want to attempt on the fly.
It would not only cause a performance hit due to decoding and re-encoding any modified HTML content, but also significantly complicates our implementation due to multiple sources of potential encoding information required to determine the character encoding, including sniffing of the first 1 KB of the content.
The “living” HTML Standard specification permits only encodings defined in the Encoding Standard. If we look carefully through those encodings, as well as a remark on Character encodings section of the HTML spec, we find that all of them are ASCII-compatible with the exception of UTF-16 and ISO-2022-JP.
This means that any ASCII text will be represented in such encodings exactly as it would be in ASCII, and any non-ASCII text will be represented by bytes outside of the ASCII range. This property allows us to safely tokenize, compare and even modify original HTML without decoding or even knowing which particular encoding it contains. It is possible as all the token boundaries in HTML grammar are represented by an ASCII character.
We need to detect UTF-16 by sniffing and either decode or skip such documents without modification. We chose the latter to avoid potential security-sensitive bugs which are common with UTF-16, and because the character encoding is seen in less than 0.1% of known character encodings luckily.
The only issue left with this approach is that in most places the HTML tokenization specification requires you to replace U+0000 (NUL) characters with U+FFFD (replacement character) during parsing. Presumably, this was added as a security precaution against bugs in C implementations of old engines which could treat NUL character, encoded in ASCII / UTF-8 / ... as a 0x00 byte, as the end of the string (yay, null-terminated strings…). It’s problematic for us because U+FFFD is outside of the ASCII range, and will be represented by different sequences of bytes in different encodings. We don’t know the encoding of the document, so this will lead to corruption of the output.
Luckily, we’re not in the same business as browser vendors, and don’t worry about NUL characters in strings as much - we use “fat pointer” string representation, in which the length of the string is determined not by the position of the NUL character, but stored along with the data pointer as an integer field:
            typedef struct {
   const char *data;
   size_t length;
} lhtml_string_t;
            Instead, we can quietly ignore these parts of the spec (sorry!), and keep U+0000 characters as-is and add them as such to tag, attribute names, and other strings, and later re-emit to the document. This is safe to do, because it doesn’t affect any state machine transitions, but merely preserves original 0x00 bytes and delegates their replacement to the parser in the end user’s browser.
    
      Content type madness
      
        
      
    
    We want to be lazy and minimise false positives. We only want to spend time parsing, decoding and rewriting actual HTML rather than breaking images or JSON. So the question is how do you decide if something is a HTML document. Can you just use the Content-Type for example ? A comment left in the source code best describes the reality.
            /*
Dear future generations. I didn't like this hack either and hoped
we could do the right thing instead. Unfortunately, the Internet
was a bad and scary place at the moment of writing. If this
ever changes and websites become more standards compliant,
please do remove it just like I tried.
Many websites use PHP which sets Content-Type: text/html by
default. There is no error or warning if you don't provide own
one, so most websites don't bother to change it and serve
JSON API responses, private keys and binary data like images
with this default Content-Type, which we would happily try to
parse and transforms. This not only hurts performance, but also
easily breaks response data itself whenever some sequence inside
it happens to look like a valid HTML tag that we are interested
in. It gets even worse when JSON contains valid HTML inside of it
and we treat it as such, and append random scripts to the end
breaking APIs critical for popular web apps.
This hack attempts to mitigate the risk by ensuring that the
first significant character (ignoring whitespaces and BOM)
is actually `<` - which increases the chances that it's indeed HTML.
That way we can potentially skip some responses that otherwise
could be rendered by a browser as part of AJAX response, but this
is still better than the opposite situation.
*/
            The reader might think that it’s a rare edge case, however, our observations show that almost 25% of the traffic served through Cloudflare with the “text/html” content type is unlikely to be HTML.
            
            
            
            
            
The trouble doesn’t end there: it turns out that there is a considerable amount of XML content served with the “text/html” content type which can’t be always processed correctly when treated as HTML.
Over time bailouts for binary data, JSON, AMP and correctly identifying HTML fragments leads to the content sniffing logic which can be described by the following diagram:
            
            
            
            
            
This is a good example of divergence between formal specifications and reality.
    
      Tag name comparison optimisation
      
        
      
    
    But just having fast parsing is not enough - we have functionality that consumes the output of the parser, rewrites it and feeds it back for the serialization. And all the memory and time constraints that we have for the parser are applicable for this code as well, as it is a part of the same content processing pipeline.
It’s a common requirement to compare parsed HTML tag names, e.g. to determine if the current tag should be rewritten or not. A naive implementation will use regular per-byte comparison which can require traversing the whole tag name. We were able to narrow this operation to a single integer comparison instruction in the majority of cases by using specially designed hashing algorithm.
The tag names of all standard HTML elements contain only alphabetical ASCII characters and digits from 1 to 6 (in numbered header tags, i.e.  - ). Comparison of tag names is case-insensitive, so we only need 26 characters to represent alphabetical characters. Using the same basic idea as arithmetic coding, we can represent each of the possible 32 characters of a tag name using just 5 bits and, thus, fit up to floor(64 / 5) = 12 characters in a 64-bit integer which is enough for all the standard tag names and any other tag names that satisfy the same requirements! The great part is that we don’t even need to additionally traverse a tag name to hash it - we can do that as we parse the tag name consuming the input byte by byte.
However, there is one problem with this hashing algorithm and the culprit is not so obvious: to fit all 32 characters in 5 bits we need to use all possible bit combinations including 00000. This means that if the leading character of the tag name is represented with 00000 then we will not be able to differentiate between a varying number of consequent repetitions of this character.
For example, considering that ‘a’ is encoded as 00000 and ‘b’ as 00001 :

    
    Tag name
    Bit representation
    Encoded value
    
    
        ab
        00000 00001
        1
    
    
        aab
        00000 00000 00001
        1
    
Luckily, we know that HTML grammar doesn’t allow the first character of a tag name to be anything except an ASCII alphabetical character, so reserving numbers from 0 to 5 (00000b-00101b) for digits and numbers from 6 to 31 (00110b - 11111b) for ASCII alphabetical characters solves the problem.
    
      LazyHTML
      
        
      
    
    After taking everything mentioned above into consideration the LazyHTML (https://github.com/cloudflare/lazyhtml) library was created. It is a fast streaming HTML parser and serializer with a token based C-API derived from the HTML5 lexer written in Ragel. It provides a pluggable transformation pipeline to allow multiple transformation handlers to be chained together.
An example of a function that transforms `href` property of links:
            // define static string to be used for replacements
static const lhtml_string_t REPLACEMENT = {
   .data = "[REPLACED]",
   .length = sizeof("[REPLACED]") - 1
};

static void token_handler(lhtml_token_t *token, void *extra /* this can be your state */) {
  if (token->type == LHTML_TOKEN_START_TAG) { // we're interested only in start tags
    const lhtml_token_starttag_t *tag = &token->start_tag;
    if (tag->type == LHTML_TAG_A) { // check whether tag is of type 
      const size_t n_attrs = tag->attributes.count;
      const lhtml_attribute_t *attrs = tag->attributes.items;
      for (size_t i = 0; i < n_attrs; i++) { // iterate over attributes
        const lhtml_attribute_t *attr = &attrs[i];
        if (lhtml_name_equals(attr->name, "href")) { // match the attribute name
          attr->value = REPLACEMENT; // set the attribute value
        }
      }
    }
  }
  lhtml_emit(token, extra); // pass transformed token(s) to next handler(s)
}

            
    
      So, is it correct and how fast is it?
      
        
      
    
    It is HTML5 compliant as tested against the official test suites. As part of the work several contributions were sent to the specification itself for clarification / simplification of the spec language.
Unlike the previous parser(s), it didn't bail out on any of the 2,382,625 documents from HTTP Archive, although 0.2% of documents exceeded expected bufferization limits as they were in fact JavaScript or RSS or other types of content incorrectly served with Content-Type: text/html, and since anything is valid HTML5, the parser tried to parse e.g. a
As for the benchmarks, In September 2016 using an example which transforms the HTML spec itself (7.9 MB HTML file) by replacing every  (only that property only in those tags) to a static value. It was compared against the few existing and popular HTML parsers (only tokenization mode was used for the fair comparison, so that they don't need to build AST and so on), and timings in milliseconds for 100 iterations are the following (lazy mode means that we're using raw strings whenever possible, the other one serializes each token just for comparison):
            
            
            
            
            
The results show that LazyHTML parser speeds are around an order of magnitude faster.
That concludes the first post in our series on HTML rewriters at Cloudflare. The next post describes how we built a new streaming rewriter on top of the ideas of LazyHTML. The major update was to provide an easier to use CSS selector API. It provides the back-end for the Cloudflare workers HTMLRewriter JavaScript API.

Tag name	Bit representation	Encoded value
ab	00000 00001	1
aab	00000 00000 00001	1



Faster script loading with BinaryAST?
Ingvar Stepanyan — Fri, 17 May 2019 13:00:00 GMT
 
    
      JavaScript Cold starts
      
        
      
    
    The performance of applications on the web platform is becoming increasingly bottle necked by the startup (load) time. Large amounts of JavaScript code are required to create rich web experiences that we’ve become used to. When we look at the total size of JavaScript requested on mobile devices from HTTPArchive, we see that an average page loads 350KB of JavaScript, while 10% of pages go over the 1MB threshold. The rise of more complex applications can push these numbers even higher.
While caching helps, popular websites regularly release new code, which makes cold start (first load) times particularly important. With browsers moving to separate caches for different domains to prevent cross-site leaks, the importance of cold starts is growing even for popular subresources served from CDNs, as they can no longer be safely shared.
Usually, when talking about the cold start performance, the primary factor considered is a raw download speed. However, on modern interactive pages one of the other big contributors to cold starts is JavaScript parsing time. This might seem surprising at first, but makes sense - before starting to execute the code, the engine has to first parse the fetched JavaScript, make sure it doesn’t contain any syntax errors and then compile it to the initial bytecode. As networks become faster, parsing and compilation of JavaScript could become the dominant factor.
            
            
            
            
            
            
            
            
            
            
The device capability (CPU or memory performance) is the most important factor in the variance of JavaScript parsing times and correspondingly the time to application start. A 1MB JavaScript file will take an order of a 100 ms to parse on a modern desktop or high-end mobile device but can take over a second on an average phone  (Moto G4).
A more detailed post on the overall cost of parsing, compiling and execution of JavaScript shows how the JavaScript boot time can vary on different mobile devices. For example, in the case of news.google.com, it can range from 4s on a Pixel 2 to 28s on a low-end device.
While engines continuously improve raw parsing performance, with V8 in particular doubling it over the past year, as well as moving more things off the main thread, parsers still have to do lots of potentially unnecessary work that consumes memory, battery and might delay the processing of the useful resources.
    
      The “BinaryAST” Proposal
      
        
      
    
    This is where BinaryAST comes in. BinaryAST is a new over-the-wire format for JavaScript proposed and actively developed by Mozilla that aims to speed up parsing while keeping the semantics of the original JavaScript intact. It does so by using an efficient binary representation for code and data structures, as well as by storing and providing extra information to guide the parser ahead of time.
The name comes from the fact that the format stores the JavaScript source as an AST encoded into a binary file. The specification lives at tc39.github.io/proposal-binary-ast and is being worked on by engineers from Mozilla, Facebook, Bloomberg and Cloudflare.
“Making sure that web applications start quickly is one of the most important, but also one of the most challenging parts of web development. We know that BinaryAST can radically reduce startup time, but we need to collect real-world data to demonstrate its impact. Cloudflare's work on enabling use of BinaryAST with Cloudflare Workers is an important step towards gathering this data at scale.”
Till Schneidereit, Senior Engineering Manager, Developer TechnologiesMozilla
    
      Parsing JavaScript
      
        
      
    
    For regular JavaScript code to execute in a browser the source is parsed into an intermediate representation known as an AST that describes the syntactic structure of the code. This representation can then be compiled into a byte code or a native machine code for execution.
            
            
            
            
            
A simple example of adding two numbers can be represented in an AST as:
            
            
            
            
            
Parsing JavaScript is not an easy task; no matter which optimisations you apply, it still requires reading the entire text file char by char, while tracking extra context for syntactic analysis.
The goal of the BinaryAST is to reduce the complexity and the amount of work the browser parser has to do overall by providing additional information and context by the time and place where the parser needs it.
To execute JavaScript delivered as BinaryAST the only steps required are:
            
            
            
            
            
Another benefit of BinaryAST is that it makes possible to only parse the critical code necessary for start-up, completely skipping over the unused bits. This can dramatically improve the initial loading time.
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
This post will now describe some of the challenges of parsing JavaScript in more detail, explain how the proposed format addressed them, and how we made it possible to run its encoder in Workers.
    
      Hoisting
      
        
      
    
    JavaScript relies on hoisting for all declarations - variables, functions, classes. Hoisting is a property of the language that allows you to declare items after the point they’re syntactically used.
Let's take the following example:
            function f() {
	return g();
}

function g() {
	return 42;
}
            Here, when the parser is looking at the body of f, it doesn’t know yet what g is referring to - it could be an already existing global function or something declared further in the same file - so it can’t finalise parsing of the original function and start the actual compilation.
BinaryAST fixes this by storing all the scope information and making it available upfront before the actual expressions.
            
            
            
            
            
As shown by the difference between the initial AST and the enhanced AST in a JSON representation:
            
            
            
            
            
    
      Lazy parsing
      
        
      
    
    One common technique used by modern engines to improve parsing times is lazy parsing. It utilises the fact that lots of websites include more JavaScript than they actually need, especially for the start-up.
Working around this involves a set of heuristics that try to guess when any given function body in the code can be safely skipped by the parser initially and delayed for later. A common example of such heuristic is immediately running the full parser for any function that is wrapped into parentheses:
            (function(...
            Such prefix usually indicates that a following function is going to be an IIFE (immediately-invoked function expression), and so the parser can assume that it will be compiled and executed ASAP, and wouldn’t benefit from being skipped over and delayed for later.
            (function() {
	…
})();
            These heuristics significantly improve the performance of the initial parsing and cold starts, but they’re not completely reliable or trivial to implement.
One of the reasons is the same as in the previous section - even with lazy parsing, you still need to read the contents, analyse them and store an additional scope information for the declarations.
Another reason is that the JavaScript specification requires reporting any syntax errors immediately during load time, and not when the code is actually executed. A class of these errors, called early errors, is checking for mistakes like usage of the reserved words in invalid contexts, strict mode violations, variable name clashes and more. All of these checks require not only lexing JavaScript source, but also tracking extra state even during the lazy parsing.
Having to do such extra work means you need to be careful about marking functions as lazy too eagerly, especially if they actually end up being executed during the page load. Otherwise, you’re making cold start costs even worse, as now every function that is erroneously marked as lazy, needs to be parsed twice - once by the lazy parser and then again by the full one.
Because BinaryAST is meant to be an output format of other tools such as Babel, TypeScript and bundlers such as Webpack, the browser parser can rely on the JavaScript being already analysed and verified by the initial parser. This allows it to skip function bodies completely, making lazy parsing essentially free.
It reduces the cost of a completely unused code - while including it is still a problem in terms of the network bandwidth (don’t do this!), at least it’s not affecting parsing times any more. These benefits apply equally to the code that is used later in the page lifecycle (for example, invoked in response to user actions), but is not required during the startup.
Last but not the least important benefit of such approach is that BinaryAST encodes lazy annotations as part of the format, giving tools and developers direct and full control over the heuristics. For example, a tool targeting the Web platform or a framework CLI can use its domain-specific knowledge to mark some event handlers as lazy or eager depending on the context and the event type.
    
      Avoiding ambiguity in parsing
      
        
      
    
    Using a text format for a programming language is great for readability and debugging, but it's not the most efficient representation for parsing and execution.
For example, parsing low-level types like numbers, booleans and even strings from text requires extra analysis and computation, which is unnecessary when you can just store and read them as native binary-encoded values in the first place and read directly on the other side.
Another problem is an ambiguity in the grammar itself. It was already an issue in the ES5 world, but could usually be resolved with some extra bookkeeping based on the previously seen tokens. However, in ES6+ there are productions that can be ambiguous all the way through until they’re parsed completely.
For example, a token sequence like:
            (a, {b: c, d}, [e = 1])...
            can start either a parenthesized comma expression with nested object and array literals and an assignment:
            (a, {b: c, d}, [e = 1]); // it was an expression
            or a parameter list of an arrow expression function with nested object and array patterns and a default value:
            (a, {b: c, d}, [e = 1]) => … // it was a parameter list
            Both representations are perfectly valid, but have completely different semantics, and you can’t know which one you’re dealing with until you see the final token.
To work around this, parsers usually have to either backtrack, which can easily get exponentially slow, or to parse contents into intermediate node types that are capable of holding both expressions and patterns, with following conversion. The latter approach preserves linear performance, but makes the implementation more complicated and requires preserving more state.
In the BinaryAST format this issue doesn't exist in the first place because the parser sees the type of each node before it even starts parsing its contents.
    
      Cloudflare Implementation
      
        
      
    
    Currently, the format is still in flux, but the very first version of the client-side implementation was released under a flag in Firefox Nightly several months ago. Keep in mind this is only an initial unoptimised prototype, and there are already several experiments changing the format to provide improvements to both size and parsing performance.
On the producer side, the reference implementation lives at github.com/binast/binjs-ref. Our goal was to take this reference implementation and consider how we would deploy it at Cloudflare scale.
If you dig into the codebase, you will notice that it currently consists of two parts.
            
            
            
            
            
One is the encoder itself, which is responsible for taking a parsed AST, annotating it with scope and other relevant information, and writing out the result in one of the currently supported formats. This part is written in Rust and is fully native.
Another part is what produces that initial AST - the parser. Interestingly, unlike the encoder, it's implemented in JavaScript.
Unfortunately, there is currently no battle-tested native JavaScript parser with an open API, let alone implemented in Rust. There have been a few attempts, but, given the complexity of JavaScript grammar, it’s better to wait a bit and make sure they’re well-tested before incorporating it into the production encoder.
On the other hand, over the last few years the JavaScript ecosystem grew to extensively rely on developer tools implemented in JavaScript itself. In particular, this gave a push to rigorous parser development and testing. There are several JavaScript parser implementations that have been proven to work on thousands of real-world projects.
With that in mind, it makes sense that the BinaryAST implementation chose to use one of them - in particular, Shift - and integrated it with the Rust encoder, instead of attempting to use a native parser.
    
      Connecting Rust and JavaScript
      
        
      
    
    Integration is where things get interesting.
Rust is a native language that can compile to an executable binary, but JavaScript requires a separate engine to be executed. To connect them, we need some way to transfer data between the two without sharing the memory.
Initially, the reference implementation generated JavaScript code with an embedded input on the fly, passed it to Node.js and then read the output when the process had finished. That code contained a call to the Shift parser with an inlined input string and produced the AST back in a JSON format.
This doesn’t scale well when parsing lots of JavaScript files, so the first thing we did is transformed the Node.js side into a long-living daemon. Now Rust could spawn a required Node.js process just once and keep passing inputs into it and getting responses back as individual messages.
            
            
            
            
            
    
      Running in the cloud
      
        
      
    
    While the Node.js solution worked fairly well after these optimisations, shipping both a Node.js instance and a native bundle to production requires some effort. It's also potentially risky and requires manual sandboxing of both processes to make sure we don’t accidentally start executing malicious code.
On the other hand, the only thing we needed from Node.js is the ability to run the JavaScript parser code. And we already have an isolated JavaScript engine running in the cloud - Cloudflare Workers! By additionally compiling the native Rust encoder to Wasm (which is quite easy with the native toolchain and wasm-bindgen), we can even run both parts of the code in the same process, making cold starts and communication much faster than in a previous model.
            
            
            
            
            
    
      Optimising data transfer
      
        
      
    
    The next logical step is to reduce the overhead of data transfer. JSON worked fine for communication between separate processes, but with a single process we should be able to retrieve the required bits directly from the JavaScript-based AST.
To attempt this, first, we needed to move away from the direct JSON usage to something more generic that would allow us to support various import formats. The Rust ecosystem already has an amazing serialisation framework for that - Serde.
Aside from allowing us to be more flexible in regard to the inputs, rewriting to Serde helped an existing native use case too. Now, instead of parsing JSON into an intermediate representation and then walking through it, all the native typed AST structures can be deserialized directly from the stdout pipe of the Node.js process in a streaming manner. This significantly improved both the CPU usage and memory pressure.
But there is one more thing we can do: instead of serializing and deserializing from an intermediate format (let alone, a text format like JSON), we should be able to operate [almost] directly on JavaScript values, saving memory and repetitive work.
How is this possible? wasm-bindgen provides a type called JsValue that stores a handle to an arbitrary value on the JavaScript side. This handle internally contains an index into a predefined array.
Each time a JavaScript value is passed to the Rust side as a result of a function call or a property access, it’s stored in this array and an index is sent to Rust. The next time Rust wants to do something with that value, it passes the index back and the JavaScript side retrieves the original value from the array and performs the required operation.
By reusing this mechanism, we could implement a Serde deserializer that requests only the required values from the JS side and immediately converts them to their native representation. It’s now open-sourced under https://github.com/cloudflare/serde-wasm-bindgen.
            
            
            
            
            
At first, we got a much worse performance out of this due to the overhead of more frequent calls between 1) Wasm and JavaScript - SpiderMonkey has improved these recently, but other engines still lag behind and 2) JavaScript and C++, which also can’t be optimised well in most engines.
The JavaScript ↔ C++ overhead comes from the usage of TextEncoder to pass strings between JavaScript and Wasm in wasm-bindgen, and, indeed, it showed up as the highest in the benchmark profiles. This wasn’t surprising - after all, strings can appear not only in the value payloads, but also in property names, which have to be serialized and sent between JavaScript and Wasm over and over when using a generic JSON-like structure.
Luckily, because our deserializer doesn’t have to be compatible with JSON any more, we can use our knowledge of Rust types and cache all the serialized property names as JavaScript value handles just once, and then keep reusing them for further property accesses.
This, combined with some changes to wasm-bindgen which we have up streamed, allows our deserializer to be up to 3.5x faster in benchmarks than the original Serde support in wasm-bindgen, while saving ~33% off the resulting code size. Note that for string-heavy data structures it might still be slower than the current JSON-based integration, but situation is expected to improve over time when reference types proposal lands natively in Wasm.
After implementing and integrating this deserializer, we used the wasm-pack plugin for Webpack to build a Worker with both Rust and JavaScript parts combined and shipped it to some test zones.
    
      Show me the numbers
      
        
      
    
    Keep in mind that this proposal is in very early stages, and current benchmarks and demos are not representative of the final outcome (which should improve numbers much further).
As mentioned earlier, BinaryAST can mark functions that should be parsed lazily ahead of time. By using different levels of lazification in the encoder (https://github.com/binast/binjs-ref/blob/b72aff7dac7c692a604e91f166028af957cdcda5/crates/binjs_es6/src/lazy.rs#L43) and running tests against some popular JavaScript libraries, we found following speed-ups.
    
      Level 0 (no functions are lazified)
      
        
      
    
    With lazy parsing disabled in both parsers we got a raw parsing speed improvement of between 3 and 10%.
Name Source size (kb) JavaScript Parse time (average ms) BinaryAST parse time (average ms) Diff (%)
React 20 0.403 0.385 -4.56
D3 (v5) 240 11.178 10.525 -6.018
Angular 180 6.985 6.331 -9.822
Babel 780 21.255 20.599 -3.135
Backbone 32 0.775 0.699 -10.312
wabtjs 1720 64.836 59.556 -8.489
Fuzzball (1.2) 72 3.165 2.768 -13.383
    
      Level 3 (functions up to 3 levels deep are lazified)
      
        
      
    
    But with the lazification set to skip nested functions of up to 3 levels we see much more dramatic improvements in parsing time between 90 and 97%. As mentioned earlier in the post, BinaryAST makes lazy parsing essentially free by completely skipping over the marked functions.
Name Source size (kb) Parse time (average ms) BinaryAST parse time (average ms) Diff (%)
React 20 0.407 0.032 -92.138
D3 (v5) 240 11.623 0.224 -98.073
Angular 180 7.093 0.680 -90.413
Babel 780 21.100 0.895 -95.758
Backbone 32 0.898 0.045 -94.989
wabtjs 1720 59.802 1.601 -97.323
Fuzzball (1.2) 72 2.937 0.089 -96.970
All the numbers are from manual tests on a Linux x64 Intel i7 with 16Gb of ram.
While these synthetic benchmarks are impressive, they are not representative of real-world scenarios. Normally you will use at least some of the loaded JavaScript during the startup. To check this scenario, we decided to test some realistic pages and demos on desktop and mobile Firefox and found speed-ups in page loads too.
For a sample application (https://github.com/cloudflare/binjs-demo, https://serve-binjs.that-test.site/) which weighed in at around 1.2 MB of JavaScript we got the following numbers for initial script execution:
Device JavaScript BinaryAST
Desktop 338ms 314ms
Mobile (HTC One M8) 2019ms 1455ms
Here is a video that will give you an idea of the improvement as seen by a user on mobile Firefox (in this case showing the entire page startup time):
            
            
            
            
            
Next step is to start gathering data on real-world websites, while improving the underlying format.
    
      How do I test BinaryAST on my website?
      
        
      
    
    We’ve open-sourced our Worker so that it could be installed on any Cloudflare zone: https://github.com/binast/binjs-ref/tree/cf-wasm.
One thing to be currently wary of is that, even though the result gets stored in the cache, the initial encoding is still an expensive process, and might easily hit CPU limits on any non-trivial JavaScript files and fall back to the un-encoded variant. We are working to improve this situation by releasing BinaryAST encoder as a separate feature with more relaxed limits in the following few days.
Meanwhile, if you want to play with BinaryAST on larger real-world scripts, an alternative option is to use a static binjs_encode tool from https://github.com/binast/binjs-ref to pre-encode JavaScript files ahead of time. Then, you can use a Worker from https://github.com/cloudflare/binast-cf-worker to serve the resulting BinaryAST assets when supported and requested by the browser.
On the client side, you’ll currently need to download Firefox Nightly, go to about:config and enable unrestricted BinaryAST support via the following options:
            
            
            
            
            
Now, when opening a website with either of the Workers installed, Firefox will get BinaryAST instead of JavaScript automatically.
    
      Summary
      
        
      
    
    The amount of JavaScript in modern apps is presenting performance challenges for all consumers. Engine vendors are experimenting with different ways to improve the situation - some are focusing on raw decoding performance, some on parallelizing operations to reduce overall latency, some are researching new optimised formats for data representation, and some are inventing and improving protocols for the network delivery.
No matter which one it is, we all have a shared goal of making the Web better and faster. On Cloudflare's side, we're always excited about collaborating with all the vendors and combining various approaches to make that goal closer with every step. 


Building fast interpreters in Rust
Ingvar Stepanyan — Mon, 04 Mar 2019 16:00:00 GMT
 In the previous post we described the Firewall Rules architecture and how the different components are integrated together. We also mentioned that we created a configurable Rust library for writing and executing Wireshark®-like filters in different parts of our stack written in Go, Lua, C, C++ and JavaScript Workers.
With a mixed set of requirements of performance, memory safety, low memory use, and the capability to be part of other products that we’re working on like Spectrum, Rust stood out as the strongest option.
            
            
            
            
            
We have now open-sourced this library under our Github account: https://github.com/cloudflare/wirefilter. This post will dive into its design, explain why we didn’t use a parser generator and how our execution engine balances security, runtime performance and compilation cost for the generated filters.
    
      Parsing Wireshark syntax
      
        
      
    
    When building a custom Domain Specific Language (DSL), the first thing we need to be able to do is parse it. This should result in an intermediate representation (usually called an Abstract Syntax Tree) that can be inspected, traversed, analysed and, potentially, serialised.
There are different ways to perform such conversion, such as:
Manual char-by-char parsing using state machines, regular expression and/or native string APIs.
Parser combinators, which use higher-level functions to combine different parsers together (in Rust-land these are represented by nom, chomp, combine and others).
Fully automated generators which, provided with a grammar, can generate a fully working parser for you (examples are peg, pest, LALRPOP, etc.).
    
      Wireshark syntax
      
        
      
    
    But before trying to figure out which approach would work best for us, let’s take a look at some of the simple official Wireshark examples, to understand what we’re dealing with:
ip.len le 1500
udp contains 81:60:03
sip.To contains "a1762"
http.request.uri matches "gl=se$"
eth.dst == ff:ff:ff:ff:ff:ff
ip.addr == 192.168.0.1
ipv6.addr == ::1
You can see that the right hand side of a comparison can be a number, an IPv4 / IPv6 address, a set of bytes or a string. They are used interchangeably, without any special notion of a type, which is fine given that they are easily distinguishable… or are they?
Let’s take a look at some IPv6 forms on Wikipedia:
2001:0db8:0000:0000:0000:ff00:0042:8329
2001:db8:0:0:0:ff00:42:8329
2001:db8::ff00:42:8329
So IPv6 can be written as a set of up to 8 colon-separated hexadecimal numbers, each containing up to 4 digits with leading zeros omitted for convenience. This appears suspiciously similar to the syntax for byte sequences. Indeed, if we try writing out a sequence like 2f:31:32:33:34:35:36:37, it’s simultaneously a valid IPv6 and a byte sequence in terms of Wireshark syntax.
There is no way of telling what this sequence actually represents without looking at the type of the field it’s being compared with, and if you try using this sequence in Wireshark, you’ll notice that it does just that:
ipv6.addr == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as an IPv6 address
http.request.uri == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as a byte sequence (will match a URL "/1234567")
Are there other examples of such ambiguities? Yup - for example, we can try using a single number with two decimal digits:
tcp.port == 80: matches any traffic on the port 80 (HTTP)
http.file_data == 80: matches any HTTP request/response with body containing a single byte (0x80)
We could also do the same with ethernet address, defined as a separate type in Wireshark, but, for simplicity, we represent it as a regular byte sequence in our implementation, so there is no ambiguity here.
    
      Choosing a parsing approach
      
        
      
    
    This is an interesting syntax design decision. It means that we need to store a mapping between field names and types ahead of time - a Scheme, as we call it - and use it for contextual parsing. This restriction also immediately rules out many if not most parser generators.
We could still use one of the more sophisticated ones (like LALRPOP) that allow replacing the default regex-based lexer with your own custom code, but at that point we’re so close to having a full parser for our DSL that the complexity outweighs any benefits of using a black-box parser generator.
Instead, we went with a manual parsing approach. While (for a good reason) this might sound scary in unsafe languages like C / C++, in Rust all strings are bounds checked by default. Rust also provides a rich string manipulation API, which we can use to build more complex helpers, eventually ending up with a full parser.
This approach is, in fact, pretty similar to parser combinators in that the parser doesn’t have to keep state and only passes the unprocessed part of the input down to smaller, narrower scoped functions. Just as in parser combinators, the absence of mutable state also allows to easily test and maintain each of the parsers for different parts of the syntax independently of the others.
Compared with popular parser combinator libraries in Rust, one of the differences is that our parsers are not standalone functions but rather types that implement common traits:
            pub trait Lex<'i>: Sized {
   fn lex(input: &'i str) -> LexResult<'i, Self>;
}
pub trait LexWith<'i, E>: Sized {
   fn lex_with(input: &'i str, extra: E) -> LexResult<'i, Self>;
}
            The lex method or its contextual variant lex_with can either return a successful pair of (instance of the type, rest of input) or a pair of (error kind, relevant input span).
            
            
            
            
            
The Lex trait is used for target types that can be parsed independently of the context (like field names or literals), while LexWith is used for types that need a Scheme or a part of it to be parsed unambiguously.
A bigger difference is that, instead of relying on higher-level functions for parser combinators, we use the usual imperative function call syntax. For example, when we want to perform sequential parsing, all we do is call several parsers in a row, using tuple destructuring for intermediate results:
            let input = skip_space(input);
let (op, input) = CombinedExpr::lex_with(input, scheme)?;
let input = skip_space(input);
let input = expect(input, ")")?;
            And, when we want to try different alternatives, we can use native pattern matching and ignore the errors:
            if let Ok(input) = expect(input, "(") {
   ...
   (SimpleExpr::Parenthesized(Box::new(op)), input)
} else if let Ok((op, input)) = UnaryOp::lex(input) {
   ...
} else {
   ...
}
            Finally, when we want to automate parsing of some more complicated common cases - say, enums - Rust provides a powerful macro syntax:
            lex_enum!(#[repr(u8)] OrderingOp {
   "eq" | "==" => Equal = EQUAL,
   "ne" | "!=" => NotEqual = LESS | GREATER,
   "ge" | ">=" => GreaterThanEqual = GREATER | EQUAL,
   "le" | "<=" => LessThanEqual = LESS | EQUAL,
   "gt" | ">" => GreaterThan = GREATER,
   "lt" | "<" => LessThan = LESS,
});
            This gives an experience similar to parser generators, while still using native language syntax and keeping us in control of all the implementation details.
    
      Execution engine
      
        
      
    
    Because our grammar and operations are fairly simple, initially we used direct AST interpretation by requiring all nodes to implement a trait that includes an execute method.
            trait Expr<'s> {
    fn execute(&self, ctx: &ExecutionContext<'s>) -> bool;
}
            The ExecutionContext is pretty similar to a Scheme, but instead of mapping arbitrary field names to their types, it maps them to the runtime input values provided by the caller.
As with Scheme, initially ExecutionContext used an internal HashMap for registering these arbitrary String -> RhsValue mappings. During the execute call, the AST implementation would evaluate itself recursively, and look up each field reference in this map, either returning a value or raising an error on missing slots and type mismatches.
This worked well enough for an initial implementation, but using a HashMap has a non-trivial cost which we would like to eliminate. We already used a more efficient hasher - [Fnv](https://github.com/servo/rust-fnv) - because we are in control of all keys and so are not worried about hash DoS attacks, but there was still more we could do.
    
      Speeding up field access
      
        
      
    
    If we look at the data structures involved, we can see that the scheme is always well-defined in advance, and all our runtime values in the execution engine are expected to eventually match it, even if the order or a precise set of fields is not guaranteed:
            
            
            
            
            
So what if we ditch the second map altogether and instead use a fixed-size array of values? Array indexing should be much cheaper than looking up in a map, so it might be well worth the effort.
How can we do it? We already know the number of items (thanks to the predefined scheme) so we can use that for the size of the backing storage, and, in order to simulate HashMap “holes” for unset values, we can wrap each item an Option<...>:
            pub struct ExecutionContext<'e> {
    scheme: &'e Scheme,
    values: Box<[Option>]>,
}
            The only missing piece is an index that could map both structures to each other. As you might remember, Scheme still uses a HashMap for field registration, and a HashMap is normally expected to be randomised and indexed only by the predefined key.
While we could wrap a value and an auto-incrementing index together into a custom struct, there is already a better solution: [IndexMap](https://github.com/bluss/indexmap). IndexMap is a drop-in replacement for a HashMap that preserves ordering and provides a way to get an index of any element and vice versa - exactly what we needed.
After replacing a HashMap in the Scheme with IndexMap, we can change parsing to resolve all the parsed field names to their indices in-place and store that in the AST:
            impl<'i, 's> LexWith<'i, &'s Scheme> for Field<'s> {
   fn lex_with(mut input: &'i str, scheme: &'s Scheme) -> LexResult<'i, Self> {
       ...
       let field = scheme
           .get_field_index(name)
           .map_err(|err| (LexErrorKind::UnknownField(err), name))?;
       Ok((field, input))
   }
}
            After that, in the ExecutionContext we allocate a fixed-size array and use these indices for resolving values during runtime:
            impl<'e> ExecutionContext<'e> {
   /// Creates an execution context associated with a given scheme.
   ///
   /// This scheme will be used for resolving any field names and indices.
   pub fn new<'s: 'e>(scheme: &'s Scheme) -> Self {
       ExecutionContext {
           scheme,
           values: vec![None; scheme.get_field_count()].into(),
       }
   }
   ...
}
            This gave significant (~2x) speed ups on our standard benchmarks:
Before:
            test matching ... bench:       2,548 ns/iter (+/- 98)
test parsing  ... bench:     192,037 ns/iter (+/- 21,538)
            After**:**
            test matching ... bench:       1,227 ns/iter (+/- 29)
test parsing  ... bench:     197,574 ns/iter (+/- 16,568)
            This change also improved the usability of our API, as any type errors are now detected and reported much earlier, when the values are just being set on the context, and not delayed until filter execution.
    
      [not] JIT compilation
      
        
      
    
    Of course, as with any respectable DSL, one of the other ideas we had from the beginning was “...at some point we’ll add native compilation to make everything super-fast, it’s just a matter of time...”.
In practice, however, native compilation is a complicated matter, but not due to lack of tools.
First of all, there is question of storage for the native code. We could compile each filter statically into some sort of a library and publish to a key-value store, but that would not be easy to maintain:
We would have to compile each filter to several platforms (x86-64, ARM, WASM, …).
The overhead of native library formats would significantly outweigh the useful executable size, as most filters tend to be small.
Each time we’d like to change our execution logic, whether to optimise it or to fix a bug, we would have to recompile and republish all the previously stored filters.
Finally, even if/though we’re sure of the reliability of the chosen store, executing dynamically retrieved native code on the edge as-is is not something that can be taken lightly.
The usual flexible alternative that addresses most of these issues is Just-in-Time (JIT) compilation.
When you compile code directly on the target machine, you get to re-verify the input (still expressed as a restricted DSL), you can compile it just for the current platform in-place, and you never need to republish the actual rules.
Looks like a perfect fit? Not quite. As with any technology, there are tradeoffs, and you only get to choose those that make more sense for your use cases. JIT compilation is no exception.
First of all, even though you’re not loading untrusted code over the network, you still need to generate it into the memory, mark that memory as executable and trust that it will always contain valid code and not garbage or something worse. Depending on your choice of libraries and complexity of the DSL, you might be willing to trust it or put heavy sandboxing around, but, either way, it’s a risk that one must explicitly be willing to take.
Another issue is the cost of compilation itself. Usually, when measuring the speed of native code vs interpretation, the cost of compilation is not taken into the account because it happens out of the process.
With JIT compilers though, it’s different as you’re now compiling things the moment they’re used and cache the native code only for a limited time. Turns out, generating native code can be rather expensive, so you must be absolutely sure that the compilation cost doesn’t offset any benefits you might gain from the native execution speedup.
I’ve talked a bit more about this at Rust Austin meetup and, I believe, this topic deserves a separate blog post so won’t go into much more details here, but feel free to check out the slides: https://www.slideshare.net/RReverser/building-fast-interpreters-in-rust. Oh, and if you’re in Austin, you should pop into our office for the next meetup!
Let’s get back to our original question: is there anything else we can do to get the best balance between security, runtime performance and compilation cost? Turns out, there is.
    
      Dynamic dispatch and closures to the rescue
      
        
      
    
    Introducing Fn trait!
In Rust, the Fn trait and friends (FnMut, FnOnce) are automatically implemented on eligible functions and closures. In case of a simple Fn case the restriction is that they must not modify their captured environment and can only borrow from it.
Normally, you would want to use it in generic contexts to support arbitrary callbacks with given argument and return types. This is important because in Rust, each function and closure implements a unique type and any generic usage would compile down to a specific call just to that function.
            fn just_call(me: impl Fn(), maybe: bool) {
  if maybe {
    me()
  }
}
            Such behaviour (called static dispatch) is the default in Rust and is preferable for performance reasons.
However, if we don’t know all the possible types at compile-time, Rust allows us to opt-in for a dynamic dispatch instead:
            fn just_call(me: &dyn Fn(), maybe: bool) {
  if maybe {
    me()
  }
}
            Dynamically dispatched objects don't have a statically known size, because it depends on the implementation details of particular type being passed. They need to be passed as a reference or stored in a heap-allocated Box, and then used just like in a generic implementation.
In our case, this allows to create, return and store arbitrary closures, and later call them as regular functions:
            trait Expr<'s> {
    fn compile(self) -> CompiledExpr<'s>;
}

pub(crate) struct CompiledExpr<'s>(Box) -> bool>);

impl<'s> CompiledExpr<'s> {
   /// Creates a compiled expression IR from a generic closure.
   pub(crate) fn new(closure: impl 's + Fn(&ExecutionContext<'s>) -> bool) -> Self {
       CompiledExpr(Box::new(closure))
   }

   /// Executes a filter against a provided context with values.
   pub fn execute(&self, ctx: &ExecutionContext<'s>) -> bool {
       self.0(ctx)
   }
}
            The closure (an Fn box) will also automatically include the environment data it needs for the execution.
            
            
            
            
            
This means that we can optimise the runtime data representation as part of the “compile” process without changing the AST or the parser. For example, when we wanted to optimise IP range checks by splitting them for different IP types, we could do that without having to modify any existing structures:
            RhsValues::Ip(ranges) => {
   let mut v4 = Vec::new();
   let mut v6 = Vec::new();
   for range in ranges {
       match range.clone().into() {
           ExplicitIpRange::V4(range) => v4.push(range),
           ExplicitIpRange::V6(range) => v6.push(range),
       }
   }
   let v4 = RangeSet::from(v4);
   let v6 = RangeSet::from(v6);
   CompiledExpr::new(move |ctx| {
       match cast!(ctx.get_field_value_unchecked(field), Ip) {
           IpAddr::V4(addr) => v4.contains(addr),
           IpAddr::V6(addr) => v6.contains(addr),
       }
   })
}
            Moreover, boxed closures can be part of that captured environment, too. This means that we can convert each simple comparison into a closure, and then combine it with other closures, and keep going until we end up with a single top-level closure that can be invoked as a regular function to evaluate the entire filter expression.
It’s turtles closures all the way down:
            let items = items
   .into_iter()
   .map(|item| item.compile())
   .collect::>()
   .into_boxed_slice();

match op {
   CombiningOp::And => {
       CompiledExpr::new(move |ctx| items.iter().all(|item| item.execute(ctx)))
   }
   CombiningOp::Or => {
       CompiledExpr::new(move |ctx| items.iter().any(|item| item.execute(ctx)))
   }
   CombiningOp::Xor => CompiledExpr::new(move |ctx| {
       items
           .iter()
           .fold(false, |acc, item| acc ^ item.execute(ctx))
   }),
}
            What’s nice about this approach is:
Our execution is no longer tied to the AST, and we can be as flexible with optimising the implementation and data representation as we want without affecting the parser-related parts of code or output format.
Even though we initially “compile” each node to a single closure, in future we can pretty easily specialise certain combinations of expressions into their own closures and so improve execution speed for common cases. All that would be required is a separate match branch returning a closure optimised for just that case.
Compilation is very cheap compared to real code generation. While it might seem that allocating many small objects (one Boxed closure per expression) is not very efficient and that it would be better to replace it with some sort of a memory pool, in practice we saw a negligible performance impact.
No native code is generated at runtime, which means that we execute only code that was statically verified by Rust at compile-time and compiled down to a static function. All that we do at the runtime is call existing functions with different values.
Execution turns out to be faster too. This initially came as a surprise, because dynamic dispatch is widely believed to be costly and we were worried that it would get slightly worse than AST interpretation. However, it showed an immediate ~10-15% runtime improvement in benchmarks and on real examples.
The only obvious downside is that each level of AST requires a separate dynamically-dispatched call instead of a single inlined code for the entire expression, like you would have even with a basic template JIT.
Unfortunately, such output could be achieved only with real native code generation, and, for our case, the mentioned downsides and risks would outweigh runtime benefits, so we went with the safe & flexible closure approach.
    
      Bonus: WebAssembly support
      
        
      
    
    As was mentioned earlier, we chose Rust as a safe high-level language that allows easy integration with other parts of our stack written in Go, C and Lua via C FFI. But Rust has one more target it invests in and supports exceptionally well: WebAssembly.
Why would we be interested in that? Apart from the parts of the stack where our rules would run, and the API that publishes them, we also have users who like to write their own rules. To do that, they use a UI editor that allows either writing raw expressions in Wireshark syntax or as a WYSIWYG builder.
We thought it would be great to expose the parser - the same one as we use on the backend - to the frontend JavaScript for a consistent real-time editing experience. And, honestly, we were just looking for an excuse to play with WASM support in Rust.
WebAssembly could be targeted via regular C FFI, but in that case you would need to manually provide all the glue for the JavaScript side to hold and convert strings, arrays and objects forth and back.
In Rust, this is all handled by wasm-bindgen. While it provides various attributes and methods for direct conversions, the simplest way to get started is to activate the “serde” feature which will automatically convert types using JSON.parse, JSON.stringify and [serde_json](https://docs.serde.rs/serde_json/) under the hood.
In our case, creating a wrapper for the parser with only 20 lines of code was enough to get started and have all the WASM code + JavaScript glue required:
            #[wasm_bindgen]
pub struct Scheme(wirefilter::Scheme);

fn into_js_error(err: impl std::error::Error) -> JsValue {
   js_sys::Error::new(&err.to_string()).into()
}

#[wasm_bindgen]
impl Scheme {
   #[wasm_bindgen(constructor)]
   pub fn try_from(fields: &JsValue) -> Result {
       fields.into_serde().map(Scheme).map_err(into_js_error)
   }

   pub fn parse(&self, s: &str) -> Result {
       let filter = self.0.parse(s).map_err(into_js_error)?;
       JsValue::from_serde(&filter).map_err(into_js_error)
   }
}
            And by using a higher-level tool called wasm-pack, we also got automated npm package generation and publishing, for free.
This is not used in the production UI yet because we still need to figure out some details for unsupported browsers, but it’s great to have all the tooling and packages ready with minimal efforts. Extending and reusing the same package, it should be even possible to run filters in Cloudflare Workers too (which also support WebAssembly).
    
      The future
      
        
      
    
    The code in the current state is already doing its job well in production and we’re happy to share it with the open-source Rust community.
This is definitely not the end of the road though - we have many more fields to add, features to implement and planned optimisations to explore. If you find this sort of work interesting and would like to help us by working on firewalls, parsers or just any Rust projects at scale, give us a shout! 


Improving request debugging in Cloudflare Workers
Ingvar Stepanyan — Fri, 28 Dec 2018 14:18:11 GMT
 At Cloudflare, we are constantly looking into ways to improve development experience for Workers and make it the most convenient platform for writing serverless code.
As some of you might have already noticed either from our public release notes, on cloudflareworkers.com or in your Cloudflare Workers dashboard, there recently was a small but important change in the look of the inspector.
But before we go into figuring out what it is, let's take a look at our standard example on cloudflareworkers.com:
            
            
            
            
            
The example worker code featured here acts as a transparent proxy, while printing requests / responses to the console.
Commonly, when debugging Workers, all you could see from the client-side devtools is the interaction between your browser and the Cloudflare Worker runtime. However, like in most other server-side runtimes, the interaction between your code and the actual origin has been hidden.
This is where console.log comes in. Although not the most convenient, printing random things out is a fairly popular debugging technique.
Unfortunately, its default output doesn't help much with debugging network interactions. If you try to expand either of request or response objects, all you can see is just a bunch of lazy accessors:
            
            
            
            
            
You could expand them one-by-one, getting some properties back, but, when it comes to important parts like headers, that doesn't help much either:
            
            
            
            
            
So, since the launch of Workers, what we have been able to suggest instead is certain JS tricks to convert headers to a more readable format:
            
            
            
            
            
This works somewhat better, but doesn't scale well, especially if you're trying to debug complex interactions between various requests on a page and subrequests coming from a worker. So we thought: how can we do better?
If you're familiar with Chrome DevTools, you might have noticed before that we were already offering its trimmed-down version in our UI with basic Console and Sources panels. The obvious solution is: why not expose the existing Network panel in addition to these? And we did just* that.
* Unfortunately, this is easier said than done. If you're already faimilar with the Network tab and are interested in the technical implementation details, feel free to skip the next section.
    
      What can you do with the new panel?
      
        
      
    
    You should be able to use most of the things available in regular Chrome DevTools Network panel, but instead of inspecting the interaction between browser and Cloudflare (which is as much as browser devtools can give you), you are now able to peek into the interaction between your Worker and the origin as well.
This means you're able to view request and response headers, including both those internal to your worker and the ones provided by Cloudflare:
            
            
            
            
            
Check the original response to verify content modifications:
            
            
            
            
            
Same goes for raw responses:
            
            
            
            
            
You can also check the time it took worker to reach and get data from your website:
            
            
            
            
            
However, note that timings from a debugging service will be different than the ones in production in different locations, so it would make sense to compare these only with other requests on the same page or with the same request as you keep iterating on code of your Worker.
You can view the initiator of each request - this might come in handy if your worker contains complex routing handled by different paths, or if you want to simply check which requests on the page were intercepted and re-issued at all:
            
            
            
            
            
Basic features like filtering by type of content also work:
            
            
            
            
            
And, finally, you can copy or even export subrequests as HAR for further inspection:
            
            
            
            
            
    
      How did we do this?
      
        
      
    
    So far we have been using a built-in mode of the inspector which was specifically designed with JavaScript-only targets in mind. This allows it to avoid loading most of the components that would require a real browser (Chromium-based) backend, and instead leaves just the core that can be integrated directly with V8 in any embedder, whether it's Node.js or, in our case, Cloudflare Workers.
Luckily, the DevTools Protocol itself is pretty well documented - chromedevtools.github.io/devtools-protocol/ - to facilitate third-party implementors.
While this is commonly used from client-side (for editor integration), there are some third-party implementors of the server-side too, even for non-JavaScript targets like Lua, Go, ClojureScript and even system-wide network debugging both on desktop and mobile: github.com/ChromeDevTools/awesome-chrome-devtools.
So there is nothing preventing us from providing our own implementation of Network domain that would give a native DevTools experience.
On the Workers backend side, we are already in charge of the network stack, which means we have access to all the necessary information to report and can wrap all the request/response handlers into own hooks to send it back to the inspector.
Communication between the inspector and the debugger backend is happening over WebSockets. So far we've been just receiving messages and passing them pretty much directly to V8 as-is. However, if we want to handle Network messages ourselves, that's not going to work anymore and we need to actually parse the messages.
To do that in a standard way, V8 provides some build scripts to generate protocol handlers for any given list of domains. While these are used by Chromium, they require quite a bit of configuration and custom glue for different levels of message serialisation, deserialisation and error handling.
On the other hand, the protocol used for communication is essentially just JSON-RPC, and capnproto, which we're already using in other places behind the scenes, provides JSON (de)serialisation support, so it's easier to reuse it rather than build a separate glue layer for V8.
For example, to provide bindings for [Runtime.callFrame](https://chromedevtools.github.io/devtools-protocol/tot/Runtime/#type-CallFrame) we need to just define a capnp structure like this:
            struct CallFrame {
  # Stack entry for runtime errors and assertions.
  functionName @0 :Text; # JavaScript function name.
  scriptId @1 :ScriptId; # JavaScript script id.
  url @2 :Text; # JavaScript script name or url.
  lineNumber @3 :Int32; # JavaScript script line number (0-based).
  columnNumber @4 :Int32; # JavaScript script column number (0-based).
}
            Okay, so by combining these two we can now parse and handle supported Network inspector messages ourselves and pass the rest through to V8 as usual.
Now, we needed to make some changes to the frontend. Wait, you might ask, wasn't the entire point of these changes to speak the same protocol as frontend already does? That's true, but there are other challenges.
First of all, because Network tab was designed to be used in a browser, it relies on various components that are actually irrelevant to us and, if pulled in as-is, would not only make frontend code larger, but also require extra backend support too. Some of them are used for cross-tab integration (e.g. with Profiler), but some are part of the Network tab itself - for example, it doesn't make much sense to use request blocking or mobile throttling when debugging server-side code. So we had some manual untangling to do here.
Another interesting challenge was in handling response bodies. Normally, when you click on a request in Network tab in the browser, and then ask to see its response body, devtools frontend sends a [Network.getResponseBody](https://chromedevtools.github.io/devtools-protocol/tot/Fetch/#method-getResponseBody) message to the browser backend and then the browser sends it back.
What this means is that, as long as the Network tab is active, browser has to store all of the responses for all of the requests from the page in memory, not knowing which of them are actually going to be requested in the future or not. Such lazy handling makes perfect sense for local or even remote Chrome debugging, where you are commonly fully in charge of both sides.
However, for us it wouldn't be ideal to have to store all of these responses from all of the users in memory on the debugging backend. After some forth and back on different solutions, we decided to deviate from the protocol and instead send original response bodies to the inspector frontend as they come through, and let frontend store them instead. This might seem not ideal either due to sending unnecessary data over the network during debugging sessions, but these tradeoffs make more sense for a shared debugging backend.
There were various smaller challenges and bug fixes to be made and upstreamed, but let them stay behind the scenes.
Is this feature useful to you? What other features would help you to debug and develop workers more efficiently? Or maybe you would like to work on Workers and tooling yourself?
Let us know!
P.S.: If you’re looking for a fun personal project for the holidays, this could be your chance to try out Workers, and play around with our new tools. 


Writing complex macros in Rust: Reverse Polish Notation
Ingvar Stepanyan — Wed, 31 Jan 2018 12:11:15 GMT
 (This is a crosspost of a tutorial originally published on my personal blog)
Among other interesting features, Rust has a powerful macro system. Unfortunately, even after reading The Book and various tutorials, when it came to trying to implement a macro which involved processing complex lists of different elements, I still struggled to understand how it should be done, and it took some time till I got to that "ding" moment and started misusing macros for everything :) (ok, not everything as in the i-am-using-macros-because-i-dont-want-to-use-functions-and-specify-types-and-lifetimes everything like I've seen some people do, but anywhere it's actually useful)
            
            
            
            
            
CC BY 2.0 image by Conor Lawless
So, here is my take on describing the principles behind writing such macros. It assumes you have read the Macros section from The Book and are familiar with basic macros definitions and token types.
I'll take a Reverse Polish Notation as an example for this tutorial. It's interesting because it's simple enough, you might be already familiar with it from school, and yet to implement it statically at compile time, you already need to use a recursive macros approach.
Reverse Polish Notation (also called postfix notation) uses a stack for all its operations, so that any operand is pushed onto the stack, and any [binary] operator takes two operands from the stack, evaluates the result and puts it back. So an expression like following:
            2 3 + 4 *
            translates into:
Put 2 onto the stack.
Put 3 onto the stack.
Take two last values from the stack (3 and 2), apply operator + and put the result (5) back onto the stack.
Put 4 onto the stack.
Take two last values from the stack (4 and 5), apply operator * (4 * 5) and put the result (20) back onto the stack.
End of expression, the single value on the stack is the result (20).
In a more common infix notation, used in math and most modern programming languages, the expression would look like (2 + 3) * 4.
So let's write a macro that would evaluate RPN at compile-time by converting it into an infix notation that Rust understands.
            macro_rules! rpn {
  // TODO
}

println!("{}", rpn!(2 3 + 4 *)); // 20
            Let's start with pushing numbers onto the stack.
Macros currently don't allow matching literals, and expr won't work for us because it can accidentally match sequence like 2 + 3 ... instead of taking just a single number, so we'll resort to tt - a generic token matcher that matches only one token tree (whether it's a primitive token like literal/identifier/lifetime/etc. or a ()/[]/{}-parenthesized expression containing more tokens):
            macro_rules! rpn {
  ($num:tt) => {
    // TODO
  };
}
            Now, we'll need a variable for the stack.
Macros can't use real variables, because we want this stack to exist only at compile time. So, instead, the trick is to have a separate token sequence that can be passed around, and so used as kind of an accumulator.
In our case, let's represent it as a comma-separated sequence of expr (since we will be using it not only for simple numbers but also for intermediate infix expressions) and wrap it into brackets to separate from the rest of the input:
            macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt) => {
    // TODO
  };
}
            Now, a token sequence is not really a variable - you can't modify it in-place and do something afterwards. Instead, you can create a new copy of this token sequence with necessary modifications, and recursively call same macro again.
If you are coming from functional language background or worked with any library providing immutable data before, both of these approaches - mutating data by creating a modified copy and processing lists with a recursion - are likely already familiar to you:
            macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt) => {
    rpn!([ $num $(, $stack)* ])
  };
}
            Now, obviously, the case with just a single number is rather unlikely and not very interesting to us, so we'll need to match anything else after that number as a sequence of zero or more tt tokens, which can be passed to next invocation of our macro for further matching and processing:
            macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
      rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}
            At this point we're still missing operator support. How do we match operators?
If our RPN would be a sequence of tokens that we would want to process in an exactly same way, we could simply use a list like $($token:tt)*. Unfortunately, that wouldn't give us an ability to go through list and either push an operand or apply an operator depending on each token.
The Book says that "macro system does not deal with parse ambiguity at all", and that's true for a single macros branch - we can't match a sequence of numbers followed by an operator like $($num:tt)* + because + is also a valid token and could be matched by the tt group, but this is where recursive macros helps again.
If you have different branches in your macro definition, Rust will try them one by one, so we can put our operator branches before the numeric one and, this way, avoid any conflict:
            macro_rules! rpn {
  ([ $($stack:expr),* ] + $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] - $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] * $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] / $($rest:tt)*) => {
    // TODO
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}
            As I said earlier, operators are applied to the last two numbers on the stack, so we'll need to match them separately, "evaluate" the result (construct a regular infix expression) and put it back:
            macro_rules! rpn {
  ([ $b:expr, $a:expr $(, $stack:expr)* ] + $($rest:tt)*) => {
    rpn!([ $a + $b $(, $stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] - $($rest:tt)*) => {
    rpn!([ $a - $b $(, $stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] * $($rest:tt)*) => {
    rpn!([ $a * $b $(,$stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] / $($rest:tt)*) => {
    rpn!([ $a / $b $(,$stack)* ] $($rest)*)
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}
            I'm not really fan of such obvious repetitions, but, just like with literals, there is no special token type to match operators.
What we can do, however, is add a helper that would be responsible for the evaluation, and delegate any explicit operator branch to it.
In macros, you can't really use an external helper, but the only thing you can be sure about is that your macros is already in scope, so the usual trick is to have a branch in the same macro "marked" with some unique token sequence, and call it recursively like we did in regular branches.
Let's use @op as such marker, and accept any operator via tt inside it (tt would be unambiguous in such context because we'll be passing only operators to this helper).
And the stack does not need to be expanded in each separate branch anymore - since we wrapped it into [] brackets earlier, it can be matched as any another token tree (tt), and then passed into our helper:
            macro_rules! rpn {
  (@op [ $b:expr, $a:expr $(, $stack:expr)* ] $op:tt $($rest:tt)*) => {
    rpn!([ $a $op $b $(, $stack)* ] $($rest)*)
  };

  ($stack:tt + $($rest:tt)*) => {
    rpn!(@op $stack + $($rest)*)
  };
  
  ($stack:tt - $($rest:tt)*) => {
    rpn!(@op $stack - $($rest)*)
  };

  ($stack:tt * $($rest:tt)*) => {
    rpn!(@op $stack * $($rest)*)
  };
  
  ($stack:tt / $($rest:tt)*) => {
    rpn!(@op $stack / $($rest)*)
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}
            Now any tokens are processed by corresponding branches, and we need to just handle final case when stack contains a single item, and no more tokens are left:
            macro_rules! rpn {
  // ...
  
  ([ $result:expr ]) => {
    $result
  };
}
            At this point, if you invoke this macro with an empty stack and RPN expression, it will already produce a correct result:
Playground
            println!("{}", rpn!([] 2 3 + 4 *)); // 20
            However, our stack is an implementation detail and we really wouldn't want every consumer to pass an empty stack in, so let's add another catch-all branch in the end that would serve as an entry point and add [] automatically:
Playground
            macro_rules! rpn {
  // ...

  ($($tokens:tt)*) => {
    rpn!([] $($tokens)*)
  };
}

println!("{}", rpn!(2 3 + 4 *)); // 20
            Our macro even works for more complex expressions, like the one from Wikipedia page about RPN!
            println!("{}", rpn!(15 7 1 1 + - / 3 * 2 1 1 + + -)); // 5
            
    
      Error handling
      
        
      
    
    Now everything seems to work smoothly for correct RPN expressions, but for a macros to be production-ready we need to be sure that it can handle invalid input as well, with a reasonable error message.
First, let's try to insert another number in the middle and see what happens:
            println!("{}", rpn!(2 3 7 + 4 *));
            Output:
            error[E0277]: the trait bound `[{integer}; 2]: std::fmt::Display` is not satisfied
  --> src/main.rs:36:20
   |
36 |     println!("{}", rpn!(2 3 7 + 4 *));
   |                    ^^^^^^^^^^^^^^^^^ `[{integer}; 2]` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string
   |
   = help: the trait `std::fmt::Display` is not implemented for `[{integer}; 2]`
   = note: required by `std::fmt::Display::fmt`
            Okay, that definitely doesn't look helpful as it doesn't provide any information relevant to the actual mistake in the expression.
In order to figure out what happened, we will need to debug our macros. For that, we'll use a trace_macros feature (and, like for any other optional compiler feature, you'll need a nightly version of Rust). We don't want to trace println! call, so we'll separate our RPN calculation to a variable:
Playground
            #![feature(trace_macros)]

macro_rules! rpn { /* ... */ }

fn main() {
  trace_macros!(true);
  let e = rpn!(2 3 7 + 4 *);
  trace_macros!(false);
  println!("{}", e);
}
            In the output we'll now see how our macro is being recursively evaluated step by step:
            note: trace_macro
  --> src/main.rs:39:13
   |
39 |     let e = rpn!(2 3 7 + 4 *);
   |             ^^^^^^^^^^^^^^^^^
   |
   = note: expanding `rpn! { 2 3 7 + 4 * }`
   = note: to `rpn ! ( [  ] 2 3 7 + 4 * )`
   = note: expanding `rpn! { [  ] 2 3 7 + 4 * }`
   = note: to `rpn ! ( [ 2 ] 3 7 + 4 * )`
   = note: expanding `rpn! { [ 2 ] 3 7 + 4 * }`
   = note: to `rpn ! ( [ 3 , 2 ] 7 + 4 * )`
   = note: expanding `rpn! { [ 3 , 2 ] 7 + 4 * }`
   = note: to `rpn ! ( [ 7 , 3 , 2 ] + 4 * )`
   = note: expanding `rpn! { [ 7 , 3 , 2 ] + 4 * }`
   = note: to `rpn ! ( @ op [ 7 , 3 , 2 ] + 4 * )`
   = note: expanding `rpn! { @ op [ 7 , 3 , 2 ] + 4 * }`
   = note: to `rpn ! ( [ 3 + 7 , 2 ] 4 * )`
   = note: expanding `rpn! { [ 3 + 7 , 2 ] 4 * }`
   = note: to `rpn ! ( [ 4 , 3 + 7 , 2 ] * )`
   = note: expanding `rpn! { [ 4 , 3 + 7 , 2 ] * }`
   = note: to `rpn ! ( @ op [ 4 , 3 + 7 , 2 ] * )`
   = note: expanding `rpn! { @ op [ 4 , 3 + 7 , 2 ] * }`
   = note: to `rpn ! ( [ 3 + 7 * 4 , 2 ] )`
   = note: expanding `rpn! { [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [  ] [ 3 + 7 * 4 , 2 ] )`
   = note: expanding `rpn! { [  ] [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [ [ 3 + 7 * 4 , 2 ] ] )`
   = note: expanding `rpn! { [ [ 3 + 7 * 4 , 2 ] ] }`
   = note: to `[(3 + 7) * 4, 2]`
            If we carefully look through the trace, we'll notice that the problem originates in these steps:
               = note: expanding `rpn! { [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [  ] [ 3 + 7 * 4 , 2 ] )`
            Since [ 3 + 7 * 4 , 2 ] was not matched by ([$result:expr]) => ... branch as a final expression, it was caught by our final catch-all ($($tokens:tt)*) => ... branch instead, prepended with an empty stack [] and then the original [ 3 + 7 * 4 , 2 ] was matched by generic $num:tt and pushed onto the stack as a single final value.
In order to prevent this from happening, let's insert another branch between these last two that would match any stack.
It would be hit only when we ran out of tokens, but stack didn't have exactly one final value, so we can treat it as a compile error and produce a more helpful error message using a built-in compile_error! macro.
Note that we can't use format! in this context since it uses runtime APIs to format a string, and instead we'll have to limit ourselves to built-in concat! and stringify! macros to format a message:
Playground
            macro_rules! rpn {
  // ...

  ([ $result:expr ]) => {
    $result
  };

  ([ $($stack:expr),* ]) => {
    compile_error!(concat!(
      "Could not find final value for the expression, perhaps you missed an operator? Final stack: ",
      stringify!([ $($stack),* ])
    ))
  };

  ($($tokens:tt)*) => {
    rpn!([] $($tokens)*)
  };
}
            The error message is now more meaningful and contains at least some details about current state of evaluation:
            error: Could not find final value for the expression, perhaps you missed an operator? Final stack: [ (3 + 7) * 4 , 2 ]
  --> src/main.rs:31:9
   |
31 |         compile_error!(concat!("Could not find final value for the expression, perhaps you missed an operator? Final stack: ", stringify!([$($stack),*])))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
40 |     println!("{}", rpn!(2 3 7 + 4 *));
   |                    ----------------- in this macro invocation
            But what if, instead, we miss some number?
Playground
            println!("{}", rpn!(2 3 + *));
            Unfortunately, this one is still not too helpful:
            error: expected expression, found `@`
  --> src/main.rs:15:14
   |
15 |         rpn!(@op $stack * $($rest)*)
   |              ^
...
40 |     println!("{}", rpn!(2 3 + *));
   |                    ------------- in this macro invocation
            If you try to use trace_macros, even it won't expand the stack here for some reason, but, luckily, it's relatively clear what's going on - @op has very specific conditions as to what should be matched (it expects at least two values on the stack), and, when it can't, @ gets matched by the same way-too-greedy $num:tt and pushed onto the stack.
To avoid this, again, we'll add another branch to match anything starting with @op that wasn't matched already, and produce a compile error:
Playground
            macro_rules! rpn {
  (@op [ $b:expr, $a:expr $(, $stack:expr)* ] $op:tt $($rest:tt)*) => {
    rpn!([ $a $op $b $(, $stack)* ] $($rest)*)
  };

  (@op $stack:tt $op:tt $($rest:tt)*) => {
    compile_error!(concat!(
      "Could not apply operator `",
      stringify!($op),
      "` to the current stack: ",
      stringify!($stack)
    ))
  };

  // ...
}
            Let's try again:
            error: Could not apply operator `*` to the current stack: [ 2 + 3 ]
  --> src/main.rs:9:9
   |
9  |         compile_error!(concat!("Could not apply operator ", stringify!($op), " to current stack: ", stringify!($stack)))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
46 |     println!("{}", rpn!(2 3 + *));
   |                    ------------- in this macro invocation
            Much better! Now our macro can evaluate any RPN expression at compile-time, and gracefully handles most common mistakes, so let's call it a day and say it's production-ready :)
There are many more small improvements we could add, but I'd like to leave them outside this demonstration tutorial.
Feel free to let me know if this has been useful and/or what topics you'd like to see better covered on Twitter! 


How we brought HTTPS Everywhere to the cloud (part 1)
Ingvar Stepanyan — Sat, 24 Sep 2016 15:46:26 GMT
 CloudFlare's mission is to make HTTPS accessible for all our customers. It provides security for their websites, improved ranking on search engines, better performance with HTTP/2, and access to browser features such as geolocation that are being deprecated for plaintext HTTP. With Universal SSL or similar features, a simple button click can now enable encryption for a website.
Unfortunately, as described in a previous blog post, this is only half of the problem. To make sure that a page is secure and can't be controlled or eavesdropped by third-parties, browsers must ensure that not only the page itself but also all its dependencies are loaded via secure channels. Page elements that don't fulfill this requirement are called mixed content and can either result in the entire page being reported as insecure or even completely blocked, thus breaking the page for the end user.
    
      What can we do about it?
      
        
      
    
    When we conceived the Automatic HTTPS Rewrites project, we aimed to automatically reduce the amount of mixed content on customers' web pages without breaking their websites and without any delay noticeable by end users while receiving a page that is being rewritten on the fly.
A naive way to do this would be to just rewrite http:// links to https:// or let browsers do that with Upgrade-Insecure-Requests directive.
            
            
            
            
            
Unfortunately, such approach is very fragile and unsafe unless you're sure that
Each single HTTP sub-resource is also available via HTTPS.
It's available at the exact same domain and path after protocol upgrade (more often than you might think that's not the case).
If either of these conditions is unmet, you end up rewriting resources to non-existing URLs and breaking important page dependencies.
Thus we decided to take a look at the existing solutions.
    
      How are these problems solved already?
      
        
      
    
    Many security aware people use the HTTPS Everywhere browser extension to avoid those kinds of issues. HTTPS Everywhere contains a well-maintained database from the Electronic Frontier Foundation that contains all sorts of mappings for popular websites that safely rewrite HTTP versions of resources to HTTPS only when it can be done without breaking the page.
However, most users are either not aware of it or are not even able to use it, for example, on mobile browsers.
            
            
            
            
            
CC BY 2.0 image by Jared Tarbell
So we decided to flip the model around. Instead of re-writing URLs in the browser, we would re-write them inside the CloudFlare reverse proxy. By taking advantage of the existing database on the server-side, website owners could turn it on and all their users would instantly benefit from HTTPS rewriting. The fact that it’s automatic is especially useful for websites with user-generated content where it's not trivial to find and fix all the cases of inserted insecure third-party content.
At our scale, we obviously couldn't use the existing JavaScript rewriter. The performance challenges for a browser extension which can find, match and cache rules lazily as a user opens websites, are very different from those of a CDN server that handles millions of requests per second. We usually don't get a chance to rewrite them before they hit the cache either, as many pages are dynamically generated on the origin server and go straight through us to the client.
That means, to take advantage of the database, we needed to learn how the existing implementation works and create our own in the form of a native library that could work without delays under our load. Let's do the same here.
    
      How does HTTPS Everywhere know what to rewrite?
      
        
      
    
    HTTPS Everywhere rulesets can be found in src/chrome/content/rules folder of the official repository. They are organized as XML files, each for their own set of hosts (with few exclusions). This allows users with basic technical skills to write and contribute missing rules to the database on their own.
Each ruleset is an XML file of the following structure:
            
  
  
 
  
  
 
  
  

            At the moment of writing, the HTTPS Everywhere database consists of ~22K such rulesets covering ~113K domain wildcards with ~32K rewrite rules and exclusions.
For performance reasons, we can't keep all those ruleset XMLs in memory, go through nodes, check each wildcard, perform replacements based on specific string format and so on. All that work would introduce significant delays in page processing and increase memory consumption on our servers. That's why we had to perform some compile-time tricks for each type of node to ensure that rewriting is smooth and fast for any user from the very first request.
Let's walk through those nodes and see what can be done in each specific case.
    
      Target domains
      
        
      
    
    First of all, we get target elements which describe domain wildcards that current ruleset potentially covers.
            
            If a wildcard is used, it can be either left-side or right-side.
Left-side wildcard like *.example.org covers any hostname which has example.org as a suffix - no matter how many subdomain levels you have.
Right-side wildcard like example.* covers only one level instead so that subdomains with the same beginning but one unexpected domain level are not accidentally caught. For example, the Google ruleset, among others, uses the google.* wildcard and it should match google.com, google.ru, google.es etc. but not google.mywebsite.com.
Note that a single host can be covered by several different rulesets as wildcards can overlap, so the rewriter should be given entire database in order to find a correct replacement. Still, matching hostname allows to instantly reduce all ~22,000 rulesets to only 3-5 which we can deal with more easily.
Matching wildcards at runtime one-by-one is, of course, possible, but very inefficient with ~113K domain wildcards (and, as we noted above, one domain can match several rulesets, so we can't even bail out early). We need to find a better way.
            
            
            
            
            
CC BY 2.0 image by vige
We use Ragel to build fast lexers in other pieces of our code. Ragel is a state machine compiler which takes grammars and actions described with its own syntax and generates source code in a given programming language as an output. We decided to use it here too and wrote a script that generates a Ragel grammar from our set of wildcards. In turn, Ragel converts it into C code of a state machine capable of going through characters of URLs, matching hosts and invoking custom handler on each found ruleset.
This leads us to another interesting problem. At the moment of writing among 113K domain wildcards we have 4.7K that have a left wildcard and less than 200 which have a right wildcard. Left wildcards are expensive in state machines (including regular expressions) as they cause DFA space explosion during compilation so Ragel got stuck for more than 10 minutes without giving any result - trying to analyze all the *. prefixes and merge all the possible states where they can go, resulting in a complex tree.
Instead, if we choose to look from the end of the host, we can significantly simplify the state tree (as only 200 wildcards need to be checked separately now instead of 4.7K), thus reducing compile time to less than 20 seconds.
Let's take an oversimplified example to understand the difference. Say, we have following target wildcards (3 left-wildcards against 1 right-wildcard and 1 simple host):
            




            If we build Ragel state machine directly from those:
            %%{
    machine hosts;
 
    host_part = (alnum | [_\-])+;
 
    main := (
        any+ '.google.com' |
        any+ '.google.co.uk' |
        any+ '.google.es' |
        'google.' host_part |
        'google.com.ua'
    );
}%%
            We will get the following state graph:
            
            
            
            
            
You can see that the graph is already pretty complex as each starting character, even g which is an explicit starting character of 'google.' and 'google.com' strings, still needs to simultaneously go also into any+ matches. Even when you have already parsed the google. part of the host name, it can still correctly match any of the given wildcards whether as google.google.com, google.google.co.uk, google.google.es, google.tech or google.com.ua. This already blows up the complexity of the state machine, and we only took an oversimplified example with three left wildcards here.
However, if we simply reverse each rule in order to feed the string starting from the end:
            %%{
    machine hosts;
 
    host_part = (alnum | [_\-])+;
 
    main := (
        'moc.elgoog.' |
        'ku.oc.elgoog.' |
        'se.elgoog.' |
        host_part '.elgoog' |
        'au.moc.elgoog'
    );
}%%
            we can get much simpler graph and, consequently, significantly reduced graph build and matching times:
            
            
            
            
            
So now, all that we need is to do is to go through the host part in the URL, stop on / right after and start the machine backwards from this point. There is no need to waste time with in-memory string reversal as Ragel provides the getkey instruction for custom data access expressions which we can use for accessing characters in reverse order after we match the ending slash.
Here is animation of the full process:
            
            
            
            
            
After we've matched the host name and found potentially applicable rulesets, we need to ensure that we're not rewriting URLs which are not available via HTTPS.
    
      Exclusions
      
        
      
    
    Exclusion elements serve exactly this goal.
            

            The rewriter needs to test against all the exclusion patterns before applying any actual rules. Otherwise, paths that have issues or can't be served over HTTPS will be incorrectly rewritten and will potentially break the website.
We don't care about matched groups nor do we care even which particular regular expression was matched, so as an extra optimization, instead of going through them one-by-one, we merge all the exclusion patterns in the ruleset into one regular expression that can be internally optimized by a regexp engine.
For example, for the exclusions above we can create the following regular expression, common parts of which can be merged internally by a regexp engine:
            (^http://(www\.)?google\.com/analytics/)|(^http://(www\.)?google\.com/imgres/)
            After that, in our action we just need to call pcre_exec without a match data destination – we don't care about matched groups, but only about completion status. If a URL matches a regular expression, we bail out of this action as following rewrites shouldn't be applied. After this, Ragel will automatically call another matched action (another ruleset) on its own until one is found.
Finally, after we both matched the host name and ensured that our URL is not covered by any exclusion patterns, we can go to the actual rewrite rules.
    
      Rewrite rules
      
        
      
    
    These rules are presented as JavaScript regular expressions and replacement patterns. The rewriter matches the URL against each of those regular expressions as soon as a host matches and a URL is not an exclusion.
            
            As soon as a match is found, the replacement is performed and the search can be stopped. Note: while exclusions cover dangerous replacements, it's totally possible and valid for the URL to not match any of actual rules - in that case it should be just left intact.
After the previous steps we are usually reduced only to couple of rules, so unlike in the case with exclusions, we don't apply any clever merging techniques for them. It turned out to be easier to go through them one-by-one rather than create a regexp engine specifically optimized for the case of multi-regexp replacements.
However, we don't want to waste time on regexp analysis and compilation on our edge server. This requires extra time during initialization and memory for carrying unnecessary text sources of regular expressions around. PCRE allows regular expressions to be precompiled into its own format using pcre_compile. Then, we gather all these compiled regular expressions into one binary file and link it using ld --format=binary - a neat option that tells linker to attach any given binary file as a named data resource available to the application.
            
            
            
            
            
CC BY 2.0 image by DaveBleasdale
The second part of the rule is the replacement pattern which uses the simplest feature of JavaScript regex replacement - number-based groups and has the form of https://www.google.com.$1/ which means that the resulting string should be concatenation of "https://www.google.com." with the matched group at position 1, and a "/".
Once again, we don't want to waste time performing repetitive analysis looking for dollar signs and converting string indexes to numeric representation at runtime. Instead, it's more efficient to split this pattern at compile-time into { "https://www.google.com.", "/" } static substrings plus an array of indexes which need to be inserted in between - in our case just { 1 }. Then, at runtime, we simply build a string going through both arrays one-by-one and concatenating strings with found matches.
Finally, after such string is built, it's inserted in place of the previous attribute value and sent to the client.
    
      Wait, but what about testing?
      
        
      
    
    Glad you asked.
The HTTPS Everywhere extension uses an automated checker that checks the validity of rewritten URLs on any change in ruleset. In order to do that, rulesets are required to contain special test elements that cover all the rewrite rules.
            
            What we need to do on our side is to collect those test URLs, combined with our own auto-generated tests from wildcards, and to invoke both the HTTPS Everywhere built-in JavaScript rewriter and our own side-by-side to ensure that we're getting same results — URLs that should be left intact, are left intact with our implementation and URLs that are rewritten, are rewritten identically.
    
      Can we fix even more mixed content?
      
        
      
    
    After all this was done and tested, we decided to look around for other potential sources of guaranteed rewrites to extend our database.
And one such is HSTS preload list maintained by Google and used by all the major browsers. This list allows website owners who want to ensure that their website is never loaded via http://, to submit their hosts (optionally together with subdomains) and this way opt-in to auto-rewrite of any http:// references to https:// by a modern browser before even hitting the origin.
This means, the origin guarantees that the HTTPS version will be always available and will serve just the same content as HTTP - otherwise any resources referenced from it will simply break as the browser won't attempt to fallback to HTTP after domain is in the list. A perfect match for another ruleset!
As we already have a working solution and don't have any complexities around regular expressions in this list, we can download the JSON version of it directly from the Chromium source and convert to the same XML ruleset with wildcards and exclusions that our system already understands and handles, as part of the build process.
This way, both databases are merged and work together, rewriting even more URLs on customer websites without any major changes to the code.
    
      That was quite a trip
      
        
      
    
    It was... but it's not really the end of the story. You see, in order to provide safe and fast rewrites for everyone, and after analyzing the alternatives, we decided to write a new streaming HTML5 parser that became the core of this feature. We intend to use it for even more tasks in future to ensure that we can improve security and performance of our customers websites in even more ways.
However, it deserves a separate blog post, so stay tuned.
And remember - if you're into web performance, security or just excited about the possibility of working on features that do not break millions of pages every second - we're hiring!
P.S. We are incredibly grateful to the folks at the EFF who created the HTTPS Everywhere extension and worked with us on this project.

Name	Source size (kb)	JavaScript Parse time (average ms)	BinaryAST parse time (average ms)	Diff (%)
React	20	0.403	0.385	-4.56
D3 (v5)	240	11.178	10.525	-6.018
Angular	180	6.985	6.331	-9.822
Babel	780	21.255	20.599	-3.135
Backbone	32	0.775	0.699	-10.312
wabtjs	1720	64.836	59.556	-8.489
Fuzzball (1.2)	72	3.165	2.768	-13.383

Name	Source size (kb)	Parse time (average ms)	BinaryAST parse time (average ms)	Diff (%)
React	20	0.407	0.032	-92.138
D3 (v5)	240	11.623	0.224	-98.073
Angular	180	7.093	0.680	-90.413
Babel	780	21.100	0.895	-95.758
Backbone	32	0.898	0.045	-94.989
wabtjs	1720	59.802	1.601	-97.323
Fuzzball (1.2)	72	2.937	0.089	-96.970

Device	JavaScript	BinaryAST
Desktop	338ms	314ms
Mobile (HTC One M8)	2019ms	1455ms