The Performance Cost of the HTML Tree Builder

I’ve been thinking about the performance gap between the Validator.nu HTML Parser and Xerces. What can be attributed to the “extra fix-ups” that an HTML parser has to do and what can be attributed to my code being worse than the Xerces code?

Introduction

Tokenizing XML and HTML is pretty similar. Sure, an HTML tokenizer has to check each name character for upper case, but then an XML tokenizer has to check the silliness that is the Name production. The main difference is in tree construction layer. In general, comparing an XML parser and an HTML parser from different authors doesn’t tell much about the performance cost of the “extra fix-ups” HTML needs. Parsers may have otherwise fundamentally better or worse implementation strategies, and different code bases have enjoyed different amounts of attention and tweaking. To compare the tree construction layers, the tokenizing layer needs to be kept constant.

To run a proper benchmark, I implemented a very thin token handler that trivially maps HTML5 tokens to SAX events. This token handler is only 115 lines of code (mostly autogenerated by Eclipse) compared to 3927 lines of the real HTML5 SAX streamer code. With this thin layer, the resulting parser is similar to an XML5 parser without support for Namespaces.

Method

I chose to use the front page of Wikipedia (saved on 2008-04-01) as the test in data, because it is a well-known, real-world Web page that happens to be well-formed text/html-compatible XHTML, so it can be used for testing both HTML and XML parsers.

Since I wanted to test the parser core, I eliminated the effect of IO and character decoding by letting the parsers read pre-converted UTF-16 data from a CharArrayReader. Instead of building a tree, the parsers ran in SAX mode. The content handler an XML serializer writing to a mock Writer that wrote to nowhere. (This way, there was some code to touch each attribute in case Xerces builds attributes lazily. I have not checked if it does.) All the parsers were set to intern element and attribute name strings. The XML parsers were set not to read the DTD.

I ran each parser in a loop first for 10 minutes for warming up HotSpot and then for another 10 minutes to actually record the benchmark. I ran that tests on Mac OS X 10.5.4 on Intel Core 2 Duo (x86_64).

I also included “fast read()” variants of the Validator.nu HTML Parser. These variants have the per-character error reporting and validator-precision source location tracking commented out. (I only commented them out in the most obvious place—there’s more potential for removing stuff that’s only interesting in a validator.)

Results

Here are the results in number of iterations per time relative to the tokenizer of the Validator.nu HTML Parser with the thin SAX layer. (Note, the two VMs are not equally fast. The 1.6 x86_64 server VM is over 50% faster than the x86 1.5 client VM.)

Parser 1.5 x86 Client 1.6 x86_64 Server
Xerces-J 2.9.1 with Namespaces 109% 112%
Xerces-J 2.9.1 without Namespaces 141% 142%
Ælfred2 (Validator.nu fork) with Namespaces 72% 75%
Validator.nu HTML Parser streaming SAX mode 89% 93%
Validator.nu HTML tokenizer with thin SAX layer 100% 100%
Validator.nu HTML Parser streaming SAX mode, fast read() 95% 93%
Validator.nu HTML tokenizer with thin SAX layer, fast read() 107% 104%

The difference between Xerces with and without Namespaces sure looks interesting. Let’s see what the numbers look like relative to Xerces without Namespaces.

Parser 1.5 x86 Client 1.6 x86_64 Server
Xerces-J 2.9.1 with Namespaces 78% 79%
Xerces-J 2.9.1 without Namespaces 100% 100%
Ælfred2 (Validator.nu fork) with Namespaces 51% 53%
Validator.nu HTML Parser streaming SAX mode 63% 66%
Validator.nu HTML tokenizer with thin SAX layer 71% 71%
Validator.nu HTML Parser streaming SAX mode, fast read() 68% 66%
Validator.nu HTML tokenizer with thin SAX layer, fast read() 76% 74%

Analysis

Summary

Xerces is faster. Namespaces are worse than the much-maligned HTML “extra fix-ups” (21% hit vs. 7% hit). An XML parser can be slow.