HTML5 Parser-Based View Source Syntax Highlighting

A new implementation of the View Source HTML and XML syntax highlighting has landed in Firefox.

Why?

The reason there is a new implementation is that the old implemention was based on the old HTML parser that we want to get rid of. The old View Source implementation was standing in the way of the goal to remove the old parser. Also, the old parser did some incorrect highlighting. Most notably, it flagged the unnecessary-but-permitted slash as an error on void elements (e.g. <br/>) because all such slashes were bogus in non-X HTML prior to HTML5.

The reason why the new implementation uses the HTML5 parser instead of using something new and Orion-integrated in the dev tools is that the new implementation was written before there were publicized plans to integrate dev tools with View Source. Furthermore, there is no way to get HTML syntax highlighting right without the highlighter running the whole HTML(5) parsing algorithm, because tokenizer state transition decisions depend on the tree builder state.

New Features

The first and foremost feature is not user-visible per se. It is the non-use of the old parser code in order to be able to get rid of the old parser. However, using the old parser had user-visible consquences.

More Correct Highlighting

As already mentioned, the old parser unconditionally highlighted the slash in <foo/> as red regardless of the element name. Furthermore, the old parser failed to get the highlighting of tricky inline scripts right (when the inline script contained the string </script>). Highlighting of SVG and MathML content in text/html was wrong, too, since the old parser knew nothing about foreign content in text/html.

Consider the following highlighting by the old parser:

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<script>
var lt = "<";
<!--
var s = "<script>foo</script>";
-->
</script><!-- Not quite optimal highlight there. -->
<style>
/* </foo> */
</style>
</head>
<body>
<p>Entity: &amp; </p>
<iframe><img></iframe>
<noscript><p>Not para</p></noscript>
<svg>
<title><![CDATA[bar]]></title>
<script><!-- this is a comment --></script>
</svg>
</body>
</html>

The first occurrence of </script> is highlighted as an end tag. The content of the SVG title and script elements is treated as if the elements were HTML elements of the same name.

Compare the above to the highlighting performed by the new implementation:

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<script>
var lt = "<";
<!--
var s = "<script>foo</script>";
-->
</script><!-- Not quite optimal highlight there. -->
<style>
/* </foo> */
</style>
</head>
<body>
<p>Entity: &amp; </p>
<iframe><img></iframe>
<noscript><p>Not para</p></noscript>
<svg>
<title><![CDATA[bar]]></title>
<script><!-- this is a comment --></script>
</svg>
</body>
</html>

The HTML script is tokenized according to the HTML rules. Note that <----> inside an HTML script is not a comment node! In the SVG subtree, title and script are not special and can have CDATA sections or comments inside them. (The coloring of the HTML script end tag is inconsistent with other end tags, though, due to technical difficulties.)

Better Error Reporting

The old parser highlighted errors so rarely that it was easy to think it was not doing it at all. However, it did indeed have support for highlighting a couple of errors. I am aware of it highlighting doctypes that used XML syntax inappropriate for HTML and highlighting the already mentioned XML-style slash in tags.

To get feature parity with the old implementation, the new implementation had to support at least highlighting the XML-style slash when the use of the slash is wrong per HTML5 / HTML Living Standard. Since highlighting the slash correctly is among the more difficult error highlights and the Java version of the parser (from which the C++ version is mechanically translated) already supported full error reporting since it was originally written for a validator, I thought I could add support for the easier error highlights, too, while at it.

But why stop at highlights without explaining them? I also made the parser attach error messages to the highlights as tooltips. (Unfortunately, Firefox has long-standing accessibility problems with tooltips, so the error messages are not keyboard-accessible at the moment.)

The new View Source implementation produces results like this (note the tooltips):

<!DOCTYPE HTML PUBLIC"-//W3C//DTD HTML 4.01//EN">
<form>
<table>
<h1>Error test</h1>
<tr><td><div>cell<td>another cell
</table>
<form>
<select><select>
<div>
<p class="foo"id="bar">
<p/>
<br/>
<h2><h3></h3></h2>
<![CDATA[bogus comment]]>
<svg>
<![CDATA[this is text]]>
<div>
<![CDATA[bogus comment again]]>
<!--foo--bar-->
<!-- foo --!>
<p><i><b>bold italic</i></b></p>
<p>&#x0000;</p>
<p>&auml </p>
<p>&foo </p>
<a class=foohref="foo">
<a></a>
</a>

Note: The tooltips will have line breaks between multiple error messages in one tooltip when viewed in the View Source window in Firefox. The lack of line breaks in Firefox in other contexts (including this page) is a known HTML5 violation bug.

Off-The-Main-Thread Highlighting

As a consequence of the off-the-main-thread design of the HTML5 parser in Firefox, the highlight computation now happens off the main thread.

Limitations

The above may set expectations too high, so it is important to lower them right away.

This Is Not a Validator!

All the errors you have seen above are parse errors. Parse errors are errors defined as such by the HTML parsing algorithm. There is much more to checking HTML validity than just finding the parse errors.

For example, putting a div element as a child of an ul element or as a child of a span element is not a parse error. In an HTML validator, content model errors like that are detected by a validation layer above the parser. The View Source implementation in Firefox does not have a validation layer at all.

The lack of a validation layer has counter-intuitive consequences. The HTML parsing algorithm avoids parse errors that would be redundant with validation errors. For example, <div<div> is a start tag for an element named div<div. Since there is no such element in the HTML language, the validation layer would catch the error. However, when we do not have a validation layer, the typo goes unreported.

Please do not advertise the new View Source implementation by saying that Firefox now has a validator in the View Source window.

Not All Parse Errors Are Reported!

Even though all the errors that are reported are parse errors according to the specification, not all parse errors are reported.

XML Syntax Highlighting

The old implementation used the HTML tokenizer for highlighting XML source. So does the new implementation. While the tokenizer has support for processing instructions when it is highlighting XML source, that is the only XML-oriented additional capability. As a result, doctypes that have an internal subset are mishighlighted and entity references to custom entities are mishighlighted. This is obvious when viewing the source of Firefox chrome files. However, the mishighlighting should not be a practical problem when viewing source of typical XML files on the Web (to the extent there are XML files on the Web).

Other Known Bugs

The new implementation broke the window title for View Source windows. Also, the highlighting of the end of named character references is off by one.

Release Schedule

The code landed on trunk in time for Firefox 10. However, the landing added quite a bit of code, so it is possible that the code gets turned off after the Aurora uplift. To test the new code, using Nightly is your best bet.

Update: This feature was deferred to Firefox 11 due to regressions that were not fixed in time for Firefox 10.