HTML Parsing:
Finally Defined

Henri Sivonen

Vocabulary &
Serializations

<!DOCTYPE html>
<html>
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
    <p>Foo</p>
  </body>
</html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
    <p>Foo</p>
  </body>
</html>

Vocabulary	HTML
Serialization	HTML	XHTML
Media Type	text/html	a…n/xhtml+xml
Parser	HTML	XML
Tree API	DOM

Vocabulary	HTML
Serialization	HTML	XHTML
Media Type	text/html	a…n/xhtml+xml
Parser	HTML	XML
Tree API	DOM

Defining
HTML Parsing

Bad Old Days

“HTML 4 is an SGML application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language SGML (defined in [ISO8879]).”

“SGML systems conforming to [ISO8879] are expected to recognize a number of features that aren’t widely supported by HTML user agents. We recommend that authors avoid using all of these features.”

Source: http://www.w3.org/TR/html401/

Glorious HTML5

10.2.4.10 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

Switch to the before attribute name state.

U+002F SOLIDUS (/)

Switch to the self-closing start tag state.

U+003E GREATER-THAN SIGN (>)

Switch to the data state. Emit the current tag token.

U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z

Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.

U+0000 NULL

Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.

EOF

Parse error. Reconsume the EOF character in the data state.

Anything else

Append the current input character to the current tag token's tag name.

Source: http://www.whatwg.org/specs/web-apps/current-work/

Parsing Steps

Bytes
(Encoding sniffing)
Conversion into characters
Tokenizer
Tree builder
DOM

Encoding Sniffing

Scan first 1024 bytes for <meta> on the byte level
In some locales, analyze the data for byte pattern statistics
Please always use HTTP
Content-Type: text/html; charset=utf-8
and/or early
<meta charset=utf-8>

Tokenization

Doctypes
Comments
Start tags
End tags
Text
End-of-file

Tokenization cont’d

State machine
Transitions per input character
- Always a case for “anything else”!
Deals with named character references like ä
Mostly based on IE’s behavior except
- <?import >
- foo=`bar`

Spec Text

10.2.4.10 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

Switch to the before attribute name state.

U+002F SOLIDUS (/)

Switch to the self-closing start tag state.

U+003E GREATER-THAN SIGN (>)

Switch to the data state. Emit the current tag token.

U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z

Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.

U+0000 NULL

Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.

EOF

Parse error. Reconsume the EOF character in the data state.

Anything else

Append the current input character to the current tag token's tag name.

Source: http://www.whatwg.org/specs/web-apps/current-work/

Tree Building History

<b><i></b></i>

IE: Non-tree
Opera: Secret augmentation
Gecko: Not deterministic
WebKit: Deterministic magic

Tree Building

Stack-based state machine
Transitions per token
- Always a case for “anything else”!
Mostly based on WebKit
- With refinements based on other browsers and experimentation
Never reads back from the DOM

Spec Text

11.2.5.4.4 The "in head" insertion mode

When the user agent is to apply the rules for the "in head" insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character into the current node.

A comment token

Append a Comment node to the current node with the data attribute set to the data given in the comment token.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is "html"

Process the token using the rules for the "in body" insertion mode.

A start tag whose tag name is one of: "base", "basefont", "bgsound", "command", "link"

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

…

…

Anything else

Act as if an end tag token with the tag name "head" had been seen, and reprocess the current token.

Source: http://www.whatwg.org/specs/web-apps/current-work/

What Do Authors Need to Do Now?

Nothing

…As Long as Your Site Already Worked Cross-Browser

Gotchas

Beware of WebKit monoculture on mobile

<foo<bar>
No reparsing
- <!--…EOF
- <title>…EOF
- <script src='foo.js' />…EOF

Implementations You Can Use

Python Ruby Java JavaScript

The Only Right Way to Sanitize against XSS

Regular Expressions
Are Not the Right Way

“Now you have two problems”

The Right Way

Parse using an HTML parser
Drop script and style content
Drop elements and attributes not on your whitelist
Serialize using an HTML serializer

No Mystery

Speculative Parsing

Here be product-specific stuff

Scripts
Block the Parser

<script src=foo.js></script>
<img src=photo.jpg>
<script src=bar.js></script>

`document.write()`

Parsing Steps with
Inconvenient Truth

Bytes
(Encoding sniffing)
Conversion into characters
Tokenizer ← document.write() ↰
Tree builder → Script execution ⬏
DOM

Old Solution:
Speculative Pre-Scan

Started resource loads
Tokenized twice
No luck with
document.write("<script src=foo.js></script>");
document.write("<script src=bar.js></script>");

Off-the-Main-Thread Parsing

Run tokenizer and tree builder off the main thread
Send tree operations to the main thread
Sync state for main-thread document.write() parsing
Assume benign document.write()
Rewind stream and reparse in case of bad document.write()

`document.write()` Tail Prescan

document.write("<script src=a.js></script>"
             + "<script src=b.js></script>");
document.write("<script src=c.js></script>");

No Need to
Worry about It!

One New Feature

MathML and SVG

The solution for quadratic equations is $x = \frac{- b \pm \sqrt{b^{2} - 4 a c}}{2 a}$ .

Warning: Remember that ± means that there are two solutions!

<p>The solution for quadratic equations is  
<math>
  <!-- ... -->
  <mfrac>
    <mrow>
      <mo>&minus;</mo>
  <!-- ... -->
</math>.</p>
<p><svg viewBox='5 9 90 86'>
  <path d='M 10,90 L 90,90 L 50,14 Z'/>
  <line x1=50 x2=50 y1=45 y2=75 />
</svg><b>Warning:</b> Remember that &PlusMinus; 
means that there are two solutions!</p>

SVG or Canvas?

Canvas zooms and prints as bitmap
- Great for games, though
SVG zooms and prints as vector graphics
- Icons
- Maps
- Illustrations
- Charts

Copy and Paste

Just Works
(ignoring degradation in old browsers)

Namespaces

<svg>…</svg> becomes SVG
<math>…</math> becomes MathML
Nested scopes: foreignObject, annotation-xml, etc.
xmlns has absolutely no effect
Right Thing happens with xlink:href

Tokenizer Reuse

<SVG VIEWBOX='0 0 10 10'> works
svg: or math: not supported
Quotes around attribute values optional when optional in HTML
MathML named characters everywhere

XMLisms for SVG/MathML Only

DOM and CSS case-sensitive
<foo/> is empty
<![CDATA[…]]> works
SVG script and style tokenized as in XML

Gotchas

<cirle fill=red/>
- <cirle fill=green />
- <cirle fill="green"/>
<foo/> in legacy browsers
Triggering break-out

Degrading Gracefully

Not <foo/>
- <foo></foo> instead
<text>Text to show</text>
<text><![CDATA[Text to hide]]></text>

Works Today

Firefox 4
IE9
Chrome

Coming Up

Opera (search for Ragnarök)
Safari (in the nightlies)

HTML Parsing:Finally Defined

Vocabulary &Serializations

DefiningHTML Parsing

Bad Old Days

Glorious HTML5

10.2.4.10 Tag name state

Parsing Steps

Encoding Sniffing

Tokenization

Tokenization cont’d

Spec Text

10.2.4.10 Tag name state

Tree Building History

Tree Building

Spec Text

11.2.5.4.4 The "in head" insertion mode

What Do Authors Need to Do Now?

Nothing

…As Long as Your Site Already Worked Cross-Browser

Gotchas

Implementations You Can Use

The Only Right Way to Sanitize against XSS

Regular Expressions Are Not the Right Way

The Right Way

No Mystery

Speculative Parsing

Scripts Block the Parser

document.write()

Parsing Steps withInconvenient Truth

Old Solution:Speculative Pre-Scan

Off-the-Main-Thread Parsing

document.write() Tail Prescan

No Need to Worry about It!

One New Feature

MathML and SVG

SVG or Canvas?

Copy and Paste

Namespaces

Tokenizer Reuse

XMLisms for SVG/MathML Only

Gotchas

Degrading Gracefully

Works Today

Coming Up

Thanks! Questions?

HTML Parsing:
Finally Defined

Vocabulary &
Serializations

Defining
HTML Parsing

Regular Expressions
Are Not the Right Way

Scripts
Block the Parser

`document.write()`

Parsing Steps with
Inconvenient Truth

Old Solution:
Speculative Pre-Scan

`document.write()` Tail Prescan

No Need to
Worry about It!