HTML Parsing:
Finally Defined

Henri Sivonen

Vocabulary &
Serializations

<!DOCTYPE html>
<html>
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
    <p>Foo</p>
  </body>
</html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
    <p>Foo</p>
  </body>
</html>
html head title body h1 p “Foo” “Hello World!” “Hello World!”
Vocabulary HTML
Serialization HTML XHTML
Media Type text/html a…n/xhtml+xml
Parser HTML XML
Tree API DOM
Vocabulary HTML
Serialization HTML XHTML
Media Type text/html a…n/xhtml+xml
Parser HTML XML
Tree API DOM

Defining
HTML Parsing

Bad Old Days

“HTML 4 is an SGML application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language SGML (defined in [ISO8879]).”

“SGML systems conforming to [ISO8879] are expected to recognize a number of features that aren’t widely supported by HTML user agents. We recommend that authors avoid using all of these features.”

Source: http://www.w3.org/TR/html401/

Glorious HTML5

10.2.4.10 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
Parse error. Reconsume the EOF character in the data state.
Anything else
Append the current input character to the current tag token's tag name.

Source: http://www.whatwg.org/specs/web-apps/current-work/

Parsing Steps

  1. Bytes
  2. (Encoding sniffing)
  3. Conversion into characters
  4. Tokenizer
  5. Tree builder
  6. DOM

Encoding Sniffing

Tokenization

Tokenization cont’d

Spec Text

10.2.4.10 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
Parse error. Reconsume the EOF character in the data state.
Anything else
Append the current input character to the current tag token's tag name.

Source: http://www.whatwg.org/specs/web-apps/current-work/

Tree Building History

<b><i></b></i>

IE
Non-tree
Opera
Secret augmentation
Gecko
Not deterministic
WebKit
Deterministic magic

Tree Building

Spec Text

11.2.5.4.4 The "in head" insertion mode

When the user agent is to apply the rules for the "in head" insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character into the current node.

A comment token

Append a Comment node to the current node with the data attribute set to the data given in the comment token.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is "html"

Process the token using the rules for the "in body" insertion mode.

A start tag whose tag name is one of: "base", "basefont", "bgsound", "command", "link"

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

Anything else

Act as if an end tag token with the tag name "head" had been seen, and reprocess the current token.

Source: http://www.whatwg.org/specs/web-apps/current-work/

What Do Authors Need to Do Now?

Nothing

…As Long as Your Site Already Worked Cross-Browser

Gotchas

Beware of WebKit monoculture on mobile

Implementations You Can Use

Python Ruby Java JavaScript

The Only Right Way to Sanitize against XSS

Regular Expressions
Are Not the Right Way

“Now you have two problems”

The Right Way

  1. Parse using an HTML parser
  2. Drop script and style content
  3. Drop elements and attributes not on your whitelist
  4. Serialize using an HTML serializer

No Mystery

Speculative Parsing

Here be product-specific stuff

Scripts
Block the Parser

<script src=foo.js></script>
<img src=photo.jpg>
<script src=bar.js></script>

document.write()

Parsing Steps with
Inconvenient Truth

  1. Bytes
  2. (Encoding sniffing)
  3. Conversion into characters
  4. Tokenizer ← document.write()
  5. Tree builder → Script execution ⬏
  6. DOM

Old Solution:
Speculative Pre-Scan

Off-the-Main-Thread Parsing

document.write() Tail Prescan

document.write("<script src=a.js></script>"
             + "<script src=b.js></script>");
document.write("<script src=c.js></script>");

No Need to
Worry about It!

One New Feature

MathML and SVG

The solution for quadratic equations is x = b ± b 2 4 a c 2 a .

Warning: Remember that ± means that there are two solutions!

<p>The solution for quadratic equations is  
<math>
  <!-- ... -->
  <mfrac>
    <mrow>
      <mo>&minus;</mo>
  <!-- ... -->
</math>.</p>
<p><svg viewBox='5 9 90 86'>
  <path d='M 10,90 L 90,90 L 50,14 Z'/>
  <line x1=50 x2=50 y1=45 y2=75 />
</svg><b>Warning:</b> Remember that &PlusMinus; 
means that there are two solutions!</p>

SVG or Canvas?

Copy and Paste

Just Works
(ignoring degradation in old browsers)

Namespaces

Tokenizer Reuse

XMLisms for SVG/MathML Only

Gotchas

Degrading Gracefully

Works Today

Coming Up

Thanks! Questions?