RELAX NG Datatype Library for HTML5 Datatypes

Working Draft — 28 July 2008

Latest version:
http://hsivonen.iki.fi/html5-datatypes/
This version:
http://hsivonen.iki.fi/html5-datatypes/2008-07-28
Previous versions:
http://hsivonen.iki.fi/html5-datatypes/2006-04-27
http://hsivonen.iki.fi/html5-datatypes/2006-04-10
Editor:
Henri Sivonen, hsivonen@iki.fi

Abstract

This specification defines a RELAX NG datatype library that allows precise attribute datatyping in RELAX NG schemas for (X)HTML5.

Status of this document

This is a work in progress! In its current form, this document is intended to provide a way for the author to organize and communicate his thoughts. Even though this document is intended to develop into an implementable specification, you should not implement this draft spec. This spec has not been endorsed by anyone.

Introduction

RELAX NG does not provide a built-in means for constraining the lexical space of attribute values (or the text content of elements) beyond enumerating permissible string literals (with or without whitespace trimming). However, RELAX NG provides extensibility via datatype libraries. RELAX NG validators are expected to provide an API for plugging in implementations of datatype libraries. This way, the conformance to a datatype specification can be checked using a Turing-complete programming language.

Typically RELAX NG validators have a built-in implementation of the XSD datatype library. The XSD library provides the datatypes from W3C XML Schemas for use in RELAX NG schemas. Most notably, the XSD datatype library provides regular expressions for constraining the lexical space of a datatype to a regular language.

The XSD datatype library is not adequate for developing accurate RELAX NG schemas for (X)HTML5. Hence, the library described in this specification is needed.

General Requirements

The datatypes defined herein do not check that the value contains only XML 1.0 characters. That task is left for another layer of software.

The ID-type of the datatypes of this datatype library is null.

Except for the string type, checking for value equality is not needed for these datatypes in order to be able to write RELAX NG schemas for (X)HTML5. However, in order for implementations of this datatype library to behave consistently under equality tests, the datatypes of this datatype library shall implement the equality test as the strict code point for code point string equality test (except for the string type).

The datatypes of this datatype library are independent of the namespace mapping context.

Whitespace characters are U+0020, U+0009, U+000D and U+000A. If this datatype library is used with the text/html serialization of HTML5, form feed should be mapped to a space before exposing a value to this library.

This specification states which values each datatypes shall accept. The datatypes must reject values that they are not defined to accept.

In addition to matching the lexical format, an acceptable value for the date datatypes must be a valid date according to the proleptic Gregorian calendar. For example 2006-02-29 is not a valid value for date, because 2006 is not a leap year. On the other hand, 1582-10-07 and 1752-09-07 must be treated as valid dates.

Leap seconds are not allowed in times.

The Datatypes

browsing-context-or-keyword

This datatype shall accept strings that constitute a valid browsing context name or keyword in HTML5.

browsing-context

This datatype shall accept strings that constitute a valid browsing context name in HTML5.

charset

This datatype shall accept strings that contain only characters allowed according to the Naming Requirements of RFC 2978.

Should this refer to the IANA charset registry instead? Or should this be a explicit list but not the IANA list?

charset-list

Not done.

circle

This datatype shall accept strings that are valid values for the coords attribute in the circle state in HTML5.

date-or-time-content

This datatype shall accept strings that constitute a date or time strings in content in HTML5.

date-or-time

This datatype shall accept strings that constitute a date or time strings in attributes in HTML5.

date

This datatype shall accept strings that conform to the format specified for date inputs in Web Forms 2.0.

This datatype must not accept the empty string.

datetime

This datatype shall accept strings that conform to the format specified for datetime inputs in Web Forms 2.0.

This datatype must not accept the empty string.

datetime-local

This datatype shall accept strings that conform to the format specified for datetime-local inputs in Web Forms 2.0.

This datatype must not accept the empty string.

datetime-tz

This datatype shall accept strings that conform to the format specified for datetime attribute of the ins and del elements in HTML5.

If the time zone designator is not “Z”, the absolute value of the time zone designator must not exceed 12 hours.

This datatype must not accept the empty string.

Note that allowing a numeric time zone designator is not the only difference with datetime. This type requires seconds to be explicitly present.

float

This datatype shall accept strings that constitute a valid floating point number in HTML5.

float-non-negative

This datatype shall accept strings that constitute a valid floating point number in HTML5 and whose parsed value is not negative (zero allowed).

float-positive

This datatype shall accept strings that constitute a valid floating point number in HTML5 and whose parsed value is positive (zero not allowed).

float-exp

This datatype shall accept strings that conform to the format specified for number inputs in Web Forms 2.0.

This datatype must not accept the empty string.

float-exp-positive

This datatype shall accept strings that conform to the format specified for number inputs in Web Forms 2.0 and whose value parser to positive number (zero not allowed).

This datatype must not accept the empty string.

hash-name

This datatype shall accept strings that have U+0023 NUMBER SIGN (#) as the first character.

This datatype must not accept the empty string.

ID

This datatype shall accept any string that consists of one or more characters and does not contain any whitespace characters.

IDREF

This datatype shall accept any string that consists of one or more characters and does not contain any whitespace characters.

IDREFS

This datatype shall accept any string that consists of one or more characters and contains at least one character that is not a whitespace character.

integer

This datatype shall accept strings that constitute a valid integer in HTML5.

integer-non-negative

This datatype shall accept strings that constitute a valid integer in HTML5 and whose parsed value is not negative (zero allowed).

integer-positive

This datatype shall accept strings that constitute a valid integer in HTML5 and whose parsed value is positive (zero not allowed).

iri

Need to turn these into charset-sensitive URLs.

This datatype shall accept any RFC 3987 IRI subject to constraints given below.

If the literal violates a “SHOULD”, it must be rejected. If the literal violates security-sensitive RFC language, it must be rejected. If the literal violates DNS-related constraints, it must be rejected.

Scheme-specific knowledge must be used for the following IRI schemes (as augmented by IDNA):

SchemeSpec
httpRFC 2616
httpsRFC 2818
ftpRFC 1738
mailtoRFC 2368
fileRFC 1738
dataRFC 2397

Scheme-specific knowledge must not be used for other IRI schemes.

If the literal cannot be converted into an URI, the literal must be rejected. (For example, if schema-specific knowledge tells which part is a host name and it cannot be converted to a conforming Punycode DNS name.)

iri-ref

Need to turn these into charset-sensitive URLs.

This datatype shall accept all the values that the iri datatype is defined to accept and, additionally, relative IRIs. However, relative IRIs with a scheme that the iri datatype is defined to have knowledge about must be rejected (e.g. http:/foo).

language

This datatype shall accept strings that are conforming RFC 3066bis language tags. When a subtag value is not reserved for private use, this datatype shall only accept values that were registered at the time the implementation of this datatype was developed.

When the registry says that a language has a default (“suppressed”) script, this datatype must not accept the version that lists the default script explicitly. For example, “fi-Latn” must be rejected.

Note that the allowed ALPHA letters are A–Z and a–z, so U+0130 and U+0131 must not be accepted as case-insensitive versions of i and I. Likewise, “” is not a conforming language tag for Ossetian.

This datatype must not accept the empty string.

Since registered language and country codes change over time, implementations should document when their internal snapshot of registered language and country codes was taken.

The IANA language subtag registry is not Free as in Free Software.

media-query

Media Queries have changes lately. This datatype needs to be reviewed against the new MQ spec.

meta-charset

This datatype shall accept strings that compared ASCII case-insensitively consists of the string “text/html;”, followed to any number of whitespace characters, followed by the string “charset=” and finally followed by a string accepted by the charset datatype.

mime-type

This datatype shall accept strings that conform to the syntax of the value of the MIME Content-Type header except LWS is allowed only around the semicolon and after the whole value.

mime-type-list

The accept attribute on input type=file. This is still really buggy.

month

This datatype shall accept strings that conform to the format specified for month inputs in Web Forms 2.0.

This datatype must not accept the empty string.

pattern

This datatype shall accept the strings that are allowed as the value of the Web Forms 2.0 pattern attribute.

polyline

This datatype shall accept strings that are valid values for the coords attribute in the polygon state in HTML5.

ratio

This datatype shall accept the strings do not cause steps for finding one or two numbers of a ratio in a string return an error.

rectangle

This datatype shall accept strings that are valid values for the coords attribute in the rectangle state in HTML5.

refresh

This datatype shall accept strings that are permitted in the content attribute of the meta element when the element is in the refresh state.

string

This datatype shall accept all strings.

The equality comparisons for this datatype must be code point for code point, except the ASCII letters A–Z must be treated as equal to the ASCII letters a–z.

time

This datatype shall accept strings that conform to the format specified for time inputs in Web Forms 2.0.

This datatype must not accept the empty string.

week

This datatype shall accept strings that conform to the format specified for week inputs in Web Forms 2.0.

This datatype must not accept the empty string.

xml-name

This datatype shall accept the strings that match Name production in XML 1.0 4th edition.