Charmod Checking

Web Forms 2.0 requires documents to conform to Charmod. The current Web Applications 1.0 draft does not mention Charmod, but since (X)HTML5 includes both Web Applications 1.0 and Web Forms 2.0, my working assumption is that (X)HTML5 documents are required to conform to Charmod.

It turns out that the best opportunity for checking whether a document conforms to Charmod is in the parser. Hence, I added the checks to my special-purpose HTML parser and to HS Ælfred2—my fork of GNU Ælfred2.

Charmod says:

NOTE: RFC 2119 makes it clear that requirements that use SHOULD are not optional and must be complied with unless there are specific reasons not to: “This word, or the adjective ‘RECOMMENDED’, mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.”

Further, Charmod says: “A specification conforms to this document if it——documents the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED——”. I have an implementation, but I’m documenting my decisions not to enforce some SHOULDs anyway.

Here’s how I have addressed the requirements of Charmod that apply to content (marked as [C] is Charmod). Disclaimer: The implementation decisions I have taken with prototype software are not endorsed by the WHAT WG or anyone else.

C001	Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. This requirement is not machine-checkable and, hence, is not enforced by the software.
C002	Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. This requirement is not machine-checkable and, hence, is not enforced by the software.
C003	Protocols, data formats and APIs MUST store, interchange or process text data in logical order. HTML5 as a data format uses logical order. It is not practical to try to figure out in software if the author is trying to subvert the nature of the format on this point. Currently, the software doesn’t enforce this at all. However, it might be useful to catch encoding labels that are used for visual Hebrew or Arabic.
C013	Textual data objects defined by protocol or format specifications MUST be in a single character encoding. A single character encoding decoder is instantiated per HTTP resource. Encoding violations are treated as fatal. However, some mixed encodings are not caught by this and need human judgment. For example, software can’t tell if ISO-8859-1 and ISO-8859-2 bytes are mixed in one HTTP resource.
C022	Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. An error is reported.
C023	If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. An error is reported.
C049	The character encoding of content SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients. UTF-8 maximizes the opportunity to directly represent characters. A warning is issued if the document uses an encoding that is not supported “everywhere”. For XHTML5 the non-obscure encodings are US-ASCII, ISO-8859-1, UTF-8 and UTF-16. For HTML5, the non-obscure encodings are currently the intersection of IANA-registered encodings supported by Sun JDK 1.4.2_8 and Python 2.4.3. (The service supports a wider set of encodings.) The character spectrum use of the document is not analyzed, because I think it wouldn’t be useful way to use my time considering that using UTF-8 always satisfies this requirement.
C034	If facilities are offered for identifying character encoding, content MUST make use of them; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement. An error is reported if an HTML5 document does not have an explicit character encoding declaration (either internal or external).
C024	Content and software that label text data MUST use one of the names required by the appropriate specification (e.g. the XML specification when editing XML text) and SHOULD use the MIME preferred name of a character encoding to label data in that character encoding. An error is reported if an encoding label is not the MIME preferred name.
C025	An IANA-registered `charset` name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name. Encoding violations are treated as fatal. However, this doesn’t catch cases where the document byte sequence is legal in the declared encoding. For example, ISO-8859-2 labeled as ISO-8859-1 is not conclusively machine-detectable.
C073	Publicly interchanged content SHOULD NOT use codepoints in the private use area. Charmod does allow the use of private use area for script that have not yet been encoded. Since human judgment is needed, the software only emits a warning. Moreover, C040 denies denying the use of the PUA.
C076	Content MUST NOT use a code point for any purpose other than that defined by its coded character set. This requirement is not machine-checkable and, hence, is not enforced by the software.
C047	Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear. This requirement is not enforced—not even as a warning. Using the five pre-defined entities in XML, using the HTML5 entities from the specification or using numeric characters references is harmless when it comes to the parsed document tree. Enforcing this requirement would mean proclaiming a prevalent authoring practice non-conforming on the grounds of the aesthetics of view source. Moreover, Charmod doesn’t give a solid machine-checkable definition for characters whose visual representation is unclear.
C048	Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both. This requirement is not enforced—not even as a warning. Using the five pre-defined entities in XML, using the HTML5 entities from the specification or using numeric characters references is harmless when it comes to the parsed document tree. Enforcing this requirement would mean proclaiming a prevalent authoring practice non-conforming on the grounds of the aesthetics of view source.
C054	Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string. This requirement is not machine-checkable to the extent it might apply to the (X)HTML5 layer and, hence, is not enforced by the software.

In the spirit of perpetual beta, the new code is enabled for all (X)HTML presets in the generic UI. Please let me know if it doesn’t work as described.

Cross-posted to the WHAT WG blog. Comments enabled there.