Lists in Attribute Values

Recently, I’ve been working on Web Apps 1.0 aka. HTML5 conformance checking. Since Web Forms 2.0 is a relatively stable part of HTML5, I am currently focusing on forms.

The Web Forms 2.0 spec is currently defined as an extension to HTML 4 and DOM. Therefore, even if the specification itself aims for precision, it is sometimes necessary to refer to less precise specifications.

accept-charset

The accept-charset attribute is not defined in Web Forms 2.0. Instead, you have to look it up in HTML 4.01. (At least for now. Hopefully, in the future the Web Apps 1.0 spec will be self-contained.) Here’s what HTML 4.01 says:

accept-charset = charset list [CI]

This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received.

The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

In my role as a conformance checker developer I need to scrutinize the specs very carefully so that I can be precise in my implementation in order to avoid endless arguments. (See Why specs matter for further discussion and technical terminology.) “Space- and/or comma-delimited list” may look reasonable on the surface, but what does it mean, exactly? Obviously, the list items can be delimited by either a single space or a single comma. That is reasonably clear. But are multiple spaces allowed? What about multiple commas? Do the spaces have to come first and the commas second? Are spaces and commas before and after the items allowed? At what point should one throw an Undecipherable Specification Error? Does all this even matter if the UA ignores list items that are empty strings?

The frustrating part about this is that allowing the comma is useless. A “charset value” cannot be the empty string and cannot contain a space, a tab, a carriage return or a line feed. There is a convention in SGMLish and XMLish language design that when you have a list of such values in an attribute value, you separate the items with whitespace and allow whitespace before and after.

RELAX NG formalizes this:

split( s )

returns a sequence of strings one for each whitespace delimited token of s; each string in the returned sequence will be non-empty and will not contain any whitespace

and

a whitespace character is one of #x20, #x9, #xD or #xA

The good thing about adhering to a convention is that people can use tools. (You know, tools will save us and all that.) The people who actually read the specifications don’t need to write custom code, and the people who don’t read specifications have a greater chance of doing the right thing by calling code someone else wrote.

The main point I am trying to make here is this:

[Image of a nutshell] If you’re designing a language that has an XML 1.0 serialization and you are defining an attribute that takes the list of values and those values cannot contain whitespace and cannot be the empty string, please use the convention as formalized by RELAX NG and use one or more whitespace characters (where whitespace characters are U+0020, U+0009, U+000D and U+000A) as the item separator and allow zero or more whitespace characters before and after the list.

inputmode

Just for comparison, let’s take a look at another attribute that is specified by reference to another specification. The inputmode attribute is defined as being exactly equivalent to the XForms attribute of the same name.

On the bright side, the token separation is conventional. On the less bright side, the spec is a bit imprecise.

How am I supposed to know which scripts count as bicameral? Where do I find normative text? Not in XForms. Not in UAX #24. Not in UTR #21. Not in section “4.2 Case—Normative” of Unicode 4.0. (Not clearly enough to cover math and whether georgian should be considered bicameral here, that is. And UTR #21 missed Deseret just like I did the first time round.) I figured (from the occurrence of “CAPITAL LETTER” in character names) that armenian, cyrillic, deseret, georgian, greek, latin and math are bicameral. Also, custom scripts and the user script have to be assumed to allow modifiers that apply to bicameral scripts.

Are multiple modifiers allowed? The spec does not say, but I assume the modifiers dealing with case are mutually exclusive and that the modifiers dealing with prediction are mutually exclusive.

Are IRIs allowed as custom scripts? The spec implies so.

Is at least one token required? The spec prose does not say. Let’s look at the schema. It says:

<xsd:attribute name="inputmode" type="xsd:string" use="optional"/>

It’s like reading DTDs all over again. And I thought XSD was all about data typing.

Thanks to XForms following the whitespace-separation convention, the syntax of the inputmode attribute (unlike accept-charset) was expressible in RELAX NG without resorting to a custom datatype.

Are They Useful?

What else do these two attributes have in common besides taking a list as the value? These are probably the two least useful attributes in the Web Forms 2.0 spec. One exists for backwards compatibility. The other exists in order to match feature bullet points with XForms.

An author shouldn’t want the user agent to submit form data in any encoding other than UTF-8. Also, as a user, I don’t want a form author to mess with my text input method in and non-obvious way. If the UA changes the input method, it makes more sense to do it based on the input type (e.g. change from Kanji to Latin for password fields and on phones change to digit input for number fields).