Thoughts on XML Entity Management in Mozilla

This document is not an official mozilla.org document. It has not been endorsed by mozilla.org.

The Current Situation

Currently, Mozilla only supports local external entities. They are used mostly for separating UI strings from XUL. However, the mechanism is also being used for MathML in a way that doesn’t quite fit to the W3C specs and may cause compatibility problems with other browsers or even with future versions of Mozilla.

There are two main problems.

Mnemonic Character References

Mnemonic character entities are specified in external DTDs. When the appropriate DTD hasn’t been loaded, the character mnemonic references don’t work. For XHTML this is mostly a non-issue. All the characters can be represented with numeric references or by encoding the document using UTF-8 (or another encoding that can represent the entire Basic Multilingual Plane of Unicode). However, some authors prefer the mnemonic character references.

With MathML the situation is more complicated. Some math chars aren’t part of the Basic Multilingual Plane and Mozilla is currently unable to handle characters on higher planes. In order to deal with this, Mozilla maps mnemonic some of the MathML mnemonic entities to PUA code points. This requires a private modified DTD.

The currently used doctype declaration is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd">. This doctype declaration has three problems. Firstly, the public identifier references a DTD that contains no MathML declarations. Secondly, the system identifier points to another DTD. And thirdly, the system identifier is relative—as opposed to being an absolute URL.

If someone tries to access a document with the above-mentioned doctype using a browser that respects the public identifier, the MathML mnemonic character entities will not be available and the document will be invalid. If someone comes along with a browser that respects the system identifier, it is likely that the DTD cannot be found using the relative system identifier.

IDs and the DOM

Without a DTD, the DOM implementation doesn’t know which attributes are considered id attributes and should work with getElementById(). This has been worked around by including the id declarations in an internal DTD subset.

External Entity Management Schemes

If Mozilla is to deal with external entities, it needs better entity management. Here are some possible schemes.

Treating External Entities as Regular Network Objects

The first and most obvious way to deal with external entities is to load the external entities (with URLs as system identifiers) over the network like images and style sheets.

Pros

The external entities would always be up to date. No cheating.
Would work with future DTD revisions and privately developed DTDs.

Cons

This is not very efficient. The same core W3C DTDs are needed over and over again. Also, the DTDs can further reference DTD modules. This option just doesn’t make sense from the performance point of view even if the regular network cache was used. Also, it would burden the W3C’s server unnecessarily.
A local copy of a public DTD would be fetched again even if the official copy was already in cache.
If HTTP Referer was sent, the W3C could find out about access patterns. Even without a referrer, sites could use their own DTD copies or other entities as Web bugs the way images, tiny iframes and style sheets can be used now.

Loading over Network and Caching for a Long Time

As a variation to the previous scheme, DTDs could be placed in a separate long-term cache.

Pros

Would require significantly less network traffic.
Would work with future DTD revisions and privately developed DTDs.

Cons

Rare privately developed DTDs could stay in the cache unnecessarily. It might be necessary to treat entities downloaded from www.w3.org differently from other entities.
A local copy of a public DTD would be fetched again even if the official copy was already in cache.

Shipping Notable DTDs with Mozilla and Using the Public identifiers

Mozilla could be shipped with a collection of DTDs. Then public identifiers could be mapped to the local copies.

Pros

No network overhead.
Mozilla could cheat with character references. The local DTD could map mnemonics to PUA chars. (Better than the current situation but a suboptimal solution overall.)

Cons

Less-known or future DTDs couldn’t be shipped. However, if the DTD list was extensible, a local administrator could add DTDs.
An author could use a bogus system identifier but the document would still appear to “work” in Mozilla. This is a potential interoperability problem. This would have to be addressed in documentation.
Wouldn’t work when only a system identifier is present.

Long-Term Caching Using Public Identifiers

This is a variant of the other long-term caching scheme. In this version, the first copy of a DTD would be used when the same public identifier is encountered in another document even if the system identifiers differ.

Pros

Reduced network overhead.
Would be forward-compatible.

Cons

If applied to DTDs loaded from untrusted servers, a false DTD might take the place of a real DTD.
An author could use a bogus system identifier but the document would still appear to “work” in Mozilla. This is a potential interoperability problem. This would have to be addressed in documentation.
Wouldn’t work when only a system identifier is present.

Performance

Parsing a DTD is a performance problem. It makes no sense to spend time parsing the DTD if the document doesn’t use any declarations made in the DTD. Also, the DTDs tend to be quite large in terms of fiel size. Usually they are even broken into separate files and the parser would have to spend time handling declarations whose only purpose is to control the inclusion of DTD modules.

Fortunately, the XML spec provides a solution to the first problem. If a file doesn’t depend on external decalrations, it can be declared standalone="yes" in the XML declaration. Leaving the DTD unparsed for standalone documents is easy. A one-liner in the expat initialization code does it.

The second problem is more difficult to handle. One approach could be making the private DTDs lighter. Mozilla doesn’t validate XML, so there are a lot of declarations in DTDs that are of no use to Mozilla. Mozilla’s local versions could include only the declarations that are of interest to a non-validating parser. These would include named character entitied, id attributes and attribute defaults.

Even then the same DTDs would be parsed over and over again. Eventually, it would make sense to keep preparsed hash tables representing the core DTDs in RAM. However, it might not be realistic to expect such a chenge make it to Mozilla 1.0.

Open Questions

Are there other worthwhile entity management schemes?
Do these schemes violate any official specs?
Are there other pros and cons?
Are non-validating parsers required to do something that Mozilla doesn’t do already?
Do these entity management schemes include things that only validating parsers are supposed to do and that non-validating parsers should not do?

Possible Implementation Strategy

Making Mozilla ignore external entities when the document has been declared standalone. (Done in my tree.)
Implementing the use of a local light copy (with Mozilla-specific PUA codes) for the public identifier “-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN” as a proof of concept.
Adding a hard-coded list of other local DTDs.
Making the local DTD list extensible.
Implementing regular caching for other cases.
Implementing long-term caching of DTDs.