Thoughts on XML Entity Management in Mozilla

This document is not an official mozilla.org document. It has not been endorsed by mozilla.org.

The Current Situation

Currently, Mozilla only supports local external entities. They are used mostly for separating UI strings from XUL. However, the mechanism is also being used for MathML in a way that doesn’t quite fit to the W3C specs and may cause compatibility problems with other browsers or even with future versions of Mozilla.

There are two main problems.

Mnemonic Character References

Mnemonic character entities are specified in external DTDs. When the appropriate DTD hasn’t been loaded, the character mnemonic references don’t work. For XHTML this is mostly a non-issue. All the characters can be represented with numeric references or by encoding the document using UTF-8 (or another encoding that can represent the entire Basic Multilingual Plane of Unicode). However, some authors prefer the mnemonic character references.

With MathML the situation is more complicated. Some math chars aren’t part of the Basic Multilingual Plane and Mozilla is currently unable to handle characters on higher planes. In order to deal with this, Mozilla maps mnemonic some of the MathML mnemonic entities to PUA code points. This requires a private modified DTD.

The currently used doctype declaration is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd">. This doctype declaration has three problems. Firstly, the public identifier references a DTD that contains no MathML declarations. Secondly, the system identifier points to another DTD. And thirdly, the system identifier is relative—as opposed to being an absolute URL.

If someone tries to access a document with the above-mentioned doctype using a browser that respects the public identifier, the MathML mnemonic character entities will not be available and the document will be invalid. If someone comes along with a browser that respects the system identifier, it is likely that the DTD cannot be found using the relative system identifier.

IDs and the DOM

Without a DTD, the DOM implementation doesn’t know which attributes are considered id attributes and should work with getElementById(). This has been worked around by including the id declarations in an internal DTD subset.

External Entity Management Schemes

If Mozilla is to deal with external entities, it needs better entity management. Here are some possible schemes.

Treating External Entities as Regular Network Objects

The first and most obvious way to deal with external entities is to load the external entities (with URLs as system identifiers) over the network like images and style sheets.

Pros
Cons

Loading over Network and Caching for a Long Time

As a variation to the previous scheme, DTDs could be placed in a separate long-term cache.

Pros
Cons

Shipping Notable DTDs with Mozilla and Using the Public identifiers

Mozilla could be shipped with a collection of DTDs. Then public identifiers could be mapped to the local copies.

Pros
Cons

Long-Term Caching Using Public Identifiers

This is a variant of the other long-term caching scheme. In this version, the first copy of a DTD would be used when the same public identifier is encountered in another document even if the system identifiers differ.

Pros
Cons

Performance

Parsing a DTD is a performance problem. It makes no sense to spend time parsing the DTD if the document doesn’t use any declarations made in the DTD. Also, the DTDs tend to be quite large in terms of fiel size. Usually they are even broken into separate files and the parser would have to spend time handling declarations whose only purpose is to control the inclusion of DTD modules.

Fortunately, the XML spec provides a solution to the first problem. If a file doesn’t depend on external decalrations, it can be declared standalone="yes" in the XML declaration. Leaving the DTD unparsed for standalone documents is easy. A one-liner in the expat initialization code does it.

The second problem is more difficult to handle. One approach could be making the private DTDs lighter. Mozilla doesn’t validate XML, so there are a lot of declarations in DTDs that are of no use to Mozilla. Mozilla’s local versions could include only the declarations that are of interest to a non-validating parser. These would include named character entitied, id attributes and attribute defaults.

Even then the same DTDs would be parsed over and over again. Eventually, it would make sense to keep preparsed hash tables representing the core DTDs in RAM. However, it might not be realistic to expect such a chenge make it to Mozilla 1.0.

Open Questions

Possible Implementation Strategy

  1. Making Mozilla ignore external entities when the document has been declared standalone. (Done in my tree.)
  2. Implementing the use of a local light copy (with Mozilla-specific PUA codes) for the public identifier “-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN” as a proof of concept.
  3. Adding a hard-coded list of other local DTDs.
  4. Making the local DTD list extensible.
  5. Implementing regular caching for other cases.
  6. Implementing long-term caching of DTDs.