Lowering memory requirements by replacing Schematron

For long time, I’ve said is that the Schematron schema in the HTML5 facet of Validator.nu was merely a rapid prototype that should be replaced with custom Java code. I finally got around to making that change.

People who have followed things on IRC have probably noticed that most performance problems with Validator.nu were related to Schematron. When the Schematron validation was backed by Xalan, it turned out that the runtime stack became excessively deep and sometimes reasonable input caused validation to take excessive time. These issues were remedied by switching to Saxon 9.

Still with Saxon there were two problems: the validator ran out to heap space during peaks in concurrent use and the Schematron messages fired later than logically possible making their order wrong compared to all other messages.

Now, loading http://s.validator.nu/html5/assertions.sch doesn’t really instantiate a Schematron validator. Instead, it instantiates custom Java code that detects the same conditions. The legacy XHTML 1.0 schema set still uses Schematron and Schematron is available for custom schemas. This means that as long as validator.nu and html5.validator.nu run in the same process, it’s still possible to use Schematron on the validator.nu side to deny service on the html5.validator.nu side. But at least if it becomes a problem, html5.validator.nu can be taken to a different JVM instance without a Schematron dependency.

The more obvious user-visible change is that messages from “assertions.sch” now fire as early as logically possible. The less obvious change is that validating a document of the size of the HTML5 spec (2.5 MB) can now happen in 12 MB less RAM, so more concurrent validations can take place without the heap space running out.