On Clipboard Formats – 2006-09-15

The Carbon version of Gecko doesn’t interoperate with anything but other Carbon Gecko processes. I figured I should try to do better with the Cocoa nsClipboard.

This stuff is so underdocumented that it isn’t even funny. This document is written so that others might find something when they search the Web.

The Gecko Clipboard Interface

The clipboard in Gecko is implemented around two interfaces: nsIClipboard and nsITransferable. There’s a single service instance of nsIClipboard. It provides methods for putting an nsITransferable on the clipboard, querying the clipboard for flavor availability and filling an nsITransferable from the contents of the clipboard.

An nsITransferable instance is a wrapper for the data being transfered through the clipboard. An nsITransferable instance can wrap multiple alternative representations of the data—for example HTML and plain text. On copy, the nsITransferable instance advertises an array of flavors it can provide. The array is ordered by the most high-fidelity representations coming first. On paste, the nsITransferable instance advertises an array of flavors it can accept.

A flavor has a name (a C string—in practice a pseudo-MIME type in ASCII), data (nsISupports) and length (I have no idea why). In theory, the data can be any XPCOM object instance. In practice, it is always an nsISupportString, an nsISupportsCString or an nsIImage. (More on this later.)

At least in theory, an nsITransferable can promise data. That is, the data is not necessarily traveling in the transferable until it is requested.

The Cocoa Pasteboard Interface

For historical reasons, the clipboard in Cocoa is called “pasteboard” in the API. On the UI layer, it is called “clipboard” for consistency with the Mac tradition.

On OS X, the inter-process pasteboards are implemented as a server process called pbs. (Yes, there’s support for multiple pasteboards.) Applications don’t talk to pbs directly. Instead, they use a Cocoa or Carbon API that takes care of the inter-process communication with pbs. In Cocoa, the communication happens through the NSPasteboard class.

The ask the NSPasteboard class to provide an NSPasteboard object for a particular pasteboard. The object can be queried for available flavors and the flavor can be retrieved. Also the app can use the object to promise flavors. The data can then be written immediately or provided on a later callback.

Plain Text

Gecko used to have an internal plain text flavor called text/plain, which meant an nsISupportsCString in the platform encoding. The concept of “platform encoding” is seriously defective. Luckily we no longer need to pretend that Mac users can only deal with MacRoman.

I did not implement any support for the obsolete text/plain flavor.

The contemporary flavor for moving plain text around inside Gecko is called text/unicode. (Notice the completely bogus MIME type.) It is host-order UTF-16 without a BOM using LF line breaks (must use LF) as an nsISupportsString.

The Cocoa way of passing around Unicode plain text is NSStringPboardType. It is usually written and read using convenience methods that use NSStrings. The actual clipboard data format is UTF-8 without a BOM and without a \0 terminator.

Cocoa apps typically write LF line breaks to the pasteboard. Line breaks are preserved between Cocoa apps. Cocoa apps automatically also see “NeXT plain ascii pasteboard type” as the last available flavor on the pasteboard when NSStringPboardType has been provided. No sane app should try to tamper with the lossy legacy NeXT flavor. When a Cocoa app has provided NSStringPboardType, Carbon and Classic apps see utxt and TEXT flavors on the clipboard. LF linebreaks are automatically converted to CR linebreaks.

TEXT is MacRoman (with lossy conversion, obviously) without \0-termination. Sane apps should avoid it if they can help it.

Traditionally on PPC, utxt has been big-endian UTF-16 (or more likely originally UCS2) without a BOM and without U+0000-termination. At 10.2, Mac OS X started putting a BOM in utxt when converting from NSStringPboardType. Apple backed out the change due to compatibility problems, but said that it will reappear later. Frankly, I think it wasn’t the right thing to do. Retroactively changing the format of a clipboard flavor is not the “right thing” to do. The Right Thing to do would be either keeping it big-endian ad infinitum even on little-endian hosts or redefining the flavor as being in the host order and making Rosetta byte-swap utxt to and from PPC apps. I didn’t find proper documentation on this and I don’t have an Intel Mac to test with, but according to the Universal Binary Programming Guidelines, 2nd ed. utxt is now BOMless for good and there’s a new ut16 flavor that has a BOM. The document doesn’t say whether utxt is big-endian or in the host order, though. Also, the benefits of the flavor proliferation are not obvious to me. And, of course, Apple didn’t revise the technote stating that BOM in utxt may reappear.

Fortunately, Cocoa developers don’t need to worry about utxt byte order or BOM. When a Carbon app (or a Classic app) has put utxt (or TEXT) on the clipboard, Cocoa apps sees NSStringPboardType in addition to the original flavors written by the Carbon app. The Cocoa app just reads NSStringPboardType and leaves the byte order issue to Apple. There’s one catch though: unlike NSStringPboardType to utxt conversion, utxt to NSStringPboardType conversion does not change line breaks. Therefore, NSStringPboardType has LF line breaks if the data was exported by a typical Cocoa app but has CR line breaks if the data was exported (as utxt) by a typical Carbon app. Since Gecko requires LF line breaks, the clipboard implementation has to make sure that each CR is replaced with a LF. (Note that sane Mac apps don’t export CRLF line breaks to the pasteboard, so there’s no need to check for those.)

HTML

HTML is so common nowadays that one might expect there to be a system-wide pasteboard flavor for HTML. Would be reasonable, right? After all, Windows is known to enable copying and pasting HTML between apps.

The documentation for NSPasteboard lists a type called NSHTMLPboardType. Whoopee!

The type is documented as follows: “HTML (which an NSTextView object can read from, but not write to)”. That it! Really. Google for it and you find someone asking for the exact format but no one replies. I tried to find apps on my system that export NSHTMLPboardType but found none.

Since docs were lacking, I dumped the clipboard exports on my own system and visited other Mac users who have apps that could potentially export HTML.

Gecko Internals

When HTML is copied in Gecko, four interrelated flavors are put in the nsITransferable instance: text/html, text/_moz_htmlcontext, text/_moz_htmlinfo and text/x-moz-url-priv. These all QI to nsISupportsString (host-order UTF-16 string without a BOM).

text/html contains a rootless serialization of the selection DOM range that was copied. text/_moz_htmlcontext contains a serialized doctypeless tag soup document that parses into a branchless document tree that only contains the element nodes from the root to the parent of the selection (including the parent) with attributes present. (At least I think that’s what it contains. I haven’t investigated what happens if the range starts and ends in a different parent.) text/_moz_htmlinfo contains a string representation of two numbers (base ten?) separated by a comma. I have no idea what they mean. text/x-moz-url-priv contains the URI of the document from which the selection was copied (about:blank if a real URI is unavailable).

Note how the concept of alternative representations is abused here. The different flavors augment each other instead of being alternatives. Also note how the MIME type text/html is used for labeling a fragment instead of a full document and how the other types are private and two of the private types don’t follow the naming convention for private types (those two are also undocumented).

Upon paste, if text/html is unavailable, Gecko tries to read application/x-moz-nativehtml, which means the Microsoft Windows CF_HTML clipboard data as an nsISupportsCString. (Note that Gecko internals don’t themselves export this flavor.)

Carbon Gecko

Carbon Gecko writes the data from the nsISupportsString flavors as BOMless UTF-16 to the clipboard. (I haven’t investigated the byte order on Intel, but I’d expect host order and I’ve heard about problem with Gecko apps in Rosetta and native Gecko apps not interoperating, which makes perfect sense.)

Each Gecko-internal flavor is mapped to a Carbon scrap type, whose 16 most significant bits are “MZ” when interpreted as MacRoman. The lower 16 bits contain an integer. (Carbon scrap types are 32-bit integers that are usually interpreted as four MacRoman characters with the most-significant byte as the leftmost character.) The Carbon Gecko clipboard implementation assigns an integer from a counter to each Gecko-internal nsISupportsString-based flavor as they are encountered by the clipboard implementation. The mappings are remembered for the lifetime of the Gecko process.

As the least favored scrap type, Carbon Gecko exports a MOZm scrap whose data contains mapping from the generated MZ types to Gecko-internal flavors, so that other Carbon Gecko processes can import the autogenerated scrap types.

Opera 9.0

Curiously, Opera 9.0 doesn’t export any HTML flavor at all. It just exports a plain text repsesentation as utxt (and TEXT). Apparently, Opera isn’t maintaining an app-internal HTML clipboard, either, because Opera-to-Opera copying and pasting (into a contenteditable part of an HTML document) doesn’t preserve HTML formatting.

WebKit

As of Mac OS X 10.3.9, WebKit exports copied HTML to the pasteboard in its Web Archive format. According to Apple’s documentation, the constant WebArchivePboardType is available only starting with Mac OS X 10.4. The header from Mac OS X 10.3.9 seems to have the constant, though. Hmm… Anyway, to be safe, the constant resolves to the NSString “Apple Web Archive pasteboard type”.

I don’t know if the format of the Web Archive is documented. I didn’t find any documentation. However, there’s a documented API for looking inside the mystery bag of bytes. As far as I can tell, there’s no documentation on what one should expect to find inside the archive in the pasteboard case.

If the copied data originated in a text/html document, the main resource of the Web Archive claims to have the MIME types text/html. However, as with Gecko, the MIME type label doesn’t mean that you should expect to find a full HTML document. Instead, the main resource is a rootless serialization of the selection range like the Gecko text/html flavor with these exceptions:

If the document of origin had a doctype, the serialization starts with a reconstruction of that doctype.
Elements whose computed style differs from what it would be with the UA style only get a style attribute with the differing competed style serialized.
Each text node is wrapped in a span element whose class is Apple-style-span. The span has a style attribute that repeats computed style that differs from UA defaults at that point in the document.

The resulting document fragment is exceedingly crufty. Clearly, whoever designed the requirements for this feature did not think in terms of semantic markup but instead continued to believe in the MacWrite legacy notion of rich text where rich text is a string of characters with font properties attached to each character and line breaks acting as paragraph separators. It seems that whoever implemented this tried to make the semantic markup recoverable (if the recipient cares to do some scrubbing) while also satisfying satisfying very structure-hostile presentational requirements.

The weirdest and most extreme symptom of the MacWrite legacy-influenced structure-hostile thinking is what happens when pasting structural markup. Suppose you have an h2 element with text content on the clipboard. If the insertion point is on what looks like a blank line on its own when you paste, you get an h2 element in the DOM. However, if there is text on the line when you paste, you don’t get an h2 element is the DOM! Instead, you get a span with a style attribute that reproduces the font properties of an h2!

The joke is that Gecko and WebKit have a reputation of being more standards-oriented than Trident, but when it comes to editing block elements, Trident gets it right and both Gecko (in the form of br elements) and WebKit exhibit block-hostile presentationalism that anyone trying to build a CMS for structural markup will hate.

But back to the format.

The Web Archive is byte-oriented, so the HTML fragment needs to be in some character encoding. WebKit seems to write it always in UTF-8 regardless of the encoding of the source document, although I don’t see it promised anywhere in documentation. I’m going to expect that it is always in UTF-8. If some program other than WebKit chooses to export something other than UTF-8 to the clipboard inside a Web Archive, instead of dealing with it in my code, I think the developer of the other program needs to be attitude-readjusted with a cluestick. (See GoLive below.)

There’s another kind of source document leak, though. If the source document had the MIME type application/xhtml+xml, the fragment exported to the clipboard will be doctypeless (good) and have that MIME type even though the fragment is rootless and produced with a namespace-unaware serializer.

The main improvement of the Web Achive format over what has existed before is that if the copied selection encompasses img elements, the images are also transferred inside the Web Archive. Unfortunately, this feature does not map nicely to Gecko internals.

MS Office

On Windows MS Office is known to export CF_HTML to the clipboard, and this is even documented (well, on the 0.9 level at least even if the export is 1.0). It turns out that MS Office on Mac exports CF_HTML as well. Interestingly, whatever Carbon type they use shows up as NSHTMLPboardType on the Cocoa side (no conversion of data—just flavor name mapping)!

When Googling for NSHTMLPboardType, I had discovered a version control log entry for WebKit stating that they don’t use NSHTMLPboardType due to problems with Word. Perhaps they were exporting a straight fragment without the CF_HTML wrapping or something.

Word puts what appear to be descriptors pointing to internal stuff and RTF on the export list before CF_HTML. Since I wouldn’t be exporting RTF from Gecko, I checked that NSTextView really works with CF_HTML. I captured the clipboard from Word and re-exported only CF_HTML and plain text. NSTextView accepted CF_HTML just fine (so the docs didn’t lie). However, normally when pasting from Word NSTextView takes the RTF version.

Note that CF_HTML is exported even if the document being edited is not an HTML document.

MS Office exports CF_HTML 1.0 but imports 0.9 just fine.

NeoOffice

If I recall my experiences correctly, OpenOffice.org on Windows supports CF_HTML. Not so on Mac, alas. NeoOffice 2.0 Alpha 4 patch 3 can copy and paste HTML internally. However, it exports only RTF and plain text to the system clipboard. Interestingly, it doesn’t even export a private marker on the system clipboard, so it has to have another mechanism for tracking whether another app has put something on the clipboard since the time NeoOffice last put some HTML on its internal clipboard.

Anyway, under these circumstances, I can’t make Cocoa Gecko interoperate on the HTML level with NeoOffice.

Dreamweaver

It turns out that Dreamweaver’s idea of an HTML clipboard flavor is closest to Gecko’s internal text/html flavor.

Dreamweaver exports and imports a scrap flavor called DwUH. It contains the HTML source (as seen in the source view) corresponding to the selection in the layout view as U+0000-terminated BOMless UTF-16. On PPC the byte order is big-endian, but there’s no way of knowing whether Macrodobia deliberately or accidentally changes the meaning of the scrap type on Intel (to host-order making native and Rosetta apps not interoperate).

The U+0000-terminator is important. The system pasteboard is designed for transferring arbitrary binary data. Hence, the data buffer has an explicit length. Dreamweaver happily ignores the length and reads until it sees a U+0000—even if that means reading past the end of the buffer.

GoLive

Like Dreamweaver, GoLive also exports the piece of HTML source as seen in the source view that corresponds to the selection in the layout view. There’s a crucial difference, though: Whereas all the apps discussed above use either UTF-8 or UTF-16 on the clipboard depending on the app, GoLive uses the character encoding of the source document on the clipboard! That’s some bad craziness!

GoLive exports the HTML fragment twice: first as GLTE and then as GLML. I was unable to come up with any test conditions that would cause these two flavors to get different contents. The data does not have a null character at the end. The interpretation of these flavors depends on a third flavor: MENV. It contains an XML document like this:

<MarkupEnv version="1">
        <base url="URL"/>
        <urlsettings>
                <urlhandling version="1" casesensitive="yes" linksabsolute="no" autoaddmailto="yes" honorcgiparameters="yes" cgiparameters="?" hhescapingState="complete" encoding="UTF8"/>
        </urlsettings>
        <markuptype markuptype="2020111469"/>
        <encoding charset="utf-8"/>
        <structure kind="area" boxClass="htmT"/>
        <actions/>
        <selectedVar name=""/>
</MarkupEnv>

Note how UTF-8 is labeled differently when describing markup and when describing URLs.

GoLive also writes a flavor called GLBx, but I haven’t been able to guess the purpose of this flavor. It seems to always contain the same six bytes.

What I did with Cocoa

In the Cocoa clipboard code that I wrote for Gecko, as the last resort, if the flavor does not get special treatment and the data QIs to nsISupportsString, I generate a Cocoa flavor name by prepending “Mozilla nsISupportsString ” to the Gecko flavor string and use the UTF-8 representation of the nsISupportsString as the pasteboard data. (Neither BOM nor \0-termination.)

Since Cocoa works with flavor strings like Gecko, there’s no need to map strings to 32 flavor identifiers. However, as a result, you cannot read these flavors through the legacy Carbon API.

Copy

text/html is exported the same way—that is, as “Mozilla nsISupportsString text/html”. However, also representations in the Web Archive, CF_HTML, Dreamweaver and GoLive formats are written to the pasteboard (in that order). I didn’t implement support for transferring HTML to and from Carbon Gecko. I figured that one Firefox and Thunderbird switch over to Cocoa, users will upgrade all their Gecko-based apps in active use to versions that use the Cocoa widget implementation, so interop with the old Carbon builds would be wasted effort. (Of course, you still get plain text copying and pasting between Carbon and Cocoa Geckos.)

When exporting to Web Archive, CF_HTML and to the GoLive flavors, the Gecko text/x-moz-url-priv flavor is appropriately mapped to the source URL of the fragment in these formats.

The DwUH flavor is considered to be big-endian regardless of host, since now the way to run Dreamweaver on an Intel Mac is on Rosetta. I hope Macrodobia keeps the flavor big-endian on Intel Macs when they ship a Universal Binary or use a different flavor identifier for a little-endian version. Of course, chances are that they don’t care make it impossible to interoperate with both a native version and a Rosetta-hosted version without guesswork. (I have posted about this macromedia.dreamweaver.appdev, but it appears that members of the Dreamweaver team don’t respond there.)

Paste

On pasting, when the nsITransferable instance advertises that it accepts text/html, the clipboard implementation looks for “Mozilla nsISupportsString text/html”, “Apple Web Archive pasteboard type”, DwUH and GLML (in that order).

In the case of Web Archive and GLML, only UTF-8 is accepted. With Web Archive, the code checks that the data claims to be UTF-8. With GLML the code checks whether the data looks like UTF-8. New documents in GoLive default to UTF-8, and supporting non-UTF-8 craziness is just not worth it. With Web Archive, the URL of the main resource is mapped to Gecko’s text/x-moz-url-priv flavor. Doing the source URL mapping when pasting from GoLive didn’t seem worth the trouble. Besides, with Dreamweaver and GoLive the main use case is copying from Gecko. Pasting to Gecko is there only for completeness.

I did not implement any scrubbing of the style attribute cruft that WebKit exports. However, such scrubbing is needed. Suppose a user uses Safari as the browser and Thunderbird as the mail app and copies something from a Web page to an HTML email message. It is really hard if not impossible (I couldn’t figure it out quickly) to get rid of WebKit’s character formatting cruft using the Thunderbird UI.

Frankly, I think Apple’s approach to copying “rich text” is just wrong. In general, regardless of HTML, if you have one text in 14 pt Times and another in 12 pt Palatino, why would you want the font and size to be preserved when copying? The first thing you have to do after pasting is making the font match the font of the target document. I often paste into TextWrangler and recopy in order to get rid of RTF on the clipboard. Perhaps I should write a small program to automate this. But I digress.

When nsITransferable instance advertises that it accepts application/x-moz-nativehtml, the clipboard implementation looks for CF_HTML, i.e. NSHTMLPboardType.

Bitmaps

When the user invokes the context menu on an image, Gecko-based apps typically offer a menu command for copying the image to clipboard. As far as I can tell there are no cases where an existing Gecko-based app tried to read a bitmap from the clipboard.

Gecko Internals

The IDL for nsITransferable has defines for four image flavors: image/png, image/gif, image/jpeg and application/x-moz-nativeimage. As far as I can tell, the three image/* are not actually used. However, code for other platforms trigger image copying for all the four types, so I did, too. When the transferable is an image, the data object QIs to nsIImage.

QuickDraw gfx

When the QuickDraw gfx is used, the image QIs to nsIImageMac, which can write itself into a PicHandle. Even the Cocoa pasteboard supports a type called NSPICTPboardType, so writing the image to the pasteboard is easy.

Thebes gfx

When the Thebes gfx is used, there’s no easy way to get a PICT, so it makes sense to endeavor to get a TIFF that can be put onto the pasteboard as NSTIFFPboardType. The part of the Cocoa API that makes it possible to take a raw bitmap buffer and turn it into a TIFF is NSBitmapImageRep.

NSBitmapImageRep looks versatile on the surface in terms of the raw buffer formats that it accepts. But it isn’t that versatile really. It assumes a row order (bottom to top?) that is the opposite of what Thebes uses (top to bottom?), so I had to reverse the row order manually. Also NSBitmapImageRep understands ARBG (which is what Thebes uses) only since Tiger and Gecko is still supposed to support Panther, so I had to rotate the pixel words to the RBGA format myself.

Thebes uses ARGB as the sample order within the 32-bit pixel word as seen through C bitwise operations. This means that on little-endian systems Thebes actually uses BGRA if you look at the memory buffer iterating by bytes. The documentation for NSBitmapImageRep does not say whether its notion of RGBA means ABGR on Intel.

What About Astral Characters?

Of course, when dealing with Unicode and dealing with it properly, one has to test with astral characters. It turns out that Word, Dreamweaver and GoLive don’t support them, so there’s nothing to deal with.