Utopian XHTML at the SEC

05 Jul 2025

I’ve been working on a project that was riffed into existence by friends John, Olov, and Nora at Grand Morelos: it’s called Labor Leverage. It’s loosely based on workshops John and our coworker Stacy Cowley have run previously, where they share with other unions how to research employer finances in order to build power at the bargaining table. Big public companies are required to report data to the SEC as part of that institution’s purpose in stabilizing public markets, and it turns out many of the same data are useful to reveal how your employer and its executives actually spend its money. Profits, stock buybacks, executive salaries and stock compensation, all available in SEC data made available to the public!

The format of this data is interesting: companies upload their data to the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) system. The API is beautiful, and is barely even an API at all: it’s “just” HTML files, uploaded at predictable paths. Specifically, it’s XHTML, the older and crankier subset of today’s HTML that was meant to validate against a strict XML syntax. Companies embed snippets of another XML flavor, iXBRL (eXtensible Business Reporting Language) inside of the HTML, so that you can have a report that looks like a webpage with tabular profit and loss tables, buried within which are machine-readable XML tags describing key accounting figures as structured data.

Many of the tags encode dry, Generally Accepted Accounting Principles (GAAP) data like us-gaap:NetIncomeLoss, us-gaap:StockRepurchasedDuringPeriodValue, or us-gaap:CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents. Maybe more interesting are the data that companies are required to report by the government, but which don’t use machine-readible XML tags because they are not legally required to do so. These include data that’s absolutely relevant to unions like the one I belong to, such as the number of total employees, or the CEO Pay Ratio disclosures required by law following the 2008 financial crisis. These data are written as prose in the report, and so extracting them from the text requires the use of cruder text matching techniques. It’s tempting to picture a regulatory regime where financial data relevant to workers also had legally-mandated, machine-readable XML tags!

Compared to the freewheeling, single page app oriented ecosystem of today’s HTML5, reading iXBRL-flavored XHTML is like discovering an exotic dialect of a language that’s been cut off from its parent for so long that it’s had to invent grammars and styles of its own. <FONT> is everywhere. All elements have 20-30 inline style definitions in lieu of CSS. When I was scraping this data, I was surprised to find that that all the 10,000 or so companies clock in with 29GB worth of HTML. The reason the dataset is that large is because each document tends to have extreme HTML energy: the biggest document—a DEF-14A filing from Vaso Corporation— weighs in at 36MB, in part because there are long sections where every single character is wrapped in a distinct font tag (!).

I remember writing HTML in the era where the trend was towards making sure your HTML passed strict XML validation, like these documents do; there were hundreds of tools that would feed your HTML thru a validator and scold you for your HTML sins. At some point things trended away from XML generally, and away from HTML validation specifically, but it’s kind of beautiful that the backstop of a powerful government agency has left an oasis where XML validation and truly semantic XML elements still reign. Matt Levine recently wrote about how there are even companies whose entire business is to “EDGARize” companies' HTML:

If you are an SEC lawyer — if you look at this system from a reasonable level of abstraction — then the way Edgar works is (1) a company sends its material nonpublic information to Edgar and (2) Edgar immediately makes it public. Information passes through the system for an infinitesimal time; the filing system does not hold on to a bunch of secret corporate information in its servers. Edgar is a publication interface, not a database of company secrets.
But of course if you are an Edgar typesetter, you do not look at this system from that level of abstraction. You deal with the actual system, which operates at a not-at-all-infinitesimal time scale. You are like “man, this stuff sure takes a long time to typeset.” And there are trades there.

It’s absolutely someone’s dream to run a business helping companies to write XML-compliant HTML, with a sideline making millions from illegal stock market information you’ve read in the XML-compliant HTML!

1 comment

Bennett on Jul 05, 2025 at 11:38 AM: I'm going to show this to Kate, I know that (at least until recently) they were using XML validator tools to upload archival finding aids to the Online Archive of California which all the institutions publish to.

THE REGEX KING

Jeff Sisson's blog (email me)

Utopian XHTML at the SEC