Utopian XHTML at the SEC

05 Jul 2025

I’ve been working on a project that was riffed into existence by friends John, Olov, and Nora at Grand Morelos: it’s called Labor Leverage. It’s loosely based on workshops John and our coworker Stacy Cowley have run previously, where they share with other unions how to research employer finances in order to build power at the bargaining table. Big public companies are required to report data to the SEC as part of that institution’s purpose in stabilizing public markets, and it turns out many of the same data are useful to reveal how your employer and its executives actually spend its money. Profits, stock buybacks, executive salaries and stock compensation, all available in SEC data made available to the public!

The format of this data is interesting: companies upload their data to the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) system. The API is beautiful, and is barely even an API at all: it’s “just” HTML files, uploaded at predictable paths. Specifically, it’s XHTML, the older and crankier subset of today’s HTML that was meant to validate against a strict XML syntax. Companies embed snippets of another XML flavor, iXBRL (eXtensible Business Reporting Language) inside of the HTML, so that you can have a report that looks like a webpage with tabular profit and loss tables, buried within which are machine-readable XML tags describing key accounting figures as structured data.

Many of the tags encode dry, Generally Accepted Accounting Principles (GAAP) data like us-gaap:NetIncomeLoss, us-gaap:StockRepurchasedDuringPeriodValue, or us-gaap:CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents. Maybe more interesting are the data that companies are required to report by the government, but which don’t use machine-readible XML tags because they are not legally required to do so. These include data that’s absolutely relevant to unions like the one I belong to, such as the number of total employees, or the CEO Pay Ratio disclosures required by law following the 2008 financial crisis. These data are written as prose in the report, and so extracting them from the text requires the use of cruder text matching techniques. It’s tempting to picture a regulatory regime where financial data relevant to workers also had legally-mandated, machine-readable XML tags!

Compared to the freewheeling, single page app oriented ecosystem of today’s HTML5, reading iXBRL-flavored XHTML is like discovering an exotic dialect of a language that’s been cut off from its parent for so long that it’s had to invent grammars and styles of its own. <FONT> is everywhere. All elements have 20-30 inline style definitions in lieu of CSS. When I was scraping this data, I was surprised to find that that all the 10,000 or so companies clock in with 29GB worth of HTML. The reason the dataset is that large is because each document tends to have extreme HTML energy: the biggest document—a DEF-14A filing from Vaso Corporation— weighs in at 36MB, in part because there are long sections where every single character is wrapped in a distinct font tag (!).

I remember writing HTML in the era where the trend was towards making sure your HTML passed strict XML validation, like these documents do; there were hundreds of tools that would feed your HTML thru a validator and scold you for your HTML sins. At some point things trended away from XML generally, and away from HTML validation specifically, but it’s kind of beautiful that the backstop of a powerful government agency has left an oasis where XML validation and truly semantic XML elements still reign. Matt Levine recently wrote about how there are even companies whose entire business is to “EDGARize” companies' HTML:

If you are an SEC lawyer — if you look at this system from a reasonable level of abstraction — then the way Edgar works is (1) a company sends its material nonpublic information to Edgar and (2) Edgar immediately makes it public. Information passes through the system for an infinitesimal time; the filing system does not hold on to a bunch of secret corporate information in its servers. Edgar is a publication interface, not a database of company secrets.
But of course if you are an Edgar typesetter, you do not look at this system from that level of abstraction. You deal with the actual system, which operates at a not-at-all-infinitesimal time scale. You are like “man, this stuff sure takes a long time to typeset.” And there are trades there.

It’s absolutely someone’s dream to run a business helping companies to write XML-compliant HTML, with a sideline making millions from illegal stock market information you’ve read in the XML-compliant HTML!

1 comment

Re-uploading Unix Time

30 May 2025

In 2016 I was inspired to run a series of DIY workshops on the command line and the Unix philosophy generally. I’d been chatting with a handful of friends about interest in learning more about the history of Unix and non-graphical interfaces. I chose the name “Unix Time” as a tribute to the time convention of the same name, and we met for several months in Red Hook, Brooklyn at Beam Center.

I recall these workshops having been pretty fun: a nice group with both technical and non-technical people showed up, and we worked thru some command line and Unix basics (how do I execute a command? what does it look like to setup my shell?) all the way thru to headier parts of the Unix ecosystem like wall, talk, and telnet.

I can’t remember if this was Allen or Sam’s idea but at some point the group co-authored a bash script to produce On Kawara’s Date Paintings from the command line! Here’s that script, for perpituity:

i=0
while read d; do
  i=$((i+1))
  echo $d | convert \
    -size 800x500 \
    xc:black \
    -pointsize 60 \
    -fill white \
    -gravity center \
    -font /usr/share/fonts/truetype/futura/futura.ttf \
    -draw "text 0,0 '$(cat - | tr '[:lower:]' '[:upper:]')'" \
    -paint 2 jpg:- > "kawara-${i}.jpg"
done < "${1:-/dev/stdin}"
convert -delay 120 -loop 0 kawara-*.jpg gif:-

Jason mentioned Unix Time at his birthday party recently, and I was inspired to re-upload the syllabus for the Unix Time workshops at unixtime.bigboy.us. This syllabus used to be on a Unix server I’d rented specifically for Unix Time — everyone had their own proper Unix user accounts — but at some point in our telnet explorations, I exposed a telnet server that was immediately hacked and so I had to take the server down.

We continued to meet well beyond this syllabus, although at some point things petered out. I’m hoping this syllabus — while not a complete artifact showing all of the cool things the people in the workshops did — can at least act as a signpost for people with similar interests in the future.

Comments

Updating volume pops pro

08 Mar 2025

In 2011, which in retrospect was a gentler computer era, I made a piece of software called Volume Pops Pro which installed custom sounds over the system default “pop” sound you get when you increase or decrease the volume on a Mac computer. There was a Funkmaster Flex inspired explosion sound, and an airhorn sound. It was packaged as a package installer that put the files in the correct place. Fun!

At some point in the past 14 years, the installer stopped working. The MacOS operating system has been hardened: there are parts of the file system protected by something called “system integrity protection (SIP)" which prevents making changes to parts of a Mac’s software, including this little pop sound. They also changed the default setting so that instead of playing a feedback sound by default, it’s inverted and you have to deliberately hold down the shift key to hear a feedback sound. Less fun!

My buddy Jivko has asked year after year for a version of Volume Pops Pro that would work again on modern computers. I recently updated the software to make it work. It was a pain: I had to write a MacOS native app, which requires paying Apple a yearly licensing fee in order to sign applications for direct distribution so that they aren’t flagged as potential malware. Apple maintains an under-documented API for listening to global media key events like volume up and down, and so the app listens to these and plays a custom sound in response. There is no way to suppress the system default feedback you get when pressing the “shift” key, though, so I had to consecrate a new key (control) for temporarily muting the custom Volume Pops sound while changing the volume (this is handy in parties or while playing music loud, where you sometimes don’t want to blast an airhorn). There’s definitely a computing metaphor somewhere in here about how something that used to be complicated (put a file on your computer) has now become quite complicated (install a custom app that listens and overrides keypresses, and pay a huge corporation for the privilege).

Comments

blob-15, blob-16 and blob-17.local

15 Oct 2024

ZeroConf is a protocol that lets computers find each other automatically on a local network. In the typical configuration, if your computer is named “Jeff’s Mac”, it’s Zeroconf that makes your computer available for networking using the hostname jeffs-mac.local.

I have what I like to think is a normal number of Raspberry Pi computers around the house, doing Raspberry Pi type things. The Raspberry Pi runs Linux as its operating system, and on Linux the Zeroconf protocol is implemented using a software package called Avahi. Zeroconf makes it really easy to work with these tiny Linux computers: after you’ve picked a name (“blob” or “bouncingbeautiful”), you can then easily login to them later using a memorable hostname like blob.local or bouncingbeautiful.local.

But there’s a bug in Avahi, which sometimes leads to a conflict in how these computers are addressable: if your computer starts up as blob.local, an hour or so later it starts to think that someone else has claimed blob.local and it will automatically add a number to the name and adopt blob-1.local instead. Let this process run wild and pretty soon your computers are describing themselves as blob-17.local, blob-117.local or blob-550.local.

I’m sure this bug will be fixed eventually, but for now I’m enjoying this as a latter day computing equivalent of the old yellow pages naming strategy, where you’d name your business “AAAAA Business Name” to try to game the alphabetized sorting used in printed phone books. Here’s a nice description of that phenomenon, in the LA Times from 1993:

Bob Johnson, a listing service product manager for Pacific Bell, said that when companies wage war in the Yellow Pages to be the first listing in each specific category, the cumulative result is pages of A’s in the white pages.

Bob goes on to describe how their particular sorting algorithm ranked separate letters (“A A A”) higher than combined letters (“AAA”), for reasons which go unmentioned at the time. There’s a great quote from the business owner of “A Aaaaa Bcalvy 24 Hour Carpet Fire Carpet Water Damage Specialist”, who admits “it’s to get as near to the top of the category as possible”. Long live the power of suffixes and prefixes like these! We should all resist the tyranny of alphabetical sorting by adding some letters or numbers to the beginning or the end of our identifiers.

1 comment

weather station update

"it's really coming down out there" bot update

14 Oct 2024

The Big Boy Weather station records data forever: observations for temperature, humidity, rainfall, and wind speed, are all recorded with the original time of observation. These observations are then rolled up into an almanac view, where it’s possible to note exceptional weather events within a day, or a month, or a year.

A timeless way to mark exceptional amounts of rainfall is to say “it’s really coming down out there”. I was first tuned into this phrase via a 2010’s-era Providence-adjacent Twitter riff, and immediately wanted to make a Twitter bot that could say “it’s really coming down out there” whenever it was raining really hard here. I made a bot that did exactly that, tweeting “it’s really coming down out there” whenever the rain rate exceeds 0.33 inches/minute, which is a lot of rain in Queens, NY, where Big Boy Weather is located.

No usernames, no passwords

When the ownership of Twitter changed hands in 2022, bots like these were some of the first casualties: unless you forked over lots of cash, Twitter no longer wanted you to use its API’s. This was a loss, for sure, but depending on an API for a project like this had already started to feel precarious to me. Even outside of Twitter, I’d begun dread all of the paraphernalia involved in “Create an Account”-type businesses with public API’s: you have to run the gauntlet of dealing with the username, the password, the 2 factor authentication, the terms of service checkboxes, the OAuth2 credentials, the marketing email opt-outs, the rate limiting, the ticketed support interface, the API documentation… Some of this is just what it’s like to use the internet now, where doing anything requires an account. But it shouldn’t have to be this way! And relying on someone else’s API for this type of project in particular started to feel faintly embarrassing to me, sort of like broadcasting to the world: “this project will molder and fall apart some day.”

So I took the moment of Twitter exodus as a hiatus from writing software that relied on corporate API’s like these, generally. But as some the Twitter alternatives developed in the shadow of Twitter’s implosion, I started paying particular attention to BlueSky and its AT Protocol federated protocol: some of the people involved had previously contributed to an earlier era of p2p protocols like Dat and Scuttlebutt, networking protocols whose orientation towards custody of data aligned with my own. And when it became clear that most my own Twitter sphere had chosen to move to BlueSky and setup house there, I decided to try to see what it’d look like to revive the “it’s really coming down out there” bot on a protocol that didn’t demand all of the trappings of a “Create an Account” worldview.

Rebuilding the bot on ATProto

On BlueSky, you publish tweets yourself by hosting the data in a format described by the ATProtocol, from your own server. This service is called a Personal Data Store (PDS). Setting up a PDS requires some comfort with operating a server generally, but isn’t so hard…you:

Run a bash script to configure and run a Dockerized node.js implementation of the PDS part of the protocol
Run a command line command to create a ATProto/BlueSky account locally: sudo pdsadmin create-invite-code
Run a command line command to request that the main BlueSky network “crawl” your PDS: sudo pdsadmin request-crawl

…at which point you have an ATProto/BlueSky account that’s 100% local to your server, but crawled and follow-able by the broader network of people on BlueSky. The only account hosted on my PDS is the weather.bigboy.us bot account, which the Big Boy Weather station now posts to whenever “it’s really coming down out there”.

Connecting this bot to the BlueSky network is no guarantee that the network or company that currently hosts it will be available forever; BlueSky is a venture capital funded company, with an unclear path to profitability. But it’s comforting to know that this bot can keep posting “it’s really coming down out there” on my server forever, with the data living on even if the broader network shuts down.

Comments

THE REGEX KING

Jeff Sisson's blog (email me)

Utopian XHTML at the SEC

Re-uploading Unix Time

Updating volume pops pro

blob-15, blob-16 and blob-17.local

"it's really coming down out there" bot update

No usernames, no passwords

Rebuilding the bot on ATProto

More Regex King posts:

Notes on email inference using llamafile

Josh TV

Same temp map

Big Boy Bike Directions

Flip phone review

What is Haberman?

sunrise sunset photos

New York State election data quest

Origins of big boy chat

elvis-tools

present weather sensor

hello world

working hard or hardly working?

non-lethal sentences

Good random numbers

NYC LLC

yes(1)

Legalese

Rip Van Winkle logout

Allen's addendum to International Art English

Go to any level / jump higher / stay bigger / live forever

Speech Acts

Selections from O'Reilly's Regular Expressions Pt. 1