THE REGEX KING

Jeff Sisson's blog (email me)

blob-15, blob-16 and blob-17.local

15 Oct 2024

ZeroConf is a protocol that lets computers find each other automatically on a local network. In the typical configuration, if your computer is named “Jeff’s Mac”, it’s Zeroconf that makes your computer available for networking using the hostname jeffs-mac.local.

I have what I like to think is a normal number of Raspberry Pi computers around the house, doing Raspberry Pi type things. The Raspberry Pi runs Linux as its operating system, and on Linux the Zeroconf protocol is implemented using a software package called Avahi. Zeroconf makes it really easy to work with these tiny Linux computers: after you’ve picked a name (“blob” or “bouncingbeautiful”), you can then easily login to them later using a memorable hostname like blob.local or bouncingbeautiful.local.

But there’s a bug in Avahi, which sometimes leads to a conflict in how these computers are addressable: if your computer starts up as blob.local, an hour or so later it starts to think that someone else has claimed blob.local and it will automatically add a number to the name and adopt blob-1.local instead. Let this process run wild and pretty soon your computers are describing themselves as blob-17.local, blob-117.local or blob-550.local.

I’m sure this bug will be fixed eventually, but for now I’m enjoying this as a latter day computing equivalent of the old yellow pages naming strategy, where you’d name your business “AAAAA Business Name” to try to game the alphabetized sorting used in printed phone books. Here’s a nice description of that phenomenon, in the LA Times from 1993:

Bob Johnson, a listing service product manager for Pacific Bell, said that when companies wage war in the Yellow Pages to be the first listing in each specific category, the cumulative result is pages of A’s in the white pages.

Bob goes on to describe how their particular sorting algorithm ranked separate letters (“A A A”) higher than combined letters (“AAA”), for reasons which go unmentioned at the time. There’s a great quote from the business owner of “A Aaaaa Bcalvy 24 Hour Carpet Fire Carpet Water Damage Specialist”, who admits “it’s to get as near to the top of the category as possible”. Long live the power of suffixes and prefixes like these! We should all resist the tyranny of alphabetical sorting by adding some letters or numbers to the beginning or the end of our identifiers.


"it's really coming down out there" bot update

14 Oct 2024

The Big Boy Weather station records data forever: observations for temperature, humidity, rainfall, and wind speed, are all recorded with the original time of observation. These observations are then rolled up into an almanac view, where it’s possible to note exceptional weather events within a day, or a month, or a year.

A timeless way to mark exceptional amounts of rainfall is to say “it’s really coming down out there”. I was first tuned into this phrase via a 2010’s-era Providence-adjacent Twitter riff, and immediately wanted to make a Twitter bot that could say “it’s really coming down out there” whenever it was raining really hard here. I made a bot that did exactly that, tweeting “it’s really coming down out there” whenever the rain rate exceeds 0.33 inches/minute, which is a lot of rain in Queens, NY, where Big Boy Weather is located.

No usernames, no passwords

When the ownership of Twitter changed hands in 2022, bots like these were some of the first casualties: unless you forked over lots of cash, Twitter no longer wanted you to use its API’s. This was a loss, for sure, but depending on an API for a project like this had already started to feel precarious to me. Even outside of Twitter, I’d begun dread all of the paraphernalia involved in “Create an Account”-type businesses with public API’s: you have to run the gauntlet of dealing with the username, the password, the 2 factor authentication, the terms of service checkboxes, the OAuth2 credentials, the marketing email opt-outs, the rate limiting, the ticketed support interface, the API documentation… Some of this is just what it’s like to use the internet now, where doing anything requires an account. But it shouldn’t have to be this way! And relying on someone else’s API for this type of project in particular started to feel faintly embarrassing to me, sort of like broadcasting to the world: “this project will molder and fall apart some day.”

So I took the moment of Twitter exodus as a hiatus from writing software that relied on corporate API’s like these, generally. But as some the Twitter alternatives developed in the shadow of Twitter’s implosion, I started paying particular attention to BlueSky and its AT Protocol federated protocol: some of the people involved had previously contributed to an earlier era of p2p protocols like Dat and Scuttlebutt, networking protocols whose orientation towards custody of data aligned with my own. And when it became clear that most my own Twitter sphere had chosen to move to BlueSky and setup house there, I decided to try to see what it’d look like to revive the “it’s really coming down out there” bot on a protocol that didn’t demand all of the trappings of a “Create an Account” worldview.

Rebuilding the bot on ATProto

On BlueSky, you publish tweets yourself by hosting the data in a format described by the ATProtocol, from your own server. This service is called a Personal Data Store (PDS). Setting up a PDS requires some comfort with operating a server generally, but isn’t so hard…you:

  • Run a bash script to configure and run a Dockerized node.js implementation of the PDS part of the protocol
  • Run a command line command to create a ATProto/BlueSky account locally: sudo pdsadmin create-invite-code
  • Run a command line command to request that the main BlueSky network “crawl” your PDS: sudo pdsadmin request-crawl

…at which point you have an ATProto/BlueSky account that’s 100% local to your server, but crawled and follow-able by the broader network of people on BlueSky. The only account hosted on my PDS is the weather.bigboy.us bot account, which the Big Boy Weather station now posts to whenever “it’s really coming down out there”.

Connecting this bot to the BlueSky network is no guarantee that the network or company that currently hosts it will be available forever; BlueSky is a venture capital funded company, with an unclear path to profitability. But it’s comforting to know that this bot can keep posting “it’s really coming down out there” on my server forever, with the data living on even if the broader network shuts down.


Notes on email inference using llamafile

21 Dec 2023

I’ve been avoiding learning deeply about large language models. I’m not totally sure why. It’s at least in part for the same reason other people are cautious about them: they seem bad for the environment, they’re going to pollute the delicate ecosystem of freely authored HTML, scammy people are interested in them. I think I also have a more specific reason I’ve stayed away: they’re not quite free, and haven’t been optimized to be freely runnable on anyone’s computer. There’s something about “you have to make calls to someone else’s paid API” that has actively repelled any interest I might have had in digging deeper.

It was from this vantage point that I approached this blog post about llamafile — a project which aims to make running a large language model on most computers really easy — with interest. It repackages a few of the more freely licensed LLM’s in the “llama.cpp” family, using the cosmopolitan libc technique for making a single binary executable runnable on many different computer architectures. What you get is a small (~4GB) server that runs on your computer and presents a vanilla HTML interface where you can chat with the large language model:

llavafile interface

Crucially, none of this requires the internet: interactions with the model run locally on your computer. For whatever reason it was this distinction that finally freed my mind to wander a bit…if I can try out an LLM without sending my data to someone else, or without paying someone else, or without slowly sucking some far-flung water aquifer dry, maybe it’d feel possible to do something interesting with it….

Is it possible to talk to my email??

I’ve had an email address since 2002, and have kept most of my emails since then. I don’t really spend time with my deep email archive. I’m mostly sending and receiving emails from the past month, at most. But I’ll occasionally try to remember an old link, or a place, or a story, and find email search to be wanting. Often the very simple reason email search doesn’t work is that my memory remembers something worded one way, but it was worded a different way in an email, and this type of mismatch the breaks fuzzy search logic most email apps use.

I would never in a million years submit any of my emails to a corporate large language model, but running an LLM locally presented an opportunity for seeing how the promise of “private large language models” worked in practice, using my local email archive. Like what if I could talk to my email and remember some place or thing I emailed someone about years ago? Or find a timeless url someone had once sent me? Or find some specific story someone told me once, I think? It’s tempting to picture a large database like “every email I’ve ever sent or received”, and imagine there are gems buried somewhere in there, if I could only find them.

I was specifically inspired by this blog post about “Retrieval Augmented Generation in Go” by Eli Bendersky which describes “retrieval augmented generation”, a technique where you try to ask a large language model a question, but augment the question you’re asking with extra text that’s included by finding semantically similar snippets of text from some large text corpus to the question you asked. I wanted to apply this technique to my local database of emails, so I could ask questions against my archive of emails.

Representing text as vectors

One innovation in large language models is that text can be converted into a mathematical representation called a “vector”, which is a list of floating point numbers with a fixed size. So a given word “hello” looks like this as a vector:

[0.026315663, -0.05107676, 0.052759565, -0.03678608, -0.057748064, 0.033566643, -0.02589281, -0.002132243, -0.028607314, 0.012253743, -0.008096664, 0.001494693, 0.0365746, 0.03807026, 0.009833517, 0.0067754393, -0.010480829, 0.022064133, 0.020115668, -0.037109215, 0.049926486, -0.036568295, 0.0053705918, 0.031117717, -0.032250315, -0.052203, -0.025519572, -0.020293564, -0.033220563, 0.023608679, -0.006456362, -0.004586842, 0.010010897, -0.04201805, 0.015593706, -0.03028678, -0.043785904, -0.03974351, 0.0014129126, 0.047360025, 0.017966205, 0.012411393, -0.015565804, 0.046122417, 0.05755795, 0.018097928, -0.015544698, -0.014457393, 0.0019716504, -0.037025385, 0.034752447, -0.040650655, 0.043754783, -0.00097598345, -0.035391726, 0.0033253669, 0.035139333, 0.024327567, -0.0053036534, 0.00032466973, 0.021560345, -0.0046450747, 0.036632985, -0.04003288, 0.027276658, -0.034950882, 0.027737923, 0.03640247, 0.038598653, 0.006711874, -0.052254688, -0.06056385, 0.06397524, 0.05018992, 0.03146692, -0.03179005, 0.0065816822, 0.031681385, 0.048647005, 0.03895677, -0.05227646, -0.018797494, -0.024809726, -0.034158837, -0.0024025394, -0.008448369, 0.023889156, -0.014096949, 0.053465273, 0.031300355, 0.002865441, -0.005450165, 0.050935287, 0.016651286, -0.01608125, -0.04010522, -0.028432064, 0.03995945, 0.011018825, -0.028760085, -0.013287061, -0.036134444, -0.007604672, 0.02963232, 0.00946132, -0.039779358, -0.0065998007, -0.006972531, -0.06255624, -0.028554522, -0.028519401, -0.046248812, -0.042899422, -0.012204772, -0.046020266, 0.04600531, 0.021571305, -0.036364153, 0.033461068, 0.041704237, 0.05259111, 0.043571096, -0.04007029, -0.034076557, -0.03011038, 0.008948071, -0.04813023, -0.044153288, 0.03518758, 0.056217145, 0.012336162, -0.032382835, 0.019346481, 0.014965278, 0.046533752, 0.046599004, -0.02928571, -0.02224698, -0.010510442, 0.042641334, -0.021578278, -0.040050805, 0.045797728, 0.02277755, 0.049083006, -0.026401268, -0.024383407, -0.025588537, -0.049048226, -0.0531303, -0.042156238, -0.012985709, -0.010362753, -0.018121995, 0.007163994, -0.043389708, 0.023375297, -0.03768581, -0.017458197, 0.050082564, 0.0060853222, 0.027943356, -0.024461797, 0.031332087, 0.037615683, -0.013563662, 0.02029403, -0.014864157, -0.029464258, 0.04442369, -0.029298533, 0.0302472, 0.04715714, 0.022353636, 0.043481253, -0.033672825, 0.0474069, -0.05228587, -0.002790663, 0.024341144, 0.025120774, 0.036285434, -0.00346869, -0.055576056, -0.07371648, 0.03767376, 0.041797392, -0.027872743, -0.030338455, -0.071010545, 0.0006263308, -0.003296338, -0.05668749, 0.041626733, -0.02344105, -0.014074221, -0.048079737, -0.016580561, -0.006270523, 0.031279285, 0.033357352, 0.0117028225, -0.006009747, -0.023284834, -0.012092737, 0.06094602, 0.013674777, 0.003260308, -0.014270174, 0.036602862, -0.004527294, 0.021936249, 0.02703726, -0.006649984, -0.046160154, 0.0054655443, 0.027177623, -0.011909271, -0.0005080942, 0.056488566, -0.037823215, 0.0010502205, 0.028413123, -0.030004766, 0.0102585675, -0.031900134, -0.011743591, 0.0114091, -0.026823547, -0.0132994205, 0.007096897, 0.0055736704, -0.020466903, 0.0010579303, -0.010763015, -0.025727881, 0.03693008, -0.010247399, 0.016443394, 0.032162197, -0.00322929, 0.025612716, -0.0010617772, -0.0045681344, -0.005656379, 0.0038616783, 0.02907526, 0.015015733, 0.046991542, 0.048260894, 0.0037447503, 0.028981335, -0.008149285, -0.013788863, -0.023555005, 0.010223529, 0.02192332, -0.0451934, -0.062838726, 0.026128672, 0.02289665, -0.030275302, -0.063174084, 0.0022732366, -0.022915745, -0.032914564, 0.016041432, -0.012015501, 0.07272382, -0.024313914, 0.028003944, 0.03830679, 0.017905323, -0.04439989, -0.028542832, -0.04374546, -0.029714901, -0.013198032, -0.0040778373, -0.015327487, 0.021371499, -0.0025264495, 0.041654684, 0.03024055, -0.014477172, -0.005203952, -0.017598575, 0.025533067, 0.027074886, 0.035987914, -0.029328384, -0.019238349, 0.060330536, -0.01350854, -0.022097755, -0.01081782, -0.01862954, 0.024826696, 0.05154685, 0.038304742, 0.050340444, 0.017058605, -0.07946641, -0.04604151, -0.026408235, -0.03904443, 0.030384433, -0.07985361, 0.061564326, 0.012700621, -0.012354287, -0.009344623, -0.0367299, -0.07239036, -0.033526517, 0.013479105, -0.014741456, 0.015465579, 0.006340796, -0.041340258, 0.044028617, -0.032779563, -0.04694552, -0.039798666, -0.008055787, 0.0022759913, -0.043846805, -0.005985449, -0.009902096, -0.0156177925, -0.01312619, 0.006933162, 0.056553904, 0.04710293, 0.009497505, -0.020777516, -0.0327266, -0.025073212, 0.012446564, 0.039447058, 0.06872826, 0.03621971, -0.023626817, -0.03655862, 0.013034176, 0.03753551, 0.05189472, -0.0030686557, 0.01195667, 0.045128383, 0.028401954, 0.009839714, 0.010051032, -0.03908404, -0.04388602, -0.013252326, 0.053872455, -0.021344408, 0.02033162, 0.042927306, 0.040674552, -0.010778672, 0.010513371, -0.0024791993, -0.007599492, -0.03129863, 0.033941735, -0.03160518, 0.012811407, 0.03917931, 0.00887006, 0.036761038, -0.0016270209, -0.02900771, -0.020914309, -0.022955302, 0.013110533, 0.037405018, 0.042493112, 0.0029953097, -0.0005984587, 0.025215842, 0.0019286971, 0.0008111912, -0.06537792, -0.02044328, -0.005869833, -0.006807886, -0.0034591414, -0.05074447, -0.017459536, -0.03532829, 0.027767923, -0.026316686, 0.0024302586, -0.037411038, 0.0615568, -0.028561596, -0.005362948, 0.01471921, 0.020184528, 0.02653486, 0.041428342, -0.007413157, -0.04561999, -0.017273037, 0.047322955, 0.051810987, 0.030876957, -0.012946942, 0.0010372113, 0.033227976, 0.0064514694, 0.033085752, -0.013396054, 0.048426185, 0.0075015305, 0.022221081, -0.033596326, 0.0069293217, -0.023342313, -0.012286653, 0.0102367345, -0.0062289997, 0.0281104, -0.022718213, -0.016924072, -0.019212652, -0.001185613, -0.029464584, 0.044396423, -0.0324116, -0.014398765, -0.025774622, 0.055743262, -0.027121518, 0.020674873, -0.00017766615, 0.03619264, 0.019520363, 0.022839574, 0.047789592, 0.005764716, -0.03447098, 0.022432338, -0.043516744, -0.037231553, -0.025048206, -0.009967526, 0.037328403, 0.035044707, -0.004535913, 0.038086124, -0.034116786, -0.046980895, -0.03524534, -0.02570679, 0.035474673, -0.019355258, 0.013432988, -0.028117996, -0.041342087, 0.01409986, -0.03525537, -0.038160156, -0.052420918, 0.01810449, 0.035464697, -0.025294058, 0.010007306, -0.025996357, -0.06924902, 0.028132096, -0.00079841854, -0.013501817, 0.046770174, 0.07517163, 0.037037298, 0.025366541, 0.040248822, -0.028081292, -0.028332917, 0.036714826, 0.007687548, -0.028901538, 0.03839228, -0.027672466, -0.0041911914, 0.048854157, -0.01784227, -0.0155344615, 0.04750416, 0.04405297, 0.024017757, 0.024709102, -0.024437224, -0.03625656, 0.03626268, -0.0119398665, -0.023228755, 0.042166322, -0.017202552, 0.010498574, 0.030785644, -0.042424165, 0.015511501, -0.04409854, 0.021100117, -0.002790288, 0.004432084, -0.014360784, -0.037868485, -0.040606778, 0.0028607904, 0.039088912, 0.032936096, 0.03599776, -0.017276917, 0.020413958, -0.009697305, -0.0479381, -0.02891013, 0.03403221, -0.024198353, -0.03161053, -0.003828878, 0.014621108, 0.06415569, -0.01566947, -0.024424698, 0.010320143, 0.029164797, -0.037783336, 0.033035688, -0.023604764, 0.0006745482, -0.024393523, -0.023095502, -0.018396921, 0.019055322, -0.011880366, 0.023322131, 0.056035183, 0.00030634843, -0.020955907, -0.049658146, -0.03962187, 0.022502886, 0.036499042, -0.029692655, 0.032915078, -0.028775077, -0.011393002, -0.005315213, -0.049632583, 0.070666976, -0.07139168, 0.009008762, 0.019913368, -0.025216734, 0.016907237, 0.033562236, 0.03401224, -0.008816014, -0.037642844, 0.068338215, -0.015326151, 0.024804862, -0.03981009, 0.021049043, -0.016449336, -0.019830056, 0.043424606, -0.010613228, -0.03317898, 0.022078512, 0.008132583, 0.036657564, 0.021471148, -0.04202048, 0.010479801, -0.060896814, 0.0036573336, -0.012137062, -0.009369492, -0.024691008, -0.028375078, -0.03712006, 0.024363784, 0.0619363, 0.0012520632, 0.020621145, -0.030255327, -0.030828038, 0.047324497, 0.033152834, 0.037796646, -0.01434374, -0.066324085, 0.022530057, 0.04724558, -0.018717038, 0.02079031, -0.042318594, 0.012404005, 0.003054884, 0.040080458, -0.007734346, 0.00966154, 0.01965865, -0.02969571, 0.048648365, 0.030942103, 0.03517304, -0.044960428, 0.023147801, -0.013064005, 0.012933487, 0.031137485, 0.043248158, -0.039774954, 0.053235162, 0.033253767, 0.04959841, -0.026097752, -0.013117914, 0.02765747, -0.04861631, 0.042001173, 0.035988443, 0.019028643, -0.0063236253, -0.03546606, 0.05249698, 0.023819618, -0.029397534, 0.0014730253, -0.000116883064, 0.04589052, 0.07982128, 0.042475965, 0.02714497, -0.011290014, 0.048732307, -0.007990668, 0.036892712, -0.05074458, -0.03419913, 0.046826247, -0.0351593, -0.017725315, 0.02825849, -0.02061025, 0.010495187, 0.029973673, 0.013354483, 0.04428554, 0.0059044575, 0.040259574, 0.024635406, 0.056278225, 0.029261485, 0.021040283, -0.02957053, 0.015028589, 0.09915923, -0.006757007, 0.021263221, -0.022744874, 0.03037738, 0.015824845, -0.039941747, 0.024193197, -0.025102578, 0.031861637, 0.04820494, 0.056952294, 0.015798865, 0.012578128, -0.034587458, 0.051569622, 0.036841784, -0.029768696, -0.037315454, -0.004181349, 0.03994207, -0.012483087, -0.019211547, -0.019353691, 0.018520227, 0.00461553, -0.008341581, -0.05549858, 0.05766917, 0.05097321, 0.00880379, 0.013997554, -0.06590693, -0.01869569, -0.042314664, -0.018904256, -0.0055119256, 0.03792496, 0.036814462, 0.013308163, 0.036309067, 0.020966355, -0.0044715456, -0.051457252, -0.0029825429, -0.014860995, 0.0038679296, -0.037870258, 0.032946188, 0.022204902, 0.031311534, -0.0159217, -0.027177777, 0.019132279, -0.0015548733, 0.0062460816, 0.024122085, 0.0013738354, -0.015215801, -0.031390846, -0.008035339, 0.020526154, 0.006488116, -0.0024450996, -0.017090369, -0.039943922, -0.01950265, 0.032263108, 0.035478763, -0.033199288, 0.026933322, -0.027106462, -0.02065646, -0.007509963, -0.050557084, -0.03340465, -0.0047946647, 0.015502574, -0.025161006, -0.0077433935, -0.025955958, 0.0020085182, -0.021800976, -0.009508331, 0.033535887, -0.047463566, -0.058905426, 0.028794395, -0.0077173035, -0.042501763, -0.024379179, 0.017200196, -0.0070375046, 0.019198136, -0.012132133, 0.03652421, -0.039759845, 0.04861978, 0.0030262715, 0.042866085, 0.041402888, 0.017450964, 0.009089696, 0.0028635971, -0.043624565, -0.028436044, 0.014845563, 0.007810105, 0.040422868, -0.01659905, 0.014551624, 0.03692245, 0.008013322, 0.027947398, -0.005875631, -0.0029010554, 0.0076159886, -0.04006688, -0.006206228, 0.0038399713, 0.0630469, 0.035773862, 0.031985953, 0.022648549, -0.020068891, 0.016998352, 0.006821056, -0.02639971, -0.023113638, -0.016550884, 0.04542948, -0.04944595, 4.6349105e-5, -0.030284645, -0.008464625, 0.04505634, -0.0008425875, 0.0018507987, -0.045248747, -0.001249333, -0.027375245, -0.034440503, -0.03445196, -0.016945217, 0.032217544, 0.01201553, -0.011383161, 0.016768109, 0.02209182, 0.04161331, -0.026711816, -0.027969444, 0.013154886, 0.040792376, 0.00037842162, 0.031208977, 0.055764157, -0.041692186, 0.01183059, 0.009995629, 0.011140254, 0.06494206, 0.0007583337, -0.018633584, -0.03988589, -0.06401332, -0.026469348, -0.03703018, -0.009482455, 0.00750478, -0.01196945, 0.0010084544, -0.015276794, -0.028999355, 0.039044295, -0.0015245616, 0.019363733, -0.013175389, 0.020596242, 0.015313282, 0.04776969, 0.03503184, -0.024441065, -0.021466441, 0.03491211, -0.03033822, -0.04221141, 0.043747444, 0.031174233, -0.05234127, -0.00021145339, -0.0108963, -0.02563045, -0.030280393, -0.063621596, -0.0059554386, 0.009598384, 3.800433e-5, -0.011455618, 0.0024069417, 0.034393646, 0.029128842, 0.007318114, 0.051935125, -0.041065566, -0.023579529, -0.015356412, 0.020628927, 0.0016839687, -0.006113899, -0.025948673, 0.011051999, -1.7599392e-5, 0.021779431, 0.021231307, 0.04925588, 0.02865201, -0.03592068, 0.035591897, -0.026523454, 0.009644514, 0.04879437, -0.029754482, -0.030387688, -0.030870467, -0.03533088, -0.02333679, 0.022666639, -0.019431714, -0.036629736, 0.035112843, 0.017431475, -0.017157005, -0.026203807, 0.022084715, -0.012101193, -0.016560372, 0.02747846, -0.036947746, -0.019196276, 0.029935298, -0.05197717, 0.029685955, -0.00030348718, 0.032604396, 0.020966766, -0.044866037, 0.053359862, -0.042657174, -0.0041652545, -0.045802977, 0.013752225, -0.017868387, -0.025728293, 0.034969736, 0.019753583, 0.028519642, -0.025506618, -0.027275596, 0.002548761, -0.021548366, -0.030770132, 0.037810154, 0.039124895, -0.036099177, 0.0067838277, 0.0014933676, 0.03411964, 0.030397482, 0.02907957, -0.013021644, 0.03546133, -0.058428895, -0.028665997, -0.033455126, -0.037742794, -0.0025381332, -0.029671138, 0.027966527, -0.04934853, -0.03034516, 0.02078554, 0.021314679, -0.019340657, 0.008697383, -0.040426604, 0.017037353, -0.009563749, -0.0060880305, 0.026690366, 0.04071305, -0.016738972, 0.0020899752, -0.04395833, 0.0059037167, -0.020659246, -0.055160575, 0.036971394, 0.012827337, 0.023630928, -0.027455963, 0.010689233, -0.020523228, -0.010644282, -0.022099117, -0.05575785, -0.0014715773, 0.045237053, 0.024157247, -0.026763534, 0.004174187, 0.00038428922, -0.036329865, -0.004427296, 0.025025152, 0.04822559, 0.046744928, -0.021798782, -0.031161044, 0.01157757, 0.027121102, 0.013186705, 0.032716304, -0.0059137377, 0.050382566, -0.04728639, -0.030213784, -0.014744704, -0.03136835, -0.008328803, -0.00839062, -0.0036500788, -0.056926843, -0.02807327, -0.01330011, 0.041436, 0.02201358, 0.022166254, -0.03179345, 0.005270372, 0.018509101, 0.014327067, 0.018272892, -0.021296602, -0.03977375, -0.013095145, -0.014545233, -0.009666092, -0.022802576, 0.0005194365, 0.018938834, 0.041110124, 0.046513252, 0.025121529, -0.036493827, 0.04333533, -0.052713536, 0.016992891, 0.017229997]

…where each number in this set of 1024 numbers represents a part of a set of coordinates in a multi-dimensional space. In this case, because the vector “length” is 1024, you can picture “hello” being plotted in a space that has 1024 dimensions rather than two or three. And finding “similar” texts involves doing math to find other texts whose vector coordinates are spatially “nearby”.

How these texts are specifically converted to each of the 1024 coordinates involves math and training of software models, and for the purposes of this project is is a black box to me. Simon Willison has a good blog post about embeddings that gets into more of the details.

Picking a vector database

A frothy part of the “business” of large language models right now are companies building “vector databases”, which provide various ways of storing vectors created from a corpus of text, so that you can later execute search queries that retrieve similar texts using those vectors. Vector databases are useful for retrieval augmented generation, too: take an input “question”, and retrieve some number of similar texts from a vector database, feeding the question + those similar texts as a prompt to the LLM.

I was glad to find that there’s a pgvector extension that adds vector storage and search capabilities to the postgres SQL database. pgvector exposes a new vector(int) column type for storing embeddings, and allows you to retrieve similar text by comparing the cosine distance between vectors stored in the database and an input vector:

-- this enables the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- this creates a table which will store a snippet of "content"
-- as a text column, along-side the fixed-length vector
-- representation (in this case, with a length of 1024)
-- called an "embedding"
CREATE TABLE IF NOT EXISTS chunks (
	id bigserial PRIMARY KEY,
	content text,
	embedding vector(1024)
);
-- searching for content using pgvector uses "cosine
-- distance" math to compare the distance between two vectors,
-- which in this query are provided by the vector stored in
-- each row and the input vector provided at query
-- execution time
SELECT
	content,
	1 - (embedding <=> $1) AS score
FROM chunks
ORDER BY
	embedding <=> ($1)
LIMIT 5;

Populating a vector database using llamafile

With postgres and pgvector in hand, I needed to create and populate a postgres table like this with vectors for all of the text in my emails. This required going on a bit of a journey…

I use the native Mail.app MacOS application for my email, and ended up needing to write some code to leverage the undocumented sqlite database and file storage layout Mail.app uses to store email texts, so that I could retrieve the string contents of emails matching criteria like “the last 10,000 emails in my inbox and sent messages”.

I’d initially hoped to create vectors for email texts by making requests to the /embeddings API exposed by the llamafile server (embeddings are also used by the LLM itself). Unfortunately, I found that the vectors produced by this endpoint don’t work for cosine similarity searches — the vectors appear to be tuned for a question-answering use-case, where prediction of “next token” (e.g. what happens at the end of a sentence) is more important than semantic similarity of the overall sentence.

Populating a vector database using llamafile Go

The next best thing was running a different model to produce embeddings locally. The go-to library for generating embeddings is the sentence-transformers Python library. I’d initially hoped to use a Go library github.com/nlpodyssey/cybertron instead of Python, to try to better understand some of the abstractions that have developed around libraries like sentence-transformers without also having to wrap my head around Python at the same time.

Here’s an implementation of a CLI utlity from cybertron in Go thats converts input texts on stdin into output vectors on stdout, using the all-MiniLM-L6-v2 model that maps text to a 384 vector space. This worked well for vectorizing chunks of texts and inserting them into my database, but was quite slow! Here’s the results of running Go’s profiling pprof tool for a given run of vectorizing 100 or so sentences:

Go pprof results showing really slow mathfuncs.dotProduct

The slowest function call, clocking in at ~15 minutes in aggregate, is for a low-level math function that does “dot product” math somewhere in the bowels of the vector translation process. This was when I came to appreciate the specific role GPU’s play in AI: the cybertron Go library runs on the CPU rather than the GPU, and is really slow as a result!

Populating a vector database using llamafile Go Rust

Higher level languages like Go or Python run on the CPU, which can do some of the math used in AI tasks like text embedding calculations, but do it much more slowly. The way to do this math really fast is to have it run on a GPU, which requires software that can talk to lower-level GPU driver API’s like CUDA (for Nvidia GPU’s) or Metal (for Apple Silicon GPU’s). One popular library that does some of this GPU driver wrangling for common AI tasks is called PyTorch. It also has a C++ library called libpytorch that’s linkable by other language runtimes which want to embed it.

I was still interested in avoiding having to implement my database population code in Python…I don’t really have a good excuse for why. I wasn’t able to find any well-maintained Go wrappers for the libpytorch library, so I found a Rust library for libpytorch that can be paired with another library called rust-bert to calculate text embeddings on the GPU. With an implementation for how to populate my vector database now written in Rust and using libpytorch under the hood, I was seeing vector embedding calculation taking something like 10 seconds on the GPU, where previously it had taken around 15 minutes on the CPU.

Putting it all together

The code now consists of:

  • A Go binary that initializes an empty vector database in postgres, queries the Mail.app database for emails, and breaks emails into smaller sentence-like chunks of text, storing those chunks in postgres without associated embeddings vectors.
  • A Rust binary that queries the postgres database for chunks of text that haven’t yet been vectorized, sends those text chunks off to the GPU for vectorization (using libpytorch), and stores the resulting vectors back in the database.
  • A running instance of llamafile, providing programattic access to an LLM.
  • Another Go binary that still uses the CPU-bound cybertron package to generate a vector embedding for a question, queries the postgres database for similar texts to the question using cosine similarity, and feeds the question + resulting texts into the llamafile’s' /completion endpoint, returning the LLM’s predicted result.

You can see all of the code here.

Does it work?

The setup is sort of complicated, but it runs OK! Does it actually work as a way to talk to my email? Here are some real questions/answers I tried, with some editorial context:

Question: When was harolds birthday?

Answer: Harold’s birthday is on ABCDEF 1AB, as mentioned in the email.

So far, so good! The emails returned here were from a recent birthday party, and the specific month/day was correct.

Question: when was harold born

Answer: Harold was born on ABCDEF 1AB, 1947.

In this instance, the LLM can’t quite hack it, and the date is completely wrong (Harold is a toddler and wasn’t born in 1947). The day/months themeselves seem to be derived from dates of real emails I sent/received, but the year looks like it was hallucinated.

Question: where did i live in france

Answer: Based on the emails provided, it appears that Jeff Sisson lived and worked in Rennes, France for a period of time. He was teaching English to French students at a high school while also working on his own projects such as writing and creating content for his website “ilikenicethings.com”.

This summary is pretty accurate, if making my life at this time seem more little boring than I remember it!

Question: who did i live with in france

Answer: You lived with Alisa and Jivko in France.

Correct.

Question: what was the name of the show double happiness did

Answer: Based on the context provided, it seems that Double Happiness is an online platform or a website that has been inactive since 2015. The founders were Jeff Sisson and Borna Sammak, with Ricky Laska joining them shortly afterward. There are emails exchanged between these individuals discussing the site’s status and potential revival. However, I cannot provide specific details about any TV show or movie titled “Double Happiness” without further information.

This is mostly accurate, though I think based off the computers read of a single email I forwarded to someone at some point, that the LLM didn’t quite read in full (e.g. omitted Bennett, for example!). It’s also kind of funny that the LLM assumed Double Happiness was likely to be some flavor of intellectual property.

Question: what hikes did we do in death valley

Answer: We went on a hike at Badwater Basin and another one near the salt pans.

I’ve certainly been to these places with Kathryn, but don’t know that I’d have described them as hikes — this appears to be the model hallucinating.

Question: what hikes have i gone on

Answer: Based on the emails provided, it seems that Jeff Sisson is interested in finding a good winter hike for him and Tom Bubul. They are considering options such as Old Croton Aqueduct trail or Bull Hill. Additionally, they discussed other potential hikes like Arden Point and Glenclyffe, Ken Lockwood Gorge and the Columbia Trail, and Long Walk at Palisades Park. However, it is unclear if any of these specific hikes have been actually gone on by either person.

This is true and based on a real email! But one of many hiking emails I’ve sent or received that were matched, and the only one the LLM cared to reference, for some reason.

Thoughts on how this works

Some general observations, having played around with this:

  • The results are more accurate when the LLM specifically mentions “based on the emails provided”. I’m not yet sure if there’s a better way to tell the LLM to only do this.
  • I’m finding the emails returned via similarity search to be as much or more interesting than how the LLM interprets them.

It seems like the similarity search and LLM both struggle a little with the format of an email, where any given individual message rarely tells the full story about what’s being discussed. Unlike, say, a website, an email presumes a lot of prior context not written into the email itself, and that’s kind of what’s beautiful about emails. Maybe these tools will get better at inferring that type of missing context, but it seems as likely that they won’t, and there will remain some forms of human communication that will be resistant to machine interpretation.


Josh TV

14 Dec 2023

My friend Josh and I were riffing earlier this year about what it’s like to “watch TV” now. He was about to become a parent, and so potentially was going to be watching a lot of TV (by way of unpredictable newborn sleep schedules).

We were talking about the flattening experience of scrolling a horizontal carousel of movies or TV forever. Maybe you’re scrolling on one of the 20 corporate streaming services…or maybe you’re scrolling on a carefully maintained library of gently-used mkv files…either way, you’re scrolling.

What if infinitely scrolling is a weird rut we’ve found ourselves in, within the history of how we watch TV? Maybe earlier eras of broadcast TV, which had way less choice, captured some fundamental essence of TV better.

We came up with an idea for software that would take his offline collection movies and TV and present them (randomly) on a “TV Guide”-like schedule. It’ll show you what’s playing right now, in case you’d like to tune in. But the schedule is also deterministic, so you can see what’s coming up later in the week in case something looks good. I started calling the app “Josh TV”:

Josh TV screenshot

The Josh TV app doesn’t actually play the media: you still have to go find the file, and press play. But by severely constraining the burden of choice, Josh TV guides you thru TV moods you may never experience when infinitely scrolling on a corporate streaming service: “nothing good is on, I’m going to do something else”, or “wow, I completely forgot about that show/movie, it’s really good”, or “I watched this because it’s on and though it would suck but it’s actually good” or “this sucks but I gotta watch something while I’m doing some chore”.

Josh TV is available here for anyone curious to try it out. The minimum requirements for running the app are “whatever Josh has”, which in this case means it runs on a Mac and uses a Plex database file.


Same temp map

15 Aug 2023

An experience I’m guaranteed to have on any given summer: it’s 90°F indoors and probably some nasty level of humidity, and I’m stewing in the living room wondering how many other people are in the exact same situation right now. I added a little feature called “same-temp” to the Big Boy Weather station that answers this. It’ll show you everywhere else (in the Contental United States) it’s currently 90°F:

Contental United States of 90°F

…or everywhere else it’s currently 60°F:

Contental United States of 60°F

…or wherever it’s it’s currently 72°F:

Contental United States of 72°F

The maps update a couple of times an hour, and use data produced by the National Weather Service to create the National Forecast with a file format called “GRIB2”. GRIB2 files are cool: they’re about 50MB a piece and they describe meteorological data like temperature or rainfall, with instructions for how to project the data onto specific points on the earth. It’s a funky file format…every other horizontal line of “latitude” changes direction from east/west to west/east…and temperature data is in degrees kelvin, which always makes me think of hell…

I like working with file formats like GRIB2 because there’s something pure about something that’s “just a file”! Even though GRIB2 files are “used by the big guys that make the operational weather forecasts”, I can be a small guy using GRIB2 to write my own software to make these little maps of every place in the Contintental United States that shares a given temperature. I keep a sqlite database with a list of all of the cities on OpenStreetMap for exactly this occasion, and so the map will show a handful of place names that share whichever same temp you’re looking at, too. If you click on one of the same-temp links, you can dial the HTML slider and cruise the full temperature range. It can be fun (or disturbing?) to realize other specific places you’ve never been to where someone else may be sweating (or shivering) as you are right now…


More Regex King posts:

Big Boy Bike Directions

Flip phone review

What is Haberman?

sunrise sunset photos

New York State election data quest

Origins of big boy chat

elvis-tools

present weather sensor

hello world

working hard or hardly working?

non-lethal sentences

Good random numbers

NYC LLC

yes(1)

Legalese

Rip Van Winkle logout

Allen's addendum to International Art English

Go to any level / jump higher / stay bigger / live forever

Speech Acts

Selections from O'Reilly's Regular Expressions Pt. 1