programming – Marcus Dillavou

Decoding Randomness: An X-Files Experiment in Binary Secrets

I was recently watching an early X-Files episode called “Conduit” (Season 1 Episode 4). The overall plot doesn’t really matter here, but there is a memorable scene involving a kid who sits in front of a TV displaying static, furiously writing down zeros and ones on paper—line after line.

Mulder notices this strange behavior and suspects these numbers might hold a clue to his mysterious case. He collects some of the pages and sends them off to the “geeks” at headquarters for analysis.

In the next scene, the NSA storms in and takes the kid away! The supposedly random zeros and ones turn out to be binary data containing parts of defense satellite transmissions—highly classified information. The government assumes the boy knows more than he’s letting on.

I won’t spoil the rest of the storyline, but that scene made me wonder how realistic it was. So, I set out to replicate the experiment in my own small way and see if I could discover anything “secret” within random bits.

My Experiment

I started with a blank sheet of printer paper and wrote down a random sequence of 0s and 1s, much like the kid in the episode. On average, I fit 47 bits (0s or 1s) on each line, and I had 46 lines per sheet. That totals 2,162 bits on a single page of paper.

Each 0 or 1 is a “bit.” Bits are quite small, so we usually talk in terms of bytes instead. One byte = 8 bits. Therefore, 2,162 bits equals about 270 bytes (2,162 ÷ 8 ≈ 270.25).

How big is 270 bytes? Not very. If you store text, that’s roughly 270 characters (including spaces and punctuation). It might be about three sentences. For storing integers, we often use 4 or 8 bytes per number, so 270 bytes can store around 67 numbers if we use 4 bytes each. A color image needs 3 bytes per pixel (one byte each for red, green, and blue), so 270 bytes would only give us a 9×9-pixel image—far too small to represent anything meaningful.

Two Key Questions

From here, I wanted to explore two questions:

If I generate a random 2,162-bit sequence, how likely is it that this exact sequence also appears somewhere in any of the files on my computer? This would tell us how likely it is that the data could uniquely identify a pattern.
Is 2,162 bits of data enough to store the information the episode claimed? Can 2,162 bits store a defense transmission, a fragment of a classical melody, or part of da Vinci’s Vitruvian Man”?

My initial guesses were:

This random 2,162-bit pattern would appear often, making it basically impossible to determine its source.
If we restrict ourselves in certain ways, then yes, 2,162 bits could represent a few of those sample pieces of information (like a short text document, a short MIDI sequence, or a small vector image).

Searching Through Existing Files

Answering the first question is simple in theory: generate 2,162 random bits, then scan every file on my computer—pictures, documents, music, movies, system files, AI models, everything—to see if that sequence appears anywhere in their raw data.

A single MP3 is roughly 4 MiB (mebibytes), which is 33,554,432 bits—about 15,000 times larger than our little 2,162-bit sequence. Movies can run into gigabytes, which is thousands of times bigger. So it felt plausible that somewhere in all those bits, my random pattern might appear by chance.

I wrote a program (it’s quite parallelized) that plowed through every bit of every file on my computer, looking for the exact 2,162-bit pattern. You can see the program here (written in Elixir/Erlang). Execution took a lot longer than I expected, but in the end, the results were…

Not once did the random series of bits appear across a single file!

This was genuinely surprising! In smaller tests (around 200 bits), I had found random patterns appearing multiple times in a much smaller set of files. For reasons that might be statistical or based on how files are encoded, the larger 2,162-bit pattern never showed up in 390 GB of data spanning 1,045,216 files.

So, perhaps The X-Files was onto something. If the data really doesn’t appear in typical large datasets, maybe you could identify the true “source” behind it. That said, you’d still have to compare it across tremendous archives of known data. Given that it took me 40 hours with 16 processors and a solid-state drive to check just one laptop’s worth of data, it seems highly improbable that anyone—especially in 1993—could feasibly run a search at this level across massive datasets.

I suppose if I were more math inclined, I would have skipped out on this whole experiment and just computed the probability. It turns out, that the probability of two sequences of 2,162 bits matching each other in order is 1 in 5.8 × 10⁶⁵⁰. Winning the Powerball Lottery is only 1 in 2.0 × 10⁹ (2 billion), and being struck by lightning is only 1 in 7.5 × 10⁵ (750,000)! To even have a 0.01% chance of this exact 2,162 bit sequence, we’d need a file with a length of 3.16 × 10⁶⁴⁸ bits!

That is 3.16 × 10⁶³³ petabytes. That is an astronomically large quantity, far exceeding all the data storage capacities conceivable with current or theoretical technology. In fact, it is estimated that the total amount of data stored in the world is about 120 billion petabytes. This number is so far beyond that, it is difficult to compare it to anything!

We do get a slight advantage in that we have a “sliding window” within a continuous sequence of bits. As you can see, in the animation below, if we have a large file with a continuous sequence of bits, we start at the first 2,162, compare them, then slide one bit to the right, and compare the next window of 2,162 bits.

What this means, is that with a file that is 2163 bits long, we’d have two chances for our pattern to match. A 2164 bit long file would have 3 chances, and so on. So, the larger the input files, the more chances we have. But, we’d still have to construct a file so enormous (larger than the whole world’s data collection) that it offers 5.8 × 10⁶⁵⁰ different positions for the sequence to appear!

Could 2,162 Bits Really Represent “Defense Transmissions,” a Melody, and a Masterpiece?

In the show, the binary patterns are identified as partial snippets of defense transmissions, classical music, and da Vinci’s Vitruvian Man. Let’s see if that’s plausible.

Defense Transmissions

“Defense transmission” can mean many things, but suppose it’s something like:

Secret Military Base 1
ID: xyz
34.180467, -108.292431

Secret Military Base 2
ID: abc
37.542678, -110.055634

Names, IDs, and coordinates could fit into 2,162 bits (270 bytes). So yes, you could store one or two base coordinates in that space.

Melody of a Classical Song

Most people think of music as a recorded file—like a CD track. On a CD, audio is uncompressed: 44,100 samples per second, 16 bits per sample, stereo (2 channels). That’s about 1,411,200 bits for just one second of music, way bigger than our 2,162 bits.

But compressed formats exist. MP3 at 192 kbps (kilobits per second) is smaller but still 192,000 bits per second. Our 2,162 bits would cover only about 0.01 seconds of audio—not enough to identify a classical piece by ear.

There are some other ways to record music on a computer. Instead of capturing real world data from a microphone, you can use computer generated sounds. Think of how a keyboard would use generated sounds. The most common format is called MIDI. In this format, you specify a computer generated “instrument” and a sequence of notes. Here is an example of what a MIDI file that plays a C Major scale would look like:

(MIDI files are binary (0s and 1s), so you can’t really read them as text. However, it would take a LOT of space to write all the 0s and 1s, so I’ve written this in hexadecimal notation instead)

4D 54 68 64
00 00 00 06
00 00
00 01
00 60
4D 54 72 6B
00 00 00 44
00 90 3C 40
60 80 3C 40
00 90 3E 40
60 80 3E 40
00 90 40 40
60 80 40 40
00 90 41 40
60 80 41 40
00 90 43 40
60 80 43 40
00 90 45 40
60 80 45 40
00 90 47 40
60 80 47 40
00 90 48 40
60 80 48 40
00 FF 2F 00

Here’s that same data with some comments to describe what is happen

4D 54 68 64    // "MThd"
00 00 00 06    // Header length = 6 bytes
00 00          // Format 0
00 01          // 1 track
00 60          // 96 ticks per quarter note

4D 54 72 6B         // "MTrk"
00 00 00 44         // Track length: 68 bytes (we’ll explain below)

-- Note for C4 (MIDI 60, 0x3C) --
00 90 3C 40         // Delta=0, Note On, note 0x3C, velocity 0x40
60 80 3C 40         // Delta=96 (0x60), Note Off, note 0x3C, velocity 0x40

-- Note for D4 (MIDI 62, 0x3E) --
00 90 3E 40         // Delta=0, Note On, note 0x3E, velocity 0x40
60 80 3E 40         // Delta=96, Note Off, note 0x3E, velocity 0x40

-- Note for E4 (MIDI 64, 0x40) --
00 90 40 40         // Delta=0, Note On, note 0x40, velocity 0x40
60 80 40 40         // Delta=96, Note Off, note 0x40, velocity 0x40

-- Note for F4 (MIDI 65, 0x41) --
00 90 41 40         // Delta=0, Note On, note 0x41, velocity 0x40
60 80 41 40         // Delta=96, Note Off, note 0x41, velocity 0x40

-- Note for G4 (MIDI 67, 0x43) --
00 90 43 40         // Delta=0, Note On, note 0x43, velocity 0x40
60 80 43 40         // Delta=96, Note Off, note 0x43, velocity 0x40

-- Note for A4 (MIDI 69, 0x45) --
00 90 45 40         // Delta=0, Note On, note 0x45, velocity 0x40
60 80 45 40         // Delta=96, Note Off, note 0x45, velocity 0x40

-- Note for B4 (MIDI 71, 0x47) --
00 90 47 40         // Delta=0, Note On, note 0x47, velocity 0x40
60 80 47 40         // Delta=96, Note Off, note 0x47, velocity 0x40

-- Note for C5 (MIDI 72, 0x48) --
00 90 48 40         // Delta=0, Note On, note 0x48, velocity 0x40
60 80 48 40         // Delta=96, Note Off, note 0x48, velocity 0x40

-- End of Track --
00 FF 2F 00         // Delta=0, Meta Event (End of Track)

(Download this MIDI file to hear it)

The MIDI file above would be only 720 bits (90 bytes). That easily fits in our 2,162 bits! So, storing a brief melody as a MIDI file is indeed possible.

da Vinci’s Vitruvian Man

A typical uncompressed “raster” image would require a color value for every pixel. At 3 bytes per pixel, 2,162 bits (270 bytes) yields only a 9×9 image—too tiny to depict anything recognizable. We have some really neat compression algorithms, like JPEG, that might be able to get pretty small, but probably not down to 2,162 bits

Instead of describing an image by specifying a color for every pixel, we can instead describe an image by shapes. Vector graphics store images as shapes, lines, curves, etc. A simplified vector drawing of the Vitruvian Man might fit into a few hundred bytes, depending on how it’s encoded. It’s not as detailed as a full-color photo, but you could theoretically squeeze in a basic line-drawing version.

A vector format might look something like the following:

CIRCLE,BLUE,30,250x34

This tells us to draw a circle with a radius of 30 and its center point at 250×34. Oh, and fill it in with the color blue. This will use much less data than having to specify the color of every single pixel!

I happened to search the web and found a vector image of the Vitruvian Man. A simplified vector drawing could potentially fit in our allotted space!

The Realistic Takeaway

From my random-data experiment, it turned out that my 2,162-bit sequence did not appear even once in 390 GB of data. That suggests if you did find that exact pattern in a large dataset, you might be able to trace its source or guess its purpose. On the other hand, analyzing only 2,162 bits to say, “This is partial data from a classical song or a list of government bases,” is still incredibly difficult—especially if any bits are missing or out of order.

If you’re missing even a single leading bit, all the subsequent data shifts and becomes gibberish. Without headers or metadata, you may never figure out the file format. So the show’s premise—identifying partial data as defense transmissions, a melody, and a famous image—is a stretch, but at least it chose examples that could feasibly fit into a single sheet of binary scribbles!

While the X-Files scenario isn’t entirely realistic, it’s still fun that they picked data snippets (text, MIDI, vector images) small enough to work on a single page of handwritten bits. And that, in my opinion, is pretty cool!

Camp Counselor 1.4.0 is Here!

Managing your Bandcamp.com wishlist just got smarter and more convenient!

This release introduces support for MPRIS (Media Player Remote Interfacing Specification), a feature that lets you control playback directly from your keyboard’s media keys, your desktop’s media controls, or even other compatible apps. Whether you’re multitasking or just want quick access to playback controls, Camp Counselor has you covered.

Camp Counselor Playing Music while being controlled by MPRIS through the Gnome Desktop

As always, you can install or update the latest version through Flatpak or your distribution’s software manager.

Update now to enjoy these new features, and feel free to share your feedback!

The Joys of Parsing: Using Machine Learning for Data Extraction

Regular Expressions. I love ’em and use them all the time. But sometimes they become too fragile. So what then? Well, maybe you write a nice grammar and do some proper parsing. That is a great solution IF your input actually conforms. But what are better options when your input is freeform, inconsistent, and full of mistakes?

How do you deal with the Wild West of parsing? That is a problem I have been struggling with. This post looks at my journey from parsing with regular expressions to investigating various machine learning algorithms for more reliable data extraction.

The Problem – Regular Expressions and Their Limits

I run a simple website called the “US Black Metal Distributors Index.” It has several web scraper backends for various US Vinyl Record vendors that specialize in Black Metal. The goal is to help you find a vendor selling your favorite Black Metal records in the US. The difficulty of finding a US vendor led me to start my own distributor, Sto’Vo’Kor Records.

Scraping websites is pretty straightforward. Some of the distributors have a nice simple API, others don’t, but Beautiful Soup helps in those cases. Getting a list of products isn’t the problem. The problem is trying to interpret those products.

For the website to be useful, we need to parse out the artist(s), album, format, price, variants, description, and other information. And we need to normalize this data so it is consistent across all the different vendors.

Let’s look at what some of this data looks like and the challenges in parsing it. We’ll start off with Sto’Vo’Kor. Since this is my distributor, I have a lot more control over how I categorize and name things. My site has an API that returns JSON data that looks like this (I’ve omitted fields we aren’t interested in):

{
    "body_html": "LP<br>Label: Infernat Profundus Records<br>Year of release: 2023<br>Vinyl Color: Black",
    "id": 7292745253071,
    "product_type": "Vinyl",
    "title": "Pa Vesh En - Catacombs",
    "updated_at": "2024-07-13T09:20:23-05:00",
    "variants": [
        {
            "available": true,
            "id": 43144713732303,
            "option1": "Black",
            "price": "28.00",
            "product_id": 7292745253071,
        },
        {
            "available": true,
            "id": 43144713797839,
            "option1": "Clear with Black/White Smoke + Red",
            "price": "31.00",
            "product_id": 7292745253071,
        }
    ],
    "vendor": "Sto'Vo'Kor Records"
}

That’s pretty nice. We have a product named “Pa Vesh En – Catacombs.” It’s a vinyl record, and there are two variants, one in Black and one in Clear with Black/White Smoke. We can easily get the price of the variant. To extract our artist and album, we don’t even need a regex; we can just split on the ’ – ’:

artist, album = obj['title'].split(' - ')

Easy peasy. Let’s look at another title: Wampyric Rites - Demo I - Demo II. Uh-oh, we can’t split directly on a - anymore, as there is now an additional dash in the album title. So, we probably need to move to a smarter split:

artist, album = obj['title'].split(' - ', maxsplit=1)

That works a bit better, but we made a big assumption that any additional - would always be in the album title, not the artist name.

Let’s look at another: Mütiilation - Destroy Your Life for Satan (10" LP). Dang it! Part of the format has now shown up in the title, so we now have to try and remove that. So now we have additional parsing we have to do on just the album title. But we can’t just remove everything in the parenthesis, because we also have titles like: Mahr - Soulmare (I & II). So now we need special exceptions and can’t really do this in a generic way.

Let’s take a look at some other titles we need to parse from other vendors.

Here are some from Metal to the Core 1986:

VLAD TEPES - War Funeral March 12" LP
TENEBRAE – Serenades of the Damned 12″ LP

Ok, that doesn’t seem too bad. It is basically {artist} - {album} {type}. We’ll need special cases for all the different types, but there can’t be that many, right?

artist, rest = obj['title'].split(' - ')
title, format = split_title_format(rest)

def split_title_format(rest):
    r = re.compile(r'^(.*?)\s+\d+"\s+LP$')
    if (match := re.match(rest)) != None:
        return match.group(1), 'Vinyl'

    return rest, None

We’ll just keep adding regular expressions as we come across different types. That should be easy with a bit of experimentation.

After testing this, it worked great on the first example, but failed completely on the second one. In fact, it failed on the initial splitting of the title by a -. Wtf?

If you look really, really, really close, you may notice something off about the – in the second item. That isn’t a standard dash (ASCII code 95). That is a Unicode character 8211, an “en dash.” Great… Let’s convert some Unicode to regular ASCII before doing anything else:

t = obj['title'].replace('-', '-')
artist, rest = t.split(' - ')
...

Ok, that is better. But we still failed on splitting the title and the format. Ugh, they used Unicode again for the ". Ok, this is going to be a pain. Now we have to look at all possible Unicode characters and try to unify them. Did you know there are at least 14 different types of dashes that all basically look the same but are totally different characters?

I imagine you’re starting to see the issue. However, this can be dealt with. There are Unicode libraries to help unify this or find all characters with a single type. So, it’s a pain, but we can survive.

Let’s keep going and look at some other vendors.

At The Gates - The Nightmare Of Being LP
USED - Fír – Dodemanswoud   Cassette
BLEMISH - Choke - Desiphon 12" (MG EXCLUSIVE)
BLEMISH / USED - Afsky - Ofte jeg drømmer mig død LP
USED Chevallier Skrog — Rolnická Renesance Cassette
Naxen ‎– Towards The Tomb Of Times CD
EXCLUSIVE - Atræ Bilis - Aumicide LP Color 2024 Press
Anatomia / Druid Lord - Split 7‟

Let’s see what we have. The ideal form is {artist} - {album} {format}, but very few are consistent with that. Ignoring the mix of Unicode and ASCII characters, and extra spaces, we also seem to have some information about the album at the beginning, and sometimes additional information at the end. And it isn’t consistent, sometimes the information at the beginning comes before a -, other times it is just part of the artist. So, more and more special cases. We’ll probably have to hard-code expressions for terms like USED, BLEMISH, and EXCLUSIVE, and hope they don’t introduce too many more.

And so this is where the parsing project for this site was for a long time. Each distributor that I parsed has a huge list of rules. Rules that often fail. Rules that have to be updated constantly. Almost every single week I had to update rules for a distributor to keep it parsing. And often the new rules conflicted with the old rules (like when there is a band called Blemish!).

Machine Learning to the Rescue

Enter machine learning. I’ve looked at a variety of machine learning models that might be useful. The problem is, of course, having enough data to train them.

Attempt 1 – The Fail with Vector-to-Vector Models

My first attempts at this started with using vector-to-vector algorithms. These are a class of algorithms that take an input vector and map it to an output vector. Specifically, I attempted to use sequence-to-sequence, which is often used for translation from one language to another. That is basically what I want: take an input and output a new vector that has the information I need in a standard format (like comma-separated).

However, throughout my trials, I never had very high success rates. I think I just didn’t have enough training data to truly make it work. And as artist names and album titles often have very unique words, especially in metal, there wasn’t a lot of overlap in the training data. As the algorithm encountered new words it had never seen, it didn’t seem to know what to do with them. Creating a training dataset appeared to take more effort than just manually fixing parsing rules.

Attempt 2 – The Win(ish) with NER and Classification

NER – Locating Text Within Unstructured Text

After some more research, I came across an NER (Named Entity Recognition) algorithm. This algorithm seeks to locate named entities in unstructured text and categorize them into predefined categories. This looked like a pretty good option.

Basically, the algorithm tries to locate a piece of information within the unstructured text based on the patterns it has learned from the training data. This means it doesn’t necessarily care about the meaning of the information or try to transform it; it just needs to locate it. AND that is one BIG caveat! The output MUST exist in the input string, so if the input string has a typo, like Mutiilation instead of Mütiilation, then the algorithm is NOT going to be able to transform that. But this is a limitation I think we can deal with.

I created a dataset with training data for each of my vendors with a sample of different formatting. The training data looks something like:

description,artist,album,format,extra
Asinhell - Impii Hora - Crimson Red Marbled Vinyl - EU Import Variant,Asinhell,Impii Hora,Vinyl,Crimson Red Marbled Vinyl - EU Import Variant
MORBID ANGEL - GATEWAYS TO ANNIHILATION - FDR RED/BLACK MARBLE VINYL LP,Morbid Angel,Gateways To Annihilation,Vinyl,FDR RED/BLACK MARBLE VINYL LP
Cân Bardd - The Last Rain - 2LP Green/Black Swirl Vinyl DLP,Cân Bardd,The Last Rain,Vinyl,2LP Green/Black Swirl Vinyl DLP
UNDERGANG – Til Døden Os Skiller- Slime Green Vinyl LP,Undergang,Til Døden Os Skiller,Vinyl,Slime Green Vinyl LP
Mastodon - Leviathan - Gold Nugget Vinyl LP,Mastodon,Leviathan,Vinyl,Gold Nugget Vinyl LP
CHILDREN OF BODOM - HEXED - (180 GRAM INDIE EXCLUSIVE VINYL LP),CHILDREN OF BODOM,HEXED,Vinyl,(180 GRAM INDIE EXCLUSIVE VINYL LP)
KRISIUN - Apocalyptic Revelations - LIMITED EDITION TRANSPARENT RED VINYL,Krisiun,Apocalyptic Revelations,Vinyl,LIMITED EDITION TRANSPARENT RED VINYL
SUFFOCATION - Pierced from Within - LTD Splatter Vinyl LP,Suffocation,Pierced from Within,Vinyl,LTD Splatter Vinyl LP
Cannibal Corpse ‎- Gallery of Suicide - Bone/Red Splatter Vinyl LP - Import,Cannibal Corpse,Gallery Of Suicide,Vinyl,Bone/Red Splatter Vinyl LP - Import

(You can find all the sample data on my GitHub repository.)

I then used NER on my training data to create three separate models:

One to locate the artist name
One to locate the album name
One to locate the extra information

I also initially tried to create a model for the format. However, I want to categorize the products into one of three formats: CD, Cassette, or Vinyl. The distributors that I scrape from use all different words to describe their formats, like LP, 2LP, Digipak, and so on, and since the NER algorithm doesn’t categorize well, I decided to skip that for now.

The NER training program is quite simple. I used spaCy, which is a Python library for natural language processing and has a built-in NER algorithm. The training is quite simple and quick, and with minimal data, I was able to start getting some good results.

You can find the training script on my GitHub repository.

One problem with this specific NER implementation is that during prediction, we don’t get back a confidence value. Having a confidence value would be helpful to diagnose any special cases where the model hasn’t been trained.

Format Classification

So at this point, I can throw my input at my NER models and get back an artist, album, and extra information pretty reliably! So, let’s go back to the format categorization. As I mentioned earlier, NER was not ideal for that scenario. But a typical classification algorithm is!

Using scikit-learn in conjunction with xgboost, which is a gradient boosting library, proved to be able to handle the classification of the format quite well based on the descriptions. The program processes the input text data, reduces its dimensionality, and applies a regression model to predict the format labels. It is both reliable and quick.

Results

I have had this setup running for several weeks now, and overall, it has been quite successful. I haven’t yet had to add any new training data or special rules.

For future improvements, I would look at using an NER algorithm that returns a confidence value during prediction, and would also try to do better matching of artist and albums through something like musicbrainz, to help with variations in names.

You can find all the training programs in the GitHub repository along with my training data and validation data. The predictor has also been integrated into the parser code.

Conclusion

Switching to machine learning for parsing unstructured text has significantly improved the reliability and maintainability of my data extraction process. While challenges remain, this approach offers a scalable solution to the complexities of parsing diverse and inconsistent data. Using a mix of models and algorithms for different types of parsing problems was the key to solving my parsing issues.