The Joys of Parsing: Using Machine Learning for Data Extraction

Regular Expressions. I love ’em and use them all the time. But sometimes they become too fragile. So what then? Well, maybe you write a nice grammar and do some proper parsing. That is a great solution IF your input actually conforms. But what are better options when your input is freeform, inconsistent, and full of mistakes?

How do you deal with the Wild West of parsing? That is a problem I have been struggling with. This post looks at my journey from parsing with regular expressions to investigating various machine learning algorithms for more reliable data extraction.

The Problem – Regular Expressions and Their Limits

I run a simple website called the “US Black Metal Distributors Index.” It has several web scraper backends for various US Vinyl Record vendors that specialize in Black Metal. The goal is to help you find a vendor selling your favorite Black Metal records in the US. The difficulty of finding a US vendor led me to start my own distributor, Sto’Vo’Kor Records.

Scraping websites is pretty straightforward. Some of the distributors have a nice simple API, others don’t, but Beautiful Soup helps in those cases. Getting a list of products isn’t the problem. The problem is trying to interpret those products.

For the website to be useful, we need to parse out the artist(s), album, format, price, variants, description, and other information. And we need to normalize this data so it is consistent across all the different vendors.

Let’s look at what some of this data looks like and the challenges in parsing it. We’ll start off with Sto’Vo’Kor. Since this is my distributor, I have a lot more control over how I categorize and name things. My site has an API that returns JSON data that looks like this (I’ve omitted fields we aren’t interested in):

{
    "body_html": "LP<br>Label: Infernat Profundus Records<br>Year of release: 2023<br>Vinyl Color: Black",
    "id": 7292745253071,
    "product_type": "Vinyl",
    "title": "Pa Vesh En - Catacombs",
    "updated_at": "2024-07-13T09:20:23-05:00",
    "variants": [
        {
            "available": true,
            "id": 43144713732303,
            "option1": "Black",
            "price": "28.00",
            "product_id": 7292745253071,
        },
        {
            "available": true,
            "id": 43144713797839,
            "option1": "Clear with Black/White Smoke + Red",
            "price": "31.00",
            "product_id": 7292745253071,
        }
    ],
    "vendor": "Sto'Vo'Kor Records"
}

That’s pretty nice. We have a product named “Pa Vesh En – Catacombs.” It’s a vinyl record, and there are two variants, one in Black and one in Clear with Black/White Smoke. We can easily get the price of the variant. To extract our artist and album, we don’t even need a regex; we can just split on the ’ – ’:

artist, album = obj['title'].split(' - ')

Easy peasy. Let’s look at another title: Wampyric Rites - Demo I - Demo II. Uh-oh, we can’t split directly on a - anymore, as there is now an additional dash in the album title. So, we probably need to move to a smarter split:

artist, album = obj['title'].split(' - ', maxsplit=1)

That works a bit better, but we made a big assumption that any additional - would always be in the album title, not the artist name.

Let’s look at another: Mütiilation - Destroy Your Life for Satan (10" LP). Dang it! Part of the format has now shown up in the title, so we now have to try and remove that. So now we have additional parsing we have to do on just the album title. But we can’t just remove everything in the parenthesis, because we also have titles like: Mahr - Soulmare (I & II). So now we need special exceptions and can’t really do this in a generic way.

Let’s take a look at some other titles we need to parse from other vendors.

Here are some from Metal to the Core 1986:

VLAD TEPES - War Funeral March 12" LP
TENEBRAE – Serenades of the Damned 12″ LP

Ok, that doesn’t seem too bad. It is basically {artist} - {album} {type}. We’ll need special cases for all the different types, but there can’t be that many, right?

artist, rest = obj['title'].split(' - ')
title, format = split_title_format(rest)

def split_title_format(rest):
    r = re.compile(r'^(.*?)\s+\d+"\s+LP$')
    if (match := re.match(rest)) != None:
        return match.group(1), 'Vinyl'

    return rest, None

We’ll just keep adding regular expressions as we come across different types. That should be easy with a bit of experimentation.

After testing this, it worked great on the first example, but failed completely on the second one. In fact, it failed on the initial splitting of the title by a -. Wtf?

If you look really, really, really close, you may notice something off about the in the second item. That isn’t a standard dash (ASCII code 95). That is a Unicode character 8211, an “en dash.” Great… Let’s convert some Unicode to regular ASCII before doing anything else:

t = obj['title'].replace('-', '-')
artist, rest = t.split(' - ')
...

Ok, that is better. But we still failed on splitting the title and the format. Ugh, they used Unicode again for the ". Ok, this is going to be a pain. Now we have to look at all possible Unicode characters and try to unify them. Did you know there are at least 14 different types of dashes that all basically look the same but are totally different characters?

I imagine you’re starting to see the issue. However, this can be dealt with. There are Unicode libraries to help unify this or find all characters with a single type. So, it’s a pain, but we can survive.

Let’s keep going and look at some other vendors.

At The Gates - The Nightmare Of Being LP
USED - Fír – Dodemanswoud   Cassette
BLEMISH - Choke - Desiphon 12" (MG EXCLUSIVE)
BLEMISH / USED - Afsky - Ofte jeg drømmer mig død LP
USED Chevallier Skrog — Rolnická Renesance Cassette
Naxen ‎– Towards The Tomb Of Times CD
EXCLUSIVE - Atræ Bilis - Aumicide LP Color 2024 Press
Anatomia / Druid Lord - Split 7‟

Let’s see what we have. The ideal form is {artist} - {album} {format}, but very few are consistent with that. Ignoring the mix of Unicode and ASCII characters, and extra spaces, we also seem to have some information about the album at the beginning, and sometimes additional information at the end. And it isn’t consistent, sometimes the information at the beginning comes before a -, other times it is just part of the artist. So, more and more special cases. We’ll probably have to hard-code expressions for terms like USED, BLEMISH, and EXCLUSIVE, and hope they don’t introduce too many more.

And so this is where the parsing project for this site was for a long time. Each distributor that I parsed has a huge list of rules. Rules that often fail. Rules that have to be updated constantly. Almost every single week I had to update rules for a distributor to keep it parsing. And often the new rules conflicted with the old rules (like when there is a band called Blemish!).

Machine Learning to the Rescue

Enter machine learning. I’ve looked at a variety of machine learning models that might be useful. The problem is, of course, having enough data to train them.

Attempt 1 – The Fail with Vector-to-Vector Models

My first attempts at this started with using vector-to-vector algorithms. These are a class of algorithms that take an input vector and map it to an output vector. Specifically, I attempted to use sequence-to-sequence, which is often used for translation from one language to another. That is basically what I want: take an input and output a new vector that has the information I need in a standard format (like comma-separated).

However, throughout my trials, I never had very high success rates. I think I just didn’t have enough training data to truly make it work. And as artist names and album titles often have very unique words, especially in metal, there wasn’t a lot of overlap in the training data. As the algorithm encountered new words it had never seen, it didn’t seem to know what to do with them. Creating a training dataset appeared to take more effort than just manually fixing parsing rules.

Attempt 2 – The Win(ish) with NER and Classification

NER – Locating Text Within Unstructured Text

After some more research, I came across an NER (Named Entity Recognition) algorithm. This algorithm seeks to locate named entities in unstructured text and categorize them into predefined categories. This looked like a pretty good option.

Basically, the algorithm tries to locate a piece of information within the unstructured text based on the patterns it has learned from the training data. This means it doesn’t necessarily care about the meaning of the information or try to transform it; it just needs to locate it. AND that is one BIG caveat! The output MUST exist in the input string, so if the input string has a typo, like Mutiilation instead of Mütiilation, then the algorithm is NOT going to be able to transform that. But this is a limitation I think we can deal with.

I created a dataset with training data for each of my vendors with a sample of different formatting. The training data looks something like:

description,artist,album,format,extra
Asinhell - Impii Hora - Crimson Red Marbled Vinyl - EU Import Variant,Asinhell,Impii Hora,Vinyl,Crimson Red Marbled Vinyl - EU Import Variant
MORBID ANGEL - GATEWAYS TO ANNIHILATION - FDR RED/BLACK MARBLE VINYL LP,Morbid Angel,Gateways To Annihilation,Vinyl,FDR RED/BLACK MARBLE VINYL LP
Cân Bardd - The Last Rain - 2LP Green/Black Swirl Vinyl DLP,Cân Bardd,The Last Rain,Vinyl,2LP Green/Black Swirl Vinyl DLP
UNDERGANG – Til Døden Os Skiller- Slime Green Vinyl LP,Undergang,Til Døden Os Skiller,Vinyl,Slime Green Vinyl LP
Mastodon - Leviathan - Gold Nugget Vinyl LP,Mastodon,Leviathan,Vinyl,Gold Nugget Vinyl LP
CHILDREN OF BODOM - HEXED - (180 GRAM INDIE EXCLUSIVE VINYL LP),CHILDREN OF BODOM,HEXED,Vinyl,(180 GRAM INDIE EXCLUSIVE VINYL LP)
KRISIUN - Apocalyptic Revelations - LIMITED EDITION TRANSPARENT RED VINYL,Krisiun,Apocalyptic Revelations,Vinyl,LIMITED EDITION TRANSPARENT RED VINYL
SUFFOCATION - Pierced from Within - LTD Splatter Vinyl LP,Suffocation,Pierced from Within,Vinyl,LTD Splatter Vinyl LP
Cannibal Corpse ‎- Gallery of Suicide - Bone/Red Splatter Vinyl LP - Import,Cannibal Corpse,Gallery Of Suicide,Vinyl,Bone/Red Splatter Vinyl LP - Import

(You can find all the sample data on my GitHub repository.)

I then used NER on my training data to create three separate models:

  1. One to locate the artist name
  2. One to locate the album name
  3. One to locate the extra information

I also initially tried to create a model for the format. However, I want to categorize the products into one of three formats: CD, Cassette, or Vinyl. The distributors that I scrape from use all different words to describe their formats, like LP, 2LP, Digipak, and so on, and since the NER algorithm doesn’t categorize well, I decided to skip that for now.

The NER training program is quite simple. I used spaCy, which is a Python library for natural language processing and has a built-in NER algorithm. The training is quite simple and quick, and with minimal data, I was able to start getting some good results.

You can find the training script on my GitHub repository.

One problem with this specific NER implementation is that during prediction, we don’t get back a confidence value. Having a confidence value would be helpful to diagnose any special cases where the model hasn’t been trained.

Format Classification

So at this point, I can throw my input at my NER models and get back an artist, album, and extra information pretty reliably! So, let’s go back to the format categorization. As I mentioned earlier, NER was not ideal for that scenario. But a typical classification algorithm is!

Using scikit-learn in conjunction with xgboost, which is a gradient boosting library, proved to be able to handle the classification of the format quite well based on the descriptions. The program processes the input text data, reduces its dimensionality, and applies a regression model to predict the format labels. It is both reliable and quick.

Results

I have had this setup running for several weeks now, and overall, it has been quite successful. I haven’t yet had to add any new training data or special rules.

For future improvements, I would look at using an NER algorithm that returns a confidence value during prediction, and would also try to do better matching of artist and albums through something like musicbrainz, to help with variations in names.

You can find all the training programs in the GitHub repository along with my training data and validation data. The predictor has also been integrated into the parser code.

Conclusion

Switching to machine learning for parsing unstructured text has significantly improved the reliability and maintainability of my data extraction process. While challenges remain, this approach offers a scalable solution to the complexities of parsing diverse and inconsistent data. Using a mix of models and algorithms for different types of parsing problems was the key to solving my parsing issues.