The Joys of Parsing: Using Machine Learning for Data Extraction

Regular Expressions. I love ’em and use them all the time. But sometimes they become too fragile. So what then? Well, maybe you write a nice grammar and do some proper parsing. That is a great solution IF your input actually conforms. But what are better options when your input is freeform, inconsistent, and full of mistakes?

How do you deal with the Wild West of parsing? That is a problem I have been struggling with. This post looks at my journey from parsing with regular expressions to investigating various machine learning algorithms for more reliable data extraction.

The Problem – Regular Expressions and Their Limits

I run a simple website called the “US Black Metal Distributors Index.” It has several web scraper backends for various US Vinyl Record vendors that specialize in Black Metal. The goal is to help you find a vendor selling your favorite Black Metal records in the US. The difficulty of finding a US vendor led me to start my own distributor, Sto’Vo’Kor Records.

Scraping websites is pretty straightforward. Some of the distributors have a nice simple API, others don’t, but Beautiful Soup helps in those cases. Getting a list of products isn’t the problem. The problem is trying to interpret those products.

For the website to be useful, we need to parse out the artist(s), album, format, price, variants, description, and other information. And we need to normalize this data so it is consistent across all the different vendors.

Let’s look at what some of this data looks like and the challenges in parsing it. We’ll start off with Sto’Vo’Kor. Since this is my distributor, I have a lot more control over how I categorize and name things. My site has an API that returns JSON data that looks like this (I’ve omitted fields we aren’t interested in):

{
    "body_html": "LP<br>Label: Infernat Profundus Records<br>Year of release: 2023<br>Vinyl Color: Black",
    "id": 7292745253071,
    "product_type": "Vinyl",
    "title": "Pa Vesh En - Catacombs",
    "updated_at": "2024-07-13T09:20:23-05:00",
    "variants": [
        {
            "available": true,
            "id": 43144713732303,
            "option1": "Black",
            "price": "28.00",
            "product_id": 7292745253071,
        },
        {
            "available": true,
            "id": 43144713797839,
            "option1": "Clear with Black/White Smoke + Red",
            "price": "31.00",
            "product_id": 7292745253071,
        }
    ],
    "vendor": "Sto'Vo'Kor Records"
}

That’s pretty nice. We have a product named “Pa Vesh En – Catacombs.” It’s a vinyl record, and there are two variants, one in Black and one in Clear with Black/White Smoke. We can easily get the price of the variant. To extract our artist and album, we don’t even need a regex; we can just split on the ’ – ’:

artist, album = obj['title'].split(' - ')

Easy peasy. Let’s look at another title: Wampyric Rites - Demo I - Demo II. Uh-oh, we can’t split directly on a - anymore, as there is now an additional dash in the album title. So, we probably need to move to a smarter split:

artist, album = obj['title'].split(' - ', maxsplit=1)

That works a bit better, but we made a big assumption that any additional - would always be in the album title, not the artist name.

Let’s look at another: Mütiilation - Destroy Your Life for Satan (10" LP). Dang it! Part of the format has now shown up in the title, so we now have to try and remove that. So now we have additional parsing we have to do on just the album title. But we can’t just remove everything in the parenthesis, because we also have titles like: Mahr - Soulmare (I & II). So now we need special exceptions and can’t really do this in a generic way.

Let’s take a look at some other titles we need to parse from other vendors.

Here are some from Metal to the Core 1986:

VLAD TEPES - War Funeral March 12" LP
TENEBRAE – Serenades of the Damned 12″ LP

Ok, that doesn’t seem too bad. It is basically {artist} - {album} {type}. We’ll need special cases for all the different types, but there can’t be that many, right?

artist, rest = obj['title'].split(' - ')
title, format = split_title_format(rest)

def split_title_format(rest):
    r = re.compile(r'^(.*?)\s+\d+"\s+LP$')
    if (match := re.match(rest)) != None:
        return match.group(1), 'Vinyl'

    return rest, None

We’ll just keep adding regular expressions as we come across different types. That should be easy with a bit of experimentation.

After testing this, it worked great on the first example, but failed completely on the second one. In fact, it failed on the initial splitting of the title by a -. Wtf?

If you look really, really, really close, you may notice something off about the in the second item. That isn’t a standard dash (ASCII code 95). That is a Unicode character 8211, an “en dash.” Great… Let’s convert some Unicode to regular ASCII before doing anything else:

t = obj['title'].replace('-', '-')
artist, rest = t.split(' - ')
...

Ok, that is better. But we still failed on splitting the title and the format. Ugh, they used Unicode again for the ". Ok, this is going to be a pain. Now we have to look at all possible Unicode characters and try to unify them. Did you know there are at least 14 different types of dashes that all basically look the same but are totally different characters?

I imagine you’re starting to see the issue. However, this can be dealt with. There are Unicode libraries to help unify this or find all characters with a single type. So, it’s a pain, but we can survive.

Let’s keep going and look at some other vendors.

At The Gates - The Nightmare Of Being LP
USED - Fír – Dodemanswoud   Cassette
BLEMISH - Choke - Desiphon 12" (MG EXCLUSIVE)
BLEMISH / USED - Afsky - Ofte jeg drømmer mig død LP
USED Chevallier Skrog — Rolnická Renesance Cassette
Naxen ‎– Towards The Tomb Of Times CD
EXCLUSIVE - Atræ Bilis - Aumicide LP Color 2024 Press
Anatomia / Druid Lord - Split 7‟

Let’s see what we have. The ideal form is {artist} - {album} {format}, but very few are consistent with that. Ignoring the mix of Unicode and ASCII characters, and extra spaces, we also seem to have some information about the album at the beginning, and sometimes additional information at the end. And it isn’t consistent, sometimes the information at the beginning comes before a -, other times it is just part of the artist. So, more and more special cases. We’ll probably have to hard-code expressions for terms like USED, BLEMISH, and EXCLUSIVE, and hope they don’t introduce too many more.

And so this is where the parsing project for this site was for a long time. Each distributor that I parsed has a huge list of rules. Rules that often fail. Rules that have to be updated constantly. Almost every single week I had to update rules for a distributor to keep it parsing. And often the new rules conflicted with the old rules (like when there is a band called Blemish!).

Machine Learning to the Rescue

Enter machine learning. I’ve looked at a variety of machine learning models that might be useful. The problem is, of course, having enough data to train them.

Attempt 1 – The Fail with Vector-to-Vector Models

My first attempts at this started with using vector-to-vector algorithms. These are a class of algorithms that take an input vector and map it to an output vector. Specifically, I attempted to use sequence-to-sequence, which is often used for translation from one language to another. That is basically what I want: take an input and output a new vector that has the information I need in a standard format (like comma-separated).

However, throughout my trials, I never had very high success rates. I think I just didn’t have enough training data to truly make it work. And as artist names and album titles often have very unique words, especially in metal, there wasn’t a lot of overlap in the training data. As the algorithm encountered new words it had never seen, it didn’t seem to know what to do with them. Creating a training dataset appeared to take more effort than just manually fixing parsing rules.

Attempt 2 – The Win(ish) with NER and Classification

NER – Locating Text Within Unstructured Text

After some more research, I came across an NER (Named Entity Recognition) algorithm. This algorithm seeks to locate named entities in unstructured text and categorize them into predefined categories. This looked like a pretty good option.

Basically, the algorithm tries to locate a piece of information within the unstructured text based on the patterns it has learned from the training data. This means it doesn’t necessarily care about the meaning of the information or try to transform it; it just needs to locate it. AND that is one BIG caveat! The output MUST exist in the input string, so if the input string has a typo, like Mutiilation instead of Mütiilation, then the algorithm is NOT going to be able to transform that. But this is a limitation I think we can deal with.

I created a dataset with training data for each of my vendors with a sample of different formatting. The training data looks something like:

description,artist,album,format,extra
Asinhell - Impii Hora - Crimson Red Marbled Vinyl - EU Import Variant,Asinhell,Impii Hora,Vinyl,Crimson Red Marbled Vinyl - EU Import Variant
MORBID ANGEL - GATEWAYS TO ANNIHILATION - FDR RED/BLACK MARBLE VINYL LP,Morbid Angel,Gateways To Annihilation,Vinyl,FDR RED/BLACK MARBLE VINYL LP
Cân Bardd - The Last Rain - 2LP Green/Black Swirl Vinyl DLP,Cân Bardd,The Last Rain,Vinyl,2LP Green/Black Swirl Vinyl DLP
UNDERGANG – Til Døden Os Skiller- Slime Green Vinyl LP,Undergang,Til Døden Os Skiller,Vinyl,Slime Green Vinyl LP
Mastodon - Leviathan - Gold Nugget Vinyl LP,Mastodon,Leviathan,Vinyl,Gold Nugget Vinyl LP
CHILDREN OF BODOM - HEXED - (180 GRAM INDIE EXCLUSIVE VINYL LP),CHILDREN OF BODOM,HEXED,Vinyl,(180 GRAM INDIE EXCLUSIVE VINYL LP)
KRISIUN - Apocalyptic Revelations - LIMITED EDITION TRANSPARENT RED VINYL,Krisiun,Apocalyptic Revelations,Vinyl,LIMITED EDITION TRANSPARENT RED VINYL
SUFFOCATION - Pierced from Within - LTD Splatter Vinyl LP,Suffocation,Pierced from Within,Vinyl,LTD Splatter Vinyl LP
Cannibal Corpse ‎- Gallery of Suicide - Bone/Red Splatter Vinyl LP - Import,Cannibal Corpse,Gallery Of Suicide,Vinyl,Bone/Red Splatter Vinyl LP - Import

(You can find all the sample data on my GitHub repository.)

I then used NER on my training data to create three separate models:

  1. One to locate the artist name
  2. One to locate the album name
  3. One to locate the extra information

I also initially tried to create a model for the format. However, I want to categorize the products into one of three formats: CD, Cassette, or Vinyl. The distributors that I scrape from use all different words to describe their formats, like LP, 2LP, Digipak, and so on, and since the NER algorithm doesn’t categorize well, I decided to skip that for now.

The NER training program is quite simple. I used spaCy, which is a Python library for natural language processing and has a built-in NER algorithm. The training is quite simple and quick, and with minimal data, I was able to start getting some good results.

You can find the training script on my GitHub repository.

One problem with this specific NER implementation is that during prediction, we don’t get back a confidence value. Having a confidence value would be helpful to diagnose any special cases where the model hasn’t been trained.

Format Classification

So at this point, I can throw my input at my NER models and get back an artist, album, and extra information pretty reliably! So, let’s go back to the format categorization. As I mentioned earlier, NER was not ideal for that scenario. But a typical classification algorithm is!

Using scikit-learn in conjunction with xgboost, which is a gradient boosting library, proved to be able to handle the classification of the format quite well based on the descriptions. The program processes the input text data, reduces its dimensionality, and applies a regression model to predict the format labels. It is both reliable and quick.

Results

I have had this setup running for several weeks now, and overall, it has been quite successful. I haven’t yet had to add any new training data or special rules.

For future improvements, I would look at using an NER algorithm that returns a confidence value during prediction, and would also try to do better matching of artist and albums through something like musicbrainz, to help with variations in names.

You can find all the training programs in the GitHub repository along with my training data and validation data. The predictor has also been integrated into the parser code.

Conclusion

Switching to machine learning for parsing unstructured text has significantly improved the reliability and maintainability of my data extraction process. While challenges remain, this approach offers a scalable solution to the complexities of parsing diverse and inconsistent data. Using a mix of models and algorithms for different types of parsing problems was the key to solving my parsing issues.

Go Raleigh + BusTime: Elevating Your Transit Experience 🚌🚀

Hey there transit enthusiasts, it’s been quite the ride this summer in Raleigh, NC, as the Go Raleigh transit agency decided to change things up in the real-time bus tracking game. They swapped out their trusty old TransLoc system for Clever Device’s BusTime. As a result, the Go Raleigh riders had to endure a brief hiatus in their real-time bus tracking experience.

But, guess what? I’m back in the driver’s seat, and after some serious detective work, I’m thrilled to announce that BusTime is now officially in the family of supported backends for the Go Transit App’s impressive Montclair system! 🎉

BusTime isn’t just any newcomer; it’s a proven player in the transit world, trusted by major agencies like the Chicago Transit Authority (CTA). In fact, my testing took me on a virtual tour of the Windy City, where CTA’s buses roll with the power of BusTime.

What’s even more exciting? This integration with BusTime paves the way for future expansion, potentially bringing our stellar transit tracking to even more cities down the road!

Now, let’s get technical. While Clever Device’s website may hide some of BusTime’s API details, fear not! CTA comes to the rescue with an excellent Developer Page offering access to API Keys and comprehensive API Documentation.

But wait, there’s more! Every feature you’ve come to rely on with Go Transit is fully supported through the BusTime API. This is especially good news for the residents of Raleigh, as it happens to be the city with the highest ridership.

As of the latest update, the Open Source software that powers Go Transit (Montclair) now boasts support for a diverse range of backends, ensuring a smooth ride:

  • Availtec
  • BusTime
  • GTFS-RT (Beta)
  • RouteShout
  • RouteShout v2
  • Transloc
  • Transloc v3

This comprehensive coverage guarantees that you’ll never miss your next bus, no matter which backend you prefer. So, whether you’re a loyal Go Raleigh rider or a transit aficionado eager to explore the wonders of BusTime, we’re here to meet your transit needs.

Stay tuned for more updates as we continue to enhance your transit experience. Thank you for your patience and loyalty – we look forward to serving you better than ever before! 🌟🚌🌟

Grooving with Camp Counselor: Your Bandcamp Buddy

Ah, the sweet siren call of new music! If there’s one thing I’m always on the prowl for, it’s fresh tunes to tickle my eardrums. I’m practically the Bandcamp connoisseur, the aficionado of audio, if you will. My collection’s like a treasure trove of musical gems, all handpicked from the vast Bandcamp galaxy. But there’s a twist to this tale – my wishlist isn’t just a shopping list; it’s a roadmap to musical exploration.

Picture this: thousands of albums languishing in my wishlist, each one a potential masterpiece. It’s a labyrinth of sound, and in my quest for auditory bliss, I’ve encountered a conundrum. How do I keep track of what I’ve listened to and what piqued my interest? You see, I’m not just a ‘skip and forget’ kind of listener. Even if an album doesn’t quite hit the mark, it lingers on my wishlist, a musical enigma begging to be solved. But here’s the kicker – without notes, it’s like trying to recall the lyrics of a song from a dream. Frustrating, right?

So, in the spirit of ‘necessity is the mother of invention,’ I decided to craft a digital companion to my Bandcamp odyssey. Behold, Camp Counselor! This nifty tool syncs your Bandcamp wishlist and purchase history into a cozy database, giving you the power to annotate, star, and sort your albums with reckless abandon.

Now, I know what you’re thinking: “Why not just leave a public review on Bandcamp?” Well, my friend, sometimes the musical journey is a private affair, a secret rendezvous between you and your headphones. That’s where Camp Counselor shines. It’s your personal backstage pass to the music world. You can leave your thoughts, ratings, and quirks in blissful secrecy.

Now, here’s the cherry on top – I wanted to break the mold, to venture into new territories. This wasn’t just about crafting a web app; it was about exploring uncharted sonic realms. So, Camp Counselor is tailored for my trusty Pinephone, my loyal companion on my musical journeys. I built it with Vala and Gtk4 for Linux, complete with a slick and responsive design. As for Mac and Windows users, well, I haven’t given it a whirl, and truth be told, I’m dancing to a different beat.

But wait, there’s more! In the future, I might just sprinkle some more magic into Camp Counselor:

  • Imagine the sheer delight of previewing albums right within the app, eliminating the need to hop between tabs.
  • Ever wish you could magically add new albums from artists you adore to your wishlist? Well, I’m conjuring up just that.
  • Who knows? Camp Counselor might just evolve to embrace the vast world of musical data beyond Bandcamp.

So there you have it, folks – Camp Counselor, your trusty sidekick in the realm of Bandcamp. It’s time to dive into the sonic sea, navigate the waves of melodies, and rediscover your musical journey, one note at a time. Happy listening! 🎵🎉

Upcoming Max Transit Changes

Upcoming Max Transit Changes

Max Transit has recently released their plans for upcoming changes to Birmingham, AL’s transit system that will go into effect in mid to late May. The plan primarily consists of two parts:

  1. Using the new BRT (Bus Rapid Transit) stations as hubs to begin moving away from the central hub and spoke model that has been out of date for decades.
  2. Using “On-Demand” service for areas of low ridership.

Why is this happening?

There appears to be three major reasons for these changes.

First, Birmingham’s transit system has used a hub and spoke model for 60+ years. The original assumption was that every one from the edges of Birmingham would all want to come to Central Station. With changes in population, this is no longer an accurate representation of how riders want to utilize public transportation.

Several other hub and spoke transportation systems in other cities in the US have already or are planning to reconfigure their system into a grid like system. A grid system has several advantages, including higher frequency, multiple paths to your destination, and better coverage. However, it comes with a potential downside of needing to transfer rather than having a direct route. Transferring is really only an issue if the frequency of buses is poor.

With the completion of the East-West BRT (Bus Rapid Transit) system last year, Birmingham now has several stations or hubs that can be utilized. This is allowing Max to begin to reconfigure their system into a more grid like system and utilize the BRT as the backbone of the system.

In my opinion, this is the correct direction to be heading in, and I am excited about realignments that focus on connecting with the BRT.

Second, as seems to always be the case in Birmingham, Max is severely underfunded. This means trying to use what little resources they have as efficiently as possible. Unfortunately, this means removing duplication and eliminating areas of low usage, so you can allocate more resources to areas of high demand.

Third, there is a nation wide shortage of bus operators, mechanics, and other related workers. During the meeting, Director Charlotte Shaw said that there is a 80k-100k shortage of public transit workers nationally. At Max, there is currently a lack of bus operators. Max has 30% fewer operators than they need to operate the system! This means drivers are working overtime and/or trips on routes have to be canceled daily. In fact, if you subscribe to Max’s alert system, you will get messages like those below throughout the day of trips being canceled, likely because of a lack of operators.

Messages of Canceled Runs

Bus operators must have a CDL (commercial drivers license), and finding operators to hire has been difficult for Max since COVID.

One advantage of the On Demand service, is that the vehicle is a standard Minivan and does not require a CDL, which means it is easier to find operators.

Route Realignments

First, let’s look at routes that are being modified to use the BRT stations as hubs:

17 – Eastwood

This is a map of the current route 17 (Green) and the BRT (Purple). Currently, the 17 meets up with the BRT at the east Woodlawn station, but continues to central station downtown on the 3rd/4th Ave S/US 78 corridor.

Existing Route 17 and BRT

The upcoming changes have the 17 terminating at the BRT east Woodlawn station. Riders that want to continue downtown must now transfer to the BRT.

Modified Route 17

The Route 17 is significantly shorter and eliminates duplication. Based upon the proposed timetable, it is a 35 minute loop with a 10 minute buffer at the Woodlawn transit center. This allows the transit agency to run a single bus and have 45 minute frequency, whereas previously, frequency was at best 55 minutes. With the updated route, the bus will run the route 21 times vs currently only running 19 times.

Riders will have to transfer to the BRT, and depending upon that transfer time, the time to central station should be roughly equal, as the BRT is an express bus, rather than a local.

The biggest effect is that the 17 no longer provides service down US 78 (3rd/4th Av S). To compensate for this, the 25 will be realigned to no longer duplicate the BRT down 1st Av N, but instead take US 78.

Route 25 – Center Point

This is a map of the current route 25 (Red) and the BRT (Purple).

Existing Route 25 and BRT

With the re-alignment, the Route 25 will still meet the BRT at the east Woodlawn Station, but will then follow the previous Route 17 down US 78 to Central Station.

Modified Route 25

The Route 25 currently has two buses service it in the morning and evening with 45 minute frequency, but only a single bus during midday with 1 hour and 30 minute frequency. According to the modified schedule, the Route 25 will now have 2 buses servicing it throughout the entire day, giving it a constant 45 minute frequency and 6 extra trips a day.

This is a good thing for Center Point residents, however, overall this is a loss for some of the current riders of the 17. Most notable, the 25 does NOT run on the weekends, which means any riders that previous used the 17 between Central Station and Woodlawn will no longer have Saturday Service.

Route 48 – South Powderly

Several years ago, during some major cuts, the Route 8 was eliminated and the Route 48 was realigned to combine the 8 and 48. Frequency was also cut back. This is a map of the current Route 48 (Red) and BRT (Purple).

Existing Route 48 and BRT

As you can see, there is quite a bit of overlap between downtown Central Station and the BRT. With the re-alignment, the 48 will no longer go to Central Station, and instead will use the 6th Av S / Goldwire BRT station has a hub. The rest of the route will remain the same.

Modified Route 48

According to the proposed timetable, the Route 48 will now have a frequency of 1 hour, instead of 1 hour 25 minutes.

Route 7 – Fairfield

This is a new route that utilizes the BRT’s West Crossplex Transit Station as a hub and services Fairfield and connects with the Route 5 – Ensley/Wylam/Fairfield.

New Route 7

At this point, the route is somewhat limited, as it only runs on weekdays with a frequency of 1 hour. It also only runs in the morning (6am – 9am) and in the afternoons (3.30pm-6.30pm). The route also has a 10 minute buffer at the Crossplex Transit Station to help with connections to the BRT.

On Demand Changes / Route Eliminations

For those not familiar, the city of Birmingham entered a partnership with Via several years ago to provide On Demand micro transit. This allows riders to use an App similar to Uber/Lyft to schedule a pickup. A 5 person van picks up riders along the same route and drops them off.

However, this is not a door to door private service like Uber/Lyft. You are typically asked to walk around 1-2 blocks to a specific corner/side of the street. This is to help optimize the route, so the vehicle isn’t pulling U-turns or changing directions. It is possible that you will get the van to yourself, but if there are other requests along the same path, those riders may be picked up or dropped off along the way.

Originally, this was paid for and managed by the City of Birmingham. Recently, the city of Birmingham has been working in partnership with Max Transit to promote the service as a unified solution to transit. This is a step in the right direction, but eventually, Max Transit should be the owner of this rather than the city of Birmingham.

With constant budget cuts comes more route eliminations. The changes presented eliminate Route 12 – Highland Park, Route 18 – Fountain Heights, and Route 43 – Birmingham Zoo.

Route 12, Route 18, and Route 43 will all be eliminated!

These neighborhoods will no longer be services by _any_ fixed route bus service. Instead, Max is proposing extending the zone of the Birmingham On-Demand Service to include these areas.

In my opinion, this is a huge loss for these neighborhoods. Highland Park is the original street car neighborhood in Birmingham, and was designed around the #12 street car. Several years ago, when there were major route cuts due to funding, Highland Park lost the #44 and was left with only the #12. The #12 was re-routed to attempt to cover parts of both the #12 and #44, but instead, it ended up ruining the route costing ridership.

While I am not against On Demand Micro Transit, I have concerns about how it is currently being implemented. Micro Transit is not a replacement for fixed route service. Nor should Micro Transit be a way to bypass fixed route service.

If you look at many of the fixed route bus lines, you can see that they are rarely straight. Instead, they meander all over the place to pickup a few riders here or service this business over there. When there are budget cuts, routes get combined causing them to cover even more areas.

What this does is slow down the bus routes and leave us with long run times and poor frequency, making a frustrating experience. What should be a 15 minute bus ride, becomes 35 minutes due to it not being direct.

Micro Transit can help with this problem. It allows the transit agency to realign and straighten the route. The bus route no longer has to go out of its way, into a low density neighborhood, to possibly pick up 1 or 2 passengers. On Demand Micro-Transit can instead, pick up those 1 or 2 passengers and bring them to the bus stop. It solves a critical first-mile/last-mile issue, especially for those who can’t or won’t walk longer distances or ride a bicycle.

This can give everyone a better experience. The bus is quicker. The bus has better frequency. And most importantly, we don’t leave any riders behind.

Unfortunately, this is not how Birmingham’s On-Demand Micro Transit is currently being implemented. Even though it is now a “partnership” with Max, it is still a separate service, with a separate fare, and no ability to transfer between fixed route services or on-demand.

Using it as a replacement or an excuse to replace fixed route service is not a good solution. I really hope in the future, the implementation of micro transit is changed to require either the pick up location or destination be a transit station, and I sincerely hope Max considers bringing back fixed route service in both Highland Park and Fountain Heights.

Questions Left Unanswered

The public meeting about these changes left several questions unanswered. If I learn any more details about these questions, I will update this page.

Payment Systems + Transfers?

The Fixed Route Buses, the BRT, and On Demand all have different payment/fare systems and are not compatible with each other. This brings up a major issue for riders of the 17, 7, and 48, as they are now expected to transfer from fixed route to the BRT.

However, at this time, it is not possible to use your Fixed Route Fare which could be a single ride ticket, daily ticket, or monthly ticket on the BRT. You have to purchase a separate BRT fare card. This means that a trip from Eastwood to downtown now essentially costs $3.00 instead of $1.50!

Max leadership indicated that they are currently looking into unifying the system between Fixed Route and the BRT, but it is not clear if will be available by Mid May when these changes take effect.

Because the On Demand Micro Transit is paid for by City of Birmingham instead of Max and is operated by Via, there are no plans currently to integrate fares or transfers between it and the Fixed Route or BRT systems.

Updated On Demand Coverage Areas?

Currently, the On Demand service has limited coverage areas and does not currently include Highland Park. It stops just short at St. Vincent’s Hospital.

Max Leadership indicated that the On Demand coverage area would be expanded to include Highland Park, but there has been no confirmation of this at this time.

The material says “Via Daytime zone covers a good percentage of this local fixed route”. However, that is not true, as it only covers from St. Vincent’s Hospital to downtown, nothing in the actual Highland Park neighborhood. The online map still shows coverage of Highland Park lacking:

On Demand coverage of Highland Park

With the changes to the 17, many people between South Side and Woodlawn will now have to utilize the 25. However, the 25 doesn’t run on Saturday. Unfortunately, it does not appear that On Demand coverage will be expanded for this area either, leaving all these residents without any weekend service.

On Demand Coverage of the Route 17

On Demand Hours?

Currently, On Demand doesn’t begin until 6am. Both the 12 and the 18 began service at 5.30am. Some riders were concerned they would no longer be able to make their connections at Central Station.

No Rider Left Behind?

Leadership did take each riders concerns very seriously and are looking to make this transition as easy as possible. To that extent, they are running a few month program during the transition called “No Rider Left Behind”. Any users affected by these route changes will be able to call a support number and have a van pick them up.


For additional information about these changes, please see Max Transit’s Web Site.

Converting Bandcamp Email Updates to an RSS Feed

I love music. I like to closely track bands and labels to learn about upcoming releases. Unfortunately, the type of music I listen to tends to stay more underground, so there isn’t any mainstream coverage of it, which just means I get to do all the work myself.

Fortunately, a majority of the bands and labels I am interested in are on bandcamp. Bandcamp makes it pretty easy to subscribe to artists and labels and get updates from them directly. Unfortunately, this all comes through email. While it is easy enough to filter to avoid cluttering my inbox, email isn’t my desired place for this information.

I run a private freshrss instance and curate several hundred RSS feeds. It syncs nicely between my various devices, and is just a great way to consume. My goal is to get all my music updates from bandcamp here, so that’d they be in the same place as where I get music reviews and follow some other underground music blogs.

Bandcamp doesn’t provide any rss feeds. There are some options, like using RSSHub and its bandcamp addon. However, many of the updates that come through email are private, and don’t show up in an artists’ or labels’ feed.

I decided, I’d go a different route, and just convert the emails I was getting into an RSS feed directly.

Step 1 – imapfilter

I already use imapfilter heavily for filtering and tagging my emails. I decided I could use it to pipe the bandcamp emails to a custom script that would convert it into an rss entry. Here is the relevant section:

function filter_bandcamp(messages)
   -- Grab anything from Bandcamp where the subject
   --  starts with "New". This indicates a message
   --  or updated rather than a receipt.
   -- I am taking these "mailing list" bandcamps
   --  messages and converting them to an rss feed.
   results = messages:contain_from('noreply@bandcamp.com') *
             messages:contain_subject('New')

   for _, mesg in ipairs(results) do
      mbox, uid = table.unpack(mesg)
      text = mbox[uid]:fetch_message()
      pipe_to('/opt/email2rss/email2rss', text)
   end

   -- delete the bandcamp messages
   results:delete_messages()

   return messages;
end

You’ll notice I don’t want all messages from “noreply@bandcamp.com”, since that would also include things like purchases and system notifications.

Step 2 – email2rss script

The email2rss script is a python program. I am using feedgen to generate the feed, and built in python libraries to parse the emails.

One issue you’ll notice immediately, is that this script runs once for every email message. For RSS, we want to create a continuous feed with all the entries. This means we have to insert the new entry into an existing file and have some persistence. The quickest/dirtiest method was to use python’s builtin pickle to serialize and deserialize the whole state. That way, I can quickly load the previous state, create a new entry, write out the rss file, then serialize and save the state to disk.

Here is the program in its entirety:

#!/usr/bin/env python3
#

import sys
import email.parser
import datetime
import pickle
import re
import os
from feedgen.feed import FeedGenerator

DATADIR=os.path.join('/', 'opt', 'email2rss', 'data')

def default_feed():
  fg = FeedGenerator()
  fg.id('https://feeds.line72.net/')
  fg.title('Bandcamp Updates')
  fg.description('Bandcamp Updates')
  fg.link(href = 'https://feeds.line72.net')
  fg.language('en')

  return fg

def get_feedgenerator():
  try:
    with open(os.path.join(DATADIR, 'feed.obj'), 'rb') as f:
      return pickle.load(f)
  except IOError:
    return default_feed()

def save_feedgenerator(fg):
  with open(os.path.join(DATADIR, 'feed.obj'), 'wb') as f:
    pickle.dump(fg, f)

def add_item(fg, msg, content):
  msg_id_header = msg.get('Message-ID')
  msg_id = re.match(r'^\<(.*?)@.*$', msg_id_header).group(1)
  sender = msg.get('From')
  subject = msg.get('Subject')

  fe = fg.add_entry()
  fe.id(f'https://feeds.line72.net/{msg_id}')
  fe.title(subject)
  fe.author(name = 'Bandcamp', email = 'noreply@bandcamp.com')
  fe.pubDate(datetime.datetime.utcnow().astimezone())
  fe.description(subject)
  fe.content(content, type = 'CDATA')

def go():
  fg = get_feedgenerator()

  parser = email.parser.Parser()
  msg = parser.parse(sys.stdin)
  for part in msg.walk():
    if part.get_content_type() == 'text/html':
      add_item(fg, msg, part.get_payload())
      break

  fg.rss_file(os.path.join(DATADIR, 'feed.rss'), pretty = True)

  save_feedgenerator(fg)

if __name__ == '__main__':
  go()

Step 3 – Host the Feed

I just have a simple nginx server running that hosts the feed.rss. Then I simply add this new feed to my freshrss instance.

Future Work

There are still some improvements that could be made to this. The history is going to grow out of control at some point, so I really should probably go through and delete old entries, maybe to keep a history of 100 or 500 entries.

The other possible issue (I haven’t run into it yet), is that the email2rss could be run simultaneously. If that is the case, then one entry will likely be lost. I really should probably have a lock around the feed.obj to keep a second instance from doing anything until the first was written out the new state.

Fixing the Audio on a Pinebook Pro after resume

I recently purchased a Pinebook Pro, which is a Linux laptop based on the Arm processor. It is a great travel laptop, and surprisingly, I have found this to now be my primary personal device.

Everything has worked well, except one annoying bug: after resuming from sleep, the audio stops working. After digging around, I found that a simple command to write some data to the device to unbind it, then rebind it resolves it. I incorporated this into a systemd script that it is automatically run upon resume.

Create a new file called: /usr/lib/systemd/system-sleep/audio with the following (sudo vim /usr/lib/systemd/system-sleep/audio)

#!/bin/sh
#
# mwd - 20200726
# This is a script to reload the audio module
#  upon resume.
case $1 in
    post|resume)
        tee /sys/bus/i2c/drivers/es8316/{un,}bind <<< 1-0011
        ;;
esac

Save this file and make sure it is executable:

sudo chmod +x /usr/lib/systemd/system-sleep/audio

New Bus Icons in Go Transit App

I just released version 1.6.1 of Montclair for all the Go Transit cities. This includes a minor visual change of a new icon for the bus.

Previously, I was using Availtec’s Icon Factory, which lets you specify a color and heading, and it generates a .png image of that color with an arrow indicating the direction the bus is going. It has worked well, and looks nice enough, but I don’t like depending upon a 3rd party service, especially for cities that don’t use Availtec.

After some deliberation, I decided I wanted to do everything client side. I would have a single SVG icon that the client would load in javascript and would manipulate the stroke/fill colors and rotate the arrow to the heading. This works quite well, and the result is a nice clear image that we only have to load once!

This code works by loading the SVG through the DOMParser, then modifying attributes, reserializing it, then returning it as a data: url that is base64 encoded.

// load the svg
let xml = new DOMParser().parseFromString(svg.data, 'image/svg+xml');
// update the attributes //

// 1. the gradient
// stop1
let stop1 = xml.querySelector('#stop958');
stop1.style.stopColor = '#' + this.color;

...

// 5. The bearing, set its rotation
let bearing = xml.querySelector('#bearing');
bearing.setAttribute('transform', `rotate(${this.heading}, 250, 190)`);

// 6. Serialize and generate a data url with base64 encoded data
let serialized = new XMLSerializer().serializeToString(xml);
const url = 'data:image/svg+xml;base64,' + btoa(serialized);

Here’s the result:

New Bus Icons using custom Icon Factory

Go Transit App Updates

I have released a new version of the Montclair software, 1.6.0, that includes some major improvements to seeing estimated arrivals of buses at stops.

When selecting a stop, the app now goes into a full screen split mode with the estimated arrivals and the map. Clicking on one of the estimated arrivals will show you where that specific bus relative to your stop and will track the bus until it arrives. This makes seeing your next bus super simple!

This version will automatically roll out to your favorite “Go Transit” city app!

Go Memphis and Go Nashville

I have added two more cities to the Go Transit Apps series, Go Memphis and Go Nashville. Both are available through the web, or in the Android Play Store and Apple App Store.

Go Memphis

‎Go Memphis
‎Go Memphis
Developer: Marcus Dillavou
Price: Free
Unknown app
Unknown app
Developer: line72
Price: Free

Go Nashville

‎Go Nashville
‎Go Nashville
Developer: Marcus Dillavou
Price: Free
Unknown app
Unknown app
Developer: line72
Price: Free

For a full list of cities and features, see the Go Transit App website.