Piperka blog

Behind the scenes: bookmarking

I updated the bookmarking code on Friday and it looks like that went smoothly. The front end interface is pretty much unchanged and most of what I did for that part was to touch up the messages. Pretty dull and it's something that just works. But the back end has been rewritten, fixing up a lot of old cruft and making my life easier all around. The best patches are those that remove the most lines. Since I don't have anything new to tell about this feature (it still works, just more so), I'll write about something old instead.

As I planned it originally, Piperka wasn't going to even keep a table of all the archive pages of a comic. Piperka got its start as a refinement of my personal comic grabbing scripts. You can find such programs on freshmeat.net, if you like. For that use, I only needed to store the last known page. But, add more users to that, who will have a different number of unread pages waiting for them, and the leap to just putting all the comic's pages in a table wasn't all that big. Using that stored information the other way around, to recognize which comic and which page it was, came naturally after that. It was just a too obvious a use to miss. But I didn't exactly plan on having it as an feature.

Truth to be told, not much of the initial Piperka was planned. I just started coding and features fell into place. I was surprised at having people trying to feed a comic's home page to the bookmarking code, but it made sense to add that.

Piperka stores archive pages in three parts. There's a common base containing the protocol, "http://", domain and a part of the path, and a tail, which is something like ".html" or just empty. The content part contains the variable part, be it a date, a number or something else. It does other preprocessing too, like strip out the protocol and any initial "www." from the URL. That would fail if somebody hosted different comics under "www.mycomic.com" and "mycomic.com".

My first version of bookmarking was as simple as (sorry for the raw SQL):

SELECT cid, ord FROM comics JOIN updates USING (cid) WHERE url=url_base||name||url_tail
Simple, declarative, and utterly inefficient. Postgresql had no idea from that about what I was really trying to do and it couldn't use any indices. Lookup time was at around a second. Wholly inadequate, and I soon replaced this first version.

The next version of the code lived in a perl module. The biggest update was that it used the domain in the given URL to match with the corresponding comic before trying to match the specific page. To deal with multiple comics residing under the same domain, I added special cases to the code, where they would add a bit more to the initial part of the string used to match with a comic. I had to update that special case code by hand if I had new comics using the same domains and it eventually grew to be tens of lines long. Another weakness of the code was that it was unable to handle comics that have identical initial parts up until the variable part. Try feeding, for example, "http://www.meetmyminion.com/?p=" as a bookmark to see how it currently handles that case.

I thought, at that point, that I'd need some persistent data structure to store the initial parts, automatically separated to tell apart the comics with similar domains and initial paths. I wrote a daemon to handle bookmark requests. I was never quite happy with that approach, and it had some bugs that I never got around to fix. Worst of all, it had a habit of sometimeseating all memory until OOM killer reaped it. I had the web server code restart it if it wasn't running when it was needed, but it was still a problem. And I had to manually tell it to refresh its index whenever I made a change to a comic's entry.

My latest change threw out the daemon code. No more RPC and socket handling, it's all done as PostgreSQL procedures now. No more code maintaining an index in its own process, but instead just use PostgreSQL's own indices. They can do matches based on the initial part of a string just fine, and it can do its job in a few milliseconds. As usual, it's the approach I should have taken in the first place. I'm not sure if I would have come up with it if I had just read PostgreSQL's fine manuals a bit more and not just enough to get started. Perhaps, perhaps not. It's been seven years and I'm sure I have developed as a coder along the way. Pretty often the best design choice has been to not start coding, yet.

One other thing which got removed with this change was the code that offered support for using Piperka without registering as a user. That never worked as well as with using accounts and I suspect that it has been broken ever since I added the CSRF protection. I don't know if anyone ever used that at all. I'm thinking of adding an option for logging in with OpenID, which would (hopefully) lower the barrier for anyone to try Piperka out. Not that I'd expect that to matter all that much, but it'd still be a nice feature. There'd be no need to come up with a password for Piperka if I added that.

Sun, 07 Oct 2012 18:11:12 UTC