I updated the bookmarking code on Friday and it looks like that went smoothly. The front end interface is pretty much unchanged and most of what I did for that part was to touch up the messages. Pretty dull and it's something that just works. But the back end has been rewritten, fixing up a lot of old cruft and making my life easier all around. The best patches are those that remove the most lines. Since I don't have anything new to tell about this feature (it still works, just more so), I'll write about something old instead.
As I planned it originally, Piperka wasn't going to even keep a table of all the archive pages of a comic. Piperka got its start as a refinement of my personal comic grabbing scripts. You can find such programs on freshmeat.net, if you like. For that use, I only needed to store the last known page. But, add more users to that, who will have a different number of unread pages waiting for them, and the leap to just putting all the comic's pages in a table wasn't all that big. Using that stored information the other way around, to recognize which comic and which page it was, came naturally after that. It was just a too obvious a use to miss. But I didn't exactly plan on having it as an feature.
Truth to be told, not much of the initial Piperka was planned. I just started coding and features fell into place. I was surprised at having people trying to feed a comic's home page to the bookmarking code, but it made sense to add that.
Piperka stores archive pages in three parts. There's a common base containing the protocol, "http://", domain and a part of the path, and a tail, which is something like ".html" or just empty. The content part contains the variable part, be it a date, a number or something else. It does other preprocessing too, like strip out the protocol and any initial "www." from the URL. That would fail if somebody hosted different comics under "www.mycomic.com" and "mycomic.com".
My first version of bookmarking was as simple as (sorry for the raw SQL):
The next version of the code lived in a perl module. The biggest update was that it used the domain in the given URL to match with the corresponding comic before trying to match the specific page. To deal with multiple comics residing under the same domain, I added special cases to the code, where they would add a bit more to the initial part of the string used to match with a comic. I had to update that special case code by hand if I had new comics using the same domains and it eventually grew to be tens of lines long. Another weakness of the code was that it was unable to handle comics that have identical initial parts up until the variable part. Try feeding, for example, "http://www.meetmyminion.com/?p=" as a bookmark to see how it currently handles that case.
I thought, at that point, that I'd need some persistent data structure to store the initial parts, automatically separated to tell apart the comics with similar domains and initial paths. I wrote a daemon to handle bookmark requests. I was never quite happy with that approach, and it had some bugs that I never got around to fix. Worst of all, it had a habit of sometimeseating all memory until OOM killer reaped it. I had the web server code restart it if it wasn't running when it was needed, but it was still a problem. And I had to manually tell it to refresh its index whenever I made a change to a comic's entry.
My latest change threw out the daemon code. No more RPC and socket handling, it's all done as PostgreSQL procedures now. No more code maintaining an index in its own process, but instead just use PostgreSQL's own indices. They can do matches based on the initial part of a string just fine, and it can do its job in a few milliseconds. As usual, it's the approach I should have taken in the first place. I'm not sure if I would have come up with it if I had just read PostgreSQL's fine manuals a bit more and not just enough to get started. Perhaps, perhaps not. It's been seven years and I'm sure I have developed as a coder along the way. Pretty often the best design choice has been to not start coding, yet.
One other thing which got removed with this change was the code that offered support for using Piperka without registering as a user. That never worked as well as with using accounts and I suspect that it has been broken ever since I added the CSRF protection. I don't know if anyone ever used that at all. I'm thinking of adding an option for logging in with OpenID, which would (hopefully) lower the barrier for anyone to try Piperka out. Not that I'd expect that to matter all that much, but it'd still be a nice feature. There'd be no need to come up with a password for Piperka if I added that.
I made something new. Until I think of some other name for it, I'll call it Piperka Reader. Not very original but it was an easy name to pick.
Piperka Reader is a page that embeds comic archives in an iframe, with controls on a bar on the top. There's the usual buttons for going to the first, previous, next and the newest page, and a dialog window with a list of all the pages of a comic. The same navigation buttons work for any and all comics listed on Piperka. If you've logged in, it can automatically move your bookmark as you read, or you can set it yourself. When reading comics sequentially, it uses an iframe to preload the following page in the background. Meaning that the next page is already there ready for viewing by the time you've read the current page and click next.
I used parts of the comic's URLs as the page names in the archive dialog. Archive pages' titles would likely be a better choice for that role but as I haven't stored those on Piperka, this'll have to do.
I've labeled Reader as "beta" for now. I'll yet add more functionality to it and it could use some polish. It's suitable for reading longer stretches of archives but it'd take a bit more to let it browse daily updates easily, with one or a few unread pages at most. I've developed it using Chrome and I can hope that it'll work with other browsers too.
If you look closely, you'll find that this means that I've made Piperka's comic index easily downloadable. All the 1546619 pages in it. I don't mind if you access them independently of Reader, but I'd appreciate it if you'd credit the source and let me know if you use them for anything. No guarantee that they continue to be available or that they'd be useful for any particular purpose.
I hope that no comic author minds that I embed their content like this. I'm not trying to misrepresent whatever they host as mine or that they'd be associated with Piperka. Technically, Piperka itself doesn't access any more content than what it did before, it just allows a user to do so, in a bit different manner, but arguing that would be sophistry. Let me know what you think.
I added a quick search box on the top and browse pages. It's not that new feature anymore, since I added it there three weeks ago already. I felt that it was self evident enough and didn't feel like posting about it on blog at that time. Now that I've received some feedback on it, it's due time that I did that. Or rather, nobody's said a thing about the quick search but about some other changes I did at the time, instead.
An implementation detail about the quick search box, if you're feeling adventurous: It uses regexps. You can do things like [bc].*ing$ to search for all comics that have "b" or "c" in their name and end with "ing". I'm not going to give you a regexp tutorial but it should be pretty versatile. Also, I added the option to not limit the search results to ten comics. It'll tax your computer a bit but it's there if you want it.
Quick search isn't the search function that I've mentioned earlier. It would be something that'd allow stacking and combining search terms, but it's still in the works. This quick search was something I made, well, quickly, and I'm pretty annoyed at myself for not doing it earlier. It really wasn't that big an effort and it's so useful that even I use it.
Another change I did was to remove the subscribed/unsubscribed filters from the browse page. Some people have missed the subscribed filter, but note that the same results are available on your profile page. I even added sort options to that page. That was a bit unnecessary duplication of functionality there and I made the code a bit more straightforward while I was working on it. There's no way to get a list of unsubscribed comics is currently but hopefully that won't be as missed feature.
Now that I'm writing: there was a couple of hours of unplanned downtime yesterday. Hetzner had some power supply issues at the server center where Piperka's server is located.
I don't have any real big announcements to make this time, but I have a few recent changes and new features that I could introduce now.
Not all changes that I make are visible to users. Like finding a way to cope with comics hosted on Tumblr, despite the lack of any kind of navigation links on the pages. Some become visible when I don't quite get them right on the first try. Like crawler changes, where the crawler may glitch on its hourly update cycle and get stuck. Some of you have certainly noticed that one. Sorry about that. I added support for downloading compressed web pages and parallel downloads, both of which caused a few bugs along the way, but those are hopefully fixed now.
There's a new favicon for Piperka. Much better than the scaled down photo I used before. Thanks to Lulu for that.
Another new thing is an AdBlock Plus whitelist for all the comics listed on Piperka. If you use ABP and want to support comics by allowing the ads on their pages to show then this could help with that. Yes, piperka.net is on that list, if only because this blog is listed on Piperka itself. Let me know if you find it useful. I'll need to yet refine that list, some four thousand rules might be a bit excessive. Thanks to romnempire for the idea.
I added graphs of daily subscribers counts from the last 30 days to the comic entry pages. I thought that it would be a nice addition. Everybody loves graphs, or at least I do.
I'll need to add some sort of a page for all the miscellanous auxiliary stuff on Piperka, like that whitelist. There's bound to come more of that, later. But not today.
Piperka went down at around 15 Jun 20:00 UTC and came back at around 16 Jun 09:00 UTC. This was totally unplanned. Apparently the server just froze, with no traces of anything in the logs. The reason why it took this long to recover from this was that I was out of town and had never committed my hosting service's web interface's password to memory. A hardware reset resolved the issue.
I wasn't going to come back home until tomorrow, but I would have hardly enjoyed my trip if I had worried about this all the time. Downtimes are a risk that I take with using a dedicated server, and with me having a day work and needing to sleep, those could last for hours. That's the name of the game, with me being my own server admin and being committed to all sorts of other stuff.
Sorry about this. At least I can say that this was the first major downtime since I moved Piperka to Hetzner, a year and half ago.