Piperka blog

Progress on the Piperka backend rewrite, part 3

This is my third progress report on my web site backend rewrite. Previous parts: One Two

I would have expected to be ready with the rewrite already. I had even reserved a part of my summer vacation for finishing what remained. Turns out that I was in a need of a vacation, instead. And I decided to implement OAuth2 login code after all, which was not unexpectedly a major undertaking. It is one more delay for the project but on the other hand it is much less painful to work on it now instead of later since I would have only then ran into all the design issues a second authentication method would have exposed in my code and it would have taken another large transition to get it to use.

I had written a TODO list of needed features in my last progress report. Almost all of them are completed by now. I'm hoping to have the authentication code done in January and the rest soon after. I'm currently at the stage where I'm testing out the new code and fixing any issues I run into. The good news is that I'm still not disgusted at the new code base even after putting OAuth2 logic in it. That was the point where I decided to start a code rewrite with the current web site code.

Along with adopting the new backend code, I'll be migrating Piperka to run on a new server. Call me old fashioned but I like to use a dedicated server and the current one is seven years old and a replacement is due.

I think it's time to talk about what my plans are when I have the new code in place. I intend to implement several things that would help with maintaining the crawler and with users keeping up with the updates.

Self-learning parser

As it stands, every parser that Piperka uses to find that elusive link to the next page is written by hand. Even though every comic doesn't get their own and there are several common ones in use. I now know that I can do a lot better than that: I want to have a piece of software that I can input the few first archive page locations and it would automatically extract features from them that would help it recognize the link to the next page automatically. I'd say that roughly only 5% of comics would still need their custom parsers, which would help the situation immensely. Moreover, this would be an interface to crawler maintenance that would be much more amenable for use by others than me.

I'd expect that there is software like this already, though I would suspect that despite that writing my own would be advantageous. I'd be interested to hear if anyone's familiar with any existing solutions for this.

Crawler improvements

The crawler could act smarter than what it does now. There are several cases where it can get stuck in situations that it could well detect and work around instead. The current version uses loop detection to avoid inserting duplicate pages to the index. Sometimes the parsers are "leaky" and catch the link to the previous page and offer it as the link to the next page and the check is in place to avoid loops. But sometimes the archives itself have removed an old page and recycled the name for a new page. In those cases, the crawler could go and see where the old page's preceding page's next link would lead and stich the removed page from the index.

The current crawler does know how to do backtracking. I can optionally give it the parameter to start seeking for updates from a few pages earlier than from the latest page. Any pages found from there on would overwrite what was in index previously in addition to inserting the newly found page in place. This is all good and well when any pages rewritten this way had only changes like typo fixes on them and the content was the same. It is just the wrong thing to do when the page was a temporary page where the artist was telling that there's a delay in the scheduled update or something like that and it would later be removed. What the crawler should be doing is to apply a text distance algorithm to decide whether to readjust users' bookmarks to show them the pages in question.

I would likely tie these both goals to the same change. I'd rather not try to teach any more tricks to the old crawler code and instead have new code for calling the new parser.

Ticketing system

When I asked people to just write me emails about crawler updates, I had just dozens of comics listed on Piperka. I had not thought what the situation would be more than a decade later. Also, having requests in my inbox is yet another blocker for sharing the maintenance work with other people. I have a long backlog and if I haven't fixed a comic entry you told me about by the time I have a ticketing system in place then I'll have to ask you to resubmit it there. I won't be going through my old emails.

In-site messaging system

A more proactive crawler would do well to inform users about any actions it had to do with users' bookmarks. Tickets need to be replied to. Readers should be notified when a comic entry gets removed for some reason.

Crawler maintenance web interface

Finally, I would have a proper interface for letting others to do actual web site maintenace work and not just request fixes from me. With a better parser and a crawler it would enable, in most cases, anyone I would like to give access to to do operations without needing to read a single line of code or without entering any SQL queries by hand to the database.

But all of this is even further away. My usable time is still the major bottleneck with getting anything done with Piperka. It hasn't gone past me that there's been a recent uptick in Patreon donations. It wasn't my idea to run an ad for my own Patreon page on Piperka's ad box but I let it pass. I appreciate being compensated for what I do and I'm grateful for each one of you who contribute. That being said, I just don't have the user base in place to expect donations to reach a level where they would translate to more time for working on Piperka. There are people whose livelihood depends on Patreon and other sources like it and I'm not one of them. I have a stable day job and Piperka is still a side project. It's nice to have the server expenses covered now but even before that I've been compensated indirectly since Piperka's something I put in my resume.

I think I would be in a much better position to seek out new users when I had the basics of the site running smoother than currently. It's been my plan to seek out more income from Piperka only later on, but people asked to enable donations so I gave them that. And it does help to keep my motivation up to see that people care. It's just that I see all the things where Piperka could be improved on and I'd rather be paid for what could be instead of what is. If that makes sense.

Patreon changed their fee structure this month. They're pretty universally used among web comic artists and there's a pretty strong network effect going on. Those of my users who are likely to go for such a thing are likely to already have an account on Patreon. In case anyone's seeking for an alternative, I set up an account on Liberapay. They have fewer fees, especially if you're European.

Mon, 11 Dec 2017 20:13:18 UTC