Piperka blog

Tackling the crawler

I finished the part of the ticket system where you get to see your own tickets a couple of weeks ago. As of this writing, people have opened tickets on 189 comics on Piperka and I've resolved 52 of those. There are still plenty of ways that the ticket system itself could be improved on but I'm setting that aside for the moment.

Another bug I fixed recently was the notification about new added comics on the side bar. Again. It didn't work after the site transition since I used the user time stamp fields differently and I made a fix for that by adding a couple of more user data fields. But my fix didn't quite work correctly, it only showed the notification briefly after a comic was added. Which happened to match when I myself saw the feature in action. A thank you to the user who asked me about it.

The next part I'm working on is the crawler. The first goal post is a rather modest one as I plan to still use the existing parsers for everything and have most of its functionality stay the same. I'll be replacing all the code around the actual parsing, which consists of database handling to get the parser instance to use and the address of the page to try next to find an update, and to decide on what to fetch next on the hourly cycle. As well as storing any newly found page addresses and the actual page downloading.

One big problem with the code is that it keeps a database session open for the whole crawling run. This causes all kinds of implicit database locks for minutes at a time and trying to run the crawler on multiple sites at the same time may sometimes cause deadlocks. Nor does it use any sort of a cache for the downloaded pages. Hitting control-C makes it to lose all the pages it had downloaded so far, or even worse, I see it crawl for a couple of hours and drop all work on the floor at the end of it due to a deadlock. The long held locks were a problem even before but they've been more of a bother after the site transition. It may be due to other changes to the code or the new version PostgreSQL may just be stricter about those and it is fully justified to do so if it does.

The full hourly cycles are supposed to fire on the hour but the crawler takes several minutes before it even begins. It's yet another behavior that I never quite found the cause for. I could fix these problems in my old code but I'm using this opportunity to start with refactoring the whole thing into something that would make further development much easier. I'll start introducing ways to control the crawler from the web site and that should make its maintenance easier. I'll just start over with it now. The last time I made a decision like that it took me two and half years to get to the conclusion but it's quite certain that this won't take nearly as long.

Progress with Piperka's development has lately been more intermittent than what I'd like. January and February were intense and, I'd say, exhilarating, but I can't keep up that kind of a pace indefinitely, no matter how much I'd like to. Finding motivation is tricky at times and sometimes, even when I sit down and try, I won't see any progress. There are times when I wish I had other people involved with Piperka's development, even more so to help keep my own interest up rather than just for their contributions' sake. Welcome as they are. I still have so much I'd like to see completed with Piperka.

Tue, 24 Apr 2018 17:32:24 UTC