Piperka blog

Crawler rewrite done

I took a slight detour with the new crawler. I've been using the new version to crawl individual comics for a couple of months now but the hourly crawls have been done with the old version until I retired it last Monday. What I did in the meantime was to integrate the crawler to the web site's admin interface.

Most of my Monday went with running the hourly crawl manually, to watch the log output directly to see how it went. I still ran into (and fixed) a few bugs with running it with the whole database. I left the new code to run for the night but I misconfigured it and Piperka was without updates during that. Luckily some comics became unstuck just due to the new download code, with no archive or parser fixes. I still had a bug with handling cases where the parser returned multiple pages from a single page which I fixed on Saturday.

I'm using the crawler as a library from the web site backend now. I worked through the submissions queue while testing the new interface, finding and fixing more bugs along the way. The crawler interface doesn't yet have all the archive management features I'd like it to have and it doesn't give me any output about its progress besides adding pages to a table as it runs, but it's already great at reducing the manual steps I've needed to take to add new comics, for most of the cases. The more I can automate the easier it gets and I'll make fewer errors.

I've tended to add new comics in batches and I had a bit longer backlog than usually since I wanted to use the new crawler code for it. I don't know where people have seen Laws and Sausages introduced but I had it submitted 27 times in the queue, all during a few days. It's a known limitation of the new comic addition code that it's going to send only one email for each processed submission.

Another fallout of GDPR was that many (all?) of the comics hosted on Tumblr have been stuck since Piperka is hosted inside EU and they've been serving a "will you accept cookies" page instead of the comics and I introduced the parser action needed to consent to that along with the new crawler. The old crawler would actually have been capable of managing that situation but I felt like I'd rather do it with the new version.

I'm turning my attention towards maintaining the comics already listed on Piperka next. Now that the crawler is in a better shape, it would be easy to use it to run an audit on all the comics listed on it. I should catch most of the problems by picking a page from the archive and running the parser to see if it could get the following page with the parser. There are far too many squatted domains linked to on Piperka and I would get to check on them and retire all of them from a single web page.

I expect to get to close tickets by the dozens on one go. All the dead comics and most of the update issues Piperka has should be easy to detect automatically and I'm finally close to get that data out from there in a manner that'll be easily actionable.

Just as a reminder, the crawler and web site code are available on Gitlab.

Mon, 10 Sep 2018 16:58:00 UTC