Piperka blog

Crawler health check page (mostly) empty

I've had more and better data from crawler's actions since I reimplemented much of it this fall but I didn't do anything new with it until now. A week ago, I added a view to Piperka that shows the issues it has found in the log. I made it so that it shows all the comics that have had errors when the crawler tried to find that next page during the last week and no successes during the same time. I didn't want to see every transient timeout but only those that have little chance of resolving on their own. I was in for a ride. I've removed 884 comics during the last week and reindexed or removed disappeared pages from Piperka's index and made the crawl run again for a good couple of hundred. I didn't keep an exact count of that latter group.

I didn't quite get the list empty yet as some of those reflect bugs with the crawler itself and not real issues with comics. And I still found a few cases it should have caught but it didn't, I'll just need to adjust my query a bit. Even in the best case, not all crawler issues will show up on my health page so there's unavoidably still an element of waiting for Piperka's users to report about any issues. But there should be much less of those now that I've turned the crawler finally to flag me before anyone even necessarily notices.

I get to see the date of last successful crawler action for a comic from the log too. It's invaluable to know when deciding what comics to eject or just to monitor. The base rate at which the crawler checks on a comic if it doesn't update regularly is at about once every 20 hours.

I still don't have any kind of messaging functionality built in to Piperka. Removals for comics that you read ought to raise some kind of notifications. That will still have to wait for another day, but I added something that should provide pretty much the same thing. My removed comics page lists any removed comics that you were subscribed to. I didn't necessarily look everywhere over the Internet for whether they had new homes somewhere. One thing I won't do is to make an entry point to a former site that still hosts an older copy of their archive with some message saying that new updates will be on a hence gone site.

I didn't even implement my idea about running the crawler to check on old pages to see whether they can find a known subsequent page yet. When I have that I should catch even more dead comics. Not all domain squatters are nice enough to return an easy 404 error for a former comic page.

Piperka's comic index has never been in this good state. I got curious and took a few statistics from the database: For a removed comic, the average count of pages is 189.6 and the median page count is 91. For live comics the respective values are 394.7 and 166. Not surprisingly longer running comics are likely to have a longer life.

You may have noticed that I've added a bit of styling to comics in listings that have more frequent updates. I experimented a long while with CSS styles until I settled for a white corner to mark the more active comics. I try to avoid information overload but this felt like a valuable addition.

I've been coding and upkeeping Piperka pretty much non-stop for three weeks. I could easily have ready plans for another month but I'll need to ease a bit for now. I'll consider later on what to do with the thumbnail functionality I implemented early this month.

submit to reddit
Sun, 25 Nov 2018 10:29:25 UTC