Piperka blog

What the crawler does

Most of the things Piperka does, like setting bookmarks and directing users to updates, are based on having an index of comics' archives. To update and maintain that, it uses a crawler. That's a program that works much like a human reading a comic. It downloads the last page it knows and looks for the link to the next page. If it finds one, then it'll add that to the database. These steps are repeated until it can no longer find new pages.

For a human, finding a new page is typically easy. They can recognize something like the text "Next" or an image of a right arrow, placed suggestively near the comic page image, as the link to the next page, but the poor crawler trying to do the same won't have such an easy time picking up cues like that. It'll use a program called parser to turn the raw HTML into something it can process more easily. From there, it tries out things like match the text "Next" or check for an image with a file name like "right_arrow.png", but there are ways how that could always go wrong. The comic title could have "Next" in it too, or the image file name had changed. I could have done an assumption about the comic's archive, like that the page name should always have a number in it, that later turned out to not be true. Conversely, there could be a placeholder page at the end of the archive and the crawler would need to detect that too and not just add it to the index. There is a web standard for adding meta information to an archive page that'd let a crawler like Piperka's to forego all those faulty heuristics, but most sites don't implement those, or do it incorrectly.

All of this means that Piperka's crawler may get stuck in a dead end, or there might be something on a page that'll prevent it from catching the link to the new page, or the archive structure may have changed. For a user, that'll mean that there won't be any updates. I may spot that sooner or later, or a user may ask me to fix it. I don't mind if you do. At that time, if there were pages that needed to be removed, I'll try to determine if users' bookmarks will need to be adjusted too. I don't read nearly all of the comics on Piperka myself, so sometimes I'll have to make a guess about whether the pages were just renamed or removed, so you may end up seeing the same pages again, or skip a few. A recent addition was to display the recent crawler errors on the comic info page. I'll admit that I'm pretty lax about checking those. I still don't have a good interface for checking those so it'll often mean that I don't.

The crawler catches most of its updates in automatic mode. The check frequency is at most an hour, but it'll gradually get down to once a day, if it won't find any updates for a long time. The algorithm behind that is based on two variables, X and Y. Every hour, X plus a small amount is added to Y. If Y then reaches a threshold, then it'll tell the crawler to check the site. If there is an update, then X will get a bonus and Y will be reduced so much that it won't hit the threshold again for about a day. Otherwise, Y will be reduced by a lesser amount. Each round X gets reduced by a fraction of itself, so that if there aren't any updates for a long time, then it'll get down to zero. Once there are updates again, X will get raised and the crawler will check for updates more often again. It's simple yet effective.

There's an upper limit to how many pages the crawler will download in automatic mode. Currently, that's 50. If the crawler hits that, then it'll disable the automatic mode for that comic until I check on it and fix any issues and re-enable it.

In manual mode, there's no upper limit. For most of the sites, I only ever use that once, for the initial crawl. If something exceptional happens, like a comic changes the URL naming scheme of a comic, then I'll have to rebuild the index and run the crawler again from start. If I'm running the crawler in the manual mode, then I have it open in a window somewhere and will keep an eye on what it's doing. For longer running comics, this all could take hours.

I don't have any hard data on how much traffic this all will cause on a web site. My monthly inbound traffic is at about 10 GB and that's spread on all the comics on Piperka and includes the web server traffic and whatever other random things I might use the server for. I'd expect that having a comic on Piperka would mean a net reduction on web traffic, especially for more popular comics, since the readers would be hitting my site instead of theirs. I'm hoping that any reduction on any possible ad impressions would be offset by having more uniques. If that's your thing.

So far, only one comic site has banned Piperka due to excessive traffic, and, frankly, I don't think they have that good a case. There are at least a couple of others that are blocking Piperka, but I don't know about their motivations (no reply when I asked one) or am not sure if it's just due to some other check that the crawler trips, incidentally. Other than that, I've seen authors write about being happy about how Piperka doesn't try to download or hotlink any content and just redirects readers to their own sites.

The crawler doesn't download robots.txt files. I know that netiquette would at least strongly favor doing that, but as I'm programming the crawler on a site by site basis, I hope that I might spot any site policies myself. The most that a robots.txt file would do would be to ban the crawler from accessing the archive, or a part of it. I couldn't do any better in that case but to ask the comic site's web master about the situation. I'd rather have them directly contact me instead.

I've heard little from comic authors regarding the crawler, or about Piperka at all. I'd love to hear more from authors about what kind of impact being listed on Piperka has. I've seen a few express surprise at being listed on Piperka. I don't know how many still don't know that they're listed. I don't routinely contact comic authors. I've seen quite a few authors' statistics page show up on my own referers page. And I love getting the occasional endorsement. But I know that I don't have the kind of user base to really have a role in web comics' destiny.

One of the things I would like to do is to allow users register as comic authors, possibly allowing them to see more statistics regarding their comics and to control crawler behaviour. I'm not sure what kinds of functionality I could offer to them. I already implemented a feature where web comics authors could use Piperka to set bookmarks on their sites, but that one never got polished enough to be in actual use. Perhaps some day.

Sun, 12 Jun 2011 19:25:58 UTC