Piperka blog

About crawler's source code

I tend to get occasional requests for getting access to Piperka's crawler code. I think I should make a statement about that, and about just what's involved with it.

Piperka's source code is available under a free license at a public darcs repository. It only includes a shim database, which specifically doesn't include the crawler code. I'll need to remind you that the main reason why the source code is available is that releasing it was easy for me once I had moved to use a version control system and FAI, with a reasonable expectation that doing so would lead to benefits for Piperka. While I do consider granting access to the source code to be a moral thing to do, with the background I have with free software and Debian, I still don't find it in myself to take the extra effort to make that happen just for the sake of it. I don't much care for the attitude how some people feel like it's something to call me out about.

As far as expectations go, I would expect to get pretty much no help from anyone even if I made the crawler code available. After all, I've released the rest of the site code and so far have yet to receive any patches based on that. You'll have to excuse me but someone would need to convince me that there's a single person out there who would be ready to get their hands dirty with Piperka's code base before I'd fulfill any request to put out even more.

Now, free software is dead without a process. Linux has one, Debian has one, but Piperka lacks one and therefore nothing is happening. I recognize that there is more I could do to further that. Things like that I should replace my email address at the bottom with a link to a contact page, which should encourage people to use the public development mailing list for discussing development ideas instead of just contacting me. I tend to be a bit of a black hole as far as email goes, and if there would ever be a thriving development community around Piperka, it can't be just about individual people talking to me. If you'd like a more web forum like experience, then the Piperka subreddit is as good a place as any. It would do good, also, if I were to list some development goals I have for Piperka. I tend to prefer to show instead of telling but I'd need to let go of that if I wanted to have others participate.

Process is just the reason why I'm reluctant to show the crawler code. With the web site code, most of the process is already in place due to the simple fact that it's on a version control system. No such thing is applicable to the crawler. No, patch is not a suitable tool for the use (as the most recent email I received about this topic suggested). I'm going to walk through a few examples of what my usual routine involves when I maintain Piperka.

Most of what I do is based on a few perl scripts and performing SQL queries by hand. The basic editing of the crawler code itself is done with editparsers script, which opens up all of the crawlers in the database. A particular crawler instance can be shared by multiple comics and I remember some of the most used ids and can pick a correct parser by looking at the site's source code alone. Most of the time I can get by with that alone, sometimes I have search the parsers list for something I could use for it or to add a case for that particular comic. Or I'll end up writing a new one for it. Web comic authors tend to do whatever renders on people's browsers and I'll just have to adapt to that. Here's an example of a common parser:

### 1
if ($tag eq 'a' && exists $attr->{rel} && $attr->{rel} eq 'next') {
if ($attr->{href} =~ m</([^/]+)/?$>) {
$next = $1;
}
$self->eof;
}

I could then use this and call something like

./inject_comic 1 http://www.paranatural.net/comic/ chapter-one /

and then I would finish the job with the genentry script. If I didn't get it right on the first try, I would try it again with getpages_init. With existing entries, I may use SQL queries to update the comics and crawler_config tables, delete the old archives from updates table and insert a new first page to the table and call getpages_single to set the crawler to rebuild the archive index. After that, I'll need to compare the old archive index and users' subscriptions to see if there were any pages that were left out with the reorganized archive on the comic's end. Sometimes, there are some hiatus or delay announcement posts or guest pages that had crept in that were cleaned out when the comic author rebuilt the archive, and I'd need to account for those somehow.

Don't worry if you didn't follow all of that. As it stands, all the scripts mentioned are included in the source and that's enough of an example to getting a comic added to the development environment, if you have one set up. My point with this is that the workflow is centered on me doing things on an SQL prompt and with perl scripts on a shell prompt. It's not perfect, it could certainly be improved on, but it's what I've ended up with and it works. If you are about to request to help me with maintaining Piperka's crawler and index, then you should be very mindful of what I'm doing currently. There's a lot of things that would need to happen if I could ever accept outside help with this. I'm not going to give anyone else direct access to the SQL database, I hope I don't need to go into detail about what a disaster that would be. I'd need to come up with a work flow that didn't involve using that, as a starter. To reiterate my point, I'm not releasing any crawler code until there is a reasonable case for seeing that benefit Piperka. Quid pro quo.

Another aspect of running the Piperka's crawler is that how it accesses web comic sites concerns its (that is, mine, in the end) relationships with comic authors. If they perceived it to cause unnecessary burden on their sites, they would sour on having their comics listed on Piperka at all. Unwarranted or not, it is something I need to be mindful of. If I had scores of people I had little control over but associated with me knocking on their sites, it may hurt Piperka's reputation with comic site admins. It can be a bit touchy subject.

To go forward with any of this, I'd need a new interface. The natural place for it would be as a part of the web site. But you know what. I've come to realize that I'm disgusted with the web site back end code, which dates back to 2005. If I feel that way about it then what hope do I have for anyone else to touch it. I made an effort to rewrite it at around 2008 but nothing came of that. Things improved vastly when I got to apply version control on it and build a real development environment for it, but the code still stinks. I'm ready to toss away the old perl code, built on top of Mason. I'm thinking of going for Snap. I have an idea of an ideal web framework I'd like to use. Snap isn't it, but it'd be a good step forward. It'll take time, but I'll hope to have a beta version running in a few months.

submit to reddit
Thu, 13 Aug 2015 19:20:0 UTC