Piperka blog

Crawler rewrite progress

I had a bit of a slow start with it, but I'm finally making good progress with the crawler rewrite. It took a while to come up with how to run the actual crawling process so I begun by writing the separate low level functions and left it for later to figure out how they fit together. This time around, I'm putting emphasis on writing tests for everything. Crawling is all about doing the correct actions for all the possible scenarios it can and will run into. As a secondary goal, I want to make the crawler stand out as a better example of how I can design and implement projects with Haskell. It's not that the backend is bad but it's not quite as crisp as I'd like yet. I still wouldn't mind a Haskell development job.

The core part of parsing is still done with Perl. The last remnants of that code are now separated into a child process that gets fed HTTP responses and returns the link to a next page, if available. I started by firing up a separate process for each comic but I didn't like that it had a startup time of 300ms. After adding a bit more bookkeeping the separate comics on the same crawler run will have to share. The new crawler holds no database handle during the main crawling. The old code does loop detection by querying the pages table for duplicates and inserting any new pages immediately to the table with no intermediate commits. The new code keeps all that in program memory instead.

The test suite offers complete environment for crawler operation, up to initializing up a dummy database and starting a simple web server. I haven't practised TDD before this and it took a while to have that set up but I think it's well worth it at this point. For this type of project at least.

Yesterday, I finally connected the page downloads with the parsing. I had some trouble coming up with a design for that part as it wasn't something that I could just copy from the old code and I had to solve a number of problems with it. As usual, writing it in Haskell meant that I had to be explicit about many things that I sort of handwaved through in the Perl code.

I'm pretty confident that I'll get the new code in place in a couple of weeks. What's left now is to write tests for everything and prod the code until they're satisfied. In case anyone wants to have a look, the source code is available. I'll move the source files around later on as I wasn't quite sure how to name everything at this point yet.

submit to reddit
Mon, 21 May 2018 15:31:05 UTC