Piperka blog

Crawler rewrite is still ongoing, goodbye Project Wonderful

I set it as a goal to get the new crawler done a couple of weeks ago. That didn't happen and I'm still not quite there yet. I'm far enough with it to do crawls for single comics with it and the hourly cycle code is almost done, it's mainly missing some logging I still want to add in there.

I think I found the reason why the crawler has been prone to database deadlocks lately. I had added caching for the users' unread comics stat shown in the sidebar (the "n1 new in n2" thing in there) and that's been updated by triggers. Which the crawler had been firing with its long held sessions. I think the deadlocks made the web site itself unresponsive at times as well. It'd be nice feature to save the few tens of milliseconds that computation takes when possible but I'll look into bringing it back some other way once the new crawler is in place.

It may take another couple of weeks for me to have the opportunity to put the new crawler in place, for real this time. There'll be nothing user visible with this change for now, though I'm hoping that I'll slowly work on new features based on it to make Piperka better catch updates. I just don't have the heart for another long slog with something that only takes me to where I was in the first place, even if the new version satisfies some odd aesthetic sense of mine.

In other news, Project Wonderful is shutting down. I've been using their ads nearly since they launched and I'm none too happy to see them go. I like their model a lot, it's fair and unintrusive and their ads have generally been relevant to Piperka's audience. I've liked them more for offering a place for comic authors to promote their art than for the ad revenue itself. I still would like to see Piperka generate enough income to support me but whatever that entails won't involve PW anymore.

I may or may not set up some new ads in July. Whatever I do, I care more about the user experince than whatever revenue they may or may not generate. If I had infinite time (I do not), I would set up some sort of self-hosted advertising system. Piperka will look quite empty without PW.

Ironically enough, PW sent me an email about receiving my 100000th bid on an ad box a few days after the shutdown announcement.

submit to reddit
Mon, 18 Jun 2018 18:33:50

Crawler rewrite progress

I had a bit of a slow start with it, but I'm finally making good progress with the crawler rewrite. It took a while to come up with how to run the actual crawling process so I begun by writing the separate low level functions and left it for later to figure out how they fit together. This time around, I'm putting emphasis on writing tests for everything. Crawling is all about doing the correct actions for all the possible scenarios it can and will run into. As a secondary goal, I want to make the crawler stand out as a better example of how I can design and implement projects with Haskell. It's not that the backend is bad but it's not quite as crisp as I'd like yet. I still wouldn't mind a Haskell development job.

The core part of parsing is still done with Perl. The last remnants of that code are now separated into a child process that gets fed HTTP responses and returns the link to a next page, if available. I started by firing up a separate process for each comic but I didn't like that it had a startup time of 300ms. After adding a bit more bookkeeping the separate comics on the same crawler run will have to share. The new crawler holds no database handle during the main crawling. The old code does loop detection by querying the pages table for duplicates and inserting any new pages immediately to the table with no intermediate commits. The new code keeps all that in program memory instead.

The test suite offers complete environment for crawler operation, up to initializing up a dummy database and starting a simple web server. I haven't practised TDD before this and it took a while to have that set up but I think it's well worth it at this point. For this type of project at least.

Yesterday, I finally connected the page downloads with the parsing. I had some trouble coming up with a design for that part as it wasn't something that I could just copy from the old code and I had to solve a number of problems with it. As usual, writing it in Haskell meant that I had to be explicit about many things that I sort of handwaved through in the Perl code.

I'm pretty confident that I'll get the new code in place in a couple of weeks. What's left now is to write tests for everything and prod the code until they're satisfied. In case anyone wants to have a look, the source code is available. I'll move the source files around later on as I wasn't quite sure how to name everything at this point yet.

submit to reddit
Mon, 21 May 2018 15:31:05 UTC

Tackling the crawler

I finished the part of the ticket system where you get to see your own tickets a couple of weeks ago. As of this writing, people have opened tickets on 189 comics on Piperka and I've resolved 52 of those. There are still plenty of ways that the ticket system itself could be improved on but I'm setting that aside for the moment.

Another bug I fixed recently was the notification about new added comics on the side bar. Again. It didn't work after the site transition since I used the user time stamp fields differently and I made a fix for that by adding a couple of more user data fields. But my fix didn't quite work correctly, it only showed the notification briefly after a comic was added. Which happened to match when I myself saw the feature in action. A thank you to the user who asked me about it.

The next part I'm working on is the crawler. The first goal post is a rather modest one as I plan to still use the existing parsers for everything and have most of its functionality stay the same. I'll be replacing all the code around the actual parsing, which consists of database handling to get the parser instance to use and the address of the page to try next to find an update, and to decide on what to fetch next on the hourly cycle. As well as storing any newly found page addresses and the actual page downloading.

One big problem with the code is that it keeps a database session open for the whole crawling run. This causes all kinds of implicit database locks for minutes at a time and trying to run the crawler on multiple sites at the same time may sometimes cause deadlocks. Nor does it use any sort of a cache for the downloaded pages. Hitting control-C makes it to lose all the pages it had downloaded so far, or even worse, I see it crawl for a couple of hours and drop all work on the floor at the end of it due to a deadlock. The long held locks were a problem even before but they've been more of a bother after the site transition. It may be due to other changes to the code or the new version PostgreSQL may just be stricter about those and it is fully justified to do so if it does.

The full hourly cycles are supposed to fire on the hour but the crawler takes several minutes before it even begins. It's yet another behavior that I never quite found the cause for. I could fix these problems in my old code but I'm using this opportunity to start with refactoring the whole thing into something that would make further development much easier. I'll start introducing ways to control the crawler from the web site and that should make its maintenance easier. I'll just start over with it now. The last time I made a decision like that it took me two and half years to get to the conclusion but it's quite certain that this won't take nearly as long.

Progress with Piperka's development has lately been more intermittent than what I'd like. January and February were intense and, I'd say, exhilarating, but I can't keep up that kind of a pace indefinitely, no matter how much I'd like to. Finding motivation is tricky at times and sometimes, even when I sit down and try, I won't see any progress. There are times when I wish I had other people involved with Piperka's development, even more so to help keep my own interest up rather than just for their contributions' sake. Welcome as they are. I still have so much I'd like to see completed with Piperka.

submit to reddit
Tue, 24 Apr 2018 17:32:24 UTC

Ticket System, part 2

I used last week to implement the moderator side of the ticket system. I can now read them and mark them as handled. Not that there's yet any way for a user to read their tickets, processed or not. I've used it to fix a number of comics already, if only to get a taste of what I'd need with a ticket system to make it best help me.

Superficially, the ticket system does nothing different compared to what the old way of just receiving email did, which is why I've delayed implementing one before. Also, I'm still quite a ways off from the eventual hope of having other people perform actual maintenance tasks. Even so, the ticket system helps me at this stage already. I've set it to display various bits of data I typically need to nudge a crawler back to action and that saves me from many keystrokes that I previously would have used. Every single manual step I can skip will make me so much more likely to do it in the first place. It's visually pleasing for me to have two lists and to be able to move items from one to the another. Marking an email as read gave me no such satisfaction.

It's not perfect nor all I want it to be yet but this is one place where having even a slight improvement matters a lot. I'm still giving a priority to development tasks instead of site maintenance but I'll be more likely to do it now. I know nobody likes landing on a squatted domain or having a comic they read not catch updates but I'm working towards improving the situation.

The weekend past that went mostly to implement the followers page. It was one last thing that still was simply missing compared to the old site. It was just an oversight I rectified once a user sent me an email about it. Piperka has some aspirations for having social media like features but those plans are still waiting for some later date.

I wrote a brief privacy policy for Piperka. With GDPR I felt it better to state at this point that I'm not holding anyone's personal data. In short, I'm trying to not annoy you. I'll write some form of a TOS document later on, which would state in detail that you shouldn't try to annoy me.

Next up, a way to view your tickets.

submit to reddit
Tue, 27 Mar 2018 17:54:28 UTC

Ticket System

I've introduced a ticket system. There's a link to the ticket page on the comics' info pages now. I didn't yet apply any CSS on it so it looks basic and I know it. I've passed on on the idea of setting up a ticket system before but that was when I still had a lot fewer comics listed than currently, and I've finally all the pieces in place to set up a proper interface for crawler maintenance. Before this, I had pretty much expected to just do the manual update steps with hand written SQL and all even with a ticket and it would've mattered little whether I'd've read about an issue from an email or a ticket.

I only implemented ticket submissions as the first step. There's no way yet to view your submitted tickets. Nor is there any interface for me to view them either, other than by doing a database query. When I add one, removals of dead comics should be much easier to do. Other maintenance operations will still have to wait for later development goals to fall in place.

Sending me email about crawler issues has been a lamentably haphazard business. I'm afraid I'm not going to go through my old emails at this point. If you've sent me one about a comic that still hasn't been fixed, please go and open a ticket. I'm not yet promising more timely fixes than with the old way but at least it'll be much better organised from now on.

I'll be looking into ways to make the ticket system to bring the errors the crawler itself finds to my attention in a more actionable way. It really shouldn't be so that my users would need to ask me crawler issues this much when the crawler could work with me better in the first place.

I spotted an issue with parallel database accesses causing deadlocks with the web site. When that got triggered, the web server would freeze for some time and trying to access the site would just give a gateway timeout. Nasty stuff, I'm hoping that I've fixed it now or at least made the situation a lot less likely. It had to do with the unread comic counts (the numbers next to the check updates link) and how they got altered by using a redirect link. I don't know how many of you encountered that bug but the fix's been in since last Wednesday. The old site used a single database handle for all backend operations (other than AJAX endpoints) and it wouldn't even have run into a situation like this.

I've split the web site code and its authentication layer to separate projects. I'm hoping that the latter will find more users as it's always better for code health to have other people rely on it. I've switched over to using Gitlab for Piperka's code. It pleases me that they're open source, though I'm still somewhat on the fence about git itself. I still have a fondness for darcs even though I used git this time around. If you think you've found a bug or have an development idea then head to the issues page. I've opened a few myself already. Please don't use it to report about crawler bugs with singular comics. If anyone's interested in giving a ride to the code themselves then I'm ready to provide a copy of the database schema.

submit to reddit
Mon, 12 Mar 2018 15:28:50 UTC