Piperka blog

Personal recommendations

I've set up a personal recommendations page. This time I resisted the temptation for implementing my own algorithm and just used R's recommenderlab library as is, with default UBCF settings. As far as I see the results look pretty reasonable, though not particularly striking. It won't necessarily offer xkcd to you in the hundred results it shows which I consider a success but it's unlikely to suggest anything outside of the top 500 of comics either. I'm not going to win any Netflix prices with this one but it's good to have some baseline.

The only input data the algorithm uses is plain user subscriptions, with no consideration for anything like the date of addition. As such, it's unlikely to suggest anything particularly new. Currently, there are 33 comics with over 200 readers on Piperka and the last one to reach that threshold was Stand Still, Stay Silent which was added five years ago. There's only a handful of comics from the past three years that have even reached the top 500. Inactive users are dropped from the counts after half a year of inactivity so it's not just that disused accounts with old subscriptions on old comics are inflating the measures. It would do well to give a bias to comics a user doesn't necessarily know yet.

I think my per comic recommendations (with the totally custom implementation I wrote for it) does a better job at picking more specific results when it doesn't get suffocated by the strong nucleus of most popular comics. But I have no clear idea how to turn that into giving per user results and as much as I like having Piperka as my personal playground I'm not sure it's worth it at this point.

With respect to comic discovery, I would like to add overlays to Piperka Map. To color the comics listed on it according to some variables. Like from PCA. I would like to come up with some application for ANNs also. I'm going to need an embedding. I'm not a data scientist by any measure but they do have some cool toys.

In other news, I'm hosting an ad for December. Piperka's been adless since Project Wonderful's demise but I'm still considering what to replace it with. I'd rather have ads that target my site over more bids, with no user tracking or personalisation and there's no obvious choice for that after PW. Web comic authors are always welcome to advertise on Piperka as far as I'm concerned. I do get some pretty imaginative offers for ads from time to time, just by the virtue of running a web site but I don't think you'd care to read about gambling and what else. This is an one off thing at this time and I'm setting the ad up manually but I would like to eventually have something just as convenient as PW.

If anyone else would like to run an ad then just drop me a message. I'm afraid I'm still a bit rubbish at replying to queries, especially if I'm in the middle of something.

It's been a fun month but I think I'll content myself with leaving Piperka to maintenance mode for a while. I guess I'll play Half-Life: Opposing Force next, I've had it waiting for the right moment for a while. I'm glad they still make good games.

submit to reddit
Fri, 30 Nov 2018 16:36:15 UTC

Crawler health check page (mostly) empty

I've had more and better data from crawler's actions since I reimplemented much of it this fall but I didn't do anything new with it until now. A week ago, I added a view to Piperka that shows the issues it has found in the log. I made it so that it shows all the comics that have had errors when the crawler tried to find that next page during the last week and no successes during the same time. I didn't want to see every transient timeout but only those that have little chance of resolving on their own. I was in for a ride. I've removed 884 comics during the last week and reindexed or removed disappeared pages from Piperka's index and made the crawl run again for a good couple of hundred. I didn't keep an exact count of that latter group.

I didn't quite get the list empty yet as some of those reflect bugs with the crawler itself and not real issues with comics. And I still found a few cases it should have caught but it didn't, I'll just need to adjust my query a bit. Even in the best case, not all crawler issues will show up on my health page so there's unavoidably still an element of waiting for Piperka's users to report about any issues. But there should be much less of those now that I've turned the crawler finally to flag me before anyone even necessarily notices.

I get to see the date of last successful crawler action for a comic from the log too. It's invaluable to know when deciding what comics to eject or just to monitor. The base rate at which the crawler checks on a comic if it doesn't update regularly is at about once every 20 hours.

I still don't have any kind of messaging functionality built in to Piperka. Removals for comics that you read ought to raise some kind of notifications. That will still have to wait for another day, but I added something that should provide pretty much the same thing. My removed comics page lists any removed comics that you were subscribed to. I didn't necessarily look everywhere over the Internet for whether they had new homes somewhere. One thing I won't do is to make an entry point to a former site that still hosts an older copy of their archive with some message saying that new updates will be on a hence gone site.

I didn't even implement my idea about running the crawler to check on old pages to see whether they can find a known subsequent page yet. When I have that I should catch even more dead comics. Not all domain squatters are nice enough to return an easy 404 error for a former comic page.

Piperka's comic index has never been in this good state. I got curious and took a few statistics from the database: For a removed comic, the average count of pages is 189.6 and the median page count is 91. For live comics the respective values are 394.7 and 166. Not surprisingly longer running comics are likely to have a longer life.

You may have noticed that I've added a bit of styling to comics in listings that have more frequent updates. I experimented a long while with CSS styles until I settled for a white corner to mark the more active comics. I try to avoid information overload but this felt like a valuable addition.

I've been coding and upkeeping Piperka pretty much non-stop for three weeks. I could easily have ready plans for another month but I'll need to ease a bit for now. I'll consider later on what to do with the thumbnail functionality I implemented early this month.

submit to reddit
Sun, 25 Nov 2018 10:29:25 UTC

Archive thumbnails

I implemented a new feature for Piperka: archive thumbnails. So far, there's only one comic I've enabled it for: Pepper & Carrot. The page number count has been linkified and clicking it will open two dialog windows, one with a listing of archive pages and a second one with thumbnails. Thumbnails would work better with a comic with a fixed page size but this is all I have for now.

The thumbnails are generated with Selenium which is used to render the page as a regular browser would do (which it does indeed use behind the scenes) and to save a screenshot which is then compressed into a smaller size both to save space and to allow showing the whole archive in a single view. Also to make sure that this form can't actually be used to read the comic.

I got a bit enthused about this feature while planning for it and implementing it but now I'm a bit uncertain about how to proceed. I certainly would like to go ahead and download and compress thumbnails for all the 2.2 million archive pages indexed on Piperka. I think I would have a pretty good case for fair use with what I'm doing as my use is transformative and it doesn't subtract from the content's original intended use, that is, reading. But I'm subject to Finnish and EU copyright laws and practices and not US ones and they don't recognize that concept over here.

I generally like living in this part of the world but EU's increasing copyright maximalism doesn't make me feel like singing Ode to Joy. I would expect that most authors wouldn't mind that I'd generate thumbnails. It's not hard to find most of their comics copied on archive.org and they have the originals in full size. It's a nice idea that I'd ask all the authors but at this scale and with me doing it alone it's more a matter of "can't" rather than "won't". Many of them wouldn't likely even respond even when they'd be fine with my use. Some may even be more annoyed to have me contact them at all and would rather have me do whatever I do without bothering them.

Regardless of copyrights, I'd be certain to drop the thumbnails for a comic on request. I couldn't be running Piperka without web comic artists' goodwill and that's not codified in any law. I'd just like it if I could assume that I had a better default position with fair use. If you're an author and would like to have thumbnails generated for your comic then feel free to drop me an email. I just won't get anywhere far with this feature if I make it opt in and wait for authors to contact me.

I'd love to hear your opinions about this feature. Especially if you're an author.

Even without thumbnails, the archive dialog is now openable for all comics on the info pages. I haven't stored the titles for any of the archive pages and the text used on them is a part of the raw archive URL. It's a bit crude but it works. The same dialog was available on Reader all along and I did plan to add it for the info page but I never returned to do it until now.

My next development goal is to add more automation and better reporting to crawl issue detection. With the recent crawler update I have much more data available on its actions in an easily processable form and I would do well to have an interface for reviewing it. I should also add an extra periodical run to check on the health of those long quiet comics. It should tell plenty if trying to download an old page with a known following page would fail the parse to find it.

submit to reddit
Thu, 15 Nov 2018 15:10:51 UTC


This time I'll be talking about bookmarklets. In short, they're browser bookmarks with a bit of javascript embedded in them to pass your current page to another site. Strictly speaking they're not a site feature at all but a browser one but sites provide the targets for them. Plenty of sites beyond Piperka use them and I certainly didn't invent them. In Piperka's case, it's a way to pass an archive page to Piperka to try and set a bookmark with one click.

Piperka has had instructions for setting up a bookmarklet since its beginning though I took them offline along with the new backend code launch in February. I've readded them and there's a link to that page on the updates page now. The reason I didn't promote them for a while was that I did a change some years ago where Piperka started to ask for confirmation after using a bookmarklet. It was all for user protection in case some other site tried to silently use your session to manipulate your user data.

It was still pretty suboptimal to have Piperka nag with an extra step whenever you used a bookmarklet. I wasn't happy with how the bookmarklets worked and didn't want to drop the CSRF protection either. Bookmarklets are back and I've added a user specific token in them to allow setting bookmarks again with a single click. Even if you had a bookmarklet defined already you may want to grab the new one. It's a small thing but my focus just was elsewhere until now.

I've also added a new bookmarklet, for going from an archive page directly to Piperka Reader. I've also added the option to launch Reader from the updates page. Reader with it content embedding is a bit apart from the traditional way of how Piperka operates and I'm still somewhat leery to bring it to closer user attention. It's useful but I suppose it's more invasive from comic authors' standpoint when compared to just taking people to their sites. If any authors read this then I'd like to hear your thoughts on the matter. I'll certainly add a blacklist for its usage if someone objects.

Other than bookmarklets, I've done some long standing code cleanups and fixed some minor issues that have been nagging for a long time. Like adding the missing second upmost tick to the readers history charts on info pages and a sanity check for making the readers list clickable when it wouldn't display anything. Piperka Map is once again updating as it had frozen in early 2017 when comics' amount hit the multiple of a buffer size and it encountered an uninitialised pointer in an array. A very C-like bug indeed. Also, I fixed the map to properly use the full browser window. Kudos to Firefox developers, my SVG map implementation was downright sluggish back when I last looked at it and it was almost as good as Chrome now.

I've edited the outgoing links sections for comics. Google Plus is gone and Ko-fi was added.

All some pretty small things that have been waiting for that moment of attention until now. I'm not done yet. I'll be focusing on some new feature development next and I'll let you know more when I have something to show.

submit to reddit
Thu, 08 Nov 2018 12:54:20 UTC

Tickets list empty

Too bad that I didn't want to punch two holes to the opposite ends of Piperka and didn't have a handy river to divert. Piperka has, to date, had tickets opened on 1664 comics listed on it. The ticket system was launched at the end of May. During the last month, I've closed tickets on 1364 of them and performed 710 comic removals. There's only one ticket remaining where someone asked me to update a comic's archive a month ago but the site's disappeared since and I'm leaving it to remind me to check on it later to see whether it has reappeared or needs to be removed after all.

Most of my recent development efforts have gone to further improve the crawler web interface. It's not perfect yet but it was in a good enough shape to just get on with applying it instead of waiting for it to do the job even better and more easily and automatically. There's certainly room for making the crawler smarter about any update issues it may face and have it directly ask me to fix things that it couldn't solve by itself, but at this point I just ended up with a crowd sourcing solution. Thanks to everyone who ticketed Piperka and my apologies for taking this long to get hold of the situation. I just didn't have the tools to do a proper job about that part until now.

Getting the crawler back on track was only a part of what I needed to do as the archives may have reorganised or removed or added content. Trying to match users' bookmarks was another part of what I did. Sometimes the old links were dead and gave no idea of what the page content was and I had to go as far as to try to see what the pages were on archive.org. Catching the newly updated page from the archive head is just a part of what I want to do with Piperka.

There's just something poignant about seeing a comic stuck in 2010 to see new updates after resetting the crawler, only to have them end in 2014. Piperka's old. Long running comics may have user bookmarks sprinkled all along the length and I doubt I'll see most of them ever move again. I'm not publishing exact user counts but it's not difficult to tell that Piperka's user base has long been in a decline. I can hope that getting the basic functionality of the site to a better shape can stem some of that. I'll admit that I tend to find creating new features more fun than finding bugs to fix even if those annoy users more than lack of new features but at least the new features have finally made bug fixing less toilsome.

I would have liked to have some sort of intra site communications method set up before going through all the tickets but that would have been another couple of months' delay. I think you would have preferred to see comics update sooner rather than later. What's more bothersome is that I had to do some removals and it would have been better if you got a message about those that you read. Some of those even were active but weren't suitable for Piperka for some reason. Either they had some overly creative page navigation solutions or they used crawl preventition measures, like Incapsula. I doubt they're targeting Piperka specifically but I'm not going to go out of my way either to request them to grant access to my crawler.

Piperka's had some connectivity issues lately and they culminated last weekend when it was offline for hours at a time on a couple of occasions. First my hosting provider wanted to relocate the rack where my server was (if it was announced beforehand I missed it totally) and then they had a major power outage that knocked even their own web pages offline. They only give a 99% SLA which amounts to almost 15 hours of outage each month. Budget hosting can be like that. Looks like the network's been working better this week and I'm hoping that it'll last. I may or may not move back to Hetzner at some point.

I tried reaching out to /r/haskell to seek out contributors for Piperka but my text post only got instantly shadow banned. Gee, thanks. Finding some Haskell enthusiast seeking for a project who'd care about web comics seems more likely than hoping to find someone technically inclined and available from my own user base. Especially as it involves Haskell though I can think of plenty of improvements that would only involve Javascript. I took the effort to lay down the installation steps for both the backend and the crawler and provided a DB schema snapshot to help get a full development environment of the site up and running. I don't think anyone's yet tried it. I'm not sure where I would go ask for contributors next. Piperka's open source and I'm hoping that that would lower the barrier for hopping in and would make me asking for help less of a "work for me for free" thing. My next blog post may be one targeted at a wider developer community and not just to my users.

On a personal note, I'm on a leave of absence for all of November. It may or may not help Piperka. This is not a luxury I can take very often but it felt like the right thing right now.

submit to reddit
Sun, 04 Nov 2018 11:07:26 UTC