I’m a great fan of Megatokyo but the publication rate of new comics can seem glacial at times. There have been times when I’ve checked the site hoping to see an update only to find that it’s a “Shirt guy Dom” comic (which I have no interest in reading; although to be fair there seems to have been less of those amongst the more recent comics).
Anyway since I become a NAO developer I’ve been trying to beef up my python skills and learn a bit about OpenCV – an open source library of computer vision related algorithms. So one day I thought to myself why not try to write a python program to determine whether a given Megatokyo comic was a “real” comic or a “Shirt Guy Dom” so rather than manually checking for updates the script could tell me 1) had a new comic been published, and 2) should I read it. This seemed fairly straightforward since while there is a lot of shading in a normal Megatokyo comic the “Shirt Guy Dom” comics are very black and white – this means that there should be no need for particularly clever feature recognition instead I should be able to produce a histogram for each comic and train a classifier to learn the difference between histograms for comics I wanted to read and those I didn’t.
There are several steps to this process:
- Check for new comics and download them
- Compute the histograms for each comic
- Create a labelled data set of “comic” and “dontread” images
- Train a support vector machine (SVM) using some pre-classified comics
- Use the trained SVN to determine whether the new comic(s) should be classified as “don’t read”
Step 1 is not particularly hard so I won’t say much about it. It takes advantage of the fact that the comic filenames are sequentially numbered and so it can easily find the number of the most recent comic on disk and then attempt to download the next comic in the sequence.
Step 2 is where OpenCV comes into play. However Megatokyo comics can be in GIF or JPEG format and OpenCV does not support GIF files. So before we can compute the histograms of the comics we need to convert the GIF images to a format OpenCV can use. In order to avoid having to remember which images were converted from GIF and which were JPEGs I elected to convert all images to PNG format. For this I used the PythonMagick binding for ImageMagick.
Having got all the images in a format that OpenCV could handle, computing the histograms was straightforward. Since most comics are greyscale I converted the images to greyscale before computing the histogram. Also rather than using a full 256 “buckets” for the histograms I chose to limit them to 64 thinking that this should give enough detail to train a classifier while keeping the number of features down.
Step 3 – I wanted a way to classify images that didn’t require any fancy file formats and would be easy to set up. In the end I settled on using the filesystem. Under the folder where I stored the megatokyo images I created a folder for each category (there are only two but in principle there could be more) and put symbolic links in the folder (linking back to the original image in the parent directory) for each image that represented the class signified by the folder.
So, for example, in my dontread folder I have links like this
The advantage of this was that to read any comics in a category all I needed to do was point an image viewer at the directory for the category.
I manually classified the first 411 images and wrote a short piece of python to set up the symbolic links for me:
Steps 4 & 5 – this is where I ran into trouble since although OpenCV does come with an implementation of a SVM I could not get it to work. It looked like it should work but everything I tried resulted in the following error message:
After breakpointing the code inside the opencv binding and seeing that I was passing something that should have corresponded to
I decided that I would try something else. I first tried the pyopencv binding which sounded promising but didn’t have a lot of luck with that either so I finally settled on PyML. Training the SVM then meant producing an array of histogram data and labelling it:
After being trained on the first 411 images the system then classified the next 814 of which 25 where classified as “dontread” Of those 25, five were comics that were incorrectly classified (they had more black in them than a typical comic). There are several things I could probably do to improve matters:
- Attempt to tune the SVM – I used the defaults for PyML and the default linear SVM. Using a different kernel might give better results.
- Use larger histograms (for example 128 or 256 buckets instead of 64) – this might capture more subtlety of shading.
- Make the number of images in the training set more equal amongst the different classes – currently the training data has 386 images in the comic category and 25 in the dontread category (this was simply the proportion in the first 411 comics).
That said the main purpose of doing this was a learning exercise and I’ve done as much as I feel like for the moment.
The other thing that I meant to do but haven’t yet was set up the links to the newly classified images so that I only need to look in the “comic” folder to see the comics I want to read.
You can find all the code for this on github at: https://github.com/davesnowdon/PythonOpenCvImageClassifier. Since I’m new to both python and OpenCV I make no claims that it particularly great code but it does seem to work.