Moving to hexo

This blog has not had much love over the last few years and I will start adding more content for it. However, I first decided to migrate from Wordpress to hexo.

Why would I do such a thing?:

  • hexo is fast. Unlike a database-driven website there are no round-trips to the datbase and no server-side page generation.
  • hexo is secure. There is no database or server-side code to compromise
  • I can get off the wordpress and PH update cycle. Even if I didn’t particularly care about getting the latest version of PHP my hosting provider charges extra for sites on legacy versions of PHP.
  • I already use nodeJs and am reasonably comfortable with the toolchain
  • I use git for developing code, so writing content in markdown and versioning in git is more natural to me than using an online web based editor (whicah I find tediously slow).

There are many static site generators, and pretty much everything I said above would apply to any of them, so why hexo? I first heard abouit on John Stevenson’s blog: jr0cket.co.uk. John had already written a number of mini-tutorials on using hexo and I trust John’s opinion so it seemed a reasonable choice.

Using OpenCV to avoid Shirt Guy Dom in Megatokyo

I’m a great fan of Megatokyo but the publication rate of new comics can seem glacial at times. There have been times when I’ve checked the site hoping to see an update only to find that it’s a “Shirt guy Dom” comic (which I have no interest in reading; although to be fair there seems to have been less of those amongst the more recent comics).

Anyway since I become a NAO developer I’ve been trying to beef up my python skills and learn a bit about OpenCV – an open source library of computer vision related algorithms. So one day I thought to myself why not try to write a python program to determine whether a given Megatokyo comic was a “real” comic or a “Shirt Guy Dom” so rather than manually checking for updates the script could tell me 1) had a new comic been published, and 2) should I read it. This seemed fairly straightforward since while there is a lot of shading in a normal Megatokyo comic the “Shirt Guy Dom” comics are very black and white – this means that there should be no need for particularly clever feature recognition instead I should be able to produce a histogram for each comic and train a classifier to learn the difference between histograms for comics I wanted to read and those I didn’t.

There are several steps to this process:

  1. Check for new comics and download them
  2. Compute the histograms for each comic
  3. Create a labelled data set of “comic” and “dontread” images
  4. Train a support vector machine (SVM) using some pre-classified comics
  5. Use the trained SVN to determine whether the new comic(s) should be classified as “don’t read”

Step 1 is not particularly hard so I won’t say much about it. It takes advantage of the fact that the comic filenames are sequentially numbered and so it can easily find the number of the most recent comic on disk and then attempt to download the next comic in the sequence.

Step 2 is where OpenCV comes into play. However Megatokyo comics can be in GIF or JPEG format and OpenCV does not support GIF files. So before we can compute the histograms of the comics we need to convert the GIF images to a format OpenCV can use. In order to avoid having to remember which images were converted from GIF and which were JPEGs I elected to convert all images to PNG format. For this I used the PythonMagick binding for ImageMagick.

1
2
3
4
5
6
7
8
9
import PythonMagick
def convert_file_name_to_png(filename):
return os.path.splitext(filename)[0] + '.png'
def convert_image_to_png(basedir, src):
image = PythonMagick.Image(basedir + '/' + src)
dest = convert_file_name_to_png(src)
image.write(basedir + '/' + dest)

Having got all the images in a format that OpenCV could handle, computing the histograms was straightforward. Since most comics are greyscale I converted the images to greyscale before computing the histogram. Also rather than using a full 256 “buckets” for the histograms I chose to limit them to 64 thinking that this should give enough detail to train a classifier while keeping the number of features down.

1
2
3
4
5
6
7
8
9
10
11
import cv
def make_histogram(imagefile):
col = cv.LoadImageM(imagefile)
gray = cv.CreateImage(cv.GetSize(col), cv.IPL_DEPTH_8U, 1)
cv.CvtColor(col, gray, cv.CV_RGB2GRAY)
hist = cv.CreateHist([NUM_BINS], cv.CV_HIST_ARRAY, [[0,255]], 1)
cv.CalcHist([gray], hist)
cv.NormalizeHist(hist, 1.0)
return hist

Step 3 – I wanted a way to classify images that didn’t require any fancy file formats and would be easy to set up. In the end I settled on using the filesystem. Under the folder where I stored the megatokyo images I created a folder for each category (there are only two but in principle there could be more) and put symbolic links in the folder (linking back to the original image in the parent directory) for each image that represented the class signified by the folder.

So, for example, in my dontread folder I have links like this

1
2
3
4
5
6
7
8
0031.png -> ../0031.png
0045.png -> ../0045.png
0065.png -> ../0065.png
0076.png -> ../0076.png
0082.png -> ../0082.png
0086.png -> ../0086.png
0093.png -> ../0093.png
...

The advantage of this was that to read any comics in a category all I needed to do was point an image viewer at the directory for the category.

I manually classified the first 411 images and wrote a short piece of python to set up the symbolic links for me:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import sys
import getopt
from megatokyo import Usage, make_link
dontread = [ '0031', '0045', '0065', '0076', '0082', '0086', '0093', '0104', '0130', '0170', '0186', '0191', '0227', '0228', '0242', '0257', '0265', '0279', '0302', '0315', '0320', '0328', '0361', '0388', '0411' ]
def make_categories(negative):
positive = []
ineg = []
for i in negative:
ineg.append(int(i))
for i in range(1, max(ineg)):
tmp = str(i)
pstr = "00000"[0:(4-len(tmp))] + tmp
if pstr not in negative:
positive.append(pstr)
return { 'comic' : positive, 'dontread' : negative}
def make_links(basedir, categories):
for k in categories.keys():
vs = categories[k]
for v in vs:
make_link(basedir, k, v+".png")
def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "h", ["help"])
except getopt.error, msg:
raise Usage(msg)
if 0 == len(args):
raise Usage("Missing base path")
basedir = args[0].strip()
print "Base dir = " + basedir
make_links(basedir, make_categories(dontread))
except Usage, err:
print >>sys.stderr, err.msg
print >>sys.stderr, "for help use --help"
return 2
if __name__ == "__main__":
sys.exit(main())

Steps 4 & 5 – this is where I ran into trouble since although OpenCV does come with an implementation of a SVM I could not get it to work. It looked like it should work but everything I tried resulted in the following error message:

1
2
3
4
5
6
NotImplementedError: Wrong number or type of arguments for overloaded function 'CvSVM_train'.
Possible C/C++ prototypes are:
train(CvSVM *,CvMat const *,CvMat const *,CvMat const *,CvMat const *,CvSVMParams)
train(CvSVM *,CvMat const *,CvMat const *,CvMat const *,CvMat const *)
train(CvSVM *,CvMat const *,CvMat const *,CvMat const *)
train(CvSVM *,CvMat const *,CvMat const *)

After breakpointing the code inside the opencv binding and seeing that I was passing something that should have corresponded to

1
train(CvSVM *,CvMat const *,CvMat const *,CvMat const *,CvMat const *,CvSVMParams)

I decided that I would try something else. I first tried the pyopencv binding which sounded promising but didn’t have a lot of luck with that either so I finally settled on PyML. Training the SVM then meant producing an array of histogram data and labelling it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def classify(basedir, category_names):
all_images = get_images(basedir)
all_classified_images = []
classified = {}
for c in category_names:
pimg = get_png_images(basedir+'/'+c)
classified[c] = pimg
for im in pimg:
all_classified_images.append(im)
# now need to find the images which are not classified yet
unclassified = []
for i in all_images:
if i not in all_classified_images:
unclassified.append(i)
# make histograms of all images
hmap = make_histograms(basedir, all_images)
clf = learn(classified, hmap)
usamples = []
for u in unclassified:
hist = hmap[u]
row = []
for j in range(NUM_BINS):
row.append(cv.QueryHistValue_1D(hist, j))
usamples.append(row)
data = VectorDataSet(usamples, patternID=unclassified)
results = clf.test(data)
patterns = results.getPatternID()
labels = results.getPredictedLabels()
# make map of image name to predicted label
lmap = {}
for i in range(len(patterns)):
lmap[patterns[i]] = labels[i]
return lmap
# train a support vector machine to recognize the images based on histograms
def learn(classified, histograms):
clf = SVM()
total_samples = 0
for c in classified.keys():
cim = classified[c]
total_samples = total_samples + len(cim)
samples = []
labels = []
for c in classified.keys():
cim = classified[c]
for im in cim:
hist = histograms[im]
row = []
for j in range(NUM_BINS):
row.append(cv.QueryHistValue_1D(hist, j))
samples.append(row)
labels.append(c)
data = VectorDataSet(samples, L=labels)
print str(data)
clf.train(data)
return clf

Conclusion

After being trained on the first 411 images the system then classified the next 814 of which 25 where classified as “dontread” Of those 25, five were comics that were incorrectly classified (they had more black in them than a typical comic). There are several things I could probably do to improve matters:

  • Attempt to tune the SVM – I used the defaults for PyML and the default linear SVM. Using a different kernel might give better results.
  • Use larger histograms (for example 128 or 256 buckets instead of 64) – this might capture more subtlety of shading.
  • Make the number of images in the training set more equal amongst the different classes – currently the training data has 386 images in the comic category and 25 in the dontread category (this was simply the proportion in the first 411 comics).

That said the main purpose of doing this was a learning exercise and I’ve done as much as I feel like for the moment.

The other thing that I meant to do but haven’t yet was set up the links to the newly classified images so that I only need to look in the “comic” folder to see the comics I want to read.

You can find all the code for this on github at: https://github.com/davesnowdon/PythonOpenCvImageClassifier. Since I’m new to both python and OpenCV I make no claims that it particularly great code but it does seem to work.

The LJC Code Share

My first experience with peer code review was during a summer job as a junior developer at a medium-sized software house in London which developed point of sale (POS) systems. Code reviews there were organised on a regular basis in a meeting room and the team of developers working on a product would get together and discuss someone’s code. This was over 23 years ago now and the details are somewhat hazy but the experience stayed with me as an example of a good practice and way to both improve the quality of the code base and a way for less experience developers to learn from more experienced ones.

It would be 23 years before I worked at a company that also practised peer code reviews on a consistent basis – I joined VMware UK in October 2011. I have been impressed by VMware’s commitment to quality and one of the ways (VMware also use other best-practices such as continuous integration but I won’t discuss them in this post) this is achieved is by ensuring that no code that has not been peer-reviewed is checked into one of the main branches. Given that VMware’s developers are globally distributed face-to-face meetings would be impractical and slow the rhythm of work so we make use of a web-based tool called Review Board.

In many ways using Review Board is actually better than a face to face meeting as all comments are recorded and reviewers can highlight exactly the lines of code that a comment is discussing. Review Board also tracks changes to the submitted code in response to comments so reviews can see not only how the change modifies the existing code base but how the author has adapted the code in response to comments. Since Review Board is web based bug tracker reports can link to the review request for a proposed fix and commit comments can link back to the review that authorized the commit.

Before I go any further let me describe what I see as the benefits of peer code reviews:

  • Flaws are spotted early – Although code reviews are not a substitute for automated tests a second or third pair of eyes looking at the code can spot potential issues and suggest fixes.
  • They encourage consistency in the code base – With multiple developers looking at any given piece of code any deviations from house coding conventions and re-implementation of functionality already present in libraries or the code-base are much more likely to be noticed.
  • They help developers learn the code base – reviewers can point the author to similar functionality already present that they may have overlooked.
  • Less experienced coders can learn from more experienced ones – As well as the opportunity to correct specific errors based on review comments codes can also learn to improve their general coding style using feedback on specific examples of code. Many coders (all?) learn best by doing and the more concrete the example and the closer it relates to a specific tasks then the easier it is to learn and apply what has been learnt.

Anyway, so how does this relate to the London Java Community? Back in August 2011 Ged Byrne and I had an idea for a new meetup for the LJC – our thinking was that the LJC organises technical talks, code dojos and social events (the “developer sessions”) but we don’t spend much time actually talking about code itself in detail. We had the idea of a regular meetup during which people could bring code and discuss it with other developers. Ged also suggested the idea of a monthly challenge in which we’d solicit code that solved a particular problem and analyse it in groups.

Since then we’ve organised five events and I think it’s time to look back and see how things are going – here are links to the previous four events and their respective challenges:

We typically start the code shares with a short introductory talk to and discuss the background to the challenge. We’ll then break up into groups of 4-6 people and discuss the month’s code in detail. We deliberately keep things low-tech and use printed copies of the code rather than relying on people bringing laptops and wasting time while people download copies of the code etc. After it looks like the groups have run out of things to talk about (between 30-45 minutes) we’ll then re-convene as a single group and discuss what we’ve learned. We keep things as informal as possible and while we’ll pick on people from each group to give feedback to get things started anyone is free to talk about anything related to the code we’ve just looked at.

For me, it’s the informal group discussions that are the most valuable part of the event and although we talk a lot about the code itself we’ve also had emergent debates on subjects like:

  • is clojure (and all other lisps) inherently unreadable?
  • what is readable code anyway?
  • fluent coding
  • should we comment code? and if so under what circumstances? are comments harmful?

Another aspect that I consider valuable is that we deliberately don’t limit ourselves to Java code – as well as other JVM languages (scala & clojure so far) we’ve used examples written in python. Ged and I believe that this not only adds to the variety but exposes developers to other ways of approaching software development.

Update 19/1/2012: I was remiss to not mention our sponsors in the first version of this post. Thoughtworks have been paying for the cost of printing the handouts (and a round of drinks in the bar afterwards), Recworks have been helping Ged and me with much of the administration and organisation and Queen Mary, University of London have provided the venue for all the Code Share events so far.

Update 22/1/2012: Details for the sixth LJC code share have been posted on meetup.com – February’s topic is dependency injection.

Migrating from prototype to jQuery

I first used Prototype and script.aculo.us on a live in 2007 when I wrote the management interface for gsnowdon.com. At the time they seemed pretty cool and a using the effects and event management functions was a lot better than rolling my own. However, for the last couple of years I’ve been using jQuery for all new sites as I find it’s a much better match for how I write Javascript and it really does seem to require less code in jQuery than doing the same thing in prototype of plain-Javascript. Not only that but the vast selection of jQuery plugins means that in most cases if something I want is not built into jQuery or jQuery UI then there is often a plugin that does the job.

Although I’ve stuck with prototype & script.aculo.us for gsnowdon.com for the last few years my wife has requested a major re-working of the way the product management pages work and I thought that I’d bite the bullet and port the site to jQuery and then use that for all the new functionality.

This post aims to make a very informal comparison of the code between the prototype and jQuery versions of the management pages since that is the most Javascript heavy part of the site and most operations make use of AJAX. Given that gsnowdon.com was one of the first sites I worked on using Prototype the comparison is probably a little unfair as there are probably better ways to write equivalent functionality in Prototype than I used at the time.

The first thing to note is that the management.js file dropped from 948 lines to 802 although some of this is due to differences in the way I tend to format jQuery code. Perhaps a better comparison is file size which dropped from 34,732 bytes to 29,259.

The biggest wins were code fragements such as the following:

1
2
3
4
5
6
7
$A(document.getElementsByTagName("select")).each(
function(value, index) {
if (value.hasClassName("printTypeOptions")) {
value.innerHTML = printTypeOptions;
}
}
);

which become one-liners in jQuery:

1
$('select.printTypeOptions').html(printTypeOptions);

The ability to chain a sequence of jQuery operations also helps to make code more concise, as in:

1
2
3
4
var list = document.getElementById('productList');
...
Element.show(list);
list.innerHTML = "<ul>" + pgHtml.join("") + "</ul>";

Compared to:

1
$('#productList').show().html("<ul>" + pgHtml.join("") + "</ul>");

The original code made quite a lot of use of Prototype’s collect() function as in:

1
2
3
4
5
var pgHtml = $.collect(productGroups,
function(index, value) {
// build HTML string here
}
);

I’d then use join() on the resulting HTML to create a string I could stuff into the innerHTML of a container element. Jquery does not seem to have a collect() function so I used the following implementation from
http://snippets.dzone.com/posts/show/11483

1
2
3
4
5
6
7
8
9
// function replacement for protoype collect function
// http://snippets.dzone.com/posts/show/11483
$.collect = function(c, f) {
var a = [];
$.each(c, function(k, v) {
a.push(f(k, v));
});
return a;
};

The arguments are reversed compared to the prototype equivalent but I like this ordering better.

There’s not a great deal of difference when making AJAX requests between Prototype and jQuery as the following shows; although the ability of jQuery to automatically process JSON and pass an object rather than a string back to the callback did make the code marginally more concise:

1
2
3
4
5
6
7
8
9
10
11
new Ajax.Request(
"mproducts_helper.php?REQ_TASK=listAll&REQ_UNITS="+units,
{
method: "get",
onSuccess: function(transport) {
var response = transport.responseText;
var jsonObj = eval("("+response+")");
parseAllProductItems(jsonObj);
}
}
);

compared to:

1
2
3
4
5
$.ajax({
url: "mproducts_helper.php?REQ_TASK=listAll&REQ_UNITS="+units,
dataType: 'json',
success: parseAllProductItems
});

I could have used $.get() and $.post() to make the jQuery versions a litle bit more concise but didn’t bother in this case.

There were also a number of cases where code like this:

1
$('pageUnits').value;

got re-written as below. The only advantage in doing things the jQuery way in this case is that the code would still work no matter what HTML form widget is used.

1
$('#pageUnits').val();

There’s a lot more I could do to improve the current code. I didn’t make use of any jQuery plugins. For example I could have used Jeditable to replace some home-grown edit-in-place code but that can wait until I start re-working the user interface as I intend to make some significant changes.