Personal musings of all colors, mostly about programming, science and mathematics.

Bayesian inference introduction

I wrote a small introduction to Bayesian inference, but because it is pretty heavy on math, I used the format of an IPython notebook. Bayesian inference is an important process in machine learning, with many real-world applications, but if you were born any time in the 20th century, you were most likely to learn about probability theory from a frequentist point of view. One reason may be that calculating some integrals in Bayesian statistics was too difficult to do without computers, so frequentist statistics was more economical.

6 things you didn’t know about MediaWiki

… (and were afraid to ask). HTML Tag Scope: If you mix HTML tags with wikitext, which is allowed for so-called “transparent tags”, MediaWiki will check the element nesting independent of the wikitext structure in a preprocessing step (include/Sanitizer.php::removeHTMLtags). Later on, when parsing the wikitext, some elements may be closed automatically (for example at the end of a block). The now-dangling close tag will be ignored, although it is detached from its counterpart by then: <span style="color: red">test this</span> will result in: test this while test this</span> will result in: test this</span> This can happen across a long part of the wikitext document, with many intermediate blocks, so the treatment of close tags has a wide context-sensitivity, which is generally bad for formal parsing.

Replacing native code with Cython

Here is a little exercise in rewriting native code with Cython while not losing performance. It turns out that this requires pulling out all the stops and applying a lot of optimization magic provided by Cython. On the other hand, the resulting code is portable to Windows without worrying about compilers etc. A real world example The example code comes from a real world project, OCRopus, a wonderful collection of OCR tools that uses latest algorithms in machine learning (such as deep learning) to transform images to text.

Including binary file in executable

A friend asked how to include a binary file in an executable. Under Windows, one would use resource files, but under Linux the basic tools are sufficient to include arbitrary binary data in object files and access them as extern symbols. Here is my example file. To make it more fun, the same file is also a Makefile and a shell script, and the program prints itself when run (without requiring the source file to be present).

I stand up, take a step from the desk, but a mix of cables was tangled around my leg, and pulled hard on my AKG K601 headphones. The left speaker stopped making a sound. This particular headphone model, which I absolutely love, is discontinued, but no worries. I suspected a broken soldering joint, which would be easy to fix. As usual, the hard part is opening the sucker up to get at the actual problem.

Attention, elasticsearch counts in UTF-16

Here is a surprise. Trying to extract the text of analyzed tokens in elasticsearch, I found that it didn’t match my expectations. The positions start_offset and end_offset were not counted in bytes, and not counted in Unicode graphemes. What was going on? A hint was the behavior of the standard analyzer: \$ python -c 'print "X \xf0\x9d\x9b\xbe Y"' > test.txt \$ curl -XGET 'http://localhost:9200/diss/_analyze?analyzer=standard&pretty=1' -d @test.txt { "tokens" : [ { "token" : "x", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "\uD835\uDEFE", "start_offset" : 2, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "y", "start_offset" : 5, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 3 } ] } Apparently, the input was converted to UTF-16, and the offsets were measured in multibytes (2-byte sequences).

A Bug Tale

So, what’s a software developer doing all day? Sometimes, it can be ridiculous. Of course, ridiculous makes for good stories, if you are the type of person who enjoys programming jokes. It was a sunny day, but the day was long gone, and the night laid its cloth of dark silk over the world, when I cracked my knuckles and straightened my back to face a task that would lead me into the treacherous belly of the cave that is libxslt.

N900 to HTC One X+ Headset Conversion

The N900 smart phone has a nice in-ear headset, but it doesn’t conform to current headset standards. However, headsets are incredibly simple and there is no reason to buy a new one just for such a simple compatibility problem. So here is what to do if you find yourself in this situation. The N900 headset and the HTC One X+ headset have slightly different pin-outs: The tip connects to the left speaker signal (L) on both devices.

Fedora 18 Beta on Thinkpad T430s

I’m dreading this post. So much nice things have happened in free software over the last decade, that it is always a huge disappointment to me when I am thrown back into the seat of a new user and get a flash back to the horrible times of abysmal failures and sloppy engineering. Let’s get it over with as quickly as possible. Linux kernel is struggeling with a desastrous SSD communication problem under workloads typical of database applications (such as the popular LAMP stack, so by no means anything extraordinary).