Archive for October, 2009

* Unidecoder

Posted on October 31st, 2009 by John. Filed under programming.


A while back I made a post about ASCIIizing Text. With it was a simple python application that would convert Unicode characters to ASCII equivalents. It doesn’t do a basic conversion but also Latinizes the characters when they are outside of the ASCII range.

The uni2ascii package I made has a few short comings I’ve decided to fix. The three major problems with it are: 1) Very basic permission checking, 2) Only accepts one file, 3) Required all input to be UTF8 encoded, 4) The decoder was a very literal port of a the ruby version.

To fix these issues I’ve written an entirely new script. Problems 1, 2 and 3 are fixed. It has robust error checking, can handle an arbitrary number of files, and the file encoding can be specified. Number 4 is fixed by using the Python port created by Tomaz Solc.

I’ve put the source code for the new decoder into a Launchpad branch:

$ bzr branch lp:~user-none/+junk/unidecoder

Tags: , , .

    Comments Off


* Calibre Week in Review

Posted on October 26th, 2009 by John. Filed under calibre.


Mostly bug fixes this week. The majority of them were centered around eReader PDB output and PML generation. eReader PDB output now marks the first image as the cover image if a cover image is not explicitly set. PMLZ got images named properly in the output. PML generation now has .png added to the end of image names. I also fixed a bug where excessive new lines were not being properly removed. PML, TXT, RB, FB2 output all got excessive space removal tones down so instances were spaces were completely removed will stop happening. Regex header and footer matching was tweaked to match at a later stage in the conversion pipeline. This should ease issues of expressions not matching properly. Finally, at Kovid’s request I’ve added some info about header / footer regexes and converting TXT and PDF files to the documentation.

Tags: , , , , , , , .

    Comments Off


* Calibre Week in Review

Posted on October 19th, 2009 by John. Filed under calibre.


Like every week there were miscellaneous bug fixes. However, this week I did a bit more. TCR input and output. Do be warned that the output supports multiple compression levels; the higher levels being slower than the lower. For instance a 200K TXT file as input will take around 25 seconds on the lowest level and 3.5 minutes at the highest.

TCR is an compressed text format used mainly by the Psion 3 and 5 series PDAs that were produced in the 90s. The compression used by TCR files is very interesting. It doesn’t have as high a compression ratio as say zlib but that is a trade off for being decompressable starting at any point in the stream. The history and more information about the format can be found at Andrew Giddings’ TCR page.

Tags: , , .

    Comments Off


* Calibre Week in Review

Posted on October 11th, 2009 by John. Filed under calibre.


I haven’t had one of these for quite some time. I’ve been working on other projects and on the calibre font I’ve only dealing with small bug fixes. However, this past week I’ve done a bit of work that is worth mentioning.

I’ve cleaned up the FB2 output. It fixes some invalid markup. Fixes some issues with text not being displayed by FBReader. It also fixes some issues with invalid characters making there way into hrefs.

eReader PDB output also got some love. Some kind people have been working on the reverse engineering of the file format and have filled in a number of the blanks I left. All of the additional information that has been discovered has been added to the files produced. The two main things that have been added are chapter and link indexes. The chapter indexes give the nice names at the top of the eReader viewer application. The link index allows links to work in the eReader viewer application.

To coincide with the eReader PDB output changes, PML input and output had some cleanup. It looks better now and replaces unicode characters with the \UXXXX equivalent.

Tags: , , , , .

    Comments Off