Posts Tagged ‘calibre’
* Calibre Week in Review
Posted on December 13th, 2009 by John. Filed under calibre.
FB2 output has been improved. It no longer generated very invalid markup. The output generator still isn’t where I want it to be though. The changes are mostly cleanup and fix long standing issues with the output. One major change is I reverted having <h1> tags work as section and title markers. I don’t like having this hard coded into the generator.
As far as I can tell FB2 does not support a true table of contents (TOC). What seems to happen is reader software will dynamically generate the TOC based on the appearance of <section><title><p>text</p></title> within th e text. If I’m wrong about this and FB2 really does support an external TOC I would love to know. The <h1> differentiation causes the files to have these sections. I do not like how this is hard coded and dependent on this single tag. Especially since calibre’s conversion process allows for an XPath expression to be specified to generate TOC point.
I would love to use the TOC sent to the FB2 output generator but this does not seem to be feasible. The problem with the TOC that is sent to the FB2 output generator is how it corresponds to locations in the document. The OEB TOC points are text which may or my not appear in the text and an anchor id. The anchor id is a set point in the document. FB2 section titles are part of the text itself. I cannot use the text from the OEB TOC because it may or may not actually appear in the text. This also prevents me from determining what the text in the document is supposed to be associated with the TOC point. The anchor id points to an anchor in the document but often that point is something like <a id=”blah” />. In this case there is no text associated with the anchor in the document. While I can assume the text following is part of the title I have no fool proof way to determine where it stops.
At this point the only TOC associated with the FB2 output is the inline TOC that can be optionally generated.
Support for two new readers has been added. Ganaxa’s GeR2 and Nokia’s 770 internet tablet. The N770 should have been support a long time ago and I apologize for how long this request has been unfulfilled. I put it at the bottom of my todo list and at some point it simply fell off and I forgot about it.
The GeR2 reader was a bit of a challenge to get supported. This reader and some models of the Cybook Gen 3 have the same vendor, product and revision (BCD) ids. On Windows and OS X this is not an issue because once the ids are matched further matches are done based on the plug and play (PNP) strings. However, on Linux only the ids are matched.
To solve this problem, matching on Linux needed some further checks. Kovid added support for libusb-1 which provides the vendor and product strings. He also added a call back that can be implemented in the device interface to implement platform and device specific checks. We did run into a few problems. The first was an easy to solve 32 vs 64 bit issue with the Python to C interface Kovid wrote for libusb-1. Once that was sorted out we ran into a larger problem. libusb-1 on Ubuntu by default is denied access to the vendor and product strings.
libusb-1, after appearing in only run release (0.6.27), has been dropped. Kovid has now written a custom device scanner for Linux that will parser the devices in /sys/bus/usb to determine if a reader is connected. libusb-1 is supposed to be an easy to use library capable of providing this functionality but unfortunately this turned out not to be the case. The custom scanner works and allowed me to implement differentiation between the GeR2 and the Cybook Gen 3 so both readers can be properly supported without conflict and with the correct device interface being used.
* Calibre Week In Reveiw
Posted on December 5th, 2009 by John. Filed under calibre.
Most this week was spent turning PML input and output. I spent a bit of work bug tracking and enhancing FB2 output as well.
The changes for PML input are as follows. Pass along the included cover as the cover when converting (also applies to eReader PDB). Allow for images to be in top level, archivename_img or images directory for PMLZ. Based on that order it will check for images and if they are not found move onto the next location. For PML, images can be in pmlname_img or images directory. Footnotes and sidebars now display cleaner. They are separated better and EPUB puts them on individual pages. They also include a return link which goes back to the place in the text they are referenced. This assumes one footnote and sidebar per entry in the text, so if it’s referenced multiple times the return link will go back to the return reference.
PML output now creates \a and \U codes only for supported characters. All characters that are not supported and that cannot be turned into a \a or \U code will be replaced with a ?.
Along with the changes for PML input reading the cover they are now read as part of the metadata. This applies to both PML, PMLZ and eReader PDB files.
I’ve created a PML2PMLZ FileType plugin which will run when ever PML is imported into the GUI. It takes a PML file looks for images in the above mentioned locations, takes it all and puts it into a PMLZ archive. The PMLZ archive is them added to the library.
When I went to test the PML2PMLZ plugin I found that the GUI on my system was horribly broken. After a bit of work with Kovid, I found that calibre-parallel had to be in the path if calibre was installed in a non standard location. I install into my home directory using the develop command. Kovid has committed a fix that writes the install path to the launcher for these instances.
FB2 output now turns h1 tags into <section><title> tags to allow for TOC generation. As far as I can tell FB2 has not set TOC and instead readers dynamically generate the TOC based on looking at all of the body and sections and sets the text using the title tag. Right now the FB2 output is limited to only turning h1 tags and cannot use the user defined TOC based on an XPATH expression. I plan to fix this limitation in the future.
* Calibre Week In Review
Posted on November 30th, 2009 by John. Filed under calibre.
I spent the past week fixing as many bugs with the new PML input parser and cleaning up as much of the output as I could. I really need to thank WayneD for helping find bugs with the new code. Also, Kevin Hendricks who has been working on his own parser based on code from a tool that does some work on eReader files. He helped me formulate what output should be derived in certain cases. The new PML input parser has been released as part of calibre 0.6.25.
* Calibre Week in Review
Posted on November 22nd, 2009 by John. Filed under calibre.
PML input had some major changes this week. Thank the user WayneD for helping me out and getting me to actually do the work I’ve been putting off since I introduced PML/eReader as an input format.
There is now a metadata reader for PML and PMLZ. WayneD provided me with a set of regular expressions that can extract the metadata from a metatdata comment within a PML document. I took those regexes and created a metatdata plugin that supports both straight PML files as well as the PMLZ archive file.
The other major change to PML is, I’ve re-written the input parser. It is not longer based on a set of regular expressions. It is now a line oriented simple state machine. When I created the regex parser I intended to replace it at some point in the future with a true parser. The regex based one was simply a quick and dirty way to get PML supported. The new parser is much faster, produces cleaner and more accurate HTML output. It also has the added benefit of reading \CX codes and turns them into table of contents entries for PML and PMLZ input. The new parser is much better and I’m not completely finished with it. I still need to add support for \v comments (they are currently removed), \n codes, and implement font attribute tracking to condense changes (this is how \n will be handled).
WayneD did provide me with his Perl based line oriented simple state machine for PML to HTML conversion. I did use one idea from it. Turning footnote and sidebar xml syntax into custom PML tags. I had intended to port his parser to python and use it as a base but when I started looking at it I remembered I don’t know Perl at all and I can’t make heads or tails of Perl code. I have no desire or need to actually learn Perl, so I ended up writing my own parser.
* Calibre Week in Review
Posted on October 26th, 2009 by John. Filed under calibre.
Mostly bug fixes this week. The majority of them were centered around eReader PDB output and PML generation. eReader PDB output now marks the first image as the cover image if a cover image is not explicitly set. PMLZ got images named properly in the output. PML generation now has .png added to the end of image names. I also fixed a bug where excessive new lines were not being properly removed. PML, TXT, RB, FB2 output all got excessive space removal tones down so instances were spaces were completely removed will stop happening. Regex header and footer matching was tweaked to match at a later stage in the conversion pipeline. This should ease issues of expressions not matching properly. Finally, at Kovid’s request I’ve added some info about header / footer regexes and converting TXT and PDF files to the documentation.
* Calibre Week in Review
Posted on October 19th, 2009 by John. Filed under calibre.
Like every week there were miscellaneous bug fixes. However, this week I did a bit more. TCR input and output. Do be warned that the output supports multiple compression levels; the higher levels being slower than the lower. For instance a 200K TXT file as input will take around 25 seconds on the lowest level and 3.5 minutes at the highest.
TCR is an compressed text format used mainly by the Psion 3 and 5 series PDAs that were produced in the 90s. The compression used by TCR files is very interesting. It doesn’t have as high a compression ratio as say zlib but that is a trade off for being decompressable starting at any point in the stream. The history and more information about the format can be found at Andrew Giddings’ TCR page.
* Calibre Week in Review
Posted on October 11th, 2009 by John. Filed under calibre.
I haven’t had one of these for quite some time. I’ve been working on other projects and on the calibre font I’ve only dealing with small bug fixes. However, this past week I’ve done a bit of work that is worth mentioning.
I’ve cleaned up the FB2 output. It fixes some invalid markup. Fixes some issues with text not being displayed by FBReader. It also fixes some issues with invalid characters making there way into hrefs.
eReader PDB output also got some love. Some kind people have been working on the reverse engineering of the file format and have filled in a number of the blanks I left. All of the additional information that has been discovered has been added to the files produced. The two main things that have been added are chapter and link indexes. The chapter indexes give the nice names at the top of the eReader viewer application. The link index allows links to work in the eReader viewer application.
To coincide with the eReader PDB output changes, PML input and output had some cleanup. It looks better now and replaces unicode characters with the \UXXXX equivalent.
* Calibre Two Weeks in Review
Posted on August 2nd, 2009 by John. Filed under calibre.
This time I missed last weeks week in review because I simply forgot. I’m hoping to keep this to a minimum in the future.
The big news is calibre 0.6 has been released. Kovid is now back to his regular (weekly at the least) bug fix releases too. As of now the latest version is 0.6.4 and I get the feeling that 0.6.5 is right around the corner. For a listing of what’s gone into the 0.6 release take a look here.
Some features I’ve been working on that are included in the current release are: asciiize text, and iRex Iliad support.
The asciiize feature is one I’ve been wanting for a while and I’ve finally implemented it. It’s based on the ASCIIize text post I made last week. There is now an –asciiize option for the ebook-convert command and an option for the conversion dialog in the GUI. The basic premise of this feature is unicode characters often are not displayed correctly by ebook readers. My Cybook Gen 3 exhibits this behavior. ASCIIize transliterates the unicode text to an ASCII representation. Meaning it takes “Михаил Горбачёв” and converts it to “Mikhail Gorbachiov”.
The iRex Iliad is now as supported as I can get it. calibre detects it as a device, displays the list of books on the device and you can send books to it using send to device. The part I’m running into issues with is the manifest.xml file. It looks like it’s similar to the Sony PRS’s media.xml file, meaning it is a quick store for metadata. However, I don’t really know what goes into this file or should I say files (there are more than one). It also doesn’t look like the device updates it in any way because I had a user send me the files off of their Iliad and they were empty even though the user has put 20 or so books on the device with the Mobipocket desktop software. On the bright side it works well enough that I don’t think anyone will notice.
On the GUI tweaks front (these won’t be in a release until some point in the future) I’ve added a history drop down to the search field. There is a swap button for authors and title in the metadata bulk dialog, and you can hide the toolbars in the ebook viewer.
The remainder of what I accomplished over the past two weeks was bug fixes and code refactoring. Just th boring stuff that I would rather put off versus actually doing it.
* Calibre Week in Review
Posted on July 19th, 2009 by John. Filed under calibre.
There was no week in review last week because I went on vacation this past week. So this week in review combines everything since the last week in review.
I’ve made a few bug fixes to some output formats, PDB metadata and FB2 output mainly. The major things I’ve been working on is a bit of restructuring for the GUI and fixing some small bugs.
The GUI has had the button in the status bar (jobs, tags, cover flow) moved to a side bar on the right hand side. The version information and device connected information has moved to the status bar. The donate button was moved to the side bar. The status bar is now collapsible. When collapsed it shows less information (the list of formats for the selected book). When expanded it shows the book info the same as it does currently. The location view that lists the library and the connected device is hidden when no device is connected. I’ve added an About button to the new sidebar that will show some information about calibre. Overall these changes have two major benefits. It makes the interface a lot more netbook friendly and makes the book table larger so more information can be seen.
These changes to the GUI will be part of the 0.6 series but I can’t say for certain if it will be included in the initial 0.6.0 release.
* Calibre Week in Review
Posted on July 5th, 2009 by John. Filed under calibre.
This week has been a productive one. I’ve made a lot of small GUI enhancements and did some work on PDF input as well. All of these changes have not made it into trunk yet. This is mainly because Kovid has been away this week.
I’ve added auto complete to a number of the input control on the GUI. Authors, Publisher, and Tags all auto complete pretty much everywhere now. The Tags will even auto complete in the table view in the main window. However, Authors, Series and Publisher do not auto complete in the main windows as of yet.
I’ve also been working with the GUI’s search. ISBN, Rating, Cover fields are all included in the default search. They are also search field identifiers. Meaning you can do isbn:123 to search just isbn numbers. Searching for empty and filled fields has been implemented as well. Use field:false and field:true respectively.
PDF input, either is or last week, got the ability to specify an unwrap factor for unwrapping lines. Previously this was a fixed value. Now it can be changed by the user. I have some ideas to enhance this further but I’m not going to to into detail because they may not materialize. Use the option –unwrap-factor with a decimal value 0 – 1. It is used by the regular expression that determines the minimum line length required for unwrapping.
PDF input had another highly requested change. The ability to remove headers and footers. However, it’s not as user friendly as I would like. There are four new options in total. –remove-header, –remove-footer, –header-regex, and –footer-regex. If the the –remove-* options are used then a regular expression that can be customized by using –*-regex is used to match headers and footers. The header and footer matching happens before all other processing rules. Use the ebook-convert’s –debug-input option to see the HTML that the regex will be matched against.
$ ebook-convert input.pdf .epub --debug-input output_dir/
Tags
Archives
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)