Posts Tagged ‘mobi’
* Amazon APNX file format
Posted on February 9th, 2011 by John. Filed under programming.
Coming with the Kindle 3.1 firmware is the ability to have real page numbers. Getting ready for this Amazon has put out a preview release of the 3.1 firmware and has started adding the necessary information to Kindle books to show the page numbers.
The page numbers themselves map to the pages of the corresponding print book. Over all it gives a very pleasant experience. Amazon has implemented the page mapping though a new auxiliary file that has the .apnx extension. Doing this they can easily add this feature to all existing books and not have to worry about incompatibilities with older Kindles.
There is an easy way to tell if a book is going to include the APNX file. Look for “Page Numbers Source ISBN:”in the Product Details. All books that map pages to a print book will specify which edition they map to.
Now on to the more technical part of this post. I’ve spent some time looking at various books that Amazon is distributing with the APNX file and I’ve been able to reverse engineer the format. It’s a very simple format and after the header information is simply a list of 4 byte big-endian integers that correspond to locations in the uncompressed text. The position of the integer in the list corresponds to its page number.
Following is the documentation of the APNX specification I’ve written:
APNX ---- apnx files are used by the Amazon Kindle (firmware revision 3.1+) to map pages from a print book to the Kindle version. Integers within the file are big-endian. Layout ------ bytes content comments 4 00010001 Format identifier. Value of 65537 little-endian. 4 start of next The offset after ending location of the first header. Starts a new sequence of header info 4 length Length of first header N first header String containing content header Starts next sequence 2 unknown Always 1 2 length Length of second header 2 page count Total number of bytes after second header that represent pages. This total includes bytes that are ignored by the pageMap. 2 unknown Always 32 N second header String containing the page mapping header 4*N padding The first number given in the page mapping header indicates the number of 0 bytes. 4*N page list Content Header -------------- The content header is a string enclosed in {} containing key, value pairs. content comments contentGuid Guid. asin Amazon identifier for the Kindle version of the book. cdeType MOBI cdeType. Should always be EBOK for ebooks. fileRevisionId Revision of this file. Example: {"contentGuid":"d8c14b0","asin":"B000JML5VM","cdeType":"EBOK","fileRevisionId":"1296874359405"} Page Mapping Header ------------------- The page mapping header is a string enclosed in {} containing key, value pairs. content comments asin The ISBN 10 for the paper book the pages correspond to pageMap Three value tuple. Looks like: "(N,N,N)" 1) Number of bytes after header that starts the page numbering sequence 2) unknown 3) unknown Example: {"asin":"1906694184","pageMap":"(4,a,1)"} Page List --------- The page list is a sequence of offsets in the uncompressed HTML. Each value is the beginning of a new page. Each entry is a 4 byte big endian int. The list is ordered lowest to highest. |
* Calibre Week in Review
Posted on April 18th, 2009 by John. Filed under calibre.
This has been a busy week for me on the Calibre front. All of my changes were to pluginize and the first three I talk about also made it into trunk and will be appearing in the next release.
I re-worked the mobi metadata reader so that it does not read the entire file into memory. It only reads the parts of the file that hold the metadata. The advantage is reading the metadata is now about five times faster. These results are from unscientific testing by a the bug reporter. Basically he said that listing the books on his Kindle went from 5 minutes to about 1 minute.
The metadata writer for pdf files has been re-worked and is now enabled. Kovid did some work to my initial work so that it won’t lock up the GUI when working with large pdf files.
I (a bit of help from Kovid on this too) was able to fix bug 2112 (last few pdf files held open). Calibre relies on Python’s garbage collector and object scope for closing files. It does not explicitly close them. The bug as caused by pyPdf which is a Python library Calibre uses to read and write pdfs. For some reason pyPdf’s file reader wan’t allowing the files to be closed. They were no longer in use and the object went out of scope but the garbage collector didn’t close the file immediately. It would close it eventually. A wrapper object was created and is used so that pyPdf doesn’t have a direct reference to the open file and it now gets closed properly.
The GUI in the releases only supports displaying one storage card from a device. Not all device support two storage cards but the Sony PRS devices do. Support for the GUI to display two storage cards has been added.
To go along with the GUI supporting displaying two storage cards, Almost all device drivers have been made to support up to two storage cards. The USBMS base class supports two cards and as most device drivers use this base they all get support for it without much work. However, this doesn’t mean that a device that doesn’t physically a storage card or two storage card slots won’t magically support two cards. All except the PRS drivers don’t have any user visible changes. For anyone looking to write a device driver using USBMS if the device supports two cards USBMS has you covered.
The PRS505 and PRS700 drivers both received the two card treatment. They also received a bit of work. They have been moved to use the USBMS base class. This removed a lot of redundant code and puts them on the same code path as the other (except PRS500) drivers. Overall this change is to reduce work in finding and fixing bugs and maintenance.
Internal work on the PRS505 and PRS700 drives wasn’t all I did to them. They no longer dump all books into a single directory. Books are stored in author/title/book hierarchy. News items are stored in a news/title hierarchy. They also support the USBMS / tag as a custom layout path.
Earlier I said almost all device driver got two storage card support. The PRS500 driver did not. It still only supports one storage card. Due to the way the driver works I will not be touching it.
I’ve been working with ldolse from mobileread and with his help the processing rules for pdftohtml (used for pdf input) have been improved.
Tags
Archives
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- December 2012 (2)
- October 2012 (1)
- August 2012 (1)
- July 2012 (1)
- June 2012 (2)
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)