Amazon APNX file format

Coming with the Kindle 3.1 firmware is the ability to have real page numbers. Getting ready for this Amazon has put out a preview release of the 3.1 firmware and has started adding the necessary information to Kindle books to show the page numbers.

The page numbers themselves map to the pages of the corresponding print book. Over all it gives a very pleasant experience. Amazon has implemented the page mapping though a new auxiliary file that has the .apnx extension. Doing this they can easily add this feature to all existing books and not have to worry about incompatibilities with older Kindles.

There is an easy way to tell if a book is going to include the APNX file. Look for “Page Numbers Source ISBN:”in the Product Details. All books that map pages to a print book will specify which edition they map to.

Now on to the more technical part of this post. I’ve spent some time looking at various books that Amazon is distributing with the APNX file and I’ve been able to reverse engineer the format. It’s a very simple format and after the header information is simply a list of 4 byte big-endian integers that correspond to locations in the uncompressed text. The position of the integer in the list corresponds to its page number.

Following is the documentation of the APNX specification I’ve written:

APNX
----

apnx files are used by the Amazon Kindle (firmware revision 3.1+) to
map pages from a print book to the Kindle version. Integers within
the file are big-endian.


Layout
------

bytes   content             comments 

4       00010001            Format identifier. Value of 65537 little-endian.
4       start of next       The offset after ending location of the first header.
                            Starts a new sequence of header info
4       length              Length of first header
N       first header        String containing content header
Starts next sequence
2       unknown             Always 1
2       length              Length of second header
2       page count          Total number of bytes after second header that
                            represent pages. This total includes bytes that
                            are ignored by the pageMap.
2       unknown             Always 32
N       second header       String containing the page mapping header
4*N     padding             The first number given in the page mapping header indicates the number of 0 bytes.
4*N     page list           


Content Header
--------------

The content header is a string enclosed in {} containing key, value pairs.

content             comments

contentGuid         Guid.
asin                Amazon identifier for the Kindle version of the book.
cdeType             MOBI cdeType. Should always be EBOK for ebooks.
fileRevisionId      Revision of this file.

Example:
{"contentGuid":"d8c14b0","asin":"B000JML5VM","cdeType":"EBOK","fileRevisionId":"1296874359405"}


Page Mapping Header
-------------------

The page mapping header is a string enclosed in {} containing key, value pairs.

content             comments

asin                The ISBN 10 for the paper book the pages correspond to
pageMap             Three value tuple. Looks like: "(N,N,N)"
                    1) Number of bytes after header that starts the page numbering sequence
                    2) unknown
                    3) unknown

Example:
{"asin":"1906694184","pageMap":"(4,a,1)"}


Page List
---------

The page list is a sequence of offsets in the uncompressed HTML. Each
value is the beginning of a new page. Each entry is a 4 byte big endian
int. The list is ordered lowest to highest.