Posts Tagged ‘pdf’
This week I focused on PDF output. There was a bug introduced in 0.8.17 that broke PDF output which has now been fixed. I was also able to fix PDF output on OS X. The PDF output engine on OS X is now using OS X’s internal PDF engine instead of Qt’s. Page sizes other than A4 are now possible and the PDFs produced are no longer large image based monstrosities. Meaning, text is now selectable and can be copied.
I am currently working on Pearl compatible regular expression (PCRE) support. An initial version has been put into git. I have an enhanced version that allows for case changes in the replacement text working. Right now I’m working caching the results of a search to improve performance.
Posted on January 1st, 2011 by John. Filed under calibre.
I did some work with PDF output. Mainly I refactored some of the output generation code to reduce redundant sections. Over all there won’t be any user visible changes.
The main reason I dove back into PDF output was because a user on OS X noted that PDF produced were not searchable. Windows users are getting searchable PDFs and on Kovid’s Gentoo Linux machine he was able to get a searchable PDF. I looked into the issue and cannot get searchable PDFs on OS X. However, I can get searchable PDFs when using the ebook-viewers print feature via print to file. I’m not sure why this happens because the ebook-viewer and PDF output use the same technique for generating a PDF. I’ve decided not to peruse the matter further because PDF output on OS X is pretty much broken due to Qt bugs. See this for an example.
TCR compression was something I added a while ago. I was never fully happy with it because it was slow, and produced low quality output. I spent a few days completely rewriting the compressor and now it performs beautifully. The new compressor is cleaner, an order of 10x faster, compresses to a much smaller size, and I would say is on par with the output of Andrew Giddings’ TCR implementation. However, calibre’s TCR compressor is a pure python implementation is still considerably slower than Andrew’s C implementation.
A minor bug in FB2 output was brought to my attention and fixed. Basically JPG images in the input document were not being written to the FB2 output file. This has been corrected.
Posted on July 5th, 2009 by John. Filed under calibre.
This week has been a productive one. I’ve made a lot of small GUI enhancements and did some work on PDF input as well. All of these changes have not made it into trunk yet. This is mainly because Kovid has been away this week.
I’ve added auto complete to a number of the input control on the GUI. Authors, Publisher, and Tags all auto complete pretty much everywhere now. The Tags will even auto complete in the table view in the main window. However, Authors, Series and Publisher do not auto complete in the main windows as of yet.
I’ve also been working with the GUI’s search. ISBN, Rating, Cover fields are all included in the default search. They are also search field identifiers. Meaning you can do isbn:123 to search just isbn numbers. Searching for empty and filled fields has been implemented as well. Use field:false and field:true respectively.
PDF input, either is or last week, got the ability to specify an unwrap factor for unwrapping lines. Previously this was a fixed value. Now it can be changed by the user. I have some ideas to enhance this further but I’m not going to to into detail because they may not materialize. Use the option –unwrap-factor with a decimal value 0 – 1. It is used by the regular expression that determines the minimum line length required for unwrapping.
PDF input had another highly requested change. The ability to remove headers and footers. However, it’s not as user friendly as I would like. There are four new options in total. –remove-header, –remove-footer, –header-regex, and –footer-regex. If the the –remove-* options are used then a regular expression that can be customized by using –*-regex is used to match headers and footers. The header and footer matching happens before all other processing rules. Use the ebook-convert’s –debug-input option to see the HTML that the regex will be matched against.
$ ebook-convert input.pdf .epub --debug-input output_dir/
Posted on June 20th, 2009 by John. Filed under calibre.
The bulk of what I’ve done this week was the many bug fixes going into the Calibre beta.
The one other thing I worked on was bring image extraction back to PDF input. It works as well as the 0.5.x series now. Meaning it will handle simple cases but there are still some bugs.
Posted on May 2nd, 2009 by John. Filed under calibre.
It seems that PDF is becoming the never ending format for me. Maybe I should start naming the posts PDF Work instead of Calibre Week in Review…
One minor and one major change to PDF processing this week. The minor change was a fix for bug 2342. German umlauts are now displayed correctly in the output. The major change is PDF output now supports comics. cbz, cbr, cbc are some of the input formats for comics that are support and now you can turn them into a PDF. The huge advantage is for people (like me) who have a Cybook. A comic can be turned into one PDF file sized for the device keeping down the amount of clutter in the library view.
I also worked on the device framework and have pluginized all of the device interfaces (I like the term interfaces better than drivers because it reduces confusion as Windows device drivers are very different). They also sport a new configuration system (though they didn’t have configuration before at all). The user will be able to specify their preferred format order for sending to the device. As well as disable certain formats from being sent to the device at all. I said will because while the configuration code is done there is currently no way to call it in the preferences dialog. However, this will be rectified before 0.6 is released.
eReader output has been put on hold for the foreseeable future. eReader input is complete and working but due to the undocumented nature of the eReader format I have not been able to produce a working output plugin. The main issue I’ve run into is the eReader header (record 0 within the pdb container) is a 132 byte package with 66 sections. There are to many unknown sections. Even with the inspector script I wrote to see what the values are in working eReader files I have not been able to understand how all of the sections interact with the file itself. My guesses have all resulted in files that are not readable by the eReader Pro software.
eReader files uses the PML markup language and while I couldn’t get eReader output working I have added support for PML input and PML output. The PML output can be taken and put into either MakeBook or DropBook to produce a working eReader file.
Two things to note about the the PML support is input can take either a straight .pml file or it can take a zip archive filled with .pml files and PNG images (the images must be in PNG format). The zip archive must have the extension changed to .pmlz for this to work. PML output will produce a zip archive with the extension .pmlz. Within this archive will be all of the image files in PNG format and the produced .pml files.
.pmlz is simply an easy way to group the files and ensure that there is not issues with including missing files or not being able to find referenced files.
Posted on April 27th, 2009 by John. Filed under calibre.
This weeks review of what I’ve been working on is a little late. Overall it wasn’t as productive as last week looking at what was accomplished but I spent just as much time coding as last. With projects like this you can’t judge output by the number of features add or bugs fixed.
The GUI received context sensitive treatment for the device menu. It will only have send to device, when a device is connected and send to card A and B will only enabled when they are available as well. A simple change but one that will reduce confusion.
I’ve spent a lot of time working with Lee Dolsen (ldolse from mobileread) on pdftohtml processing rules. They are nearly complete and the output is looking really good. I know I’ve been saying that for a while now but each week it just keeps getting better. However, PDF is still not an ebook format and should not be treated as such. This simply helps to get content out of the PDF format and into a more manageable one.
One big thing I spent most of my time this week on was eReader input. Yep, eReader pdb files can now be converted to any supported output format. Metadata reading of eReader files is not yet supported. That is on my todo list. The html it produces could probably use some work but that will come as people report issues once 0.6 is released.
The other big thing that has taken up my Calibre time is eReader output. Sadly, it does not work. Also, it will not be working for the foreseeable future. The issue I’ve run into is I don’t know enough about the format to produce a file that can be read by eReader’s reading software. The main problem I face is there are around 66 “sections” to the eReader format header (not the pdb header, this is record 0 of an eReader file). I know what 10 of those sections are and what values they should have as they are used for my reader. Around 40ish of the sections should have a value of 0. However, that leaves 26ish sections that I don’t know what they are, what they do or what value they should have and how it relates to the rest of the file. Suffice it to say until I know more about the format I won’t be able to complete the output plugin.
Oh, I did write an inspector script (it’s in the eReader directory in the Calibre source tree) to help understand the eReader format. If anyone is interested in analyzing the format they can use it to help them see what is in the header.
Posted on April 18th, 2009 by John. Filed under calibre.
This has been a busy week for me on the Calibre front. All of my changes were to pluginize and the first three I talk about also made it into trunk and will be appearing in the next release.
I re-worked the mobi metadata reader so that it does not read the entire file into memory. It only reads the parts of the file that hold the metadata. The advantage is reading the metadata is now about five times faster. These results are from unscientific testing by a the bug reporter. Basically he said that listing the books on his Kindle went from 5 minutes to about 1 minute.
The metadata writer for pdf files has been re-worked and is now enabled. Kovid did some work to my initial work so that it won’t lock up the GUI when working with large pdf files.
I (a bit of help from Kovid on this too) was able to fix bug 2112 (last few pdf files held open). Calibre relies on Python’s garbage collector and object scope for closing files. It does not explicitly close them. The bug as caused by pyPdf which is a Python library Calibre uses to read and write pdfs. For some reason pyPdf’s file reader wan’t allowing the files to be closed. They were no longer in use and the object went out of scope but the garbage collector didn’t close the file immediately. It would close it eventually. A wrapper object was created and is used so that pyPdf doesn’t have a direct reference to the open file and it now gets closed properly.
The GUI in the releases only supports displaying one storage card from a device. Not all device support two storage cards but the Sony PRS devices do. Support for the GUI to display two storage cards has been added.
To go along with the GUI supporting displaying two storage cards, Almost all device drivers have been made to support up to two storage cards. The USBMS base class supports two cards and as most device drivers use this base they all get support for it without much work. However, this doesn’t mean that a device that doesn’t physically a storage card or two storage card slots won’t magically support two cards. All except the PRS drivers don’t have any user visible changes. For anyone looking to write a device driver using USBMS if the device supports two cards USBMS has you covered.
The PRS505 and PRS700 drivers both received the two card treatment. They also received a bit of work. They have been moved to use the USBMS base class. This removed a lot of redundant code and puts them on the same code path as the other (except PRS500) drivers. Overall this change is to reduce work in finding and fixing bugs and maintenance.
Internal work on the PRS505 and PRS700 drives wasn’t all I did to them. They no longer dump all books into a single directory. Books are stored in author/title/book hierarchy. News items are stored in a news/title hierarchy. They also support the USBMS / tag as a custom layout path.
Earlier I said almost all device driver got two storage card support. The PRS500 driver did not. It still only supports one storage card. Due to the way the driver works I will not be touching it.
I’ve been working with ldolse from mobileread and with his help the processing rules for pdftohtml (used for pdf input) have been improved.
Posted on April 11th, 2009 by John. Filed under programming.
Other than little fixes here and there I’ve mainly focused this week on getting PDF output complete. It now supports profiles as well as custom page sizes. A little work is still needed on the processing rules for the html produced by the PDFInput. Otherwise, PDF Input/Output is complete. Though, pluginize is still in flux so what is complete now might not be complete next week as new requirements and interfaces are added. Or if I get some suggestions about what could or should be added.
Posted on April 4th, 2009 by John. Filed under programming.
This has been a busy week for Calibre. The new conversion pipeline is complete. Part of this change is there is a new framework for command line options.
I’ve spent most of this past week moving the PDF input/output and TXT input/output over to use the new framework.
The other major work I’ve completed is moving the pdfmanipulte program and it’s commands over to the new command line option framework. I’ve also added a few new commands to pdfmanipulate. The current commands it supports are [crop, decrypt, encrypt, info, merge, reverse, split]. The trim command is now crop. It has also been cleaned up a bit.
Posted on March 29th, 2009 by John. Filed under Uncategorized.
The other day on mobileread there was a post about combining pdf files. The person has their books in pdf and they are divided by chapters. This got me thinking about the state of the pdf tools in Calibre. There was only one, pdftrim.
I’ve added three new pdf manipulation tools. Merge to combine multiple pdfs into one. Split to split a pdf into multiple files by page. And info to show information about the pdf. Info is especially handy when you want to work with split and need to know how many pages are in the document.
To stop issue with naming conflicts (pdfinfo is used by poppler-utils) and to keep the amount of pdf* names under control I’ve created a git/bzr like wrapper for all of Calibre’s pdf manipulation tools. pdfmanipulate is the base command. A subcommand (see them all with –help) is added after.
$ pdfmanipulate --help Usage: pdfmanipulate command ... command can be one of the following: [info, merge, split, trim] Use pdfmanipulate command --help to get more information about a specific command Manipulate a PDF. ...
$ pdfmanipulate merge --help Usage: pdfmanipulate merge [options] file1.pdf file2.pdf ... Merges individual PDFs. Metadata will be used from the first PDF specified. ...
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- December 2012 (2)
- October 2012 (1)
- August 2012 (1)
- July 2012 (1)
- June 2012 (2)
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)