Posts Tagged ‘fb2’
Posted on April 9th, 2011 by John. Filed under calibre.
It has been a few weeks since I’ve done a calibre week in review. This is partly because I had been working on some new features for the upcoming 0.8 release. I haven’t wanted to talk about it very much until the release gets closer. Kovid said yesterday that he will be reviewing my changes next week.
One complaint I hear often is in regard to the inability to edit ebooks. Many people seem to think EPUB is not a good format for editing. Sigil is often the solution given around these parts but some people insist on the need for a book to be contained in a single HTML file. Simply unzipping an EPUB doesn’t accomplish this due to the need to split the files.
To remedy this situation I’ve added a new output format: HTMLZ. Just like TXTZ it is just a zip file with with a different extension to differentiate it. Inside is a metadata.opf file (calibre can read and write metadata to it). Images are preserved, renamed and placed in an images folder. This format is available in the 0.7.54 release.
Also inside is a single HTML file. Even if you’re converting from and EPUB that has been split into multiple parts a conversion to HTMLZ will result in a single HTML file. To go along with this there are a number of ways to configure CSS handling. The default is to place the CSS in separate style.css file. It can also place class based CSS inside of the head element in the HTML itself. Or you can have it write the CSS inline within each element. Finally the last option for CSS is to remove it and convert as much as possible (a very limited set right now) to HTML tags.
As with all of my output format attempts I believe this will have quite a few bugs. Let me know about any issues so I can fix them. I hope people find this useful for their hand editing needs.
Just a small change to FB2 output this time. Users can now select the genre for the output document. The default is antique but a list of supported genres is available to choose from.
GUI – Toolbars
theducks on MobileRead made a few requests regarding handling of toolbars. He was having trouble with the number of interface action plugins he had added to the toolbar and needed more space.
The first change is removing the split toolbar into two option and make the second toolbar user configurable. This way you can add what ever you want in the order you want to the second toolbar.
Along with this, thducks also wanted to be able to remove the icons on the toolbars so I added an off option to the toolbar icon size setting. This way icons can be removed completely. If they are disabled then the text will automatically be used even if the toolbar text option is set to never show. This way you won’t lose your toolbar.
I also made it so that any toolbar that doesn’t have any items on it will be hidden. All of these toolbar changes are in the 0.7.54 release.
GUI – Menubar
Another change to the GUI which won’t be out until the 0.7.55 release is the addition of a configurable menubar. I personally don’t like the toolbar and added support for a menubar. It is configurable in the toolbar configuration are in preferences. Just like the toolbars and right click menus you can configure what is in the menu and what order they appear in.
The main motivation of the menubar addition was the fact that I use a Mac. OS X always shows a menubar outside of the application window. Calibre never looked quite right on a Mac because it doesn’t have a menu so OS X’s menubar would always appear empty.
GUI – OS X
On OS X the menubar has a number of default items. All other OS’s the menubar is default empty and hidden. Also some toolbar items are not shown by default on OS X because they are available though the menubar. The idea is to provide visually appealing default for OS X and to provide a more initiative experience for Mac users.
I’ve also made the toolbar and statusbar on OS X use the system type instead of the generic Qt toolbar and status bar. They look better and behave as one would expect on OS X. The hide toolbar button for instance now works an d hides the toolbar.
Aside from my changes, I’ve been giving direction to Perkin form MobileRead for enhancements to the Textile input and output. The input changes are already in the latest (0.7.54) release. He’s still working on enhancements to Textile output to ensure it produces the same output that the input supports. He has also identified a few bugs with the current Textile output and is working to fix them too.
Posted on February 7th, 2011 by John. Filed under calibre.
Once again this is a big week with a lot of little changes. The majority of which were related to TXT input.
I was thinking about the fact that for all of the formats I support I use the format specification to know how the reading and writing should happen and how they aren’t part of calibre proper. I have a set of documents that outline what is known about each format I handle. I say what is known because in some cases (eReader) the binary format is reverse engineered and a lot of it is guess work. The documents I use are partly a collection of information available in (sometimes many) different places and some of it is my own work. I’ve now added these documents to calibre proper in the top level format_docs directory. Hopefully people will find this useful and help others work on these formats.
Recently there was a request to add auto complete to (just like it is in tags) to the authors metadata field in the GUI. I added this a few versions ago and it caused an uproar. Many people loved the feature and many people hated how after completing it would add the completion character at the end of the completion. Even though when you save the changes the completion character is removed people a small group of vocal users didn’t like the way it looked while editing. Kovid changed completion so the the separator character isn’t inserted after completion and since I as well as other liked this behavior said that I should re-implement it as a tweak. So in 0.7.45 set the tweak completer_append_separator to True to have it insert the separator character after completion.
Lee and I did some more work on Heuristics. Mainly he did the work. I’ve pretty much just been getting the options set up on the command line and in the GUI for him. There is a new option for replacing soft scene breaks with a hard scene break. The replacement text is user defined but the history drop down comes preloaded with a number of common cases.
I did a little heuristic work myself. Namely I tweaked the italicize patterns to make them more robust and I in the process I simplified them.
FB2 output was updated to handle creating soft scene breaks baded on empty paragraphs and top margins. Because FB2 does not specify how the document is supposed to look (this is left to the reader software, elements only define type not layout) I chose inserting blank lines between paragraphs to create scene breaks.
PML Input had some tweaks regarding soft scene break. I reduced the number of empty lines between paragraphs to create a soft scene break. I haven’t seen any documents that need this change. However, the more I thought about how it was handled, I realized that a valid document can use fewer lines.
Now that PML Input retains soft scene breaks it’s only natural to have PML output write them. Empty paragraphs and margin based spacing are both accounted for. In addition I added support for left margins being retained in the resultant PML.
There was one small bug fix. Looking over the PML docs again I noticed that \c and \r codes need to be closed on the next line following their opening. I modified the output code to ensure this happens. There was some general work to produce cleaner output as well.
While I was working on the above I decided that since I previously changed PML input to create a multi-level TOC that I should also have PML output write a multi-level TOC. Currently this is based on the tags being pointed to by the TOC items and by them not being headings. Only \Cn TOC markers are supported at this time. \Xn markers are going to need a bit more work.
TXT input paragraph processing was restructure so paragraph transformations are always applied. Previously they were not being applied when Markdown or Textile formatting was used. A user on MobileRead had modified their TXT file and simply added #’s in from of the headings to have them formatted in the output. The user did not make any other changes to have their document conform to Markdown and the resultant output was not very nice. I seems very common for users to simply stick Markdown or Textile formatting into their documents and rely on calibre to clean them.
Dehyphenatation of TXT input was tweaking. It now looks for heuristics and dehyphenate options to be enabled. In this case it will be run over all TXT input including Markdown and Textile formatted documents.
There were a few bug fixes related to various issues. Spaces at the beginning of lines were not properly preserved. Spaces within documents were getting converted to entities when they shouldn’t. A regression that brok block formatted paragraphs was fixed.
Print formatted documents not have the indents retained.
For people like me who do not like indented paragraphs I’ve added an option to remove indents from TXT input documents.
There was one small bug fix in TXT output and that was to have TXT output show all TOC items. Previously it was only showing top level items.
I’ve added support for a new pseudo format call TXTZ. It’s essentially just TXT files put into a zip archive with the extension .txtz. It can contain images which should make working with Markdown and Textile formatted text easier. Also, it has metadata support via an OPF file called metadata.opf within the archive. This OPF file will be referenced for metadata reading and writing. Both input and output of TXTZ support has been added.
Posted on January 31st, 2011 by John. Filed under calibre.
This is really a two week in review because I didn’t do one last week. The past two weeks I didn’t focus on major changes. I mainly spent my time with little tweaks and closing out bugs. All of these little changes didn’t feel like I accomplished very much but getting them all together it turns out I did quite a bit over the last two weeks.
The GUI saw quite a few usability changes. I’ve added auto compete to the authors field. This works the same as the tags field but starts completion with the & character instead of ,. There were a few issues that users pointed out relating to this change but they have all been corrected. It turns out that the issues were present with the tags completion but no one had noticed.
I added a confirmation dialog when stopping a running job. There is the possibility of a job finishing before the user confirms but I don’t see that as an issue. The user either wanted to stop the job and it’s stopped or they don’t and it finished properly.
The Regex Builder window saw some changes. Search next and previous were added so the user can cycle though the matched items more easily. Also, when clearing the regex text entry the highlighting will automatically clear. There were also some tweaks to remove the delay caused when testing without any input text. I’ve also implemented caching for so each time the wizard is opened it won’t reconvert the input document. It saves the result and just displaying it each time. The Search and Replace dialog also makes use of the caching across each search and replace field.
There was also some more work done with making entries into history entries. The Regex input fields (search and replace) now store previous entries. The filename import in Add books also saves previous regexes used for importing books.
The last GUI change was with the Send Specific Formats to Device dialog. It now only displaying formats that are present and or convertible. It also tell the user the number out of the total number of books that are in a particular format. It also notes which formats are convertible and which are not. All items in the dialog are also sorted from most to least preferred.
Italicize common cases saw some tweaks to the matching patterns to make them more robust. I foresee this being a weekly occurrence for some time.
A number of RTF output bugs were fixed. An issue with incorrect spacing between letter and missing spaces around italicized text. Also, the generated markup was greatly altered. It is simpler and produces more consistant results. It also allows for h tags to be turned into RTF style headings. So converting from calibre generated RTF to say EPUB the headings will carry over properly as headings. I still consider RTF output a work in progress and relegate it to experimental status.
The language is now set correctly.
Soft scene breaks are now retained. PML also saw a bug fix relating to the \T tag. The biggest change to PML input is support for multi-level table of contents. Previously the toc from a PML file was flattened. Now the levels are properly retained.
Like PML input TXT input now retains soft scene breaks between paragraphs. I also changed heuristic processing on TXT input to not enable preserving whitespace. Instead whitespace at the beginning of a paragraph is maintained by default. Also I rewrote the preserve whitespace function to only use when necessary instead of in place of every regular space.
Textile formatted output is not supported. This complements Markdown output and Textile input. Soft scene breaks are now detected and written. The scene breaks can either be empty paragraphs or defined by CSS top margin.
Posted on January 17th, 2011 by John. Filed under calibre.
FB2 output had some more bug fixes. The cover image is now put inside of the coverpage element in the metadata header. This is per the FB2 spec. However, the calibre ebook-viewer does not currently display the cover image that is part of the metadata header. Calibre’s FB2 metadata reader will read the cover image.
PML input had a bug fixed dealing with the \t and \T tags. They are now handled properly and will indent the entire line. This had been somewhat fixed previously but the previous fix would only work when those tags would start and end the line.
At a user’s request I’ve reworked the Author’s fields thought the GUI. Authors are now auto completed using the & symbol just like tags are auto completed using a ,. This makes adding multiple authors much easier. This change was actually fairly large and a lot of work. I refactored the auto complete classes for tags into a generic set of auto completion classes. Then I reworked each author field to use the new classes.
All of the above changes have made it into trunk and are either in the current release (0.7.40) or will be in the next release (0.7.41). The following changes are still being finished and will need Kovid’s review before being merged into a release.
Lee Dolsen and I had worked on the TXT last week and our partnership continued this week. He had created a variety of heuristic processing functions a while back. The heuristics processing would be used when the –preprocess-html option was enabled. We’ve broken the –preprocess-html function has been broken into individual options:
The majority of the heuristic code is his. I helped to make the infrastructure changes to accomodate the options on the command line and in the GUI. I also added the –italicize-common-cases as a heuristic function and removed it from only working in TXT Input. I also made the necessary changes to the conversion pipeline so the heuristics will run over all input types. Currently the –preprocess-html option does not run over EPUB input. Lee did all the work to change the heuristic code to work as individual options as well as adding some extras and cleaning up some existing parts.
While Lee was making most of the heuristics changes I took the time to rework the –remove-header and –remove-footer options. Those two as well as their related regular expression options have been removed. Instead I’ve created three sets of generic search and replace options. They are much more flexible and also not as miss leading about what they do. My hope is to eventually have a heuristic function for removing headers and footers that does not require regular expressions.
Posted on January 9th, 2011 by John. Filed under calibre.
This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We’ve worked together before improving PDF input. Since then he’s done a lot of work with preprocessing of PDF and other not so clean input.
TXT input now auto detects the character encoding of the file. It isn’t 100% accurate but should work for the majority of cases. It’s using chardet for the detection. Unfortunately, cp1252 is the most common encoding that gives people issues and unless you’re using things like smart quotes and curly apostrophes it doesn’t always detect properly.
I started getting TXT input to detect the document structure. Mainly, are the paragraphs arranged in block, single line, or print fashion. Lee saw the detection code and modifying some of his preprocessing code he was able to greatly increased the detection accuracy over my initial work. He’s also added an unformatted type that assumes the text is one big blob and tries to determine paragraphs in much the same way PDF input tries to determine them. By unwrapping based upon punctuation and other factors.
In addition to detecting the paragraph style used in the document, TXT input now tries to detect the formatting of the text content. Markdown formatted text is detected. I’ve also added a heuristic processor which runs by default if either Markdown is not detected or if the user has not specified the formatting as none (which disables any type of formatting processing).
The heuristic processor uses some ideas from GutenMark. Specifically italicizing common words and certain contentions used in Project Gutenberg texts that denote italics. I started working on a set of heuristics to detect chapter headings but Lee quickly pointed out he had already created something similar using regular expressions in his preprocessing code. I quickly began using it in my heuristic processor and it’s working well. Chapter headings and subheadings are now formatted with the appropriate h tags. He has some plans to enhance the detection further using a word list.
TCR, PDB PalmDoc and PDB zTXT inputs all pass the extracted text to the TXT input plugin for processing. This allows them to take advantage of all the work that’s gone into TXT input. Also, with auto detection now being part of TXT input it should allow for one time conversion instead of convert, check, tweak some options, convert again. Especially since these formats don’t make it easy to see how the text is structured within the file without first converting.
TXT input wasn’t the only part of TXT support that was touched. I spent some time cleaning up the TXT output. Consistant spacing is now created around headings. Also, when using the –remove-paragraph-spacing option, headings are not indented with a tab. The output now looks much cleaner and I consider it acceptable for reading.
Not to be left out FB2 output got a small bug fix. With all the work rewriting it I broke having it read covers. If you were converting an EPUB for instance that specified the cover (or title page) in the guide rather than the spine it would not be included. Also, the –cover option was being ignored. Now that’s fixed and external covers are inserted properly.
Posted on January 1st, 2011 by John. Filed under calibre.
I did some work with PDF output. Mainly I refactored some of the output generation code to reduce redundant sections. Over all there won’t be any user visible changes.
The main reason I dove back into PDF output was because a user on OS X noted that PDF produced were not searchable. Windows users are getting searchable PDFs and on Kovid’s Gentoo Linux machine he was able to get a searchable PDF. I looked into the issue and cannot get searchable PDFs on OS X. However, I can get searchable PDFs when using the ebook-viewers print feature via print to file. I’m not sure why this happens because the ebook-viewer and PDF output use the same technique for generating a PDF. I’ve decided not to peruse the matter further because PDF output on OS X is pretty much broken due to Qt bugs. See this for an example.
TCR compression was something I added a while ago. I was never fully happy with it because it was slow, and produced low quality output. I spent a few days completely rewriting the compressor and now it performs beautifully. The new compressor is cleaner, an order of 10x faster, compresses to a much smaller size, and I would say is on par with the output of Andrew Giddings’ TCR implementation. However, calibre’s TCR compressor is a pure python implementation is still considerably slower than Andrew’s C implementation.
A minor bug in FB2 output was brought to my attention and fixed. Basically JPG images in the input document were not being written to the FB2 output file. This has been corrected.
Posted on December 20th, 2010 by John. Filed under calibre.
This week saw some more work on FB2 output. I’ve added support for a few formatting types from the 2.1 spec. Also, a very helpful user submitted a patch for sectionizing. It allows for sectonizing based on the file structure (based on EPUB splitting), no sectonizing and based on TOC. There is one limitation based on TOC sectionization. It only works when the TOC item points to an element within the document. It does not work with TOC items that point to actual pages. However, it’s a vast improvement and works very well with calibre’s auto TOC.
On the MobileRead forums a user (SweetPea) mentioned a use case that was causing her some problems. Basically, when her device is connected she would select the book in her library and press delete thinking it would delete from the book from the reader. A few other users chimed in a said that they expected that same behavior. This is a perfect example of what you as a programmer expects to happen and what the user expects being vastly different. To accommodate this case I’ve added a dialog that appears when you try to delete a book in the library that also appears on the connected device. The dialog asks where you want to delete the book from: Library, Device or Both. Hopefully, this reduced confusion and I personally like this idea because it makes it so I don’t have to switch between my device and library as much.
Posted on December 5th, 2010 by John. Filed under calibre.
I’ve become active in contributing to calibre again. So far I’ve been focusing on fixing issues related to the output formats I maintain. I’ve been focusing on FB2 and TXT output at the moment.
With FB2 output my goals were to fix as many bugs with it as possible and to produce 100% valid output. The first goal corresponds nicely to the second because most of the open bugs dealt with invalid markup.
FB2 output underwent some very large changes with a large amount of code being re-written. Also, I’ve removed a number of options. The idea is to simplify the code while working toward valid output and to remove options that were really just work arounds for invalid output in certain cases.
Overall I’m pleased with the FB2 output changes. It’s 100% valid (at least with the test book I ran though it) and the code is simpler. As always if any issues are found with the output a ticket would be appreciated.
TXT output had one small bug fix and one major change. TXT output can now produce Markdown formatted text. However, I’m not fully satisfied with the markdown generation. I didn’t spend much time with it and as of now it doesn’t appear to be taking css styling into account. I only pushed the xhtml from the OEB intermediate stage into html2markdown. I need to spend some more time with it. My fear is I may have to abandon the use of html2markdown if it’s unable to cope with css.
One other change with me getting back into calibre development is my working branch. I’ve changed it to lp:~user-none/calibre/dev because of some issues relating to my previous branch and some failures with upgrading the branch format.
Posted on December 13th, 2009 by John. Filed under calibre.
FB2 output has been improved. It no longer generated very invalid markup. The output generator still isn’t where I want it to be though. The changes are mostly cleanup and fix long standing issues with the output. One major change is I reverted having <h1> tags work as section and title markers. I don’t like having this hard coded into the generator.
As far as I can tell FB2 does not support a true table of contents (TOC). What seems to happen is reader software will dynamically generate the TOC based on the appearance of <section><title><p>text</p></title> within th e text. If I’m wrong about this and FB2 really does support an external TOC I would love to know. The <h1> differentiation causes the files to have these sections. I do not like how this is hard coded and dependent on this single tag. Especially since calibre’s conversion process allows for an XPath expression to be specified to generate TOC point.
I would love to use the TOC sent to the FB2 output generator but this does not seem to be feasible. The problem with the TOC that is sent to the FB2 output generator is how it corresponds to locations in the document. The OEB TOC points are text which may or my not appear in the text and an anchor id. The anchor id is a set point in the document. FB2 section titles are part of the text itself. I cannot use the text from the OEB TOC because it may or may not actually appear in the text. This also prevents me from determining what the text in the document is supposed to be associated with the TOC point. The anchor id points to an anchor in the document but often that point is something like <a id=”blah” />. In this case there is no text associated with the anchor in the document. While I can assume the text following is part of the title I have no fool proof way to determine where it stops.
At this point the only TOC associated with the FB2 output is the inline TOC that can be optionally generated.
Support for two new readers has been added. Ganaxa’s GeR2 and Nokia’s 770 internet tablet. The N770 should have been support a long time ago and I apologize for how long this request has been unfulfilled. I put it at the bottom of my todo list and at some point it simply fell off and I forgot about it.
The GeR2 reader was a bit of a challenge to get supported. This reader and some models of the Cybook Gen 3 have the same vendor, product and revision (BCD) ids. On Windows and OS X this is not an issue because once the ids are matched further matches are done based on the plug and play (PNP) strings. However, on Linux only the ids are matched.
To solve this problem, matching on Linux needed some further checks. Kovid added support for libusb-1 which provides the vendor and product strings. He also added a call back that can be implemented in the device interface to implement platform and device specific checks. We did run into a few problems. The first was an easy to solve 32 vs 64 bit issue with the Python to C interface Kovid wrote for libusb-1. Once that was sorted out we ran into a larger problem. libusb-1 on Ubuntu by default is denied access to the vendor and product strings.
libusb-1, after appearing in only run release (0.6.27), has been dropped. Kovid has now written a custom device scanner for Linux that will parser the devices in /sys/bus/usb to determine if a reader is connected. libusb-1 is supposed to be an easy to use library capable of providing this functionality but unfortunately this turned out not to be the case. The custom scanner works and allowed me to implement differentiation between the GeR2 and the Cybook Gen 3 so both readers can be properly supported without conflict and with the correct device interface being used.
Posted on December 5th, 2009 by John. Filed under calibre.
Most this week was spent turning PML input and output. I spent a bit of work bug tracking and enhancing FB2 output as well.
The changes for PML input are as follows. Pass along the included cover as the cover when converting (also applies to eReader PDB). Allow for images to be in top level, archivename_img or images directory for PMLZ. Based on that order it will check for images and if they are not found move onto the next location. For PML, images can be in pmlname_img or images directory. Footnotes and sidebars now display cleaner. They are separated better and EPUB puts them on individual pages. They also include a return link which goes back to the place in the text they are referenced. This assumes one footnote and sidebar per entry in the text, so if it’s referenced multiple times the return link will go back to the return reference.
PML output now creates \a and \U codes only for supported characters. All characters that are not supported and that cannot be turned into a \a or \U code will be replaced with a ?.
Along with the changes for PML input reading the cover they are now read as part of the metadata. This applies to both PML, PMLZ and eReader PDB files.
I’ve created a PML2PMLZ FileType plugin which will run when ever PML is imported into the GUI. It takes a PML file looks for images in the above mentioned locations, takes it all and puts it into a PMLZ archive. The PMLZ archive is them added to the library.
When I went to test the PML2PMLZ plugin I found that the GUI on my system was horribly broken. After a bit of work with Kovid, I found that calibre-parallel had to be in the path if calibre was installed in a non standard location. I install into my home directory using the develop command. Kovid has committed a fix that writes the install path to the launcher for these instances.
FB2 output now turns h1 tags into <section><title> tags to allow for TOC generation. As far as I can tell FB2 has not set TOC and instead readers dynamically generate the TOC based on looking at all of the body and sections and sets the text using the title tag. Right now the FB2 output is limited to only turning h1 tags and cannot use the user defined TOC based on an XPATH expression. I plan to fix this limitation in the future.
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- December 2012 (2)
- October 2012 (1)
- August 2012 (1)
- July 2012 (1)
- June 2012 (2)
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)