Posts Tagged ‘pml’
* Calibre Week in Review
Posted on February 18th, 2011 by John. Filed under calibre.
This is a short week for the week in review because I’m now doing my week from Friday to Thursday. Last week I ened my week on Monday so this review only has a few days worth of work.
TXTZ
I’ve added an import plugin that runs over TXT content when it is added to the library. What happens is the TXT file is scanned looking for Markdown (inline or reference) and Textile image references. It collects all of the images and adds them plus the TXT file to a TXTZ archive when the following conditions are true:
- Path must not be empty.
- Path must be a relative path.
- The mimetype of the image (based on extension) must be an OEB supported image type. (JPG, PNG, SVG, GIF).
- The image must exist relative to the TXT file’s location and the location specified by the path.
If no images are found referenced in the TXT file or if they images found fail the above tests then a TXTZ archive is not created and the TXT file itself is added to the library.
PML Input:
Fix a bug where TOC entries specified by \x and \X were not being included in the TOC.
Heuristics:
Italcize common cases patterns got tweaked again. One pattern (/text/ would match <br /> </… and cause issues.
* Calibre Week in Review
Posted on February 13th, 2011 by John. Filed under calibre.
I’ve been putting up my week in reviews on based on a week starting on Monday for some time now. I’ve been thinking about this and it doesn’t really make much sense. Calibre has a release pretty much every Friday now. So starting next week I’m going to change my week in review to be Friday though Thursday. This way features I talk about in my review will be in the just released version.
TXT Input
First the small changes. Heuristic processing now enables smarten punctuation to further my goal of TXT documents coming out looking great. A change was made to have hard scene breaks separated from the text to ensure it doesn’t accidentally get merged into the paragraph before or after. The formatting type none was renamed to plain to correspond with the formatting output option.
The only big change for TXT input was a new paragraph type option was added. It’s called off. When specified there will be no modifications to the paragraph structure applied to the text. This is especially useful for Markdown and Textile formatted documents. It ensures there are no changes that will cause elements to render incorrectly.
TXTZ Input
A bug caused images to not be included when converting. With Kovid’s help this has been corrected.
TXT Output
I modified Textile output to not write %’s for span tags. The span tag is superfluous in calibre’s Textile output because it does not contain any real information. The span tags are invisible when rendering the XHTML. The %’s cluttered up the resultant TXT so they were removed.
PML Input
PML input saw a lot of of relating to \t and \T tags. The entire handling of these tags was rewritten. Unfortunately, there is no way to have these two tags map one to one to XHTML so only some common cases are handled.
- \T’s that do not start the line are ignored.
- \t’s that start and end the line use a margin for the text block.
- \t’s that start a line and end another line use a margin for the text block.
- \t’s that start a line but end before a line ending will use a text-indent.
- \t’s that are in the middle of lines are ignored. open and closed \t blocks within a line are ignored.
Heuristics
Once again the italicize common cases regex was tweaked. This time it was to fix an issue with None being inserted in the text before ajacent underscores. I’m hoping this is the last time for a while that I need to tweak them.
Kindle Interface
The work I did on the APNX format was undertaken for a very real world reason. Integrating APNX generation to calibre’s Kindle device interface plugin.
The 0.7.45 release saw the initial inclusion of this feature. After I received some user feed back I’ve tweaked it for the 0.7.46 release. The 0.7.45 release included a very basic APNX file that would create pages every 1024 bytes of uncompressed HTML.
In 0.7.46 there are a lot of differences. Writing the APNX can be disabled. This is very useful for Kindle 2 users as the Kindle interface works for both Kindle 2 and 3′s.
There are now two parser for generating pages. The default is the fast parser. It uses the uncompressed length of the MOBI HTML and creates pages every 2300 bytes. A few users complained that 1024 created too many pages. About double what you would find in an average paper back book. The 2300 number is a bit more than double 1024 and I chose 2300 after counting the number of characters in a page of an average paper back book. I counted approximately 2240 and added an additional 60 characters to account for markup per page. Thus 2300.
The other parser that can be enabled in the Kindle interface’s setting is the accurate parser. It works by decompressing the MOBI HTML and looking at the actual content. The big difference and why I’m calling it an accurate parser is it looks at the amount of visible text to decide when a page ends and a new one begins. The assumption is there are 30 lines per page and each line can have up to 70 characters. The parser starts a new line every time it encounters a new paragraph and every 70 characters in a paragraph.
The major disadvantage of the accurate parser and why it’s not the default is it’s slow. It requires the text to decompressed and parsed. With a PalmDoc compressed file this can take a few seconds but with a HUFF/CDIC compressed file it can take minutes.
The other minor disadvantage of the accurate parser is it cannot work on DRM content. The fast parser can because the uncompressed text length is stored unencrypted in the MOBI header. If the accurate parser is chosen it will fall back to the fast parser for DRM content. So when ever a Mobipocket book is sent to the Kindle (AZW, MOBI, PRC) an APNX file can and will (unless disabled) be generated.
One thing I will note about the accurate parer is it currently ignores all markup and only looks at text. Meaning it can be made even more accurate by accounting for <div class=”mbp_pagebreak” />, <br>, <hr>, images, margins, and font size changes. I do plan to add support for most if not all of these in the future but since most books people read on their Kindle are pretty much all text and because the accurate parser does a good enough job giving page numbers that correspond to the page length in a paper back book I’m don’t see a pressing need to spend the time on it at this moment.
* Calibre Week in Review
Posted on February 7th, 2011 by John. Filed under calibre.
Once again this is a big week with a lot of little changes. The majority of which were related to TXT input.
Format Specifications
I was thinking about the fact that for all of the formats I support I use the format specification to know how the reading and writing should happen and how they aren’t part of calibre proper. I have a set of documents that outline what is known about each format I handle. I say what is known because in some cases (eReader) the binary format is reverse engineered and a lot of it is guess work. The documents I use are partly a collection of information available in (sometimes many) different places and some of it is my own work. I’ve now added these documents to calibre proper in the top level format_docs directory. Hopefully people will find this useful and help others work on these formats.
GUI
Recently there was a request to add auto complete to (just like it is in tags) to the authors metadata field in the GUI. I added this a few versions ago and it caused an uproar. Many people loved the feature and many people hated how after completing it would add the completion character at the end of the completion. Even though when you save the changes the completion character is removed people a small group of vocal users didn’t like the way it looked while editing. Kovid changed completion so the the separator character isn’t inserted after completion and since I as well as other liked this behavior said that I should re-implement it as a tweak. So in 0.7.45 set the tweak completer_append_separator to True to have it insert the separator character after completion.
Heuristic Processing
Lee and I did some more work on Heuristics. Mainly he did the work. I’ve pretty much just been getting the options set up on the command line and in the GUI for him. There is a new option for replacing soft scene breaks with a hard scene break. The replacement text is user defined but the history drop down comes preloaded with a number of common cases.
I did a little heuristic work myself. Namely I tweaked the italicize patterns to make them more robust and I in the process I simplified them.
FB2
FB2 output was updated to handle creating soft scene breaks baded on empty paragraphs and top margins. Because FB2 does not specify how the document is supposed to look (this is left to the reader software, elements only define type not layout) I chose inserting blank lines between paragraphs to create scene breaks.
PML
PML Input had some tweaks regarding soft scene break. I reduced the number of empty lines between paragraphs to create a soft scene break. I haven’t seen any documents that need this change. However, the more I thought about how it was handled, I realized that a valid document can use fewer lines.
Now that PML Input retains soft scene breaks it’s only natural to have PML output write them. Empty paragraphs and margin based spacing are both accounted for. In addition I added support for left margins being retained in the resultant PML.
There was one small bug fix. Looking over the PML docs again I noticed that \c and \r codes need to be closed on the next line following their opening. I modified the output code to ensure this happens. There was some general work to produce cleaner output as well.
While I was working on the above I decided that since I previously changed PML input to create a multi-level TOC that I should also have PML output write a multi-level TOC. Currently this is based on the tags being pointed to by the TOC items and by them not being headings. Only \Cn TOC markers are supported at this time. \Xn markers are going to need a bit more work.
TXT
TXT input paragraph processing was restructure so paragraph transformations are always applied. Previously they were not being applied when Markdown or Textile formatting was used. A user on MobileRead had modified their TXT file and simply added #’s in from of the headings to have them formatted in the output. The user did not make any other changes to have their document conform to Markdown and the resultant output was not very nice. I seems very common for users to simply stick Markdown or Textile formatting into their documents and rely on calibre to clean them.
Dehyphenatation of TXT input was tweaking. It now looks for heuristics and dehyphenate options to be enabled. In this case it will be run over all TXT input including Markdown and Textile formatted documents.
There were a few bug fixes related to various issues. Spaces at the beginning of lines were not properly preserved. Spaces within documents were getting converted to entities when they shouldn’t. A regression that brok block formatted paragraphs was fixed.
Print formatted documents not have the indents retained.
For people like me who do not like indented paragraphs I’ve added an option to remove indents from TXT input documents.
There was one small bug fix in TXT output and that was to have TXT output show all TOC items. Previously it was only showing top level items.
TXTZ
I’ve added support for a new pseudo format call TXTZ. It’s essentially just TXT files put into a zip archive with the extension .txtz. It can contain images which should make working with Markdown and Textile formatted text easier. Also, it has metadata support via an OPF file called metadata.opf within the archive. This OPF file will be referenced for metadata reading and writing. Both input and output of TXTZ support has been added.
* Calibre Week in Review
Posted on January 31st, 2011 by John. Filed under calibre.
This is really a two week in review because I didn’t do one last week. The past two weeks I didn’t focus on major changes. I mainly spent my time with little tweaks and closing out bugs. All of these little changes didn’t feel like I accomplished very much but getting them all together it turns out I did quite a bit over the last two weeks.
GUI
The GUI saw quite a few usability changes. I’ve added auto compete to the authors field. This works the same as the tags field but starts completion with the & character instead of ,. There were a few issues that users pointed out relating to this change but they have all been corrected. It turns out that the issues were present with the tags completion but no one had noticed.
I added a confirmation dialog when stopping a running job. There is the possibility of a job finishing before the user confirms but I don’t see that as an issue. The user either wanted to stop the job and it’s stopped or they don’t and it finished properly.
The Regex Builder window saw some changes. Search next and previous were added so the user can cycle though the matched items more easily. Also, when clearing the regex text entry the highlighting will automatically clear. There were also some tweaks to remove the delay caused when testing without any input text. I’ve also implemented caching for so each time the wizard is opened it won’t reconvert the input document. It saves the result and just displaying it each time. The Search and Replace dialog also makes use of the caching across each search and replace field.
There was also some more work done with making entries into history entries. The Regex input fields (search and replace) now store previous entries. The filename import in Add books also saves previous regexes used for importing books.
The last GUI change was with the Send Specific Formats to Device dialog. It now only displaying formats that are present and or convertible. It also tell the user the number out of the total number of books that are in a particular format. It also notes which formats are convertible and which are not. All items in the dialog are also sorted from most to least preferred.
Heuristics
Italicize common cases saw some tweaks to the matching patterns to make them more robust. I foresee this being a weekly occurrence for some time.
RTF Output
A number of RTF output bugs were fixed. An issue with incorrect spacing between letter and missing spaces around italicized text. Also, the generated markup was greatly altered. It is simpler and produces more consistant results. It also allows for h tags to be turned into RTF style headings. So converting from calibre generated RTF to say EPUB the headings will carry over properly as headings. I still consider RTF output a work in progress and relegate it to experimental status.
FB2 Output
The language is now set correctly.
PML Input
Soft scene breaks are now retained. PML also saw a bug fix relating to the \T tag. The biggest change to PML input is support for multi-level table of contents. Previously the toc from a PML file was flattened. Now the levels are properly retained.
TXT Input
Like PML input TXT input now retains soft scene breaks between paragraphs. I also changed heuristic processing on TXT input to not enable preserving whitespace. Instead whitespace at the beginning of a paragraph is maintained by default. Also I rewrote the preserve whitespace function to only use when necessary instead of in place of every regular space.
TXT Output
Textile formatted output is not supported. This complements Markdown output and Textile input. Soft scene breaks are now detected and written. The scene breaks can either be empty paragraphs or defined by CSS top margin.
* Calibre Week in Review
Posted on January 17th, 2011 by John. Filed under calibre.
TXT input got some more work. It now supports the Textile markup language. This can be used in place of Markdown. Textile is also supported by the new auto-detection in TXT input.
FB2 output had some more bug fixes. The cover image is now put inside of the coverpage element in the metadata header. This is per the FB2 spec. However, the calibre ebook-viewer does not currently display the cover image that is part of the metadata header. Calibre’s FB2 metadata reader will read the cover image.
PML input had a bug fixed dealing with the \t and \T tags. They are now handled properly and will indent the entire line. This had been somewhat fixed previously but the previous fix would only work when those tags would start and end the line.
At a user’s request I’ve reworked the Author’s fields thought the GUI. Authors are now auto completed using the & symbol just like tags are auto completed using a ,. This makes adding multiple authors much easier. This change was actually fairly large and a lot of work. I refactored the auto complete classes for tags into a generic set of auto completion classes. Then I reworked each author field to use the new classes.
All of the above changes have made it into trunk and are either in the current release (0.7.40) or will be in the next release (0.7.41). The following changes are still being finished and will need Kovid’s review before being merged into a release.
Lee Dolsen and I had worked on the TXT last week and our partnership continued this week. He had created a variety of heuristic processing functions a while back. The heuristics processing would be used when the –preprocess-html option was enabled. We’ve broken the –preprocess-html function has been broken into individual options:
- –enable-heuristics
- –markup-chapter-headings
- –italicize-common-cases
- –fix-indents
- –html-unwrap-factor=HTML_UNWRAP_FACTOR
- –unwrap-lines
- –delete-blank-paragraphs
- –format-scene-breaks
- –dehyphenate
- –renumber-headings
The majority of the heuristic code is his. I helped to make the infrastructure changes to accomodate the options on the command line and in the GUI. I also added the –italicize-common-cases as a heuristic function and removed it from only working in TXT Input. I also made the necessary changes to the conversion pipeline so the heuristics will run over all input types. Currently the –preprocess-html option does not run over EPUB input. Lee did all the work to change the heuristic code to work as individual options as well as adding some extras and cleaning up some existing parts.
While Lee was making most of the heuristics changes I took the time to rework the –remove-header and –remove-footer options. Those two as well as their related regular expression options have been removed. Instead I’ve created three sets of generic search and replace options. They are much more flexible and also not as miss leading about what they do. My hope is to eventually have a heuristic function for removing headers and footers that does not require regular expressions.
* Calibre Week In Review
Posted on January 2nd, 2010 by John. Filed under calibre.
There was a major change this week to the device infrastructure. Kovid merged (with some modification) my changes to allow “Send to device” to use custom device paths just like “Save to disk”. Kovid’s major change to my implementation are having a separate save template for “Send to device” and allowing for per device overrides of the template. Kovid and I spent yesterday testing and it is working well. Expect it in the next release. Oh, News and the / tags still work as expected.
I did a little bit of work on TXT and PML output. Now they both honor the “Remove spacing between paragraphs” option. Previously TXT output had TXT specific options for this behavior. I’ve removed them and just use that look and feel option. PML output previously ignored it but now it honors it. So you can have both look more like a printed book than a web page.
* Calibre Week in Review
Posted on December 28th, 2009 by John. Filed under calibre.
I spent a bit of time working on calibre this week. I worked on profiles, devices and a bug fix here and there.
A few new profiles were added. Specifically profiles for the Sony PRS 300 and 900. They have a different screen size and resolution than the 6″ models so they warranted their own profiles.
I added support for two new devices. The Airis Dbook and the Binatone Readme. Along with supporting new devices I also reorganized and renamed a few. The BeBook device interface is not called Hanlin. The BeBook is a rebraned Hanlin. The EZReader, LBook, Eco Reader are also rebranded Hanlins. They all were using the BeBook interface. So I renamed it to be a little more generic and avoid confusion.
One big bug fix this week. There was a typo in the tag map for PML output. It causes italics to be ignore. This has been corrected.
* Calibre Week In Reveiw
Posted on December 5th, 2009 by John. Filed under calibre.
Most this week was spent turning PML input and output. I spent a bit of work bug tracking and enhancing FB2 output as well.
The changes for PML input are as follows. Pass along the included cover as the cover when converting (also applies to eReader PDB). Allow for images to be in top level, archivename_img or images directory for PMLZ. Based on that order it will check for images and if they are not found move onto the next location. For PML, images can be in pmlname_img or images directory. Footnotes and sidebars now display cleaner. They are separated better and EPUB puts them on individual pages. They also include a return link which goes back to the place in the text they are referenced. This assumes one footnote and sidebar per entry in the text, so if it’s referenced multiple times the return link will go back to the return reference.
PML output now creates \a and \U codes only for supported characters. All characters that are not supported and that cannot be turned into a \a or \U code will be replaced with a ?.
Along with the changes for PML input reading the cover they are now read as part of the metadata. This applies to both PML, PMLZ and eReader PDB files.
I’ve created a PML2PMLZ FileType plugin which will run when ever PML is imported into the GUI. It takes a PML file looks for images in the above mentioned locations, takes it all and puts it into a PMLZ archive. The PMLZ archive is them added to the library.
When I went to test the PML2PMLZ plugin I found that the GUI on my system was horribly broken. After a bit of work with Kovid, I found that calibre-parallel had to be in the path if calibre was installed in a non standard location. I install into my home directory using the develop command. Kovid has committed a fix that writes the install path to the launcher for these instances.
FB2 output now turns h1 tags into <section><title> tags to allow for TOC generation. As far as I can tell FB2 has not set TOC and instead readers dynamically generate the TOC based on looking at all of the body and sections and sets the text using the title tag. Right now the FB2 output is limited to only turning h1 tags and cannot use the user defined TOC based on an XPATH expression. I plan to fix this limitation in the future.
* Calibre Week In Review
Posted on November 30th, 2009 by John. Filed under calibre.
I spent the past week fixing as many bugs with the new PML input parser and cleaning up as much of the output as I could. I really need to thank WayneD for helping find bugs with the new code. Also, Kevin Hendricks who has been working on his own parser based on code from a tool that does some work on eReader files. He helped me formulate what output should be derived in certain cases. The new PML input parser has been released as part of calibre 0.6.25.
* Calibre Week in Review
Posted on November 22nd, 2009 by John. Filed under calibre.
PML input had some major changes this week. Thank the user WayneD for helping me out and getting me to actually do the work I’ve been putting off since I introduced PML/eReader as an input format.
There is now a metadata reader for PML and PMLZ. WayneD provided me with a set of regular expressions that can extract the metadata from a metatdata comment within a PML document. I took those regexes and created a metatdata plugin that supports both straight PML files as well as the PMLZ archive file.
The other major change to PML is, I’ve re-written the input parser. It is not longer based on a set of regular expressions. It is now a line oriented simple state machine. When I created the regex parser I intended to replace it at some point in the future with a true parser. The regex based one was simply a quick and dirty way to get PML supported. The new parser is much faster, produces cleaner and more accurate HTML output. It also has the added benefit of reading \CX codes and turns them into table of contents entries for PML and PMLZ input. The new parser is much better and I’m not completely finished with it. I still need to add support for \v comments (they are currently removed), \n codes, and implement font attribute tracking to condense changes (this is how \n will be handled).
WayneD did provide me with his Perl based line oriented simple state machine for PML to HTML conversion. I did use one idea from it. Turning footnote and sidebar xml syntax into custom PML tags. I had intended to port his parser to python and use it as a base but when I started looking at it I remembered I don’t know Perl at all and I can’t make heads or tails of Perl code. I have no desire or need to actually learn Perl, so I ended up writing my own parser.
Tags
Archives
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- December 2012 (2)
- October 2012 (1)
- August 2012 (1)
- July 2012 (1)
- June 2012 (2)
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)