Posts Tagged ‘txtz’
Posted on February 18th, 2011 by John. Filed under calibre.
This is a short week for the week in review because I’m now doing my week from Friday to Thursday. Last week I ened my week on Monday so this review only has a few days worth of work.
I’ve added an import plugin that runs over TXT content when it is added to the library. What happens is the TXT file is scanned looking for Markdown (inline or reference) and Textile image references. It collects all of the images and adds them plus the TXT file to a TXTZ archive when the following conditions are true:
- Path must not be empty.
- Path must be a relative path.
- The mimetype of the image (based on extension) must be an OEB supported image type. (JPG, PNG, SVG, GIF).
- The image must exist relative to the TXT file’s location and the location specified by the path.
If no images are found referenced in the TXT file or if they images found fail the above tests then a TXTZ archive is not created and the TXT file itself is added to the library.
Fix a bug where TOC entries specified by \x and \X were not being included in the TOC.
Italcize common cases patterns got tweaked again. One pattern (/text/ would match <br /> </… and cause issues.
Posted on February 13th, 2011 by John. Filed under calibre.
I’ve been putting up my week in reviews on based on a week starting on Monday for some time now. I’ve been thinking about this and it doesn’t really make much sense. Calibre has a release pretty much every Friday now. So starting next week I’m going to change my week in review to be Friday though Thursday. This way features I talk about in my review will be in the just released version.
First the small changes. Heuristic processing now enables smarten punctuation to further my goal of TXT documents coming out looking great. A change was made to have hard scene breaks separated from the text to ensure it doesn’t accidentally get merged into the paragraph before or after. The formatting type none was renamed to plain to correspond with the formatting output option.
The only big change for TXT input was a new paragraph type option was added. It’s called off. When specified there will be no modifications to the paragraph structure applied to the text. This is especially useful for Markdown and Textile formatted documents. It ensures there are no changes that will cause elements to render incorrectly.
A bug caused images to not be included when converting. With Kovid’s help this has been corrected.
I modified Textile output to not write %’s for span tags. The span tag is superfluous in calibre’s Textile output because it does not contain any real information. The span tags are invisible when rendering the XHTML. The %’s cluttered up the resultant TXT so they were removed.
PML input saw a lot of of relating to \t and \T tags. The entire handling of these tags was rewritten. Unfortunately, there is no way to have these two tags map one to one to XHTML so only some common cases are handled.
- \T’s that do not start the line are ignored.
- \t’s that start and end the line use a margin for the text block.
- \t’s that start a line and end another line use a margin for the text block.
- \t’s that start a line but end before a line ending will use a text-indent.
- \t’s that are in the middle of lines are ignored. open and closed \t blocks within a line are ignored.
Once again the italicize common cases regex was tweaked. This time it was to fix an issue with None being inserted in the text before ajacent underscores. I’m hoping this is the last time for a while that I need to tweak them.
The work I did on the APNX format was undertaken for a very real world reason. Integrating APNX generation to calibre’s Kindle device interface plugin.
The 0.7.45 release saw the initial inclusion of this feature. After I received some user feed back I’ve tweaked it for the 0.7.46 release. The 0.7.45 release included a very basic APNX file that would create pages every 1024 bytes of uncompressed HTML.
In 0.7.46 there are a lot of differences. Writing the APNX can be disabled. This is very useful for Kindle 2 users as the Kindle interface works for both Kindle 2 and 3′s.
There are now two parser for generating pages. The default is the fast parser. It uses the uncompressed length of the MOBI HTML and creates pages every 2300 bytes. A few users complained that 1024 created too many pages. About double what you would find in an average paper back book. The 2300 number is a bit more than double 1024 and I chose 2300 after counting the number of characters in a page of an average paper back book. I counted approximately 2240 and added an additional 60 characters to account for markup per page. Thus 2300.
The other parser that can be enabled in the Kindle interface’s setting is the accurate parser. It works by decompressing the MOBI HTML and looking at the actual content. The big difference and why I’m calling it an accurate parser is it looks at the amount of visible text to decide when a page ends and a new one begins. The assumption is there are 30 lines per page and each line can have up to 70 characters. The parser starts a new line every time it encounters a new paragraph and every 70 characters in a paragraph.
The major disadvantage of the accurate parser and why it’s not the default is it’s slow. It requires the text to decompressed and parsed. With a PalmDoc compressed file this can take a few seconds but with a HUFF/CDIC compressed file it can take minutes.
The other minor disadvantage of the accurate parser is it cannot work on DRM content. The fast parser can because the uncompressed text length is stored unencrypted in the MOBI header. If the accurate parser is chosen it will fall back to the fast parser for DRM content. So when ever a Mobipocket book is sent to the Kindle (AZW, MOBI, PRC) an APNX file can and will (unless disabled) be generated.
One thing I will note about the accurate parer is it currently ignores all markup and only looks at text. Meaning it can be made even more accurate by accounting for <div class=”mbp_pagebreak” />, <br>, <hr>, images, margins, and font size changes. I do plan to add support for most if not all of these in the future but since most books people read on their Kindle are pretty much all text and because the accurate parser does a good enough job giving page numbers that correspond to the page length in a paper back book I’m don’t see a pressing need to spend the time on it at this moment.
Posted on February 7th, 2011 by John. Filed under calibre.
Once again this is a big week with a lot of little changes. The majority of which were related to TXT input.
I was thinking about the fact that for all of the formats I support I use the format specification to know how the reading and writing should happen and how they aren’t part of calibre proper. I have a set of documents that outline what is known about each format I handle. I say what is known because in some cases (eReader) the binary format is reverse engineered and a lot of it is guess work. The documents I use are partly a collection of information available in (sometimes many) different places and some of it is my own work. I’ve now added these documents to calibre proper in the top level format_docs directory. Hopefully people will find this useful and help others work on these formats.
Recently there was a request to add auto complete to (just like it is in tags) to the authors metadata field in the GUI. I added this a few versions ago and it caused an uproar. Many people loved the feature and many people hated how after completing it would add the completion character at the end of the completion. Even though when you save the changes the completion character is removed people a small group of vocal users didn’t like the way it looked while editing. Kovid changed completion so the the separator character isn’t inserted after completion and since I as well as other liked this behavior said that I should re-implement it as a tweak. So in 0.7.45 set the tweak completer_append_separator to True to have it insert the separator character after completion.
Lee and I did some more work on Heuristics. Mainly he did the work. I’ve pretty much just been getting the options set up on the command line and in the GUI for him. There is a new option for replacing soft scene breaks with a hard scene break. The replacement text is user defined but the history drop down comes preloaded with a number of common cases.
I did a little heuristic work myself. Namely I tweaked the italicize patterns to make them more robust and I in the process I simplified them.
FB2 output was updated to handle creating soft scene breaks baded on empty paragraphs and top margins. Because FB2 does not specify how the document is supposed to look (this is left to the reader software, elements only define type not layout) I chose inserting blank lines between paragraphs to create scene breaks.
PML Input had some tweaks regarding soft scene break. I reduced the number of empty lines between paragraphs to create a soft scene break. I haven’t seen any documents that need this change. However, the more I thought about how it was handled, I realized that a valid document can use fewer lines.
Now that PML Input retains soft scene breaks it’s only natural to have PML output write them. Empty paragraphs and margin based spacing are both accounted for. In addition I added support for left margins being retained in the resultant PML.
There was one small bug fix. Looking over the PML docs again I noticed that \c and \r codes need to be closed on the next line following their opening. I modified the output code to ensure this happens. There was some general work to produce cleaner output as well.
While I was working on the above I decided that since I previously changed PML input to create a multi-level TOC that I should also have PML output write a multi-level TOC. Currently this is based on the tags being pointed to by the TOC items and by them not being headings. Only \Cn TOC markers are supported at this time. \Xn markers are going to need a bit more work.
TXT input paragraph processing was restructure so paragraph transformations are always applied. Previously they were not being applied when Markdown or Textile formatting was used. A user on MobileRead had modified their TXT file and simply added #’s in from of the headings to have them formatted in the output. The user did not make any other changes to have their document conform to Markdown and the resultant output was not very nice. I seems very common for users to simply stick Markdown or Textile formatting into their documents and rely on calibre to clean them.
Dehyphenatation of TXT input was tweaking. It now looks for heuristics and dehyphenate options to be enabled. In this case it will be run over all TXT input including Markdown and Textile formatted documents.
There were a few bug fixes related to various issues. Spaces at the beginning of lines were not properly preserved. Spaces within documents were getting converted to entities when they shouldn’t. A regression that brok block formatted paragraphs was fixed.
Print formatted documents not have the indents retained.
For people like me who do not like indented paragraphs I’ve added an option to remove indents from TXT input documents.
There was one small bug fix in TXT output and that was to have TXT output show all TOC items. Previously it was only showing top level items.
I’ve added support for a new pseudo format call TXTZ. It’s essentially just TXT files put into a zip archive with the extension .txtz. It can contain images which should make working with Markdown and Textile formatted text easier. Also, it has metadata support via an OPF file called metadata.opf within the archive. This OPF file will be referenced for metadata reading and writing. Both input and output of TXTZ support has been added.
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- December 2012 (2)
- October 2012 (1)
- August 2012 (1)
- July 2012 (1)
- June 2012 (2)
- April 2012 (1)
- March 2012 (1)
- February 2012 (3)
- January 2012 (3)
- December 2011 (2)
- November 2011 (1)
- October 2011 (3)
- September 2011 (9)
- August 2011 (15)
- July 2011 (5)
- June 2011 (3)
- May 2011 (4)
- April 2011 (2)
- March 2011 (2)
- February 2011 (4)
- January 2011 (4)
- December 2010 (2)
- November 2010 (1)
- October 2010 (1)
- August 2010 (3)
- July 2010 (4)
- June 2010 (1)
- May 2010 (2)
- March 2010 (1)
- January 2010 (8)
- December 2009 (5)
- November 2009 (6)
- October 2009 (4)
- September 2009 (2)
- August 2009 (6)
- July 2009 (6)
- June 2009 (4)
- May 2009 (6)
- April 2009 (4)
- March 2009 (2)
- February 2009 (4)
- January 2009 (4)
- December 2008 (7)
- November 2008 (2)