<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John&#039;s Blog &#187; tcr</title>
	<atom:link href="http://john.nachtimwald.com/tag/tcr/feed/" rel="self" type="application/rss+xml" />
	<link>http://john.nachtimwald.com</link>
	<description>My little blog</description>
	<lastBuildDate>Sun, 29 Jan 2012 21:31:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Calibre Week in Review</title>
		<link>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calibre-week-in-review-26</link>
		<comments>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/#comments</comments>
		<pubDate>Mon, 10 Jan 2011 02:39:22 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[calibre]]></category>
		<category><![CDATA[fb2]]></category>
		<category><![CDATA[heuristic]]></category>
		<category><![CDATA[palmdoc]]></category>
		<category><![CDATA[pdb]]></category>
		<category><![CDATA[tcr]]></category>
		<category><![CDATA[txt]]></category>
		<category><![CDATA[ztxt]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=497</guid>
		<description><![CDATA[This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We&#8217;ve worked together before improving PDF input. Since then he&#8217;s done a lot of work with preprocessing of PDF and other not so [...]]]></description>
			<content:encoded><![CDATA[<p>This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We&#8217;ve worked together before improving PDF input. Since then he&#8217;s done a lot of work with preprocessing of PDF and other not so clean input.</p>
<p>TXT input now auto detects the character encoding of the file. It isn&#8217;t 100% accurate but should work for the majority of cases. It&#8217;s using <a href="http://chardet.feedparser.org/">chardet</a> for the detection. Unfortunately, cp1252 is the most common encoding that gives people issues and unless you&#8217;re using things like smart quotes and curly apostrophes it doesn&#8217;t always detect properly.</p>
<p>I started getting TXT input to detect the document structure. Mainly, are the paragraphs arranged in block, single line, or print fashion. Lee saw the detection code and modifying some of his preprocessing code he was able to greatly increased the detection accuracy over my initial work. He&#8217;s also added an unformatted type that assumes the text is one big blob and tries to determine paragraphs in much the same way PDF input tries to determine them. By unwrapping based upon punctuation and other factors.</p>
<p>In addition to detecting the paragraph style used in the document, TXT input now tries to detect the formatting of the text content. Markdown formatted text is detected. I&#8217;ve also added a heuristic processor which runs by default if either Markdown is not detected or if the user has not specified the formatting as none (which disables any type of formatting processing).</p>
<p>The heuristic processor uses some ideas from <a href="http://www.sandroid.org/GutenMark/">GutenMark</a>. Specifically italicizing common words and certain contentions used in Project Gutenberg texts that denote italics. I started working on a set of heuristics to detect chapter headings but Lee quickly pointed out he had already created something similar using regular expressions in his preprocessing code. I quickly began using it in my heuristic processor and it&#8217;s working well. Chapter headings and subheadings are now formatted with the appropriate h tags. He has some plans to enhance the detection further using a word list.</p>
<p>TCR, PDB PalmDoc and PDB zTXT inputs all pass the extracted text to the TXT input plugin for processing. This allows them to take advantage of all the work that&#8217;s gone into TXT input. Also, with auto detection now being part of TXT input it should allow for one time conversion instead of convert, check, tweak some options, convert again. Especially since these formats don&#8217;t make it easy to see how the text is structured within the file without first converting.</p>
<p>TXT input wasn&#8217;t the only part of TXT support that was touched. I spent some time cleaning up the TXT output. Consistant spacing is now created around headings. Also, when using the &#8211;remove-paragraph-spacing option, headings are not indented with a tab. The output now looks much cleaner and I consider it acceptable for reading.</p>
<p>Not to be left out FB2 output got a small bug fix. With all the work rewriting it I broke having it read covers. If you were converting an EPUB for instance that specified the cover (or title page) in the guide rather than the spine it would not be included. Also, the &#8211;cover option was being ignored. Now that&#8217;s fixed and external covers are inserted properly.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Calibre Week in Review</title>
		<link>http://john.nachtimwald.com/2011/01/01/calibre-week-in-review-25/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calibre-week-in-review-25</link>
		<comments>http://john.nachtimwald.com/2011/01/01/calibre-week-in-review-25/#comments</comments>
		<pubDate>Sat, 01 Jan 2011 16:52:00 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[calibre]]></category>
		<category><![CDATA[fb2]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[tcr]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=493</guid>
		<description><![CDATA[I did some work with PDF output. Mainly I refactored some of the output generation code to reduce redundant sections. Over all there won&#8217;t be any user visible changes. The main reason I dove back into PDF output was because a user on OS X noted that PDF produced were not searchable. Windows users are [...]]]></description>
			<content:encoded><![CDATA[<p>I did some work with PDF output. Mainly I refactored some of the output generation code to reduce redundant sections. Over all there won&#8217;t be any user visible changes.</p>
<p>The main reason I dove back into PDF output was because a user on OS X noted that PDF produced were not searchable. Windows users are getting searchable PDFs and on Kovid&#8217;s Gentoo Linux machine he was able to get a searchable PDF. I looked into the issue and cannot get searchable PDFs on OS X. However, I can get searchable PDFs when using the ebook-viewers print feature via print to file. I&#8217;m not sure why this happens because the ebook-viewer and PDF output use the same technique for generating a PDF. I&#8217;ve decided not to peruse the matter further because PDF output on OS X is pretty much broken due to Qt bugs. See <a href="http://bugreports.qt.nokia.com/browse/QTBUG-8149">this</a> for an example.</p>
<p>TCR compression was something I added a while ago. I was never fully happy with it because it was slow, and produced low quality output. I spent a few days completely rewriting the compressor and now it performs beautifully. The new compressor is cleaner, an order of 10x faster, compresses to a much smaller size, and I would say is on par with the output of <a href="http://www.cix.co.uk/~gidds/Software/TCR.html">Andrew Giddings&#8217; TCR</a> implementation. However, calibre&#8217;s TCR compressor is a pure python implementation is still considerably slower than Andrew&#8217;s C implementation.</p>
<p>A minor bug in FB2 output was brought to my attention and fixed. Basically JPG images in the input document were not being written to the FB2 output file. This has been corrected.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2011/01/01/calibre-week-in-review-25/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Calibre Week in Review</title>
		<link>http://john.nachtimwald.com/2009/10/19/calibre-week-in-review-15/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calibre-week-in-review-15</link>
		<comments>http://john.nachtimwald.com/2009/10/19/calibre-week-in-review-15/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 11:35:41 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[calibre]]></category>
		<category><![CDATA[ebook]]></category>
		<category><![CDATA[tcr]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=238</guid>
		<description><![CDATA[Like every week there were miscellaneous bug fixes. However, this week I did a bit more. TCR input and output. Do be warned that the output supports multiple compression levels; the higher levels being slower than the lower. For instance a 200K TXT file as input will take around 25 seconds on the lowest level [...]]]></description>
			<content:encoded><![CDATA[<p>Like every week there were miscellaneous bug fixes. However, this week I did a bit more. TCR input and output. Do be warned that the output supports multiple compression levels; the higher levels being slower than the lower. For instance a 200K TXT file as input will take around 25 seconds on the lowest level and 3.5 minutes at the highest.</p>
<p>TCR is an compressed text format used mainly by the <a href="http://en.wikipedia.org/wiki/Psion">Psion</a> <a href="http://en.wikipedia.org/wiki/Psion_Series_3">3</a> and <a href="http://en.wikipedia.org/wiki/Psion_Series_5">5</a> series PDAs that were produced in the 90s. The compression used by TCR files is very interesting. It doesn&#8217;t have as high a compression ratio as say zlib but that is a trade off for being decompressable starting at any point in the stream. The history and more information about the format can be found at <a href="http://www.cix.co.uk/~gidds/Software/TCR.html">Andrew Giddings&#8217; TCR page</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2009/10/19/calibre-week-in-review-15/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

