<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John&#039;s Blog &#187; ztxt</title>
	<atom:link href="http://john.nachtimwald.com/tag/ztxt/feed/" rel="self" type="application/rss+xml" />
	<link>http://john.nachtimwald.com</link>
	<description>My little blog</description>
	<lastBuildDate>Sun, 29 Jan 2012 21:31:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Calibre Week in Review</title>
		<link>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calibre-week-in-review-26</link>
		<comments>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/#comments</comments>
		<pubDate>Mon, 10 Jan 2011 02:39:22 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[calibre]]></category>
		<category><![CDATA[fb2]]></category>
		<category><![CDATA[heuristic]]></category>
		<category><![CDATA[palmdoc]]></category>
		<category><![CDATA[pdb]]></category>
		<category><![CDATA[tcr]]></category>
		<category><![CDATA[txt]]></category>
		<category><![CDATA[ztxt]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=497</guid>
		<description><![CDATA[This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We&#8217;ve worked together before improving PDF input. Since then he&#8217;s done a lot of work with preprocessing of PDF and other not so [...]]]></description>
			<content:encoded><![CDATA[<p>This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We&#8217;ve worked together before improving PDF input. Since then he&#8217;s done a lot of work with preprocessing of PDF and other not so clean input.</p>
<p>TXT input now auto detects the character encoding of the file. It isn&#8217;t 100% accurate but should work for the majority of cases. It&#8217;s using <a href="http://chardet.feedparser.org/">chardet</a> for the detection. Unfortunately, cp1252 is the most common encoding that gives people issues and unless you&#8217;re using things like smart quotes and curly apostrophes it doesn&#8217;t always detect properly.</p>
<p>I started getting TXT input to detect the document structure. Mainly, are the paragraphs arranged in block, single line, or print fashion. Lee saw the detection code and modifying some of his preprocessing code he was able to greatly increased the detection accuracy over my initial work. He&#8217;s also added an unformatted type that assumes the text is one big blob and tries to determine paragraphs in much the same way PDF input tries to determine them. By unwrapping based upon punctuation and other factors.</p>
<p>In addition to detecting the paragraph style used in the document, TXT input now tries to detect the formatting of the text content. Markdown formatted text is detected. I&#8217;ve also added a heuristic processor which runs by default if either Markdown is not detected or if the user has not specified the formatting as none (which disables any type of formatting processing).</p>
<p>The heuristic processor uses some ideas from <a href="http://www.sandroid.org/GutenMark/">GutenMark</a>. Specifically italicizing common words and certain contentions used in Project Gutenberg texts that denote italics. I started working on a set of heuristics to detect chapter headings but Lee quickly pointed out he had already created something similar using regular expressions in his preprocessing code. I quickly began using it in my heuristic processor and it&#8217;s working well. Chapter headings and subheadings are now formatted with the appropriate h tags. He has some plans to enhance the detection further using a word list.</p>
<p>TCR, PDB PalmDoc and PDB zTXT inputs all pass the extracted text to the TXT input plugin for processing. This allows them to take advantage of all the work that&#8217;s gone into TXT input. Also, with auto detection now being part of TXT input it should allow for one time conversion instead of convert, check, tweak some options, convert again. Especially since these formats don&#8217;t make it easy to see how the text is structured within the file without first converting.</p>
<p>TXT input wasn&#8217;t the only part of TXT support that was touched. I spent some time cleaning up the TXT output. Consistant spacing is now created around headings. Also, when using the &#8211;remove-paragraph-spacing option, headings are not indented with a tab. The output now looks much cleaner and I consider it acceptable for reading.</p>
<p>Not to be left out FB2 output got a small bug fix. With all the work rewriting it I broke having it read covers. If you were converting an EPUB for instance that specified the cover (or title page) in the guide rather than the spine it would not be included. Also, the &#8211;cover option was being ignored. Now that&#8217;s fixed and external covers are inserted properly.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2011/01/09/calibre-week-in-review-26/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>lebookread</title>
		<link>http://john.nachtimwald.com/2010/05/16/lebookread/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=lebookread</link>
		<comments>http://john.nachtimwald.com/2010/05/16/lebookread/#comments</comments>
		<pubDate>Mon, 17 May 2010 01:52:56 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[lebookread]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[epub]]></category>
		<category><![CDATA[palmdoc]]></category>
		<category><![CDATA[pdb]]></category>
		<category><![CDATA[qt]]></category>
		<category><![CDATA[rb]]></category>
		<category><![CDATA[ztxt]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=358</guid>
		<description><![CDATA[I have been taking a short break from blogging again. The pressure at work has only increased and is eating into a lot of my time. I haven&#8217;t been motivated to work on personal projects because well they are work. However, this has recently changed a bit. I&#8217;ve started a Qt based library for reading [...]]]></description>
			<content:encoded><![CDATA[<p>I have been taking a short break from blogging again. The pressure at work has only increased and is eating into a lot of my time. I haven&#8217;t been motivated to work on personal projects because well they are work. However, this has recently changed a bit.</p>
<p>I&#8217;ve started a Qt based library for reading ebooks in a generic manner. It is called <a href="https://launchpad.net/lebookread">lebookread</a>! It is it&#8217;s early stages. So far I have it supporting epub, palmdoc pdb, ztxt pdb, tcr, and rb files. I plan to support ereader pdb, mobi, and plucker files in the near future.</p>
<p>The main goal of this project is to make reading ebooks easy for Qt based projects. I&#8217;ve chose to write the library in C++. This is also my first attempt at writing a library and it shows. I hope that it will be used by <a href="http://code.google.com/p/sigil/">Sigil</a>.</p>
<p>The real motivation of writing lebook read is I really want a good light weight ebook reader. The current offering have issues. I want something that is a bit more advanced in it&#8217;s rendering than <a href="http://www.fbreader.org/">FBReader</a>. I also didn&#8217;t want anything with as large a dependency list as <a href="http://calibre-ebook.com/">calibre</a>. So, I plan on using lebookread to write my own ebook viewer.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2010/05/16/lebookread/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Calibre Week in Review</title>
		<link>http://john.nachtimwald.com/2009/05/10/calibre-week-in-review-4/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calibre-week-in-review-4</link>
		<comments>http://john.nachtimwald.com/2009/05/10/calibre-week-in-review-4/#comments</comments>
		<pubDate>Sun, 10 May 2009 12:29:33 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[calibre]]></category>
		<category><![CDATA[eReader]]></category>
		<category><![CDATA[GUI]]></category>
		<category><![CDATA[palmdoc]]></category>
		<category><![CDATA[pdb]]></category>
		<category><![CDATA[ztxt]]></category>

		<guid isPermaLink="false">http://john.nachtimwald.com/?p=113</guid>
		<description><![CDATA[Device interfaces can now be configured in the GUI. Also, there is a simple framework for creating plugin configuration widgets. I&#8217;ve added a metadata reader for the eReader format. However, eReader supports 3 ways to set the metadata in the file. 1) In the pdb header (only supports setting a short title). 2) In the [...]]]></description>
			<content:encoded><![CDATA[<p>Device interfaces can now be configured in the GUI. Also, there is a simple framework for creating plugin configuration widgets.</p>
<p>I&#8217;ve added a metadata reader for the eReader format. However, eReader supports 3 ways to set the metadata in the file. 1) In the pdb header (only supports setting a short title). 2) In the metadata section of the file (supports the most information: title, author, publisher, copyright, isbn). 3) Embedded in the text as a comment. 2 and 3 are only accessible if the book does not contain DRM (or has been unlocked, but Calibre does not support this). 3 is not supported at all with this metadata reader. The reader first tires 2 then falls back to 1 if the book is DRMed or if the metadata section is non-existent.</p>
<p>Two new input and output formats have been added. ztxt and palmdoc. They are both pdb formats like eReader. For input the pdb input plugin will automatically determine the internal format and call the appropriate code path. For output the default is palmdoc but there is an option &#8211;format that can be used to change it to any other supported pdb output format (ztxt is the only other currently). The format option is also available in the conversion dialog in the GUI.</p>
<p>Speaking of conversion in the GUI. It now works. There are all new dialogs for single and bulk conversion. Pretty much anything that can be done using the command line ebook-convert can be done in the GUI. Bulk, single and auto conversion are all complete and working. Auto conversion will also honor a users preferences for formats set for the device interface plugin.</p>
]]></content:encoded>
			<wfw:commentRss>http://john.nachtimwald.com/2009/05/10/calibre-week-in-review-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

