Posts Tagged ‘txt’

* Calibre Week in Review

Posted on September 8th, 2011 by John. Filed under calibre.


This week I finally sat down and spend some time with Markdown input and output. Both saw major changes. Markdown input was bumped to upstream version 2.0. Output was completely rewritten from scratch. Markdown output is now completely custom code (not using a third party output module like before). I based the new markdown code off of the Textile output classes I helped Perkin to create.

As with all new code and major changes there are probably bugs. I tested Markdown output with a variety of test material and kept working at it until everything converted acceptably. I also used a variety of the Markdown tests provided by John Gruber to ensure my output was correct. When converting the HTML output tests back to Markdown the output is similar enough to the original that I feel it is acceptable.

The last big change I made this week was adding a new OEB transformation to unsmarten punctuation. As the name implies it changes curly quotes, apostrophes and a few other characters to their plain text, straight equivalents. It basically does the opposite of smarten punctuation. I find this especially useful when converting to formatted (Textile or Markdown) plain text files (TXT).

Tags: , , .

    Comments Off


* Formatting Tips: Markdown, Textile and calibre

Posted on August 30th, 2011 by John. Filed under Formatting Tips.


About Formatting Tips.

Up to this point Formatting Tips have been focused on the EPUB format and working directly with the underlying XHTML and CSS. Not everyone wants or needs this level of control over the layout of their book. Often times a book only needs basic formatting such as headings, bold, and italic. There are other easier ways to format an ebook. However, in this case simpler does mean basic.

A very easy way to format an ebook is to start with a plain text file (TXT). Then use either Markdown or Textile to add the formatting. Both Markdown and Textile allow for simple text formatting and they are designed to be converted to HTML.

By using TXT with a formatting syntax you can use pretty much any text editor you want. Markdown and Textile are very simple formats that are much easier to learn than XHTML and CSS. Adding things like *bold* is as easy as putting a * before and after a segment of text.

I recommend looking at both Markdown and Textile. There are differences in what formatting they support but they both support the basics like bold, italic, and headings. I’ve found Markdown to be easier to use but Textile offers more options.

After adding your formatting to the text it’s very easy to turn the TXT file into your desired final format (EPUB or MOBI most likely). calibre supports TXT formatted with either Markdown or Textile. However, the Textile support is more robust. Simply convert to the output format of your choosing.

Tags: , , , , , , , , , , .



* Calibre Weeks in Review

Posted on April 9th, 2011 by John. Filed under calibre.


It has been a few weeks since I’ve done a calibre week in review. This is partly because I had been working on some new features for the upcoming 0.8 release. I haven’t wanted to talk about it very much until the release gets closer. Kovid said yesterday that he will be reviewing my changes next week.

HTMLZ

One complaint I hear often is in regard to the inability to edit ebooks. Many people seem to think EPUB is not a good format for editing. Sigil is often the solution given around these parts but some people insist on the need for a book to be contained in a single HTML file. Simply unzipping an EPUB doesn’t accomplish this due to the need to split the files.

To remedy this situation I’ve added a new output format: HTMLZ. Just like TXTZ it is just a zip file with with a different extension to differentiate it. Inside is a metadata.opf file (calibre can read and write metadata to it). Images are preserved, renamed and placed in an images folder. This format is available in the 0.7.54 release.

Also inside is a single HTML file. Even if you’re converting from and EPUB that has been split into multiple parts a conversion to HTMLZ will result in a single HTML file. To go along with this there are a number of ways to configure CSS handling. The default is to place the CSS in separate style.css file. It can also place class based CSS inside of the head element in the HTML itself. Or you can have it write the CSS inline within each element. Finally the last option for CSS is to remove it and convert as much as possible (a very limited set right now) to HTML tags.

As with all of my output format attempts I believe this will have quite a few bugs. Let me know about any issues so I can fix them. I hope people find this useful for their hand editing needs.

FB2 Output

Just a small change to FB2 output this time. Users can now select the genre for the output document. The default is antique but a list of supported genres is available to choose from.

GUI – Toolbars

theducks on MobileRead made a few requests regarding handling of toolbars. He was having trouble with the number of interface action plugins he had added to the toolbar and needed more space.

The first change is removing the split toolbar into two option and make the second toolbar user configurable. This way you can add what ever you want in the order you want to the second toolbar.

Along with this, thducks also wanted to be able to remove the icons on the toolbars so I added an off option to the toolbar icon size setting. This way icons can be removed completely. If they are disabled then the text will automatically be used even if the toolbar text option is set to never show. This way you won’t lose your toolbar.

I also made it so that any toolbar that doesn’t have any items on it will be hidden. All of these toolbar changes are in the 0.7.54 release.

GUI – Menubar

Another change to the GUI which won’t be out until the 0.7.55 release is the addition of a configurable menubar. I personally don’t like the toolbar and added support for a menubar. It is configurable in the toolbar configuration are in preferences. Just like the toolbars and right click menus you can configure what is in the menu and what order they appear in.

The main motivation of the menubar addition was the fact that I use a Mac. OS X always shows a menubar outside of the application window. Calibre never looked quite right on a Mac because it doesn’t have a menu so OS X’s menubar would always appear empty.

GUI – OS X

On OS X the menubar has a number of default items. All other OS’s the menubar is default empty and hidden. Also some toolbar items are not shown by default on OS X because they are available though the menubar. The idea is to provide visually appealing default for OS X and to provide a more initiative experience for Mac users.

I’ve also made the toolbar and statusbar on OS X use the system type instead of the generic Qt toolbar and status bar. They look better and behave as one would expect on OS X. The hide toolbar button for instance now works an d hides the toolbar.

Other

Aside from my changes, I’ve been giving direction to Perkin form MobileRead for enhancements to the Textile input and output. The input changes are already in the latest (0.7.54) release. He’s still working on enhancements to Textile output to ensure it produces the same output that the input supports. He has also identified a few bugs with the current Textile output and is working to fix them too.

Tags: , , , , , .

    Comments Off


* Calibre Week in Review

Posted on February 18th, 2011 by John. Filed under calibre.


This is a short week for the week in review because I’m now doing my week from Friday to Thursday. Last week I ened my week on Monday so this review only has a few days worth of work.

TXTZ

I’ve added an import plugin that runs over TXT content when it is added to the library. What happens is the TXT file is scanned looking for Markdown (inline or reference) and Textile image references. It collects all of the images and adds them plus the TXT file to a TXTZ archive when the following conditions are true:

  • Path must not be empty.
  • Path must be a relative path.
  • The mimetype of the image (based on extension) must be an OEB supported image type. (JPG, PNG, SVG, GIF).
  • The image must exist relative to the TXT file’s location and the location specified by the path.

If no images are found referenced in the TXT file or if they images found fail the above tests then a TXTZ archive is not created and the TXT file itself is added to the library.

PML Input:

Fix a bug where TOC entries specified by \x and \X were not being included in the TOC.

Heuristics:

Italcize common cases patterns got tweaked again. One pattern (/text/ would match <br /> </… and cause issues.

Tags: , , , , .

    Comments Off


* Calibre Week in Review

Posted on February 13th, 2011 by John. Filed under calibre.


I’ve been putting up my week in reviews on based on a week starting on Monday for some time now. I’ve been thinking about this and it doesn’t really make much sense. Calibre has a release pretty much every Friday now. So starting next week I’m going to change my week in review to be Friday though Thursday. This way features I talk about in my review will be in the just released version.

TXT Input

First the small changes. Heuristic processing now enables smarten punctuation to further my goal of TXT documents coming out looking great. A change was made to have hard scene breaks separated from the text to ensure it doesn’t accidentally get merged into the paragraph before or after. The formatting type none was renamed to plain to correspond with the formatting output option.

The only big change for TXT input was a new paragraph type option was added. It’s called off. When specified there will be no modifications to the paragraph structure applied to the text. This is especially useful for Markdown and Textile formatted documents. It ensures there are no changes that will cause elements to render incorrectly.

TXTZ Input

A bug caused images to not be included when converting. With Kovid’s help this has been corrected.

TXT Output

I modified Textile output to not write %’s for span tags. The span tag is superfluous in calibre’s Textile output because it does not contain any real information. The span tags are invisible when rendering the XHTML. The %’s cluttered up the resultant TXT so they were removed.

PML Input

PML input saw a lot of of relating to \t and \T tags. The entire handling of these tags was rewritten. Unfortunately, there is no way to have these two tags map one to one to XHTML so only some common cases are handled.

  • \T’s that do not start the line are ignored.
  • \t’s that start and end the line use a margin for the text block.
  • \t’s that start a line and end another line use a margin for the text block.
  • \t’s that start a line but end before a line ending will use a text-indent.
  • \t’s that are in the middle of lines are ignored. open and closed \t blocks within a line are ignored.

Heuristics

Once again the italicize common cases regex was tweaked. This time it was to fix an issue with None being inserted in the text before ajacent underscores. I’m hoping this is the last time for a while that I need to tweak them.

Kindle Interface

The work I did on the APNX format was undertaken for a very real world reason. Integrating APNX generation to calibre’s Kindle device interface plugin.

The 0.7.45 release saw the initial inclusion of this feature. After I received some user feed back I’ve tweaked it for the 0.7.46 release. The 0.7.45 release included a very basic APNX file that would create pages every 1024 bytes of uncompressed HTML.

In 0.7.46 there are a lot of differences. Writing the APNX can be disabled. This is very useful for Kindle 2 users as the Kindle interface works for both Kindle 2 and 3′s.

There are now two parser for generating pages. The default is the fast parser. It uses the uncompressed length of the MOBI HTML and creates pages every 2300 bytes. A few users complained that 1024 created too many pages. About double what you would find in an average paper back book. The 2300 number is a bit more than double 1024 and I chose 2300 after counting the number of characters in a page of an average paper back book. I counted approximately 2240 and added an additional 60 characters to account for markup per page. Thus 2300.

The other parser that can be enabled in the Kindle interface’s setting is the accurate parser. It works by decompressing the MOBI HTML and looking at the actual content. The big difference and why I’m calling it an accurate parser is it looks at the amount of visible text to decide when a page ends and a new one begins. The assumption is there are 30 lines per page and each line can have up to 70 characters. The parser starts a new line every time it encounters a new paragraph and every 70 characters in a paragraph.

The major disadvantage of the accurate parser and why it’s not the default is it’s slow. It requires the text to decompressed and parsed. With a PalmDoc compressed file this can take a few seconds but with a HUFF/CDIC compressed file it can take minutes.

The other minor disadvantage of the accurate parser is it cannot work on DRM content. The fast parser can because the uncompressed text length is stored unencrypted in the MOBI header. If the accurate parser is chosen it will fall back to the fast parser for DRM content. So when ever a Mobipocket book is sent to the Kindle (AZW, MOBI, PRC) an APNX file can and will (unless disabled) be generated.

One thing I will note about the accurate parer is it currently ignores all markup and only looks at text. Meaning it can be made even more accurate by accounting for <div class=”mbp_pagebreak” />, <br>, <hr>, images, margins, and font size changes. I do plan to add support for most if not all of these in the future but since most books people read on their Kindle are pretty much all text and because the accurate parser does a good enough job giving page numbers that correspond to the page length in a paper back book I’m don’t see a pressing need to spend the time on it at this moment.

Tags: , , , , , , , .



* Calibre Week in Review

Posted on February 7th, 2011 by John. Filed under calibre.


Once again this is a big week with a lot of little changes. The majority of which were related to TXT input.

Format Specifications

I was thinking about the fact that for all of the formats I support I use the format specification to know how the reading and writing should happen and how they aren’t part of calibre proper. I have a set of documents that outline what is known about each format I handle. I say what is known because in some cases (eReader) the binary format is reverse engineered and a lot of it is guess work. The documents I use are partly a collection of information available in (sometimes many) different places and some of it is my own work. I’ve now added these documents to calibre proper in the top level format_docs directory. Hopefully people will find this useful and help others work on these formats.

GUI

Recently there was a request to add auto complete to (just like it is in tags) to the authors metadata field in the GUI. I added this a few versions ago and it caused an uproar. Many people loved the feature and many people hated how after completing it would add the completion character at the end of the completion. Even though when you save the changes the completion character is removed people a small group of vocal users didn’t like the way it looked while editing. Kovid changed completion so the the separator character isn’t inserted after completion and since I as well as other liked this behavior said that I should re-implement it as a tweak. So in 0.7.45 set the tweak completer_append_separator to True to have it insert the separator character after completion.

Heuristic Processing

Lee and I did some more work on Heuristics. Mainly he did the work. I’ve pretty much just been getting the options set up on the command line and in the GUI for him. There is a new option for replacing soft scene breaks with a hard scene break. The replacement text is user defined but the history drop down comes preloaded with a number of common cases.

I did a little heuristic work myself. Namely I tweaked the italicize patterns to make them more robust and I in the process I simplified them.

FB2

FB2 output was updated to handle creating soft scene breaks baded on empty paragraphs and top margins. Because FB2 does not specify how the document is supposed to look (this is left to the reader software, elements only define type not layout) I chose inserting blank lines between paragraphs to create scene breaks.

PML

PML Input had some tweaks regarding soft scene break. I reduced the number of empty lines between paragraphs to create a soft scene break. I haven’t seen any documents that need this change. However, the more I thought about how it was handled, I realized that a valid document can use fewer lines.

Now that PML Input retains soft scene breaks it’s only natural to have PML output write them. Empty paragraphs and margin based spacing are both accounted for. In addition I added support for left margins being retained in the resultant PML.

There was one small bug fix. Looking over the PML docs again I noticed that \c and \r codes need to be closed on the next line following their opening. I modified the output code to ensure this happens. There was some general work to produce cleaner output as well.

While I was working on the above I decided that since I previously changed PML input to create a multi-level TOC that I should also have PML output write a multi-level TOC. Currently this is based on the tags being pointed to by the TOC items and by them not being headings. Only \Cn TOC markers are supported at this time. \Xn markers are going to need a bit more work.

TXT

TXT input paragraph processing was restructure so paragraph transformations are always applied. Previously they were not being applied when Markdown or Textile formatting was used. A user on MobileRead had modified their TXT file and simply added #’s in from of the headings to have them formatted in the output. The user did not make any other changes to have their document conform to Markdown and the resultant output was not very nice. I seems very common for users to simply stick Markdown or Textile formatting into their documents and rely on calibre to clean them.

Dehyphenatation of TXT input was tweaking. It now looks for heuristics and dehyphenate options to be enabled. In this case it will be run over all TXT input including Markdown and Textile formatted documents.

There were a few bug fixes related to various issues. Spaces at the beginning of lines were not properly preserved. Spaces within documents were getting converted to entities when they shouldn’t. A regression that brok block formatted paragraphs was fixed.

Print formatted documents not have the indents retained.

For people like me who do not like indented paragraphs I’ve added an option to remove indents from TXT input documents.

There was one small bug fix in TXT output and that was to have TXT output show all TOC items. Previously it was only showing top level items.

TXTZ

I’ve added support for a new pseudo format call TXTZ. It’s essentially just TXT files put into a zip archive with the extension .txtz. It can contain images which should make working with Markdown and Textile formatted text easier. Also, it has metadata support via an OPF file called metadata.opf within the archive. This OPF file will be referenced for metadata reading and writing. Both input and output of TXTZ support has been added.

Tags: , , , , , , .



* Calibre Week in Review

Posted on January 31st, 2011 by John. Filed under calibre.


This is really a two week in review because I didn’t do one last week. The past two weeks I didn’t focus on major changes. I mainly spent my time with little tweaks and closing out bugs. All of these little changes didn’t feel like I accomplished very much but getting them all together it turns out I did quite a bit over the last two weeks.

GUI

The GUI saw quite a few usability changes. I’ve added auto compete to the authors field. This works the same as the tags field but starts completion with the & character instead of ,. There were a few issues that users pointed out relating to this change but they have all been corrected. It turns out that the issues were present with the tags completion but no one had noticed.

I added a confirmation dialog when stopping a running job. There is the possibility of a job finishing before the user confirms but I don’t see that as an issue. The user either wanted to stop the job and it’s stopped or they don’t and it finished properly.

The Regex Builder window saw some changes. Search next and previous were added so the user can cycle though the matched items more easily. Also, when clearing the regex text entry the highlighting will automatically clear. There were also some tweaks to remove the delay caused when testing without any input text. I’ve also implemented caching for so each time the wizard is opened it won’t reconvert the input document. It saves the result and just displaying it each time. The Search and Replace dialog also makes use of the caching across each search and replace field.

There was also some more work done with making entries into history entries. The Regex input fields (search and replace) now store previous entries. The filename import in Add books also saves previous regexes used for importing books.

The last GUI change was with the Send Specific Formats to Device dialog. It now only displaying formats that are present and or convertible. It also tell the user the number out of the total number of books that are in a particular format. It also notes which formats are convertible and which are not. All items in the dialog are also sorted from most to least preferred.

Heuristics

Italicize common cases saw some tweaks to the matching patterns to make them more robust. I foresee this being a weekly occurrence for some time.

RTF Output

A number of RTF output bugs were fixed. An issue with incorrect spacing between letter and missing spaces around italicized text. Also, the generated markup was greatly altered. It is simpler and produces more consistant results. It also allows for h tags to be turned into RTF style headings. So converting from calibre generated RTF to say EPUB the headings will carry over properly as headings. I still consider RTF output a work in progress and relegate it to experimental status.

FB2 Output

The language is now set correctly.

PML Input

Soft scene breaks are now retained. PML also saw a bug fix relating to the \T tag. The biggest change to PML input is support for multi-level table of contents. Previously the toc from a PML file was flattened. Now the levels are properly retained.

TXT Input

Like PML input TXT input now retains soft scene breaks between paragraphs. I also changed heuristic processing on TXT input to not enable preserving whitespace. Instead whitespace at the beginning of a paragraph is maintained by default. Also I rewrote the preserve whitespace function to only use   when necessary instead of in place of every regular space.

TXT Output

Textile formatted output is not supported. This complements Markdown output and Textile input. Soft scene breaks are now detected and written. The scene breaks can either be empty paragraphs or defined by CSS top margin.

Tags: , , , , , .



* Calibre Week in Review

Posted on January 17th, 2011 by John. Filed under calibre.


TXT input got some more work. It now supports the Textile markup language. This can be used in place of Markdown. Textile is also supported by the new auto-detection in TXT input.

FB2 output had some more bug fixes. The cover image is now put inside of the coverpage element in the metadata header. This is per the FB2 spec. However, the calibre ebook-viewer does not currently display the cover image that is part of the metadata header. Calibre’s FB2 metadata reader will read the cover image.

PML input had a bug fixed dealing with the \t and \T tags. They are now handled properly and will indent the entire line. This had been somewhat fixed previously but the previous fix would only work when those tags would start and end the line.

At a user’s request I’ve reworked the Author’s fields thought the GUI. Authors are now auto completed using the & symbol just like tags are auto completed using a ,. This makes adding multiple authors much easier. This change was actually fairly large and a lot of work. I refactored the auto complete classes for tags into a generic set of auto completion classes. Then I reworked each author field to use the new classes.

All of the above changes have made it into trunk and are either in the current release (0.7.40) or will be in the next release (0.7.41). The following changes are still being finished and will need Kovid’s review before being merged into a release.

Lee Dolsen and I had worked on the TXT last week and our partnership continued this week. He had created a variety of heuristic processing functions a while back. The heuristics processing would be used when the –preprocess-html option was enabled. We’ve broken the –preprocess-html function has been broken into individual options:

  • –enable-heuristics
  • –markup-chapter-headings
  • –italicize-common-cases
  • –fix-indents
  • –html-unwrap-factor=HTML_UNWRAP_FACTOR
  • –unwrap-lines
  • –delete-blank-paragraphs
  • –format-scene-breaks
  • –dehyphenate
  • –renumber-headings

The majority of the heuristic code is his. I helped to make the infrastructure changes to accomodate the options on the command line and in the GUI. I also added the –italicize-common-cases as a heuristic function and removed it from only working in TXT Input. I also made the necessary changes to the conversion pipeline so the heuristics will run over all input types. Currently the –preprocess-html option does not run over EPUB input. Lee did all the work to change the heuristic code to work as individual options as well as adding some extras and cleaning up some existing parts.

While Lee was making most of the heuristics changes I took the time to rework the –remove-header and –remove-footer options. Those two as well as their related regular expression options have been removed. Instead I’ve created three sets of generic search and replace options. They are much more flexible and also not as miss leading about what they do. My hope is to eventually have a heuristic function for removing headers and footers that does not require regular expressions.

Tags: , , , , , .

    Comments Off


* Calibre Week in Review

Posted on January 9th, 2011 by John. Filed under calibre.


This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We’ve worked together before improving PDF input. Since then he’s done a lot of work with preprocessing of PDF and other not so clean input.

TXT input now auto detects the character encoding of the file. It isn’t 100% accurate but should work for the majority of cases. It’s using chardet for the detection. Unfortunately, cp1252 is the most common encoding that gives people issues and unless you’re using things like smart quotes and curly apostrophes it doesn’t always detect properly.

I started getting TXT input to detect the document structure. Mainly, are the paragraphs arranged in block, single line, or print fashion. Lee saw the detection code and modifying some of his preprocessing code he was able to greatly increased the detection accuracy over my initial work. He’s also added an unformatted type that assumes the text is one big blob and tries to determine paragraphs in much the same way PDF input tries to determine them. By unwrapping based upon punctuation and other factors.

In addition to detecting the paragraph style used in the document, TXT input now tries to detect the formatting of the text content. Markdown formatted text is detected. I’ve also added a heuristic processor which runs by default if either Markdown is not detected or if the user has not specified the formatting as none (which disables any type of formatting processing).

The heuristic processor uses some ideas from GutenMark. Specifically italicizing common words and certain contentions used in Project Gutenberg texts that denote italics. I started working on a set of heuristics to detect chapter headings but Lee quickly pointed out he had already created something similar using regular expressions in his preprocessing code. I quickly began using it in my heuristic processor and it’s working well. Chapter headings and subheadings are now formatted with the appropriate h tags. He has some plans to enhance the detection further using a word list.

TCR, PDB PalmDoc and PDB zTXT inputs all pass the extracted text to the TXT input plugin for processing. This allows them to take advantage of all the work that’s gone into TXT input. Also, with auto detection now being part of TXT input it should allow for one time conversion instead of convert, check, tweak some options, convert again. Especially since these formats don’t make it easy to see how the text is structured within the file without first converting.

TXT input wasn’t the only part of TXT support that was touched. I spent some time cleaning up the TXT output. Consistant spacing is now created around headings. Also, when using the –remove-paragraph-spacing option, headings are not indented with a tab. The output now looks much cleaner and I consider it acceptable for reading.

Not to be left out FB2 output got a small bug fix. With all the work rewriting it I broke having it read covers. If you were converting an EPUB for instance that specified the cover (or title page) in the guide rather than the spine it would not be included. Also, the –cover option was being ignored. Now that’s fixed and external covers are inserted properly.

Tags: , , , , , , , .

    Comments Off


* Back to work on calibre

Posted on December 5th, 2010 by John. Filed under calibre.


I’ve become active in contributing to calibre again. So far I’ve been focusing on fixing issues related to the output formats I maintain. I’ve been focusing on FB2 and TXT output at the moment.

With FB2 output my goals were to fix as many bugs with it as possible and to produce 100% valid output. The first goal corresponds nicely to the second because most of the open bugs dealt with invalid markup.

FB2 output underwent some very large changes with a large amount of code being re-written. Also, I’ve removed a number of options. The idea is to simplify the code while working toward valid output and to remove options that were really just work arounds for invalid output in certain cases.

Overall I’m pleased with the FB2 output changes. It’s 100% valid (at least with the test book I ran though it) and the code is simpler. As always if any issues are found with the output a ticket would be appreciated.

TXT output had one small bug fix and one major change. TXT output can now produce Markdown formatted text. However, I’m not fully satisfied with the markdown generation. I didn’t spend much time with it and as of now it doesn’t appear to be taking css styling into account. I only pushed the xhtml from the OEB intermediate stage into html2markdown. I need to spend some more time with it. My fear is I may have to abandon the use of html2markdown if it’s unable to cope with css.

One other change with me getting back into calibre development is my working branch. I’ve changed it to lp:~user-none/calibre/dev because of some issues relating to my previous branch and some failures with upgrading the branch format.

Tags: , , , .