eBook Paragraph Formating

Today I wrote two simple programs to help me clean up my ebooks. I prefer to keep my ebook collection as plain text files with paragraphs separated by a blank line. The first program reflows the paragraphs to put each on a single line. The second removes extraneous whitespace from the file.

The reflow is the more intensive of the two. I ran it on the largest ebook I have, Project Gutenberg’s War and Peace by Leo Tolstoy. The file is 3.1 MB.

Time to run: 7m35.494s.
Memory usage: 13.1 MB according to gnome-system-monitor.

Right now I’m loading the entire book into memory and using QStrings to work on it. Memory usage is about 4.5 x the size of the book. Thankfully plain text ebooks are fairly small. Later I’m going to look into optimizing it for size and hopefully speed.

Without further ado here are the two. They are MIT licensed and use the Qt tool kit.

fix_paragraphs_ebook_txt.cpp

/*
Copyright (c) 2008 John Schember <john@nachtimwald.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
Reflows txt file ebook paragraphs. Paragraphs should be separated by a blank
line. Takes paragraphs that have hard breaks and puts all lines onto a single
line.

For Example:

INPUT

This is a multi line paragraph. It comprises
a few lines but has hard
breaks.

Now for the second
borken apart paragraph.

OUTPUT

This is a multi line paragraph. It comprises a few lines but has hard breaks.

Now for the second broken apart paragraph.
*/

#include <QFile>
#include <QRegExp>
#include <QString>
#include <QTextStream>

int main(int argc, char** argv)
{
    // Stream to write errors to the console.
    QTextStream errStream(stderr);

    // Regular expression to search for broken paragraphs. Works by looking
    // for char newline char. A proper ebook should have paragraphs separated
    // by a blank line meaning char newline newline char.
    QRegExp re("[^\n]\n[^\n]");
    
    // Store for the contents of the ebook.
    QString content;
    
    // We need an ebook file to work on.
    if (argc != 2) {
        errStream << QObject::tr("Error: No input file") << endl;
        return 1;
    }
    
    QFile ebook(argv[1]);
    if (!ebook.open(QIODevice::ReadWrite | QIODevice::Text)) {
        errStream << QObject::tr("Error: Could not open") << endl;
        return 1;
    }
    
    // We use a QTextStream to actually work on the file.
    QTextStream ioStream(&ebook);
    // Read the entire file contents into memory.
    content = ioStream.readAll();
    
    while (content.contains(re)) {
        // Remove the newline when there is a match with the regular expression.
        content = content.replace(content.indexOf(re)+1, 1, " ");
    }
    
    // Truncate the ebook so we don't end up with the original contents after
    // our modified contents.  
    if (!ebook.resize(0)) {
        errStream << QObject::tr("Error: Could not truncate file") << endl;
        return 1;
    }
    
    // Store the modified content back on disk.
    ioStream.seek(0);
    ioStream << content;

    ebook.close();

    return 0;
}

remove_extra_whitespace_ebook_txt.cpp

/*
Copyright (c) 2008 John Schember <john@nachtimwald.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
Removes extraneous whitespace in a txt file ebook. This will remove every
'\t', '\v', '\f', '\r', and will replace multiple occurrences ' ' with a single
one.

For Example:

INPUT

      This     is a bad                          line.

Now for  the     second     borken line.       

OUTPUT

This is a bad line.

Now for the second borken line.

*/

#include <QFile>
#include <QString>
#include <QTextStream>

int main(int argc, char **argv)
{
    // Stream to write errors to the console.
    QTextStream errStream(stderr);
    
    // Store for the contents of the ebook.
    QString content;

    // We need an ebook file to work on.
    if (argc != 2) {
        errStream << QObject::tr("Error: No input file") << endl;
        return 1;
    }
    
    QFile ebook(argv[1]);
    if (!ebook.open(QIODevice::ReadWrite | QIODevice::Text)) {
        errStream << QObject::tr("Error: Could not open") << endl;
        return 1;
    }
    
    // We use a QTextStream to actually work on the file.
    QTextStream ioStream(&ebook);
    
    // Read every line and remove the extras we don't want.
    while (!ioStream.atEnd()) {
        content += ioStream.readLine().simplified() + "\n";
    }

    // Truncate the ebook so we don't end up with the original contents after
    // our modified contents.  
    if (!ebook.resize(0)) {
        errStream << QObject::tr("Error: Could not truncate file") << endl;
        return 1;
    }
    
    // Store the modified content back on disk.
    ioStream.seek(0);
    ioStream << content;

    ebook.close();

    return 0;
}