gutcheck logo



Home   |   Example    |   Documentation     |    Etc. 



 Summary 

Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well.


 Background 

Project Gutenberg and other etext projects turn public domain books into etexts and post them publicly on the Internet. The volunteers who make these etexts either type them in by hand, or scan them and use OCR to turn the images into text. Both typists and OCR software make mistakes.

As a Project Gutenberg helper, I double-check texts before release. At first, I looked for errors by simple reading, but as I gained experience, I saw the same types of errors occurring over and over again. I wrote editor macros to search for them, but some of the typical errors, like mismatched quotation-marks, needed more logic than a simple macro offered.

I wrote gutcheck to report these kinds of issues, and added more search functions as I found more types of errors. Several PG volunteers asked for copies so that they could check their own texts, and it became a common utility within the project.


 Download 

Choose your preferred download:

  • Click here for gutcheck 0.991 with source and Win32 executable. You can use this to compile on most platforms.
  • Here is a version of 0.98 with a Mac PPC OS/X binary.

If you prefer to use a Windows GUI to work, I highly recommend Thundergnat's excellent Guiguts, which includes gutcheck and a vast number of other features.


 Technical 

Gutcheck is of no technical interest whatsoever. It wasn't designed: it "just growed". Its source code is a single ANSI C file. It should compile on any *nix or MS-DOS-based platform; if it doesn't, tell me.

On a Unix-based platform, putting gutcheck.c into a directory and typing

make gutcheck

is all you should need to do to make an executable.


 Development Plans 

Gutcheck is nearly feature-complete as a format checker. There are some things it should do, but doesn't. These include correct matching of singlequotes, and better sentence-punctuation checking (period and a capital letter). Ideas for these and other checks are welcome, but beware!--many checks that seem reasonable on some texts work badly on others.

Possible sideways developments for gutcheck include a variant without all of the PG-specific constraints, for use in other etext projects.


 Non-Development Plans :-) 
I've received many suggestions about adding functions to gutcheck. Here are some roads I'm not going to take, and why:


More spell-checking

The minimal English-typo support gutcheck contains is almost accidental, growing out of a small corner of my early research and experiments for gutspell, and I've never tried to implement it seriously--I haven't even optimized typo-lookup! However, it is useful as a spellcheck-detector--if gutcheck finds typos, then the whole text needs a spellcheck.

I appreciate the offers I've received for adding spellcheck functionality, including other languages, to gutcheck, but there's no point in rewriting ispell. Anyhow, standard word-by-word spellcheck methods are not good enough for PG work, regardless of the dictionary used. Too much depends on context: finding the word "modem" (scanno for "modern") in an 18th Century novel should certainly be flagged, whereas dialect shouldn't, since ignoring many dialect words results in fatigue, which may cause the proofer to overlook genuine errors. I hope to make gutspell a spelling-checker that can be used for PG texts without these disadvantages, and future development will focus on gutspell rather than gutcheck. Gutspell is not released in any form yet, but now that I've cleared the decks by releasing gutcheck, I hope gutspell can progress without distraction.


Automatic fixes for errors

It's tempting, for at least some classes of errors, to offer an automatic change to the text to fix the error, in the way that many spellcheckers do, but I've decided against doing that in gutcheck. There are two main reasons: the "fix" would be too likely to be wrong (we all know what happens when you accept all spellcheck-suggested fixes!) and I find that errors often occur in clusters--if there's an error on line 1234, odds are good that there's another nearby that gutcheck didn't detect--and it's a good idea to have a look around.


Unbounded paranoia

Gutcheck is reasonably paranoid by default--enough so that it probably reports about 30% "false positives" on a typical text, where a "false positive" is a flagged item that isn't really an error. I reckon that it catches about 50% of all errors (including spelling). However, I'm balancing paranoia against false positives. More checks mean more false positives. Some of you users want more checks, but the usual price is to increase the number of reports you have to ignore. On the most common type of text, simple paragraphs of prose using common words, the number of false positives generated by your idea may be low, but it may cause trouble for reports on other texts. Before you propose a patch, test your idea on a hundred or so PG texts, including not only novels, but poetry, plays, and 19thC scientific papers. Gutcheck isn't ideal for these at the moment, but it's a reasonable compromise, and I don't want to make things worse by requiring users to take a Doctorate in Switches before running it.



 License 

Gutcheck is Free Software distributed under the GNU General Public License (GPL).







SourceForge Logo Project Gutenberg