.lm +6
.C;Spelling Checker Program Documentation
.c;SPELL__PROGRAM.RNO
.C;30-Nov-87
.c;Jack Harvey
.c;National Data Systems
.c;299 Market Street
.c;Saddle Brook, NJ 07661
.c;201 843-5300
.TITLE Spelling Checker Program Documentation
.B2
.AP
.PARAGRAPH 0
.HL1 Introduction
This is a COBOL implementation of the Fortran spelling checker
program that has ricocheted through RSX and VMS tapes for many
years. The main dictionary is the 91,707 word version that Tom
Wolfe of Jet Propulsion Laboratory included it with his EVE based
spelling checker on the Fall '86 tape (86C). The code is all
my fault.

The major improvements (aside from being written in COBOL) are
faster performance, a very flexible dictionary extension scheme and
a standard RMS indexed file format requiring about half the storage of
most other implementations. 

The following sections discuss:
.list "o"
.le;Installation
.le;Overall Description
.le;Required Files and Cleanup
.le;Performance Strategy
.le;Logical Name Summary
.end list
.hl 1 Installation
I think the simplest initial installation is to put the entire
content of the distribution directory in your selected device and
directory and then make the edits described below.  Later, after
you get it right, you can review the cleanup section to reduce
storage requirements.  The rest of this documentation assumes you
want to put the package in $DISK1:[COB.SPEL]. Just replace this
string with your preferred device and directory. 
.hl 2 Installing _the Foreign Command
Edit your login command file to include the following lines:
.literal

$	SPEL*L :== @$DISK1:[COB.SPEL]SPELL
$	DEFINE NDS_SPELL $DISK1:[COB.SPEL]

.END LITERAL
If you are a system manager and want the spell program to
be available to all users, put these lines in your SYLOGIN.COM.
.hl 2 Avoiding Special Logical Names
Some groups have an aversion to defining willy-nilly logicals.
If yours is such, omit the second line above defining NDS__SPELL.
Instead, edit SPELL.COM to add the following at the beginning:
.literal

$	DEFINE NDS_SPELL/USER_MODE $DISK1:[COB.SPEL]

.end literal
This will define the logical name only for the single execution
of the program.  It then disappears.
.hl 2 User Documentation
The user guide is SPELL.RNO, written in Digital Standard Runoff.
To get a readable version, do RUNOFF SPELL, and then print the
resulting SPELL.MEM file on (preferably) a lower case printer. 
.hl 1 Overall Description
This section assumes you have read the user documentation and
used the program in its two modes, so that you know what to
expect of it. If you haven't done this, the following text will
self destruct. The spell program consists of four major
components: 
.list
.le;SPELL.EXE, the spelling engine.
.le;DICTIONARY.IDX, an RMS indexed file constituting a 
91,707 word dictionary.
.le;POPULAR.WORDS - a list of the 1,000 most frequently used
words. (A list of the 3,000 most frequent is used for large
documents.)
.le;SPELL.COM, the command file that does various pre-execution
checks and ties it all together.  This is what you talk to and
fiddle with if you don't like my command format and dictionary
names. 

.end list
.hl 2 The Engine
SPELL.EXE is controlled entirely by logical names defined by
the command file.  These specify the dictionaries to be used and
other options such as an output report file. 

The engine takes ^&no\& input from your keyboard (SYS$INPUT). It
has a few messages to SYS$OUTPUT which signal error counts
something extraordinary, usually a missing input file (text or
dictionary). Since SPELL.COM is supposed to check for file
existence before starting the engine, you should never get these. 

The engine has to examine the text to be checked and isolate
words for testing.  This is probably the dirtiest part of the
code, and the most controversial.  How do you define a word?
Clearly, white space at either end is a good idea, but an
embedded digit or graphic, such as $ seems to make it a non-word.
White space has to include beginning of line and end of line.
What about hyphenated words?  What about contractions (can't),
possessives (Digital's) and plural possessives?  I tried, but you
may not like it. Give me a copy of your improvements, please. 

Words must be longer than one character.  We probably should
have a switch to permit ignoring words of two letters also,
but the false alarm rate due to this is fairly low.
.hl 2 The Dictionaries
The dictionary is in two main parts: one is an 
ordinary RMS indexed file.  The other is a memory resident
array of frequently used words sorted for binary search.
 
.hl 3 Dictionary Disk File
The indexed dictionary file consists of 32 byte fixed length
records, with a single key. Each record key is the entire record.
This is relatively compact (compared to previous implementations
of this file) and very easy to implement and maintain with
the standard RMS utilities. 

And don't let the 91,000 word size of the main dictionary impress
you. Since I don't have a safe algorithm for predicting correct
spellings for suffixes and plurals, ^&all\& variations on a word
must be present as separate entries.  There are many plurals,
verb endings and other parts of English speech that do not follow
regular rules.  You will probably find a surprising number of
words missing from the dictionary that the spell program will
incorrectly report as errors. 

This is why the extension dictionaries, such as NEWWORDS.ADD and
MYWORDS.ADD are so useful. (See User documentation.) It has been
my experience that if you rigorously update these extensions, all
the words in your own active vocabulary missing from the main
dictionary are quickly added and the false alarm rate drops to an
acceptable level. 

To save space,  the original sequential dictionary from Tom Wolfe
is not included with this release. If you need it, just use the
CONVERT utility on the indexed file, specifying a sequential
output file. 

So far we have not had to do any maintenance of the dictionary,
but if we did, I think SORT/MERGE would probably be a simple
answer. This won't do deletes, but that's a most unlikely need,
unless you are a censor.  Datatrieve, of course, should be ideal
for minor maintenance. 

.hl 3 The Popular Words
If we had to go to the disk dictionary for every word, the
performance would be terrible, especially if the disk is being
accessed by several other users at the same time.  Of course, VMS
disk caching helps, but I tried it that way first and didn't like
it. By keeping the hottest 1,000 words in memory and looking for them
with a binary search, the program speeds up dramatically. 

The memory resident sorted array is built at the start of each
execution.  (I hear you screaming, but read the Performance
Strategy section.) The command file, SPELL.COM, constructs a
search list logical which includes a file containing the most
frequently used 1000 words.  The search list also has the various
possible dictionary extension files. See user documentation.  The
search list logical is then used exhaustively (rather than just
the first success) and all files placed on it by the command file
are read, sorted and merged with duplicate elimination. These are
then loaded into a memory resident array with pointers to
facilitate a binary search.  This array is searched first
for each word tested by the program 
before going to the disk resident
dictionary. 

The spelling engine doesn't give a damn how many files go into the
popular words array.  It just gobbles all the files on the search
list. SPELL.COM decides what files will go into the array and
strings them together in the search list POPULAR__WORDS. 
You can easily fiddle with that if you don't like my selections.

The array is 65,000 bytes long.  The words are sorted by size (as
the most significant part of the sort key) so that when the array
is loaded, the words are tightly packed, with zero waste space.
It is organized as sub-arrays of all two letter words, all three
letter words, etc.,  and the length of the word we are testing is
used to select the sub-array. This reduces the number of probes
required for the binary search. 

If we are using the 3,000 most popular words, with all the
extension dictionaries we have (maybe another 100 words) we use
only about 20,000 bytes of the array.  But it's a Demand Zero
allocation, so the unused space ain't real virtual memory. 

The source of the popular words is you.  I took all the
documentation text files from a previous VMS SIG tape, sorted the
words, counted them, and ranked by count.  There were about
100,000 words. So these are the words (and buzzwords and
acronyms) most popular to people contributing to the tape. 
.hl 2 Operating Modes
The engine has two modes of operation determined by whether or
not the user has specified an output file:
.list
.le;If no output file is specified, the logical SPEL__OUTPUT__FILE
is undefined and the engine will display the text file to be
checked on SYS$OUTPUT which should be a VT100 or up. Erroneous
words will be shown in reversed blinking video. (I got this
neat idea from Tom Wolfe.)
.le;If an output file is specified, the logical will define
it for the engine.  The output file will contain only text
file lines in error, numbered, and with erroneous words separated
at the beginning.  The file is suitable for hard copy printing.
.end list 
The first mode is suitable for quickie texts such as letters, while the 
second is best for long documents.

.hl 2 Rebuilding _the Engine
The .EXE file is provided for sites deprived of a COBOL compiler.
If you have one and want to tinker, SPELL.EXE is linked from
four modules: 
.list
.le;SPELL.COB - the main program which reads the input text, isolates
words and formats the output display or file.
.le;POPULAR-WORDS.COB - a subprogram which creates the memory resident
sorted array and performs the binary searches of it.
.le;DICTIONARY.COB - a subprogram which accesses the dictionary
disk file for words not found by POPULAR-WORDS.
.LE;EXTDEF.MAR - a trivial macro program defining the globals (COBOL externals)
needed by some of the system calls.
.end list
A command file, BUILD.COM, is included to compile and link these modules
to produce an executable.  

.hl 1 Required Files and Cleanup

The distribution directory is shown below.  Following each
file size are letters categorizing it.  The categories
are:
.literal

R  Required for program to run.
D  Documentation source - DSR will make readable documentation.
P  Needed to compile or link program.

BUILD.COM                       1 P - compiles and links SPELL.EXE
DICTIONARY.COB                  3 P - subprogram reads .IDX file
DICTIONARY.FDL                  4   - created by CONV, defines .IDX file
DICTIONARY.IDX               3115 R - main 91,707 word dictionary
EXTDEF.MAR                      1 P - VMS global definitions
NEWWORDS.ADD                    2 R - words not in big dictionary
POPULAR-WORDS.COB              14 P - subprogram searches memory array
POPULAR.3000                   53 R - 3,000 most frequently used words
POPULAR.WORDS                  17 R - 1,000 most frequently used words
SPELL.COB                      14 P - main program of SPELL.EXE
SPELL.COM                       5 R - command file controlling SPELL.EXE
SPELL.EXE                      29 R - the engine
SPELL.RNO                      11 D - user documentation
SPELL_PROGRAM.RNO              45 D - this documentation
SPELL_PROGRAM_RNO.ADD           1 D - extension file for above
SPELL_RNO.ADD                   1 D - user document extension.

.end literal
.page
.hl 1 Performance Strategy
In this section, I will discuss "performance" in the sense 
of basic resource use: time, I/O and storage.  There are
other important performance considerations, such as ease of
use and quality of the result, that are not covered.
.hl 2 Test Results
As a base line, I used Tom Wolfe's standalone spelling checker
from the Fall '86 VAX SIG tape, save set VAX86C.  I should point
out that Tom's main contribution was the EVE/TPU interface which
I haven't touched.  Perhaps, after EVE settles down a bit, I may
follow his lead and hook my engine to it. 

The input text being checked was a RUNOFF file of about 30 pages
containing 10,560 words.
Tom's standalone echos the entire input file to SYS$OUTPUT,
whereas my version (in large file mode) displays only text lines
containing errors.  Therefore, to even things up a bit, I
assigned SYS$OUTPUT to the null device. After all, Tom's program
could be easily equipped with a switch to skip error free lines. 

The tests were run in batch mode as the only active job on a
MicroVAX II. Test file and dictionary were on the same RA81 disk.
Swapping and paging were zero. 

Since the performance of my program is critically dependent
on the popular word dictionary, I tested it with no popular
words, and with 1,000 and 3,000 word popular lists.
.literal

                  Wolfe/JPL    No Popular        1,000           3,000
                    Version        Words         Words           Words

Elapsed time           411          541            171             125
CPU time               216          246            108              92
Setup time             1(approx)    1.5            3.9             8.9
Direct I/O count    17,637       10,181          2,086             986
I/O per word         1.670        0.964          0.198           0.093

              All time intervals in seconds.
.end literal

The figures above are all taken from the batch log statistics
report, except for "Setup time".  This is the time from the
RUN command to start of file checking.  This includes image
activation, file opening, loading of private dictionaries,
and in my case, sorting and building the popular word array.
For the tests, I instrumented my program to display this time.
For Tom's program, I estimated it from several manual tests
with a stop watch.
.hl 2  Test Results Discussion
The effect of the large popular word dictionary is pronounced.
Elapsed time with the 3,000 popular words was less than a third
of the previous standalone spelling program.  The reason is
obvious too: direct I/O, which is primarily accesses to the disk
dictionary, was less than six percent of the base line program
figure which averaged about 1.7 accesses per word. 

Somewhat surprisingly, CPU time was cut in half also.  

Of course, my program without the popular word file was
a loser.  The reason is that getting a word out of the RMS
indexed file on disk involves considerable I/O and computation.
But with the 3,000 popular words in memory, less than one word
in ten requires access to the disk dictionary.
.hl 2 Choice _of Popular Word Dictionary Size
So which should we use?  If I am checking a one paragraph letter,
I don't want to stare at my terminal for nine seconds before
anything happens.  But I sure like the performance improvement on
some of the large documentation files I maintain.

Of course, dummy, select the popular word dictionary on the basis
of the input text file size.  That's what SPELL.COM does.  It
selects either the 3,000 or 1,000 word popular files, or no file.
I ran some timing tests and selected the decision values used
by the command file to minimize total elapsed time for various
input file sizes.  (This was done rather slap dash, and the
optimum decision values can probably be fine tuned.  They
may also be system dependent, so you might profitably second
guess me here.)
.hl 2 Why Build _the Popular Word Array _at Run Time?
Why not hard code this array, or at least load a pre-sorted file?
Well, I made a judgement call.  I hope someone checks it on the
instant replay though.  I will if I get time, but my reasoning
for sorting at run time went like this: 
.list
.le;We want to be able to use several (maybe lots) of extension
dictionaries, to correct omissions in the big dictionary,
and to include group, project and document specific acronyms,
etc.
.le;These are ad hoc lists of words we can't count on being
properly sorted.  People make them with EDT, so we have to crank
up a sort anyway. 
.le;If we keep the ad hoc extensions and the popular words
in separate arrays, we must search both sometimes.  By merging
the extensions and the popular words, we have a single binary
search.
.le;The extensions are likely to be "popular" in the environment
they are used.  Acronyms are used precisely because they are
replacements for frequently used phrases. (What the hell does
VAX/VMS stand for, anyway?)
.le;By sorting at run time, it is easy to update the popular
word list and there is no concern about its sequence.  The
files I use are in frequency sequence rather than alphabetic.
.le;It was easy to do.
.end list

Guess which item above was the controlling factor.  My vindication
is the measured performance. My hope is that there is room
for improvement.
.hl 2 Storage Use
Previous implementers have usually put the big dictionary in a
relative file and then invented some scheme to find out where to
start a sequential search.  The performance has usually been
pretty good but of course it is an ad hoc file scheme and usually
has required writing a special program to create, access and
maintain the file(s). 

The relative file approach usually requires fixed length records
as big as the largest word in the dictionary.  (Variable length
records in relative files make little sense.) And as we get
larger and larger dictionaries, the size of this file can
be rather painful. Tom Wolfe's dictionary scheme requires about 6,300
blocks for the 91,000+ words.

Of course, an RMS indexed file has a lot of overhead too:
prologs, multilevel index buckets, bucket splitters and, I
suppose, power steering and call forwarding. But it has one
joyful economy: key compression which in this case turns out to
be significant. By defining the file as 32 byte fixed length
records, with a 32 byte key (the whole record) and then calling
for key compression with the .FDL file, we can about halve the
size of the file.  

The compression works on both ends of the key. Trailing spaces
are compressed, and leading characters the same as the previous
key are omitted. So even though it is a "fixed" length record,
RMS actually stores it variable length.  The result is an average
of about 17 bytes per word, including data and all prolog, index
and record header overhead. See the discussion of Prolog 3 files
in "Guide to VAX/VMS File Applications". 

An obvious way to improve the storage performance of a relative
file would be to do a better job of packing the words.   With an
average word length of 6.5 letters in our dictionary, the 32 byte
fixed length record is mostly wasted space.  One way to improve
on this would be to break the big dictionary into 32 separate
files, each having only words of a single length, tightly packed.
And with a little more brain juice, the same idea can be done
with a single file. 

So why not go all the way and use a five bit byte to represent
the alphabet?  I suppose this is the way the electric typewriters
do it.  But all of these high density packing schemes suffer from
one problem: they require fancy special software to build, access
and maintain the dictionary.  One such scheme widely distributed
on a SIG tape had a bug that caused many of the dictionary words
to be overlooked in the search. The standard RMS indexed
structure presented here can be updated with a DCL command file. 
.appendix Logical Name Utilization
.lm +6
This lists all the logical names used by this program, explains
where they are defined and what they mean or control.  Some hints
about how to tailor execution to your tastes is included.
It will also aid in altering SPELL.COM to suit your needs.

NDS__SPELL defines the device and directory containing the
required components of this spell program.  The definition
is made in one of these: your login file, the system login
file, or (as a system logical) in SYSTARTUP.COM.  This logical
name is used only by SPELL.COM so that if you prefer some
other name, you need only edit that file.

INPUT__TEXT is a user mode logical defining the input text file to
be spell checked.  It is defined by SPELL.COM from the input
parameter P1, after checking for it's existence.  The logical
name is automatically removed following execution.

POPULAR__WORDS is a logical name search list.  It consists
of ^&all\& extension dictionaries and a list of popular words
for the memory resident array.  It is defined by SPELL.COM
in user mode.  The POPULAR-WORDS module of the spelling engine
loads all files on the search list into the array.  If you
want to include other automatic extensions besides those
SPELL.COM provides, you would edit it to add them to this
search list. See "POPS" in the command file.

ENGLISH is a logical name pointing to the main 91,707
word dictionary.  It is defined by SPELL.COM in user mode
and as distributed, points to NDS__SPELL:DICTIONARY.IDX.

SPEL__POPULAR__WORD__SIZE is defined by the POPULAR-WORDS module
in the spelling engine.  It is a text string giving a one
line report of space utilization in the popular words array.
SPELL.COM contains a commented out SHOW LOGICAL to display
this report, and then deassigns it.  If you are interested
in actual size and space utilization by the popular word list,
uncomment the SHOW LOGICAL. Note that since duplicates are
eliminated, the total words shown by this report may be less
that the sum of the individual lists.

SPEL__OUTPUT__FILE is defined by SPELL.COM if P2 is not blank.
That is, if the user requests an output file by specifying it
with parameter 2, the logical name is defined in user mode and
the main module of the engine will create an output file if
errors are found. By defining it permanently, you can force the
program to run with an output file even though you omit P2 from
your command line. 

SPEL__SHOW__NON is simply tested by the engine.  If the logical
is defined, then "non-words", i.e., character strings that are not
tested against the dictionary, are displayed with reversed video.
These usually are strings which include graphics or digits, such
as SYS$LOGIN or DUA0.  If you want to see these, simply define
this logical to any equivalence string and run the program
^&with_ no_ output_ file.\&