.lm +6 .C;Spelling Checker Program Documentation .c;SPELL__PROGRAM.RNO .C;30-Nov-87 .c;Jack Harvey .c;National Data Systems .c;299 Market Street .c;Saddle Brook, NJ 07661 .c;201 843-5300 .TITLE Spelling Checker Program Documentation .B2 .AP .PARAGRAPH 0 .HL1 Introduction This is a COBOL implementation of the Fortran spelling checker program that has ricocheted through RSX and VMS tapes for many years. The main dictionary is the 91,707 word version that Tom Wolfe of Jet Propulsion Laboratory included it with his EVE based spelling checker on the Fall '86 tape (86C). The code is all my fault. The major improvements (aside from being written in COBOL) are faster performance, a very flexible dictionary extension scheme and a standard RMS indexed file format requiring about half the storage of most other implementations. The following sections discuss: .list "o" .le;Installation .le;Overall Description .le;Required Files and Cleanup .le;Performance Strategy .le;Logical Name Summary .end list .hl 1 Installation I think the simplest initial installation is to put the entire content of the distribution directory in your selected device and directory and then make the edits described below. Later, after you get it right, you can review the cleanup section to reduce storage requirements. The rest of this documentation assumes you want to put the package in $DISK1:[COB.SPEL]. Just replace this string with your preferred device and directory. .hl 2 Installing _the Foreign Command Edit your login command file to include the following lines: .literal $ SPEL*L :== @$DISK1:[COB.SPEL]SPELL $ DEFINE NDS_SPELL $DISK1:[COB.SPEL] .END LITERAL If you are a system manager and want the spell program to be available to all users, put these lines in your SYLOGIN.COM. .hl 2 Avoiding Special Logical Names Some groups have an aversion to defining willy-nilly logicals. If yours is such, omit the second line above defining NDS__SPELL. Instead, edit SPELL.COM to add the following at the beginning: .literal $ DEFINE NDS_SPELL/USER_MODE $DISK1:[COB.SPEL] .end literal This will define the logical name only for the single execution of the program. It then disappears. .hl 2 User Documentation The user guide is SPELL.RNO, written in Digital Standard Runoff. To get a readable version, do RUNOFF SPELL, and then print the resulting SPELL.MEM file on (preferably) a lower case printer. .hl 1 Overall Description This section assumes you have read the user documentation and used the program in its two modes, so that you know what to expect of it. If you haven't done this, the following text will self destruct. The spell program consists of four major components: .list .le;SPELL.EXE, the spelling engine. .le;DICTIONARY.IDX, an RMS indexed file constituting a 91,707 word dictionary. .le;POPULAR.WORDS - a list of the 1,000 most frequently used words. (A list of the 3,000 most frequent is used for large documents.) .le;SPELL.COM, the command file that does various pre-execution checks and ties it all together. This is what you talk to and fiddle with if you don't like my command format and dictionary names. .end list .hl 2 The Engine SPELL.EXE is controlled entirely by logical names defined by the command file. These specify the dictionaries to be used and other options such as an output report file. The engine takes ^&no\& input from your keyboard (SYS$INPUT). It has a few messages to SYS$OUTPUT which signal error counts something extraordinary, usually a missing input file (text or dictionary). Since SPELL.COM is supposed to check for file existence before starting the engine, you should never get these. The engine has to examine the text to be checked and isolate words for testing. This is probably the dirtiest part of the code, and the most controversial. How do you define a word? Clearly, white space at either end is a good idea, but an embedded digit or graphic, such as $ seems to make it a non-word. White space has to include beginning of line and end of line. What about hyphenated words? What about contractions (can't), possessives (Digital's) and plural possessives? I tried, but you may not like it. Give me a copy of your improvements, please. Words must be longer than one character. We probably should have a switch to permit ignoring words of two letters also, but the false alarm rate due to this is fairly low. .hl 2 The Dictionaries The dictionary is in two main parts: one is an ordinary RMS indexed file. The other is a memory resident array of frequently used words sorted for binary search. .hl 3 Dictionary Disk File The indexed dictionary file consists of 32 byte fixed length records, with a single key. Each record key is the entire record. This is relatively compact (compared to previous implementations of this file) and very easy to implement and maintain with the standard RMS utilities. And don't let the 91,000 word size of the main dictionary impress you. Since I don't have a safe algorithm for predicting correct spellings for suffixes and plurals, ^&all\& variations on a word must be present as separate entries. There are many plurals, verb endings and other parts of English speech that do not follow regular rules. You will probably find a surprising number of words missing from the dictionary that the spell program will incorrectly report as errors. This is why the extension dictionaries, such as NEWWORDS.ADD and MYWORDS.ADD are so useful. (See User documentation.) It has been my experience that if you rigorously update these extensions, all the words in your own active vocabulary missing from the main dictionary are quickly added and the false alarm rate drops to an acceptable level. To save space, the original sequential dictionary from Tom Wolfe is not included with this release. If you need it, just use the CONVERT utility on the indexed file, specifying a sequential output file. So far we have not had to do any maintenance of the dictionary, but if we did, I think SORT/MERGE would probably be a simple answer. This won't do deletes, but that's a most unlikely need, unless you are a censor. Datatrieve, of course, should be ideal for minor maintenance. .hl 3 The Popular Words If we had to go to the disk dictionary for every word, the performance would be terrible, especially if the disk is being accessed by several other users at the same time. Of course, VMS disk caching helps, but I tried it that way first and didn't like it. By keeping the hottest 1,000 words in memory and looking for them with a binary search, the program speeds up dramatically. The memory resident sorted array is built at the start of each execution. (I hear you screaming, but read the Performance Strategy section.) The command file, SPELL.COM, constructs a search list logical which includes a file containing the most frequently used 1000 words. The search list also has the various possible dictionary extension files. See user documentation. The search list logical is then used exhaustively (rather than just the first success) and all files placed on it by the command file are read, sorted and merged with duplicate elimination. These are then loaded into a memory resident array with pointers to facilitate a binary search. This array is searched first for each word tested by the program before going to the disk resident dictionary. The spelling engine doesn't give a damn how many files go into the popular words array. It just gobbles all the files on the search list. SPELL.COM decides what files will go into the array and strings them together in the search list POPULAR__WORDS. You can easily fiddle with that if you don't like my selections. The array is 65,000 bytes long. The words are sorted by size (as the most significant part of the sort key) so that when the array is loaded, the words are tightly packed, with zero waste space. It is organized as sub-arrays of all two letter words, all three letter words, etc., and the length of the word we are testing is used to select the sub-array. This reduces the number of probes required for the binary search. If we are using the 3,000 most popular words, with all the extension dictionaries we have (maybe another 100 words) we use only about 20,000 bytes of the array. But it's a Demand Zero allocation, so the unused space ain't real virtual memory. The source of the popular words is you. I took all the documentation text files from a previous VMS SIG tape, sorted the words, counted them, and ranked by count. There were about 100,000 words. So these are the words (and buzzwords and acronyms) most popular to people contributing to the tape. .hl 2 Operating Modes The engine has two modes of operation determined by whether or not the user has specified an output file: .list .le;If no output file is specified, the logical SPEL__OUTPUT__FILE is undefined and the engine will display the text file to be checked on SYS$OUTPUT which should be a VT100 or up. Erroneous words will be shown in reversed blinking video. (I got this neat idea from Tom Wolfe.) .le;If an output file is specified, the logical will define it for the engine. The output file will contain only text file lines in error, numbered, and with erroneous words separated at the beginning. The file is suitable for hard copy printing. .end list The first mode is suitable for quickie texts such as letters, while the second is best for long documents. .hl 2 Rebuilding _the Engine The .EXE file is provided for sites deprived of a COBOL compiler. If you have one and want to tinker, SPELL.EXE is linked from four modules: .list .le;SPELL.COB - the main program which reads the input text, isolates words and formats the output display or file. .le;POPULAR-WORDS.COB - a subprogram which creates the memory resident sorted array and performs the binary searches of it. .le;DICTIONARY.COB - a subprogram which accesses the dictionary disk file for words not found by POPULAR-WORDS. .LE;EXTDEF.MAR - a trivial macro program defining the globals (COBOL externals) needed by some of the system calls. .end list A command file, BUILD.COM, is included to compile and link these modules to produce an executable. .hl 1 Required Files and Cleanup The distribution directory is shown below. Following each file size are letters categorizing it. The categories are: .literal R Required for program to run. D Documentation source - DSR will make readable documentation. P Needed to compile or link program. BUILD.COM 1 P - compiles and links SPELL.EXE DICTIONARY.COB 3 P - subprogram reads .IDX file DICTIONARY.FDL 4 - created by CONV, defines .IDX file DICTIONARY.IDX 3115 R - main 91,707 word dictionary EXTDEF.MAR 1 P - VMS global definitions NEWWORDS.ADD 2 R - words not in big dictionary POPULAR-WORDS.COB 14 P - subprogram searches memory array POPULAR.3000 53 R - 3,000 most frequently used words POPULAR.WORDS 17 R - 1,000 most frequently used words SPELL.COB 14 P - main program of SPELL.EXE SPELL.COM 5 R - command file controlling SPELL.EXE SPELL.EXE 29 R - the engine SPELL.RNO 11 D - user documentation SPELL_PROGRAM.RNO 45 D - this documentation SPELL_PROGRAM_RNO.ADD 1 D - extension file for above SPELL_RNO.ADD 1 D - user document extension. .end literal .page .hl 1 Performance Strategy In this section, I will discuss "performance" in the sense of basic resource use: time, I/O and storage. There are other important performance considerations, such as ease of use and quality of the result, that are not covered. .hl 2 Test Results As a base line, I used Tom Wolfe's standalone spelling checker from the Fall '86 VAX SIG tape, save set VAX86C. I should point out that Tom's main contribution was the EVE/TPU interface which I haven't touched. Perhaps, after EVE settles down a bit, I may follow his lead and hook my engine to it. The input text being checked was a RUNOFF file of about 30 pages containing 10,560 words. Tom's standalone echos the entire input file to SYS$OUTPUT, whereas my version (in large file mode) displays only text lines containing errors. Therefore, to even things up a bit, I assigned SYS$OUTPUT to the null device. After all, Tom's program could be easily equipped with a switch to skip error free lines. The tests were run in batch mode as the only active job on a MicroVAX II. Test file and dictionary were on the same RA81 disk. Swapping and paging were zero. Since the performance of my program is critically dependent on the popular word dictionary, I tested it with no popular words, and with 1,000 and 3,000 word popular lists. .literal Wolfe/JPL No Popular 1,000 3,000 Version Words Words Words Elapsed time 411 541 171 125 CPU time 216 246 108 92 Setup time 1(approx) 1.5 3.9 8.9 Direct I/O count 17,637 10,181 2,086 986 I/O per word 1.670 0.964 0.198 0.093 All time intervals in seconds. .end literal The figures above are all taken from the batch log statistics report, except for "Setup time". This is the time from the RUN command to start of file checking. This includes image activation, file opening, loading of private dictionaries, and in my case, sorting and building the popular word array. For the tests, I instrumented my program to display this time. For Tom's program, I estimated it from several manual tests with a stop watch. .hl 2 Test Results Discussion The effect of the large popular word dictionary is pronounced. Elapsed time with the 3,000 popular words was less than a third of the previous standalone spelling program. The reason is obvious too: direct I/O, which is primarily accesses to the disk dictionary, was less than six percent of the base line program figure which averaged about 1.7 accesses per word. Somewhat surprisingly, CPU time was cut in half also. Of course, my program without the popular word file was a loser. The reason is that getting a word out of the RMS indexed file on disk involves considerable I/O and computation. But with the 3,000 popular words in memory, less than one word in ten requires access to the disk dictionary. .hl 2 Choice _of Popular Word Dictionary Size So which should we use? If I am checking a one paragraph letter, I don't want to stare at my terminal for nine seconds before anything happens. But I sure like the performance improvement on some of the large documentation files I maintain. Of course, dummy, select the popular word dictionary on the basis of the input text file size. That's what SPELL.COM does. It selects either the 3,000 or 1,000 word popular files, or no file. I ran some timing tests and selected the decision values used by the command file to minimize total elapsed time for various input file sizes. (This was done rather slap dash, and the optimum decision values can probably be fine tuned. They may also be system dependent, so you might profitably second guess me here.) .hl 2 Why Build _the Popular Word Array _at Run Time? Why not hard code this array, or at least load a pre-sorted file? Well, I made a judgement call. I hope someone checks it on the instant replay though. I will if I get time, but my reasoning for sorting at run time went like this: .list .le;We want to be able to use several (maybe lots) of extension dictionaries, to correct omissions in the big dictionary, and to include group, project and document specific acronyms, etc. .le;These are ad hoc lists of words we can't count on being properly sorted. People make them with EDT, so we have to crank up a sort anyway. .le;If we keep the ad hoc extensions and the popular words in separate arrays, we must search both sometimes. By merging the extensions and the popular words, we have a single binary search. .le;The extensions are likely to be "popular" in the environment they are used. Acronyms are used precisely because they are replacements for frequently used phrases. (What the hell does VAX/VMS stand for, anyway?) .le;By sorting at run time, it is easy to update the popular word list and there is no concern about its sequence. The files I use are in frequency sequence rather than alphabetic. .le;It was easy to do. .end list Guess which item above was the controlling factor. My vindication is the measured performance. My hope is that there is room for improvement. .hl 2 Storage Use Previous implementers have usually put the big dictionary in a relative file and then invented some scheme to find out where to start a sequential search. The performance has usually been pretty good but of course it is an ad hoc file scheme and usually has required writing a special program to create, access and maintain the file(s). The relative file approach usually requires fixed length records as big as the largest word in the dictionary. (Variable length records in relative files make little sense.) And as we get larger and larger dictionaries, the size of this file can be rather painful. Tom Wolfe's dictionary scheme requires about 6,300 blocks for the 91,000+ words. Of course, an RMS indexed file has a lot of overhead too: prologs, multilevel index buckets, bucket splitters and, I suppose, power steering and call forwarding. But it has one joyful economy: key compression which in this case turns out to be significant. By defining the file as 32 byte fixed length records, with a 32 byte key (the whole record) and then calling for key compression with the .FDL file, we can about halve the size of the file. The compression works on both ends of the key. Trailing spaces are compressed, and leading characters the same as the previous key are omitted. So even though it is a "fixed" length record, RMS actually stores it variable length. The result is an average of about 17 bytes per word, including data and all prolog, index and record header overhead. See the discussion of Prolog 3 files in "Guide to VAX/VMS File Applications". An obvious way to improve the storage performance of a relative file would be to do a better job of packing the words. With an average word length of 6.5 letters in our dictionary, the 32 byte fixed length record is mostly wasted space. One way to improve on this would be to break the big dictionary into 32 separate files, each having only words of a single length, tightly packed. And with a little more brain juice, the same idea can be done with a single file. So why not go all the way and use a five bit byte to represent the alphabet? I suppose this is the way the electric typewriters do it. But all of these high density packing schemes suffer from one problem: they require fancy special software to build, access and maintain the dictionary. One such scheme widely distributed on a SIG tape had a bug that caused many of the dictionary words to be overlooked in the search. The standard RMS indexed structure presented here can be updated with a DCL command file. .appendix Logical Name Utilization .lm +6 This lists all the logical names used by this program, explains where they are defined and what they mean or control. Some hints about how to tailor execution to your tastes is included. It will also aid in altering SPELL.COM to suit your needs. NDS__SPELL defines the device and directory containing the required components of this spell program. The definition is made in one of these: your login file, the system login file, or (as a system logical) in SYSTARTUP.COM. This logical name is used only by SPELL.COM so that if you prefer some other name, you need only edit that file. INPUT__TEXT is a user mode logical defining the input text file to be spell checked. It is defined by SPELL.COM from the input parameter P1, after checking for it's existence. The logical name is automatically removed following execution. POPULAR__WORDS is a logical name search list. It consists of ^&all\& extension dictionaries and a list of popular words for the memory resident array. It is defined by SPELL.COM in user mode. The POPULAR-WORDS module of the spelling engine loads all files on the search list into the array. If you want to include other automatic extensions besides those SPELL.COM provides, you would edit it to add them to this search list. See "POPS" in the command file. ENGLISH is a logical name pointing to the main 91,707 word dictionary. It is defined by SPELL.COM in user mode and as distributed, points to NDS__SPELL:DICTIONARY.IDX. SPEL__POPULAR__WORD__SIZE is defined by the POPULAR-WORDS module in the spelling engine. It is a text string giving a one line report of space utilization in the popular words array. SPELL.COM contains a commented out SHOW LOGICAL to display this report, and then deassigns it. If you are interested in actual size and space utilization by the popular word list, uncomment the SHOW LOGICAL. Note that since duplicates are eliminated, the total words shown by this report may be less that the sum of the individual lists. SPEL__OUTPUT__FILE is defined by SPELL.COM if P2 is not blank. That is, if the user requests an output file by specifying it with parameter 2, the logical name is defined in user mode and the main module of the engine will create an output file if errors are found. By defining it permanently, you can force the program to run with an output file even though you omit P2 from your command line. SPEL__SHOW__NON is simply tested by the engine. If the logical is defined, then "non-words", i.e., character strings that are not tested against the dictionary, are displayed with reversed video. These usually are strings which include graphics or digits, such as SYS$LOGIN or DUA0. If you want to see these, simply define this logical to any equivalence string and run the program ^&with_ no_ output_ file.\&