|
ABSTRACT
With the increase in word and text processing computer systems, programs which check and correct spelling will become more and more common. Peterson investigates the basic structure of several such existing programs and their approaches to solving the problems which arise when this type of program is created. The basic framework and background necessary to write a spelling checker or corrector are provided.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Bledsoe W.W., and Browning, 1. Pattern recognition and reading by machine. Proc. Eastern Joint Comptr. Conf., Boston, Mass., Dec. 1959, pp. 225-232. Here, the authors have used a small dictionary with the probability of each word for OCR.
|
| |
3
|
Blair, C.R. A program for correcting spelling errors. Inform. and Control 3, I (March 1960), 60-67. Blair weights the letters to create a four- or five-letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. The possibility (impossibility) of building in rules like "i before e except after c and when like a as in neighbor and weigh" is mentioned.
|
| |
4
|
Bourne, C.P. Frequency and impact of spelling errors in bibliographic data bases. Inform. Processing and Mgmt. 13, 1 (1977), 1- 12. Bourne examines the frequency of spelling errors in a sample drawn from 11 machinereadable bibliographic databases. He finds that spelling errors are severe enough to influence the search strategy to find information in the database. Errors are not only in the input queries, but also in the database itself.
|
| |
5
|
Carlson, G. Techniques for replacing characters that are garbled on input. Proc. 1966 Spring Joint Comptr. Conf., AFIPS Press, Arlington, Va., 1966, pp. 189-192. The author uses trigrams to correct the OCR input of genealogical records.
|
| |
6
|
Cornew, R.W. A statistical method of spelling correction. Inform. and Control 12, 2 (Feb. 1968), 79-93. Cornew employs digrams first, then a dictionary search to correct one character substitutions. The dictionary already exists for a speech output problem.
|
 |
7
|
|
 |
8
|
|
| |
9
|
Fisher, E.G. The use of context in character recognition. Tech. Rep. 76-12, Dept. of Comptr. and Inform. Sci., Univ. of Massachusetts, Amherst, July 1976. Fisher considers the problems of automatically reading addresses from envelopes in the Post Office and also Morse code recognition.
|
| |
10
|
Freeman, D.N. Error correction in CORC: The Cornell computing language. Ph.D. Th., Dept. of Comptr. Sci., Cornell Univ., Ithaca, N.Y., Sept. 1963.
|
| |
11
|
Galli, E.J. and Yamada, H.M. An automatic dictionary and verification of machinereadable text. IBM Syst. J. 6, 3 (1967), 192- 207. This article is a good discussion of the general problem of token identification and verification.
|
| |
12
|
Galli, E.J., and Yamada, H.M. Experimental studies in computer-assisted correction of unorthographic text. IEEE Trans. Engineering Writing and Speech EWS-11, 2 (Aug. 1968), 75-84. Galli and Yamada provide a good review and explanation of techniques and problems.
|
| |
13
|
Giangardella, J.J., Hudson, J., and Roper, R.S. Spelling correction by vector representation using a digital computer. 1EEE Trans. Engineering Writing and Speech EWS-IO, 2 (Dec. 1967), 57-62. The authors define hash functions to give a vector representation of a word as norm, angle, and distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same norm and the extra or deleted letter is within fixed range.
|
 |
14
|
|
| |
15
|
Hanson, A.R., Riseman, E.M., and Fisher, E.G. Context in word recognition. Pattern Recognition 8, 1 (Jan. 1976), 35~,5. The authors suggest positional binary trigrams to correct words with wrong letters. See also {33}.
|
| |
16
|
Harmon, L.D. Automatic reading of cursive script. Proc. Symp. on Optical Character Recognition, Spartan Books, Washington, D.C., Jan. 1962, pp. 151-152. Harmon uses digrams and a "confusion matrix" to give the probability of letter substitutions.
|
| |
17
|
Harmon, L.D. Automatic recognition of print and script. Proc. 1EEE 60, 10 (Oct. 1972) 1165-1176. Here, Harmon surveys the techniques for computer input of print, including a section on error detection and correction. He indicates that digrams can catch 70 percent of incorrect letter errors.
|
| |
18
|
James, E.B., and Partridge, D.P. Tolerance to inaccuracy in computer programs. Comptr. J. 19, 3 (Aug. 1976), 207-212. The authors describe an approach to correcting spelling errors in Fortran key words.
|
| |
19
|
|
| |
20
|
Kucera H., and Francis, W.N. Computational Analysis of Present-Day American English. Brown Univ. Press, Providence, R.I., 1967. The authors give frequency and statistical information for the Brown Corpus of over a million tokens.
|
| |
21
|
Leslie, L.A. 20,000 Words. McGraw-Hill, N.Y., 1977. This work is representative of several books which list words.
|
 |
22
|
|
| |
23
|
McElwain, C.K., and Evans, M.E. The degarbler a program for correcting machineread Morse code. Inform. and Control 5, 4 (Dec. 1962), 368-384. The authors' program uses digrams, trigrams, and a dictionary to correct up to 70 percent of errors in machine recognized Morse code. It also uses 5 special rules for the types of errors which can occur (dot interpreted as dash, etc.)
|
| |
24
|
McMahon, L.E., Cherry, L.L., and Morris, R. Statistical text processing. Bell Syst. Tech. J. 57, 6, part 2 (July-August 1978), 2137-2154. McMahon et al. provide a good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.
|
 |
25
|
|
| |
26
|
Morris, R., and Cherry, L.L. Computer detection of typograhical errors. IEEE Trans. Professional Comm. PC-18, 1 (March 1975), 54-64. Morris and Cherry describe the TYPO program for the UNIX system.
|
| |
27
|
Muth, F., and Tharp, A.L. Correcting human error in alphanumeric terminal input. Inform. Processing and Mgmt. 13, 6 (1977), 329-337. This paper suggests using a tree structure (like a trie) with special search procedures to find corrections. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). (Reviewed in Comptng. Reviews 19, 6 (June 1978), 231.)
|
| |
28
|
O'Brien, J.A. Computer program for automatic spelling correction. Tech. Rep. RADC-TR-66-696, Rome Air Development Ctr., New York, March 1967. This report describes an early prototype of a spelling correction program designed to correct input from an OCR reader. It was implemented on a CDC 160A with special hardware.
|
| |
29
|
Okuda, T., Tanaka, E., and Kasai, T. A method for the correction of garbled words based on the Levenshtein metric. 1EEE Trans. Comptrs. C-25, 2 (Feb. 1976), 172-177. The authors suggest a method for correcting an incorrectly spelled token.
|
| |
30
|
Partridge, D.P., and James, E.B. Natural information processing. Internat. J. Man-Machine Studies 6, 2 (March 1974), 205-235. Here, the authors use a tree structure representation of words to allow checks for incorrect input words. This is done in the context of correcting key words in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.
|
| |
31
|
Peterson, J.L. Design of a spelling program: An experiment in program design. Lecture Notes in Comptr. ScL 96, Springer-Verlag, New York, Oct. 1980. In this work the complete top-down design of a program to detect and correct spelling errors is given. The design is machine-independent, but directly results in a Pascal program which implements the design.
|
| |
32
|
Riseman, E.M., and Ehrich, R.W. Contextual word recognition using binary digrams. IEEE Trans. Comptrs. C-20, 4 (April 1971), 397403. The authors indicate that the important property of digrams is only their zero or nonzero nature.
|
| |
33
|
Riseman, E.M., and Hanson, A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comptrs. C-23, 5 (May 1974), 480--493. Riseman and Hanson suggest using digrams (2- grams), trigrams (3-grams), or in general ngrams, but only storing whether the probability is zero or nonzero (1 bit). They also favor using positional n-grams which means a separate n-gram table for each pair of positions is kept (for each i and j we have the digram table for characters in position i and position j in a word).
|
| |
34
|
Rosenbaum, W.S., and Hilliard, J.J. Multifont OCR postprocessing system. IBM J. Res. and Develop. 19, 4 (July 1975), 398-421. This paper focuses very specifically on OCR problems. Searching with a match-any character is discussed.
|
 |
35
|
|
| |
36
|
Szanser, A.J. Error-correcting methods in natural language processing. Inform. Processing 68, North-Holland, Amsterdam, Aug. 1968, pp. 1412-1416. This is a confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes; it suggests using "elastic" matching.
|
| |
37
|
Szanser, A.J. Automatic error-correction in natural languages. Inform. Storage and Retrieval 5, 4 (Feb. 1970), 169-174.
|
| |
38
|
Tanaka, E., and Kasai, T. Correcting method of garbled languages using ordered key letters. Electronics and Comm. in Japan 55, 6 (1972), 127-133.
|
| |
39
|
Tenczar, P.J., and Golden, W.W. Spelling, word and concept recognition. Rep. CERL-X-35, Univ. of Illinois, Urbana, II1., Oct. 1962.
|
| |
40
|
Thomas, R.B., and Kassler, M. Character recognition in context. Inform. and Control 10, 1 (Jan. 1967), 43-64. Thomas and Kassler consider tetragrams (sequences of 4 letters); of 274 possible tetragrams, only 12 percent (61,273) are legal.
|
| |
41
|
Thorelli, L. Automatic correction of errors in text. BIT 2, 1 (1962), 45-62. This paper is a kind of survey/tutorial. It mentions diagrams and dictionary look-up and suggests maximizing probabilities.
|
 |
42
|
|
 |
43
|
|
 |
44
|
|
CITED BY 66
|
|
|
|
|
|
|
|
|
|
|
William H. Cushman , Purnendu S. Ojha , Cathleen M. Daniels, Usable OCR: what are the minimum performance requirements?, Proceedings of the SIGCHI conference on Human factors in computing systems: Empowering people, p.145-152, April 01-05, 1990, Seattle, Washington, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lon-Mu Liu , Yair M. Babad , Wei Sun , Ki-Kan Chan, Adaptive post-processing of OCR text via knowledge acquisition, Proceedings of the 19th annual conference on Computer Science, p.558-569, April 1991, San Antonio, Texas, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A. Vagelatos , T. Triantopoulou , C. Tsalidis , D. Christodoulakis, Utilization of a lexicon for spelling correction in modern Greek, Proceedings of the 1995 ACM symposium on Applied computing, p.267-271, February 26-28, 1995, Nashville, Tennessee, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R. Casajuana , C. Rodríguez , L. Sopeña , C. Villar, Towards an integrated environment for Spanish document verification and composition, Proceedings of the third conference on European chapter of the Association for Computational Linguistics, p.52-55, April 01-03, 1987, Copenhagen, Denmark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tetsuo Araki , Satorn Ikehara , Nobuyuki Tsukahara , Yasunori Komatsu, An evaluation to detect and correct erroneous characters wrongly substituted, deleted and inserted in Japanese and English sentences using Markov models, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan
|
|
|
|
|
|
|
|
|
E. Agirre , I. Alegria , X. Arregi , X. Artola , A. Díaz de Ilarraza , M. Maritxalar , K. Sarasola , M. Urkia, XUXEN: a spelling checker/corrector for basque based on two-level morphology, Proceedings of the third conference on Applied natural language processing, March 31-April 03, 1992, Trento, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sargur N. Srihari , Jonathan J. Hull , Ramesh Choudhari, Integration of bottom-up and top-down contextual knowledge in text error correction, Proceedings of the June 7-10, 1982, national computer conference, June 07-10, 1982, Houston, Texas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|