|
ABSTRACT
The best schemes for text compression use large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling that are suitable for use in practical text compression systems.
The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
AUSLANDER, M., HARRISON, W., MILLER, V., AND WEGMAN, M. 1985. PCTERM: A terminal emulator using compression. In Proceedings of the IEEE Globecom '85. IEEE Press, pp. 860-862.
|
| |
4
|
BAUM, L. E., PETRIE, T., $OULES, G., AND WEISS, N. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Star. 41,164-171.
|
| |
5
|
BELL, T. C. 1986. Better OPM/L text compression. IEEE Trans. Commun. COM-34, 12 (Dec.), 1176-1182.
|
| |
6
|
BELL, T. C. 1987. A unifying theory and improvements for existing approaches to text compression. Ph.D. dissertation, Dept. of Computer Science, Univ. of Canterbury, New Zealand.
|
| |
7
|
BELL, T. C. 1989. Longest match string searching for Ziv-Lempel compression. Res. Rept. 6/89, Dept. of Computer Science, Univ. of Canterbury, New Zealand.
|
| |
8
|
|
| |
9
|
BELL, T. C., AND WITTEN, I. H. 1987. Greedy macro text compression. Res. Rept. 87/285/33. Department of Computer Science, University of Calgary.
|
 |
10
|
|
| |
11
|
BOOKSTEIN, A., AND FOUTY, G. 1976. A mathematical model for estimating the effectiveness of bigram coding. Inf. Process. Manage. 12.
|
| |
12
|
BRENT, R. P. 1987. A linear algorithm for data compression. Aust. Comput. J. 19, 2, 64-68.
|
| |
13
|
CAMERON, R. D. 1986. Source encoding using syntactic information source models. LCCR Tech. Rept. 86-7, Simon Fraser University.
|
| |
14
|
CLEARY, J. G. 1980. An associative and impressible computer. Ph.D. dissertation. Univ. of Canterbury, Christchurch, New Zealand.
|
| |
15
|
CLEARY, J. G., AND WITTEN, I. H. 1984a. A comparison of enumerative and adaptive codes. IEEE Trans. Inf. Theory, IT-30, 2 (Mar.), 306-315.
|
| |
16
|
CLEARY, J. G., AND WlTTEN, I. H. 1984b. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. COM- 32, 4 (Apr.), 396-402.
|
| |
17
|
COOrER, D., AND LYNCH, M. F. 1982. Text compression using variable-to-fixed-length encodings. J. Am. Soc. Inf. Sck (Jan.), 18-31.
|
| |
18
|
|
| |
19
|
|
| |
20
|
CORTESI, D. 1982. An effective text-compression algorithm. Byte 7, i (Jan.), 397-403.
|
| |
21
|
COVER, T. M., ANt) KING, R. C. 1978. A convergent gambling estimate of the entropy of English. IEEE Trans. inf. Theory IT-24, 4 (Jul.), 413-421.
|
| |
22
|
DARRAGH, J. J., WITTEN, I. H., AND CLEARY, J. G. 1983. Adaptive text compression to enhance a modem. Res. Rept. 83/132/21. Computer Science Dept., Univ. of Calgary.
|
| |
23
|
ELIAS, P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory IT-21, 2 (Mar.), 194-203.
|
| |
24
|
|
| |
25
|
|
| |
26
|
EVANS, T. G. 1971. Grammatical inference techniques in pattern analysis. In Software Engineering, J. Tou, Ed. Academic Press, New York, pp. 183-202.
|
| |
27
|
FALLER, N. 1973. An adaptive system for data compression. Record of the 7th Asilomar Conference on Circuits, Systems and Computers. Naval Postgraduate School, Monterey, CA, pp. 593-597.
|
 |
28
|
|
| |
29
|
|
| |
30
|
GAINES, B. R. 1976. Behaviour/structure transformations under uncertainty. Int. J. Man-Mach. Stud. 8, 337-365.
|
| |
31
|
GAINES, B. R. 1977. System identification, approximation and complexity. Int. J. General Syst. 3, 145-174.
|
| |
32
|
GALLAGER, R. G. 1978. Variations on a theme by Huffman. IEEE Trans. Inf. Theory IT-24, 6 (Nov.), 668-674.
|
| |
33
|
GOLD, E. M. 1978. On the complexity of automaton identification from given data. Inf. Control 37, 302-320.
|
 |
34
|
|
| |
35
|
GOTrLIEB, D., HAGERTH, S. A., LEHOT, P. G. H., AND RABiNOWITZ, H. S. 1975. A classification of compression methods and their usefulness for a large data processing center. National Comput. Conf. 44, 453-458.
|
| |
36
|
GUAZZO, M. 1980. A general minimum-redundancy source-coding algorithm. IEEE Trans. Inf. Theory IT-26, i (Jan.), 15-25.
|
| |
37
|
|
| |
38
|
HELMAN, D. R., AND LANGDON, G. G. 1988. Data compression. IEEE Potentials (Feb.), 25-28.
|
| |
39
|
HORSPOOL, R. N., AND CORMACK, G. V. (1983). Data compression based on token recognition. Unpublished.
|
| |
40
|
HORSPOOL, R. N., AND CORMACK, G. V. 1986. Dynamic Markov modelling--A prediction technique. In Proceedings of the International Conference on the System Sciences, Honolulu, Hi, pp. 7OO-7O7.
|
| |
41
|
HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. In Proceedings of the Institute of Electrical and Radio Engineers 40, 9 (Sept.), pp. 1098-1101.
|
| |
42
|
HUNTER, R., AND ROBINSON, A. H. 1980. International digital facsimile coding standards. In Proceedings of the Institute of Electrical and Electronic Engineers 68, 7 (Jul.), pp. 854-867.
|
| |
43
|
JAGGER, D. 1989. Fast Ziv-Lempel decoding using RISC architecture. Res. Rept., Dept. of Computer Science, Univ. of Canterbury, New Zealand.
|
| |
44
|
|
| |
45
|
JAMISON, D., AND JAMISON, K. 1968. A note on the entropy of partially-known languages. Inf. Control 12, 164-167.
|
| |
46
|
JEWELL, G. C. 1976. Text compaction for information retrieval systems. IEEE Syst., Man and Cybernetics Soc. Newsletter 5, 47.
|
 |
47
|
|
| |
48
|
KATAJAINEN, J., AND RAITA, T. 1987a. An approximation algorithm for space-optimal encoding of a text. Res. Rept., Dept. of Computer Science, Univ. of Turku, Turku, Finland.
|
| |
49
|
KATAJAINEN, J., AND RAITA, T. 1987b. An analysis of the longest match and the greedy heuristics for text encoding. Res. Rept., Dept. of Computer Science, Univ. of Turku, Turku, Finland.
|
| |
50
|
|
| |
51
|
|
| |
52
|
|
| |
53
|
LANGOON, G. G. 1983. A note on the Ziv-Lempel model for compressing individual sequences. IEEE Trans. Inf. Theory IT-29, 2 (Mar.), 284-287.
|
| |
54
|
LANGDON, G. C. 1984. An introduction to arithmetic coding, iBM J. Res. Dev. 28, 2 (Mar.), 135-149.
|
| |
55
|
LANGDON, G. G., AND RmSANrN, J. 1981. Compression of black-white images with arithmetic coding. IEEE Trans. Cornmun. COM-29, 6 (Jun.), 858-867.
|
| |
56
|
LANGDON, G. G., AND RISSANEN, J. J. 1982. A simple general binary source code. IEEE Trans. Inf. Theory IT-28 (Sept.), 800-803.
|
| |
57
|
LANGOON, G. G., ANO RISSANEN, J. J. 1983. A doubly-adaptive file compression algorithm. IEEE Trans. Commun. COM-31, 11 (Nov.), 1253-1255.
|
 |
58
|
|
| |
59
|
LEMPrL, A., AND ZlV, J. 1976. On the complexity of finite sequences. IEEE Trans. Inf. Theory IT-22, i (Jan.), 75-81.
|
| |
60
|
LEViNSON, S. E., RABINER, L. R., ANO SONOHI, M. 1983. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62, 4 (Apr.), 1035-1074.
|
| |
61
|
|
| |
62
|
LYNCH, M. F. 1973. Compression of bibliographic files using an adaptation of run-length coding. Inf. Storage Retrieval 9, 207-214.
|
| |
63
|
|
| |
64
|
MAYNE, A., AND JAMES, E. B. 1975. Information compression by factorizing common strings. Comput. J. 18, 2, 157-160.
|
| |
65
|
G. & C. MERRIAM COMPANY 1963. Webster's Seventh New Collegiate Dictionary. Springfield, MA.
|
| |
66
|
MILLER, V. S., AND WEC, MAN, M. N. 1984. Variations on a theme by Ziv and Lempel. In Combinatorial Algorithms on Words. A. Apostolico and Z. Galil, Eds. NATO ASI Series, Vol. F12. Springer-Verlag, Berlin, pp. 131-140.
|
| |
67
|
MOrFAT, A. 1987. Word based text compression. Res. Rept., Dept. of Computer Science, Univ. of Melbourne, Victoria, Australia.
|
| |
68
|
MOFrAT, A. 1988a. A data structure for arithmetic encoding on large alphabets. In Proceedings of the 11th Australian Computer Science Conference. Brisbane, Australia (Feb.), pp. 309-317.
|
| |
69
|
MOFFAT, A. 1988b. A note on the PPM data compression algorithm. Res. Rept. 88/7, Dept. of Computer Science, Univ. of Melbourne, Victoria, Australia.
|
 |
70
|
|
 |
71
|
|
| |
72
|
OZEKI, K. 1974a. Optimal encoding of linguistic information. Systems, Computers, Controls 5, 3, 96-103. Translated from Denshi Tsushin Gakkai Ronbunshi, Vol. 57-D, No. 6, June 1974, pp. 361-368.
|
| |
73
|
OZEKI, K. 1974b. Stochastic context-free grammar and Markov chain. Systems, Computers, Controls 5, 3, 104-110. Translated from Denshi Tsushin Gakkai Ronbunshi, Vol. 57-D, No. 6, June 1974, pp. 369-375.
|
| |
74
|
OZEKI, K. 1975. Encoding of linguistic information generated by a Markov chain which is associated with a stochastic context-free grammar. Systems, Computers, Controls 6, 3, 75-80. Translated from Denshi Tsushin Gakkai Ronbunshi, Vol. 58-D, Nol. 6, June 1975, pp. 322-327.
|
| |
75
|
|
| |
76
|
PIKE, J. 1981. Text compression using a 4 bit coding system. Comput. J. 24, 4.
|
| |
77
|
RAmNER, L. R., AND JUANG, B. H. 1986. An Introduction to Hidden Markov Models. IEEE ASS P Mag. (Jan.).
|
 |
78
|
|
| |
79
|
RISSANEN, J. J. 1976. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20, (May), 198-203.
|
| |
80
|
RlSSANEN, J. J. 1979. Arithmetic codings as number representations. Acta Polytechnic Scandinavica, Math 31 (Dec.), 44-51.
|
| |
81
|
RISSANEN, J. 1983. A universal data compression system. IEEE Trans. Inf. Theory IT-29, 5 (Sept.), 656-664.
|
| |
82
|
RISSANEN, J., AND LANGDON, G. G. 1979. Arithmetic coding. IBM J. Res. Dev. 23, 2 (Mar.), 149-162.
|
| |
83
|
RlSSANEN, J., AND LANGDON, G. G. 1981. Universal modeling and coding. IEEE Trans. Inf. Theory IT-27, 1 (Jan.), 12-23.
|
| |
84
|
ROBERTS, M. G. 1982. Local order estimating Markovian analysis for noiseless source coding and authorship identification. Ph.D. dissertation. Stanford Univ.
|
 |
85
|
|
 |
86
|
|
| |
87
|
RUBIN, F. 1979. Arithmetic stream coding using fixed precision registers. IEEE Trans. In/. Theory IT-25, 6 (Nov.), 672-675.
|
| |
88
|
RYABKO, B. Y. 1980. Data compression by means of a "book stack." Problemy Peredachi Informatsii 16, 4.
|
| |
89
|
SCHIEBER, W. D., AND THOMAS, G. W. 1971. An algorithm for compaction of alphanumeric data. J. Library Automation 4, 198-206.
|
| |
90
|
SCHUEGRAF, E. J., AND HEAPS, H. S. 1973. Selection of equifrequent word fragments for information retrieval. Inf. Storage Retrieval 9, 697-711.
|
| |
91
|
SCHUE6RA~, E. J., AND HEAPS, H. S. 1974. A comparison of algorithms for data-base compression by use of fragments as language elements. Inf. Storage Retrieval 10, 309-319.
|
| |
92
|
SHANNON, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27 (Jul.), 398-403.
|
| |
93
|
SHANNON, C. E. 1951. Prediction and entropy of printed English. Bell Syst. Tech. J. (Jan.), 50-64.
|
| |
94
|
SNYDERMAN, M., AND HUNT, B. 1970. The myriad virtues of text compaction. Datamation I (Dec.), 36-40.
|
| |
95
|
STORER, J. A. 1977. NP-completeness results concerning data compression. Tech. Rept. 234. Dept. of Electrical Engineering and Computer Science, Princeton Univ., Princeton, NJ.
|
| |
96
|
|
 |
97
|
|
| |
98
|
SVANKS, M. I. 1975. Optimizing the storage of alphanumeric data. Can. Datasyst. (May), 38-40.
|
| |
99
|
TAN, C. P. 1981. On the entropy of the Malay language, iEEE Trans. Inf. Theory IT-27, 3 (May), 383-384.
|
| |
100
|
THOMAS, S. W., MCKm, J., DAVIES, S., TURKOWSKI, K., WOODS, J. A., AND OROST, J. W. 1985. Compress (Version 4.0) program and documentation. Available from joe(~petsd. UUCP.
|
| |
101
|
TISCHER, P. 1987. A modified Lempel-Ziv-Welch data compression scheme. Aust. Comp. Sck Commun. 9, 1, 262-272.
|
| |
102
|
|
| |
103
|
TROPPER, R. 1982. Binary-coded text, a compression method. Byte 7, 4 (Apr.), 398-413.
|
 |
104
|
|
 |
105
|
|
 |
106
|
|
| |
107
|
WALKER, D. E., AND AMSLER, R. A. 1986. The use of machine-readable dictionaries in sublanguage analysis. In Analysing languages in restricted domains: Sublanguage description and processing, R. Grishman and R. Kittridge, Eds. Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 69-83.
|
| |
108
|
WELCH, T. A. 1984. A technique for high-performance data compression. IEEE Computer 17, 6 (Jun.), 8-19.
|
| |
109
|
WHITE, H. E. 1967. Printed English compression by dictionary encoding. In Proceedings of the Institute of Electrical and Electronic Engineers 55, 3, 390-396.
|
| |
110
|
|
| |
111
|
WlTTEN, I. H. 1979. Approximate, non-deterministic modelling of behaviour sequences. Int. J. General Systems 5 (Jan.), 1-12.
|
| |
112
|
WITTEN, I. H. 1980. Probabilistic behaviour/structure transformations using transitive Moore models. Int. J. General Syst. 6, 3, 129-137.
|
| |
113
|
WlTTEN, I. H., AND CLEARY, J. 1983. Picture coding and transmission using adaptive modelling of quad trees. In Proceedings of the International Electrical, Electronics Conference 1, Toronto, ON, pp. 222-225.
|
| |
114
|
|
 |
115
|
|
| |
116
|
WOLFF, J. G. 1978. Recoding of natural language for economy of transmission or storage. Comput. J. 21, 1, 42-44.
|
| |
117
|
YOUNG, D. M. 1985. MacWrite file formats. Wheels for the Mind (Newsletter of the Australian Apple University Consortium), University of Western Australia, Nedlands, WA 6009, Australia, p. 34.
|
| |
118
|
ZIv, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT-23, 3 (May), 337-343.
|
| |
119
|
ZIv, J., ANO LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory IT-24, 5 (Sept.), 530-536.
|
REVIEW
"Glen G. Langdon : Reviewer"
The first part of this survey follows the modeling and coding
approach described by Rissanen and Langdon [1], and the second part
surveys dictionary approaches, including the Ziv-Lempel approach. The
work describes general techniques for compr
more...
|