|
ABSTRACT
The use of regular expressions for text search is widely known and well understood. It is then surprising that the standard techniques and tools prove to be of limited use for searching structured text formatted with SGML or similar markup languages. Our experience with structured text search has caused us to reexamine the current practice. The generally accepted rule of “leftmost longest match” is an unfortunate choice and is at the root of the difficulties. We instead propose a rule which is semantically cleaner. This rule is generally applicable to a variety of text search applications, including source code analysis, and has interesting properties in its own right. We have written a publicly available search tool implementing the theory in the article, which has proved valuable in a variety of circumstances.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
AHO, A. V. 1980. Pattern matching in strings. In Formal Language Theory Perspectives and Open Problems, R. V. Book, Ed. Academic Press, New York, 325-344.
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
CLARKE, C. L. A. AND CORMACK, G. V. 1996. Context grep. Tech. Rep. CS-96-41, Computer Science Dept., Univ. of Waterloo, Waterloo, Canada.
|
| |
15
|
CLARKE, C. L. A., CORMACK, G. V., AND BURKOWSKI, F. J. 1995a. An algebra for structured text search and a framework for its implementation. Comput. J. 38, 1, 43-56.
|
| |
16
|
CLARKE, C. L. A., CORMACK, G. V., AND BURKOWSKI, F. J. 1995b. Schema-independent retrieval from hetrogeneous structured text. In Proceedings of the ~th Annual Symposium on Document Analysis and Information Retrieval. UNLV, Las Vegas, Nev.
|
| |
17
|
FISHER, M. AND PATTERSON, M. 1974. String matching and other products. In Proceedings of the SIAM-AMS Conference on Complexity of Computation, R. M. Karp, Ed. American Mathematical Society, Providence, R.I., 113-125.
|
| |
18
|
HARGREAVES, K. A. AND BERRY, K. 1992. Regex. Free Software Foundation, Cambridge, Mass. Available from ftp://prep.ai.mit.edu/pub/gnu.
|
 |
19
|
|
| |
20
|
|
| |
21
|
HORSPOOL, R. N. 1980. Practical fast searching in strings. Softw. Pract. F, xper. 10, 501-506.
|
| |
22
|
IEEE. 1992. Standard for information technology Portable Operating System Interface (POSIX) Part 2 (Shell and utilities) Section 2.8 (Regular expression notation). IEEE Std 1003.2, Institute of Electrical and Electronics Engineers, New York.
|
| |
23
|
ISO. 1986. Information processing Text and office systems Standard Generalized Markup Language (SGML). ISO 8879, International Standards Organization, Geneva, Switzerland.
|
| |
24
|
JAAKKOLA, J. AND KILPELAINEN, P. 1995. Sgrep home page. Dept. of Computer Science, Univ. of Helsinki, Helsinki, Finland. Available as http://www.cs.helsinki.fi/jjaakkol/sgrep.html.
|
 |
25
|
|
| |
26
|
KNIGHT, J. R. AND MYERS, E. W. 1995. Super-pattern matching. Algorithmica 13, 211-243.
|
| |
27
|
KNUTH, D. E., MORRIS, J. H., AND PRATT, V. R. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2 (June), 323-350.
|
| |
28
|
LESK, M. E. 1975. Lex -- A lexical analyzer generator. Computing Science Tech. Rep. 39, AT&T Bell Laboratories, Murray Hill, N.J.
|
| |
29
|
|
| |
30
|
MCNAUTHTON~ R. AND YAMADA~ H. 1960. Regular expressions and state graphs for automata. IRE Trans. Electronic Comput. EC-9, 1 (Mar.), 39-47.
|
 |
31
|
|
| |
32
|
MYERS~ E. W. AND MILLER~ W. 1989. Approximate matching of regular expressions. Bull. Math. Biol. 51, 1, 5-37.
|
| |
33
|
|
| |
34
|
PIKE~ R. 1987a. Structural regular expressions. In Proceedings of the European UNIX User's Group Conference. EUUG, Helsinki, Finland.
|
| |
35
|
|
 |
36
|
|
| |
37
|
|
| |
38
|
Wu~ S. AND MANBER~ W. 1992a. agrep A fast approximate pattern matching algorithm. In Proceedings of the USENIX Winter 1992 Technical Conference. USENIX Assoc., Berkeley, Calif, 153-162.
|
 |
39
|
|
CITED BY 5
|
|
Anthony Cox , Charles Clarke , Susan Sim, A model independent source code repository, Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, p.1, November 08-11, 1999, Mississauga, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|