ACM Home Page
Please provide us with feedback. Feedback
On the use of regular expressions for searching text
Full text PdfPdf (222 KB)
Source ACM Transactions on Programming Languages and Systems (TOPLAS) archive
Volume 19 ,  Issue 3  (May 1997) table of contents
Pages: 413 - 426  
Year of Publication: 1997
ISSN:0164-0925
Authors
Charles L. A. Clarke  Univ. of Waterloo, Waterloo, Ont., Canada
Gordon V. Cormack  Univ. of Waterloo, Waterloo, Ont., Canada
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 118,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/256167.256174
What is a DOI?

ABSTRACT

The use of regular expressions for text search is widely known and well understood. It is then surprising that the standard techniques and tools prove to be of limited use for searching structured text formatted with SGML or similar markup languages. Our experience with structured text search has caused us to reexamine the current practice. The generally accepted rule of “leftmost longest match” is an unfortunate choice and is at the root of the difficulties. We instead propose a rule which is semantically cleaner. This rule is generally applicable to a variety of text search applications, including source code analysis, and has interesting properties in its own right. We have written a publicly available search tool implementing the theory in the article, which has proved valuable in a variety of circumstances.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
AHO, A. V. 1980. Pattern matching in strings. In Formal Language Theory Perspectives and Open Problems, R. V. Book, Ed. Academic Press, New York, 325-344.
 
3
4
 
5
 
6
 
7
8
 
9
10
 
11
12
 
13
 
14
CLARKE, C. L. A. AND CORMACK, G. V. 1996. Context grep. Tech. Rep. CS-96-41, Computer Science Dept., Univ. of Waterloo, Waterloo, Canada.
 
15
CLARKE, C. L. A., CORMACK, G. V., AND BURKOWSKI, F. J. 1995a. An algebra for structured text search and a framework for its implementation. Comput. J. 38, 1, 43-56.
 
16
CLARKE, C. L. A., CORMACK, G. V., AND BURKOWSKI, F. J. 1995b. Schema-independent retrieval from hetrogeneous structured text. In Proceedings of the ~th Annual Symposium on Document Analysis and Information Retrieval. UNLV, Las Vegas, Nev.
 
17
FISHER, M. AND PATTERSON, M. 1974. String matching and other products. In Proceedings of the SIAM-AMS Conference on Complexity of Computation, R. M. Karp, Ed. American Mathematical Society, Providence, R.I., 113-125.
 
18
HARGREAVES, K. A. AND BERRY, K. 1992. Regex. Free Software Foundation, Cambridge, Mass. Available from ftp://prep.ai.mit.edu/pub/gnu.
19
 
20
 
21
HORSPOOL, R. N. 1980. Practical fast searching in strings. Softw. Pract. F, xper. 10, 501-506.
 
22
IEEE. 1992. Standard for information technology Portable Operating System Interface (POSIX) Part 2 (Shell and utilities) Section 2.8 (Regular expression notation). IEEE Std 1003.2, Institute of Electrical and Electronics Engineers, New York.
 
23
ISO. 1986. Information processing Text and office systems Standard Generalized Markup Language (SGML). ISO 8879, International Standards Organization, Geneva, Switzerland.
 
24
JAAKKOLA, J. AND KILPELAINEN, P. 1995. Sgrep home page. Dept. of Computer Science, Univ. of Helsinki, Helsinki, Finland. Available as http://www.cs.helsinki.fi/jjaakkol/sgrep.html.
25
 
26
KNIGHT, J. R. AND MYERS, E. W. 1995. Super-pattern matching. Algorithmica 13, 211-243.
 
27
KNUTH, D. E., MORRIS, J. H., AND PRATT, V. R. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2 (June), 323-350.
 
28
LESK, M. E. 1975. Lex -- A lexical analyzer generator. Computing Science Tech. Rep. 39, AT&T Bell Laboratories, Murray Hill, N.J.
 
29
 
30
MCNAUTHTON~ R. AND YAMADA~ H. 1960. Regular expressions and state graphs for automata. IRE Trans. Electronic Comput. EC-9, 1 (Mar.), 39-47.
31
 
32
MYERS~ E. W. AND MILLER~ W. 1989. Approximate matching of regular expressions. Bull. Math. Biol. 51, 1, 5-37.
 
33
 
34
PIKE~ R. 1987a. Structural regular expressions. In Proceedings of the European UNIX User's Group Conference. EUUG, Helsinki, Finland.
 
35
36
 
37
 
38
Wu~ S. AND MANBER~ W. 1992a. agrep A fast approximate pattern matching algorithm. In Proceedings of the USENIX Winter 1992 Technical Conference. USENIX Assoc., Berkeley, Calif, 153-162.
39


Collaborative Colleagues:
Charles L. A. Clarke: colleagues
Gordon V. Cormack: colleagues