| Language classification using n-grams accelerated by FPGA-based Bloom filters |
| Full text |
Pdf
(156 KB)
|
| Source
|
Conference on High Performance Networking and Computing
archive
Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications: held in conjunction with SC07
table of contents
Reno, Nevada
SESSION: Applications
table of contents
Pages 31-37
Year of Publication: 2007
ISBN:978-1-59593-894-7
|
|
Authors
|
|
Arpith Jacob
|
Washington University in St. Louis, St. Louis, Missouri
|
|
Maya Gokhale
|
Lawrence Livermore National Laboratory, Livermore, California
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 44, Citation Count: 0
|
|
|
ABSTRACT
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Apache Lucene. http://lucene.apache.org.
|
| |
2
|
Lextek Language Identifier. http://www.lextek.com/langid/li/.
|
| |
3
|
Mguesser. http://www.mnogosearch.org/guesser/.
|
| |
4
|
Ngram Statistics Package. http://ngram.sourceforge.net.
|
| |
5
|
SpamAssassin. http://spamassassin.apache.org/.
|
 |
6
|
|
| |
7
|
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, US, 1994.
|
| |
8
|
S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004.
|
| |
9
|
C. M. Kastner, G. A. Covington, A. A. Levine, and J. W. Lockwood. HAIL: A hardware-accelerated algorithm for language identification. In 15th Annual Conference on Field Programmable Logic and Applications (FPL), Tampere, Finland, Aug. 2005.
|
| |
10
|
Praveen Krishnamurthy , Jeremy Buhler , Roger Chamberlain , Mark Franklin , Kwame Gyang , Arpith Jacob , Joseph Lancaster, Biosequence Similarity Search on the Mercury System, Journal of VLSI Signal Processing Systems, v.49 n.1, p.101-121, October 2007
[doi> 10.1007/s11265-007-0087-0]
|
| |
11
|
|
| |
12
|
S. Ralf, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, Tufis, and D. Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In 5th International Conference on Language Resources and Evaluation (LREC'2006), Genoa, Italy, May 2006.
|
| |
13
|
|
|