ACM Home Page
Please provide us with feedback. Feedback
Language classification using n-grams accelerated by FPGA-based Bloom filters
Full text PdfPdf (156 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications: held in conjunction with SC07 table of contents
Reno, Nevada
SESSION: Applications table of contents
Pages 31-37  
Year of Publication: 2007
ISBN:978-1-59593-894-7
Authors
Arpith Jacob  Washington University in St. Louis, St. Louis, Missouri
Maya Gokhale  Lawrence Livermore National Laboratory, Livermore, California
Sponsors
: Open fpga
IEEE-CS\DATC : IEEE Computer Society
: NCSA
SIGARCH: ACM Special Interest Group on Computer Architecture
CHREC : NSF Center for High-Performance Reconfigurable Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 44,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1328554.1328564
What is a DOI?

ABSTRACT

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apache Lucene. http://lucene.apache.org.
 
2
Lextek Language Identifier. http://www.lextek.com/langid/li/.
 
3
Mguesser. http://www.mnogosearch.org/guesser/.
 
4
Ngram Statistics Package. http://ngram.sourceforge.net.
 
5
SpamAssassin. http://spamassassin.apache.org/.
6
 
7
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, US, 1994.
 
8
S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004.
 
9
C. M. Kastner, G. A. Covington, A. A. Levine, and J. W. Lockwood. HAIL: A hardware-accelerated algorithm for language identification. In 15th Annual Conference on Field Programmable Logic and Applications (FPL), Tampere, Finland, Aug. 2005.
 
10
 
11
 
12
S. Ralf, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, Tufis, and D. Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In 5th International Conference on Language Resources and Evaluation (LREC'2006), Genoa, Italy, May 2006.
 
13
Collaborative Colleagues:
Arpith Jacob: colleagues
Maya Gokhale: colleagues