|
|||||||||||||||||||||
|
|||||||||||||||||||||
ABSTRACT
In the realm of machine learning for text classification, TF-IDF is the most widely used representation for real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature representations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without feature selection, obviating the need for feature selection. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references. INDEX TERMS
Primary Classification:
Additional Classification:
General Terms:
Keywords:
|
|||||||||||||||||||||