|
ABSTRACT
Weblogs, or blogs, have rapidly gained in popularity over the past few years. In particular, the growth of business blogs written by or providing commentary on businesses and companies opens up new opportunities for developing blog-specific search and mining techniques. In this paper, we propose probabilistic models for blog search and mining using two machine learning techniques, Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA). We implement the models in our database of business blogs, with the aim of achieving higher precision and recall. The probabilistic model is able to segment the business blogs into separate topic areas, which is useful for keywords detection on the blogosphere. Various term-weighting schemes and factor values were also studied in detail, which reveal interesting patterns in our database of business blogs. From our study, we can uncover domain-driven data mining techniques that can better strengthen business intelligence in complex enterprise applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Mishne, G. Information Access Challenges in the Blogspace. In Proceedings of International Workshop on Intelligent Information Access (IIIA-2006) (Helsinki, Finland, 2006).
|
| |
2
|
Pikas, C. K. Blog Searching for Competitive Intelligence, Brand Image, and Reputation Management (Cover Story). In Online, 29 (4) (2005) 16--21.
|
| |
3
|
Mishne, G., de Rijke, M. A Study of Blog Search. In Proceedings of 28th European Conference on Information Retrieval (ECIR) (2006).
|
| |
4
|
Gill, K. E. How Can We Measure the Influence of the Blogosphere? In Proceedings of WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis, and Dynamics (New York, May 18, 2004).
|
| |
5
|
Nakajima, S., Tatemura, J., Hino, Y., Hara, Y., Tanaka, K. Discovering Important Bloggers based on Analyzing Blog Threads. In Proceedings of WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (Chiba, Japan, 2005).
|
| |
6
|
Avesani, P., Cova, M., Hayes, C., Massa, P. Learning Contextualised Weblog Topics. In Proceedings of WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (Chiba, Japan, 2005).
|
| |
7
|
Glance, N. S., Hurst, M., Tomokiyo, T. BlogPulse: Automated Trend Discovery for Weblogs. In Proceedings of WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis, and Dynamics (New York, May 18, 2004).
|
 |
8
|
Daniel Gruhl , R. Guha , David Liben-Nowell , Andrew Tomkins, Information diffusion through blogspace, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988739]
|
| |
9
|
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. Indexing by Latent Semantic Analysis. In J. Inform. Sci., 41, (1990), 391--407.
|
| |
10
|
|
| |
11
|
Dumais, S. T. Improving the Retrieval of Information from External Sources. Behavior Research Methods, Instruments and Computers, 23 (2) (1991) 229--236.
|
| |
12
|
Kolda, T. G. Limited-Memory Matrix with Applications. Ph.D. Thesis, University of Maryland, College Park, Technical Report CS-TR-3806, 1997.
|
| |
13
|
Zeimpekis, D., Gallopoulos, E. Design of a MATLAB Toolbox for Term-document Matrix Generation. In Proceedings of Workshop on Clustering High Dimensional Data and its Application, (Newport Beach, California, 2005) 38--48.
|
 |
14
|
|
| |
15
|
Yu, C., Cuadrado, J., Ceglowski, M., Payne, J. S. Patterns in Unstructured Data: Discovery, Aggregation, and Visualization. In Presentation to the Andrew W. Mellon Foundation (2002).
|
| |
16
|
|
 |
17
|
|
|