General Method of Forum Information Extraction

Rui Liu, Wentao Tan, Yuanbin Fu, Hong Wang.
Accepted by Journal of Chinese Computer Systems (Chinese) 2018.

Abstract: The classification and information extraction of online forums are two important technologies of online data mining. The traditional web page classification methods do not take the structure features of their URLs into full account. They are often based on the characteristics of the content,therefore, they are susceptible to noise,of low efficiency and they can’t meet the needs of versatility. The traditional information extraction methods are based on text density and layout structure,ignoring the semantic information of the content. They are difficult to extract the content from a variety of forums effectively. This paper proposes a clustering method based on URLs’ structure (USC) and a filter method based on keyword scoring (KSF). Both methods only need to analyze a small number of samples in the data set and extract general rules to meet the demand of large-scale extraction. In the same data set,the F value of the USC method is 18.99% higher than that of the traditional classification method,and the accuracy of the KSF method is 18.46% higher than that of the traditional information extraction method.

Download: [PDF]