A DATA-DRIVEN DOCUMENT SIMILARITY MEASURE BASED ON CLASSIFICATION ALGORITHMS

Authors

  • Su Gon Cho Korea University
  • Seoung Bum Kim Korea University

DOI:

https://doi.org/10.23055/ijietap.2017.24.3.2451

Keywords:

classification, sentence-term matrix, document similarity measure, text mining

Abstract

Measuring document similarity has shown its fundamental utilization in various text mining application problems. This paper propose a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By making comparative experiments on several widely used document similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.

Author Biographies

Su Gon Cho, Korea University

Ph.D. Candidate in the Department of Industrial Management Engineering 

Seoung Bum Kim, Korea University

Professor in the Department of Industrial Management Engineering 

Published

2017-10-29

How to Cite

Cho, S. G., & Kim, S. B. (2017). A DATA-DRIVEN DOCUMENT SIMILARITY MEASURE BASED ON CLASSIFICATION ALGORITHMS. International Journal of Industrial Engineering: Theory, Applications and Practice, 24(3). https://doi.org/10.23055/ijietap.2017.24.3.2451

Issue

Section

Data Sciences and Computational Intelligence