TY - GEN
T1 - Vector representation of words for plagiarism detection based on string matching
AU - Baba, Kensuke
AU - Nakatoh, Tetsuya
AU - Minami, Toshiro
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Number JP15K00310.
Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.
AB - Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.
KW - Document similarity
KW - Plagiarism detection
KW - Score vector
KW - Text processing
KW - Vector representation of words
UR - http://www.scopus.com/inward/record.url?scp=85025178102&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85025178102&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-58524-6_28
DO - 10.1007/978-3-319-58524-6_28
M3 - Conference contribution
AN - SCOPUS:85025178102
SN - 9783319585239
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 341
EP - 350
BT - Human Interface and the Management of Information
A2 - Yamamoto, Sakae
PB - Springer Verlag
T2 - Thematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017
Y2 - 9 July 2017 through 14 July 2017
ER -