Vector representation of words for plagiarism detection based on string matching

Kensuke Baba, Tetsuya Nakatoh, Toshiro Minami

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.

Original languageEnglish
Title of host publicationHuman Interface and the Management of Information
Subtitle of host publicationSupporting Learning, Decision-Making and Collaboration - 19th International Conference, HCI International 2017, Proceedings
EditorsSakae Yamamoto
PublisherSpringer Verlag
Pages341-350
Number of pages10
ISBN (Print)9783319585239
DOIs
Publication statusPublished - 2017
Externally publishedYes
EventThematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017 - Vancouver, Canada
Duration: Jul 9 2017Jul 14 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10274 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceThematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017
Country/TerritoryCanada
CityVancouver
Period7/9/177/14/17

Keywords

  • Document similarity
  • Plagiarism detection
  • Score vector
  • Text processing
  • Vector representation of words

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Vector representation of words for plagiarism detection based on string matching'. Together they form a unique fingerprint.

Cite this