Integrating State-of-the-art NLP Tools into Existing Methods to Address Current Challenges in Plagiarism Detection
Ughakepen Thompson, Victor (2023) Integrating State-of-the-art NLP Tools into Existing Methods to Address Current Challenges in Plagiarism Detection. Doctoral thesis, UNSPECIFIED.
Item Type: | Thesis (Doctoral) |
---|
Abstract
Paraphrase plagiarism occurs when text is deliberately obfuscated to evade detection, deliberate alteration increases the complexity of plagiarism and the difficulty in detecting paraphrase plagiarism. In paraphrase plagiarism, copied texts often contain little or no matching words, and conventional plagiarism detectors, most of which are designed to detect matching stings are ineffective under such condition. The problem of plagiarism detection has been widely researched in recent years with significant progress made particularly in the platform of Pan@Clef competition on plagiarism detection. However further research is required specifically in the area of paraphrase and translation (obfuscation) plagiarism detection as studies show that the state-of-the-art is unsatisfactory. A rational solution to the problem is to apply models that detect plagiarism using semantic features in texts, rather than matching strings. Deep contextualised learning models (DCLMs) have the ability to learn deep textual features that can be used to compare text for semantic similarity. They have been remarkably effective in many natural language processing (NLP) tasks, but have not yet been tested in paraphrase plagiarism detection. The second problem facing conventional plagiarism detection is translation plagiarism, which occurs when copied text is translated to a different language and sometimes paraphrased and used without acknowledging the original sources. The most common method used for detecting cross-lingual plagiarism (CLP) require internet translation services, which is limiting to the detection process in many ways. A rational solution to the problem is to use detection models that do not utilise internet translation services. In this thesis we addressed these ongoing challenges facing conventional plagiarism detection by applying some of the most advanced methods in NLP, which includes contextualised and non-contextualised deep learning models. To address the problem of paraphrased plagiarism, we proposed a novel paraphrase plagiarism detector that integrates deep contextualised learning (DCL) into a generic plagiarism detection framework. Evaluation results revealed that our proposed paraphrase detector outperformed a state-of-art model, and a number of standard baselines in the task of paraphrase plagiarism detection. With respect to CLP detection, we propose a novel multilingual translation model (MTM) based on the Word2Vec (word embedding) model that can effectively translate text across a number of languages, it is independent of the internet and performs comparably, and in many cases better than a common cross-lingual plagiarism detection model that rely on online machine translator. The MTM does not require parallel or comparable corpora, it is therefore designed to resolve the problem of CLPD in low resource languages. The solutions provided in this research advance the state-of-the-art and contribute to the existing body of knowledge in plagiarism detection, and would also have a positive impact on academic integrity that has been under threat for a while by plagiarism.
Microsoft Word
REC DC - 16341.docx - Supplemental Material Restricted to Repository staff only Download (138kB) | Request a copy |
||
|
PDF (Thesis)
PhD thesis_with title page.pdf - Accepted Version Download (2MB) | Preview |
More Information
Depositing User: Nicola Jackson |
Identifiers
Item ID: 16341 |
URI: http://sure.sunderland.ac.uk/id/eprint/16341 |
Users with ORCIDS
Catalogue record
Date Deposited: 07 Jul 2023 14:07 |
Last Modified: 11 Jul 2023 08:00 |
Author: | Victor Ughakepen Thompson |
University Divisions
Collections > ThesesActions (login required)
View Item (Repository Staff Only) |