ІНТЕЛЕКТУАЛЬНА ТЕХНОЛОГІЯ ВИЯВЛЕННЯ ТЕКСТОВИХ ДІПФЕЙКІВ З ВИКОРИСТАННЯМ ВЕЛИКИХ МОВНИХ МОДЕЛЕЙ

V. B. Mokin; B. Yu. Varer; S. M. Levitskyi1

doi:10.31649/1997-9266-2024-172-1-110-120

Authors

V. B. Mokin Vinnytsia National Technical University
B. Yu. Varer Vinnytsia National Technical University
S. M. Levitskyi1 Vinnytsia National Technical University

DOI:

https://doi.org/10.31649/1997-9266-2024-172-1-110-120

Keywords:

text deepfakes, misinformation, artificial intelligence, large language models, identification of synthesized texts, Kaggle, intelligent technology, chat-bots

Abstract

The rapid development of large language models in recent years has generated a significant problem — the increase in the volume of synthesized texts in the information environment, which poses a threat of the spread of misinformation. Accordingly, improving technologies for detecting such texts becomes a relevant ask.

This article proposes an intelligent technology for the automatic identification of texts generated by artificial intelligence, especially large language models. The research is based on the analysis of solutions from the "LLM — Detect AI Generated Text" competition on the Kaggle platform. For this purpose, a dataset was constructed that contains examples of texts from two classes: those written by humans and those generated by large language models. The dataset was compiled from data that is publicly available. An exploratory data analysis was also conducted, demonstrating the main features of the prepared dataset.

The article analyzes popular solutions for the problem of identifying texts generated by large language models within the Kaggle competition. It formalizes the general structure of the solution and justifies the main factors affecting the accuracy of identifying texts generated by artificial intelligence. An algorithm was developed to increase the accuracy of the solution through pre-processing and post-processing operations, improving the training dataset, optimizing the selection of models, and their ensemble method, among others. Experiments were conducted, demonstrating the effectiveness of the proposed intelligent technology.

This research contributes to the development of technologies to combat misinformation and highlights the importance of finding new methods to detect artificially created texts in modern information environment.

Author Biographies

V. B. Mokin, Vinnytsia National Technical University

Dr. Sc. (Eng.), Professor, Head of the Chair of System Analysis and Information Technology

B. Yu. Varer, Vinnytsia National Technical University

Post-Graduate Student of the Chair of System Analysis and Information Technology

S. M. Levitskyi1, Vinnytsia National Technical University

Post-Graduate Student of the Chair of System Analysis and Information Technology

References

R. R. Soto et al., “Few-Shot Detection of Machine-Generated Text using Style Representations,” arXiv preprint, arXiv:2401.06712, 2024.

B. P. Kumar, M. S. Ahmed, and M. Sadanandam, “DistilBERT: A Novel Approach to Detect Text Generated by Large Language Models (LLM),” Feb. 2024, https://doi.org/10.21203/rs.3.rs-3909387/v1 .

Z. Wu, and H. Xiang, “MFD: Multi-Feature Detection of LLM-Generated Text”, Aug. 2023, https://doi.org/10.21203/rs.3.rs-3226684/v1 .

OpenAI, “New AI classifier for indicating AI-written text,” 2023. [Online]. Available: https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text. Accessed: Feb 15, 2024.

В. Б. Мокін, і М. В. Дратований, «Інтелектуальний метод з підкріпленням синтезу оптимального конвеєру операцій попереднього оброблення даних у задачах машинного навчання,» Наукові праці ВНТУ, вип. 4, Груд 2022. https://doi.org/10.31649/2307-5376-2022-4-15-25 .

J. King, P. Baffour, S. Crossley, R. Holbrook, and M. Demkin, “LLM – Detect AI Generated Text,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text. Accessed: Feb 15, 2024.

N. Broad “R100_Ensemble,” 2023 [Online]. Available: https://www.kaggle.com/code/nbroad/r100-ensemble/input. Accessed: Feb 15, 2024.

D. Kłeczek, “DAIGT V2 Train Dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset. Accessed: Feb 15, 2024

R. Osmulski, “LLM Generated Essays for the Detect AI Comp!” 2023 [Online]. Available: https://www.kaggle.com/datasets/radek1/llm-generated-essays. Accessed: Feb 15, 2024.

D. Kłeczek, “DAIGT Proper Train Dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset. Accessed: Feb 2, 2024.

C. McBride Ellis, “LLM: 7 prompt training dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets /carlmcbrideellis/llm- 7-prompt-training-dataset . Accessed: Feb 15, 2024.

A. Paullier, “DAIGT | External Dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset. Accessed: Feb 15, 2024.

N. Broad, “Persuade corpus 2.0,” 2023 [Online]. Available: https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/. Accessed: Feb 2, 2024.

D. Kłeczek, “Daigt-v3-train-dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset. Accessed: Feb 15, 2024.

N. Broad, “Daigt data – llama 70b and falcon180b,” 2023 [Online]. Available: https://www.kaggle.com/datasets /nbroad/daigt-data-llama-70b-and-falcon180b . Accessed: Feb 15, 2024

C. McBride Ellis, “LLM: Mistral-7B Instruct texts,” 2023 [Online]. Available: https://www.kaggle.com/datasets /carlmcbrideellis /llm-mistral-7b-instruct-texts . Accessed: Feb 15, 2024.

D. Kłeczek, “DAIGT-V4-TRAIN-DATASET,” 2023 [Online]. Available: https://www.kaggle.com/datasets/thedrcat /daigt-v4-train-dataset. Accessed: Feb 15, 2024.

D. Kłeczek, “DAIGT External Train Dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets /thedrcat/daigt-external-train-dataset. Accessed: Feb 15, 2024.

Y. Liu et al., “ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models,” arXiv preprint, arXiv:2304.07666, 2023.

K. Hayawi, S. Shahriar, and S. Mathew, “The Imitation Game: Detecting Human and AI-Generated Texts in the Era of Large Language Models,” arXiv preprint, arXiv:2307.12166, 2023.

M. Rizqi, “LLM-generated essay using PaLM from Google Gen-AI,” 2023 [Online]. Available: https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai. Accessed: Feb 15, 2024.

D. Hanley, “Hello, Claude! 1000 essays from Anthropic…,” 2023 [Online]. Available: https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic. Accessed: Feb 15, 2024.

P. Srikanth, “[DAIGT] 3500 Essays from Intel Neural Chat 7b,” 2023 [Online]. Available: https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b. Accessed: Feb 15, 2024.

N. Matatov, “GPT4 Rephrased LLM DAIGT Dataset,” 2023 [Online]. Available: https://www.kaggle.com/datasets /snassimr/gpt4-rephrased-llm-daigt-dataset. Accessed: Feb 15, 2024.

R. Biswas, et al., “mock_test,” 2023 [Online]. Available: https://www.kaggle.com/datasets/conjuring92/mock-test. Accessed: Feb 15, 2024.

N. Broad, “Clean llama 70b data,” 2023 [Online]. Available: https://www.kaggle.com/code/nbroad/clean-llama-70b-data/notebook . Accessed: Feb 15, 2024

S. Crossley, et al., “A large-scale corpus for assessing written argumentation: PERSUADE 2.0,” Zenodo, Aug. 2023, https://doi.org/10.1016/j.asw.2023.100667 .

“Scikit-learn: Machine Learning in Python,” Sklearn.ensemble Module [Online]. Available: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble . Accessed: Feb 15, 2024.

N. Broad, “Comprehensive 1st Place Write-Up,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/473295 . Accessed: Feb 15, 2024.

Y. Maslov, “3rd place solution,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470333. Accessed: Feb 15, 2024

E. Demir, “[4th Place Solution] A Summary of Combined Arms Approach,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470179 . Accessed: Feb 15, 2024.

J. Day, “5th place solution: 1.7 million training examples + domain adaptation,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470093 . Accessed: Feb 15, 2024.

D. Cozzolino, “6nd place solution with code,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/471831 . Accessed: Feb 15, 2024.

H. Mei, “[7th Place Solution] Generate Data with Non-Instruction-Tuned Models,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470643 . Accessed: Feb 15, 2024.

A. Meda, “[8th LB Solution] Linguistic Features: PPL & GLTR,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470224 . Accessed: Feb 15, 2024.

D. Hanley, “[1st Public/9th Private] LLMLab - Solution Summary,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470255. Accessed: Feb 15, 2024.

U. Erii, “12th place solution: DeBERTa + TF-IDF,” 2023 [Online]. Available: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470396 . Accessed: Feb 15, 2024/

R. Banthia, “13th place solution - Transformers only,” 2023 [Online]. Available:

https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470593 . Accessed: Feb 15, 2024.

Verma, Vivek, et al. “Ghostbuster: Detecting Text Ghostwritten by Large Language Models,” arXiv preprint, arXiv:2305.15047, 2023.

Intelligent Technology for Detecting Text-Based Deepfakes Using Large Language Models

Authors

DOI:

Keywords:

Abstract

Author Biographies

V. B. Mokin, Vinnytsia National Technical University

B. Yu. Varer, Vinnytsia National Technical University

S. M. Levitskyi1, Vinnytsia National Technical University

References

Downloads

Published

How to Cite

Issue

Section

Metrics

Downloads

License

Most read articles by the same author(s)

Language

Make a Submission

Information

Visitors

Current Issue