Method for Synthesizing a Benchmark to Evaluate the Robust Resilience of Large Language Models to Disinformation and Factual Manipulation

Authors

  • S. M. Levitskyi Vinnytsia National Technical University
  • V. B. Mokin Vinnytsia National Technical University

DOI:

https://doi.org/10.31649/1997-9266-2025-178-1-128-136

Keywords:

benchmark, intelligent technology, artificial intelligence, large language models, reinforcement learning, manipulation, disinformation, model optimization

Abstract

With the development and widespread adoption of intelligent assistants based on large language models (LLMs), testing these models by various criteria is becoming incringly important. One of the most crucial factors is their robustness against misinformation and manipulative tactics. Unstable models can pose serious risks in decision-making in the sphere of security, healthcare, and sensitive social issues. Such evaluations typically rely on benchmark tests based on labeled datasets. However, most existing benchmarks are designed for single-turn (context-free) questions, whereas LLM-based chatbots are primarily used in multi-turn conversational modes (with context). These benchmarks are highly dependent on the domain of application, meaning that instead of a single test, a method for synthesizing such tests is required.

This paper proposes a method for synthesizing benchmarks to assess the robustness of LLMs against multi-turn manipulations involving statements that are definitively known to be false. The method enables the generation of a benchmark that constructs a sequence of manipulative transformations of a false statement, eventually leading an insufficiently robust LLM to accept the misinformation as valid. The method is based on: (1) forming a set of reference, exclusively false statements from a given domain, followed by clustering and extracting typical variants; (2) creating sets of manipulation templates applicable to arbitrary statements using argumentation logic while maintaining their falsity; and (3) applying reinforcement learning to synthesize an optimal policy (strategy) for structuring sequences of fact manipulations for each type of reference false statement. The proposed robustness criterion for LLMs is the percentage of false statements correctly classified as false.

Experimental testing has confirmed the effectiveness of the proposed method. A benchmark was developed and used to evaluate the well-known LLM "Llama 3.2 3B Instruct." This model exhibited moderate (65 %) robustness against misinformation and manipulations in a single-turn (context-free) mode. However, after applying the synthesized benchmark in a multi-turn conversational mode, its robustness dropped by more than half (to 30 %). This result demonstrated the vulnerability of LLMs to more complex manipulative scenarios and validated the effectiveness of the proposed benchmark synthesis method.

Author Biographies

S. M. Levitskyi, Vinnytsia National Technical University

Post-Graduate Student of the Chair of System Analysis and Information Technologies

V. B. Mokin, Vinnytsia National Technical University

Dr. Sc. (Eng.), Professor, Head of the Chair of System Analysis and Information Technologies

References

Philip J. Fleming, and John J. Wallace, “How not to lie with statistics: the correct way to summarize benchmark results,” Communications of the ACM, no. 29 (3), pp. 218-221, 1986. https://doi.org/10.1145/5666.5673 .

J. Wei, Ng. Karina, et al.,“Measuring short-form factuality in large language models,” arXiv preprint, arXiv:2411.04368, Nov 2024.

C. E. Jimenez, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” arXiv preprint, arXiv:2310.06770, 2024.

S. Lin et al., “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” arXiv preprint, arXiv:2109.07958v2, May 2022.

J. Thorne, et al., “FEVER: a large-scale dataset for Fact Extraction and VERification,” arXiv preprint, arXiv:1803.05355v3, Dec 2018.

M. Andriushchenko, et al., “AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents,” arXiv preprint, arXiv:2410.09024, Oct 2024.

S. Bringsjord, et al., Argument-based inductive logics, with coverage of compromised perception, Jan 2024, https://doi.org/10.3389/frai.2023.1144569

J. Schulman, “Proximal Policy Optimization Algorithms,” arXiv preprin, arXiv:1707.06347, Aug 2017.

М. В. Дратований, і В. Б. Мокін, «Інтелектуальний метод з підкріпленням синтезу оптимального конвеєру операцій попереднього оброблення даних у задачах машинного навчання,» Наукові праці ВНТУ, вип. 4, грудень 2022. https://doi.org/10.31649/2307-5392-2022-4-15-24 .

Downloads

Abstract views: 14

Published

2025-02-27

How to Cite

[1]
S. M. Levitskyi and V. B. Mokin, “Method for Synthesizing a Benchmark to Evaluate the Robust Resilience of Large Language Models to Disinformation and Factual Manipulation”, Вісник ВПІ, no. 1, pp. 128–136, Feb. 2025.

Issue

Section

Information technologies and computer sciences

Metrics

Downloads

Download data is not yet available.

Most read articles by the same author(s)

1 2 3 4 5 6 7 > >>