Method for Synthesizing a Benchmark to Evaluate the Robust Resilience of Large Language Models to Disinformation and Factual Manipulation
DOI:
https://doi.org/10.31649/1997-9266-2025-178-1-128-136Keywords:
benchmark, intelligent technology, artificial intelligence, large language models, reinforcement learning, manipulation, disinformation, model optimizationAbstract
With the development and widespread adoption of intelligent assistants based on large language models (LLMs), testing these models by various criteria is becoming incringly important. One of the most crucial factors is their robustness against misinformation and manipulative tactics. Unstable models can pose serious risks in decision-making in the sphere of security, healthcare, and sensitive social issues. Such evaluations typically rely on benchmark tests based on labeled datasets. However, most existing benchmarks are designed for single-turn (context-free) questions, whereas LLM-based chatbots are primarily used in multi-turn conversational modes (with context). These benchmarks are highly dependent on the domain of application, meaning that instead of a single test, a method for synthesizing such tests is required.
This paper proposes a method for synthesizing benchmarks to assess the robustness of LLMs against multi-turn manipulations involving statements that are definitively known to be false. The method enables the generation of a benchmark that constructs a sequence of manipulative transformations of a false statement, eventually leading an insufficiently robust LLM to accept the misinformation as valid. The method is based on: (1) forming a set of reference, exclusively false statements from a given domain, followed by clustering and extracting typical variants; (2) creating sets of manipulation templates applicable to arbitrary statements using argumentation logic while maintaining their falsity; and (3) applying reinforcement learning to synthesize an optimal policy (strategy) for structuring sequences of fact manipulations for each type of reference false statement. The proposed robustness criterion for LLMs is the percentage of false statements correctly classified as false.
Experimental testing has confirmed the effectiveness of the proposed method. A benchmark was developed and used to evaluate the well-known LLM "Llama 3.2 3B Instruct." This model exhibited moderate (65 %) robustness against misinformation and manipulations in a single-turn (context-free) mode. However, after applying the synthesized benchmark in a multi-turn conversational mode, its robustness dropped by more than half (to 30 %). This result demonstrated the vulnerability of LLMs to more complex manipulative scenarios and validated the effectiveness of the proposed benchmark synthesis method.
References
Philip J. Fleming, and John J. Wallace, “How not to lie with statistics: the correct way to summarize benchmark results,” Communications of the ACM, no. 29 (3), pp. 218-221, 1986. https://doi.org/10.1145/5666.5673 .
J. Wei, Ng. Karina, et al.,“Measuring short-form factuality in large language models,” arXiv preprint, arXiv:2411.04368, Nov 2024.
C. E. Jimenez, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” arXiv preprint, arXiv:2310.06770, 2024.
S. Lin et al., “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” arXiv preprint, arXiv:2109.07958v2, May 2022.
J. Thorne, et al., “FEVER: a large-scale dataset for Fact Extraction and VERification,” arXiv preprint, arXiv:1803.05355v3, Dec 2018.
M. Andriushchenko, et al., “AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents,” arXiv preprint, arXiv:2410.09024, Oct 2024.
S. Bringsjord, et al., Argument-based inductive logics, with coverage of compromised perception, Jan 2024, https://doi.org/10.3389/frai.2023.1144569
J. Schulman, “Proximal Policy Optimization Algorithms,” arXiv preprin, arXiv:1707.06347, Aug 2017.
М. В. Дратований, і В. Б. Мокін, «Інтелектуальний метод з підкріпленням синтезу оптимального конвеєру операцій попереднього оброблення даних у задачах машинного навчання,» Наукові праці ВНТУ, вип. 4, грудень 2022. https://doi.org/10.31649/2307-5392-2022-4-15-24 .
Downloads
-
pdf (Українська)
Downloads: 8
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).