Automated Approach for Dating English Text Using Transformer Neural Networks
DOI:
https://doi.org/10.31649/1997-9266-2025-180-3-133-139Keywords:
software natural language processing (NLP), machine learning, transformer neural networks (TNN), transfer learning, BERT, text dating, stylometry, historical text analysisAbstract
The paper examines the existing methods of text dating using neural networks, highlighting their advantages and limitations. Text dating is a crucial task in fields such as history, archival studies, linguistics, and forensic science, as accurately determining the creation time of a document can help verify its authenticity, establish authorship, and detect forgeries. However, traditional methods based on stylometric or statistical approaches often lack accuracy, especially when dealing with large volumes of text data. This study proposes an approach for dating English-language texts using transformer neural networks. The model achieves an accuracy of 85 % within a 30-year range for texts written between the 15th and 20th centuries, outperforming existing models applied to English text. The core idea of the proposed automated approach is to utilize transfer learning to fine-tune a pre-trained transformer neural network, optimizing it for the classification of text fragments by decade. One key advantage of this approach is the use of transformer architecture, which, through the self-attention mechanism, effectively captures complex relationships within a text. Another significant benefit is the application of transfer learning, which reduces training time and computational resources compared to training a model from scratch. The approach was implemented in Python using the transformers libraries for training and testing the neural network, datasets for working with the dataset, and numpy for the calculations. Experimental results demonstrated high accuracy: 86 % within a 30-year range and 73 % within a 20-year range on the test dataset. For the 19th and 20th centuries, the model achieved an accuracy of 89% and 90%, respectively, while accuracy for earlier centuries was lower, averaging around 30%. The research also examines the possibility of identifying features that indicate a text's association with a specific period by extracting words with the highest attention scores. Future research will focus on improving the accuracy for underrepresented historical periods by expanding and refining the dataset. Further enhancements may be achieved by optimizing model hyperparameters and experimenting with alternative neural network architectures. Another direction for future research is to explore methods for identifying linguistic or stylistic features that mark texts as belonging to a certain historical period, in order to make the neural network's results more interpretable for the user. The proposed approach has potential applications in historical research, document authentication, plagiarism detection, literary studies, and forensic analysis.
References
Y. Assael, T. Sommerschield, et al, “Restoring and attributing ancient texts using deep neural networks,” Nature 603, pp. 280-283, 2022. https://doi.org/10.1038/s41586-022-04448-z .
Shikhar Vashishth, Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar. “Dating Documents using Graph Convolution Networks,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1 (Long Papers), pp. 1605-1615, Melbourne, Australia. Association for Computational Linguistics. 2018. https://doi.org/10.18653/v1/P18-1149 .
Wahlberg, Fredrik & Wilkinson, Tomas & Brun, Anders, Historical Manuscript Production Date Estimation Using Deep Convolutional Neural Networks, 2016. https://doi.org/10.1109/ICFHR.2016.0048 .
O. Hellwig, “Dating Sanskrit texts using linguistic features and neural networks,” 2019. [Електронний ресурс]. Режим доступу: https://www.academia.edu/53885816/Dating_Sanskrit_texts_using_linguistic_features_and_neural_networks.3073703.
Ashish Vaswani, et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), pp. 6000-6010, 2017. [Electronic resource]. Available: https://dl.acm.org/doi/10.5555/3295222.3295349 .
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018. https://doi.org/10.48550/arXiv.1810.04805 .
Project Gutenberg — English Language eBooks. [Electronic resource]. Available: https://huggingface.co/datasets/sedthh/gutenberg_english .
Downloads
-
pdf (Українська)
Downloads: 31
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).