ANATOMY OF A LITERARY TEXT: A PERSPECTIVE THROUGH THE LENS OF DIGITAL PHILOLOGY (A CASE STUDY OF T. DREISER’S “THE FINANCIER”)
DOI:
https://doi.org/10.32782/2412-933X/2025-XXIV-19Keywords:
topic modeling, Natural Language Processing, Latent Dirichlet Allocation, Bidirectional Encoder Representations from Transformers, quantitative analysis, semantic structure, literary textAbstract
The article explores the potential of topic modeling as a tool for the quantitative analysis of literary texts in the context of contemporary digital philology. The aim of the study is to compare the effectiveness of two approaches to automatic topic extraction in literary discourse: the classical Latent Dirichlet Allocation (LDA) algorithm and the modern BERTopic model, which relies on contextual vector representations and is capable of capturing deeper semantic relationships within the text. The material of the study is Theodore Dreiser’s novel “The Financier”, on the basis of which a corpus for topic analysis was created. The research discusses text preprocessing strategies, including standard cleaning procedures, lemmatization, and part-of-speech filtering using NLP tools. The text was segmented, and optimal parameters were selected for each model. During modeling, the number of extracted topics, their coherence, semantic richness, interpretability, and correspondence to the literary content were analyzed. Special attention was given to model tuning methods (determining the number of topics, part-of-speech filtering, handling of proper nouns), visualization of results for improved interpretation, and the impact of preprocessing on topic quality. The comparative analysis revealed that BERTopic captures deeper semantic connections than LDA and, in most cases, generates topics that more accurately reflect the text’s semantic dominants. It demonstrates a higher capacity for constructing semantically coherent thematic structures while preserving contextual relationships between words. The results of the study may be used for further exploration of the semantic structure of literary works and for developing methods of automated literary discourse analysis. The prospect of further research is seen in conducting a comparative topic analysis of multiple novels by the same author to trace the evolution of semantic dominants within the author’s style.
References
Babalola, Olusola & Ojokoh, Bolanle & Boyinbode, Olutayo. Comprehensive Evaluation of LDA, NMF, and BERTopic’s Performance on News Headline Topic Modeling. Journal of Computing Theories and Applications. 2024. 2. 268–89. DOI: htpps: 10.62411/jcta.11635-
Бердник Д., Бойчук А. Порівняльний аналіз методів тематичного моделювання для аналізу відгуків в інтернет-магазині цифрових товарів. Herald of Khmelnytskyi National University. Technical Sciences. 2022. № 307(2). С. 37–41. https://doi.org/10.31891/2307-5732-2022-307-2-37-41.
Blei D. M., J. D. Lafferty. A correlated topic model of Science. Ann. Appl. Stat., vol. 1, no. 1, Jun. 2007. doi: 10.1214/07-AOAS114.
Chen Y., Z. Peng, S.-H. Kim, C. W. Choi, “What We Can Do and Cannot Do with Topic Mod- eling: A Systematic Review,” Commun. Methods Meas., vol. 17, no. 2, pp. 111–130, Apr. 2023. doi: 10.1080/19312458.2023.2167965.
Da N. Z. The computational case against computational literary studies. Critical Inquiry. 2019. Vol. 45, № 3. P. 601–639. DOI:10.1086/702594.
Dahllöf, Mats & Berglund, Karl. Faces, Fights, and Families: Topic Modeling and Gendered Themes in Two Corpora of Swedish Prose Fiction. Digital Humanities in the Nordic and Baltic Countries Publica- tions. 2. 92–111. 2019. DOI:10.5617/dhnbpub.11084.
Egger R., J. Yu. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. vol. 7, May 2022. doi: 10.3389/fsoc.2022.886498.
Gius Evelyn, Janina Jacke. Are Computational Literary Studies Structuralist? Journal of Cultural Analytics, vol. 7, no. 4, Dec. 2022. https://doi.org/10.22148/001c.46662.
Griffiths T. L., M. Steyvers. Finding scientific topics. Proc. Natl. Acad. Sci., vol. 101, no. suppl_1, pp. 5228–5235, Apr. 2004. doi: https://doi.org/10.1073/pnas.0307752101.
Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv. Mar. 11, 2022. URL: http://arxiv.org/abs/2203.05794
Лаптєв О., Юзва А. Дослідження використання технік моделювання для аналізу відгуків клієнтів. Прикладні проблеми комп’ютерних наук, безпеки та математики. 2024. № 2. С. 4–17. URL: https://apcssm.vnu.edu.ua/index.php/Journalone/article/view/20/15
Laureate C. D. P., W. Buntine, H. Linger. A systematic review of the use of topic models for short text social media analysis. Artif. Intell. Rev., vol. 56, no. 12, pp. 14223–14255, Dec. 2023. doi: 10.1007/s10462-023-10471-x.
Li Defeng, Wu Kan and Lei Victoria L.C. Applying Topic Modeling to Literary Analysis: A Review. Digital Studies in Language and Literature. vol. 1, no. 1-2, 2024, pp. 113–141. https://doi.org/10.1515/ dsll-2024-0010.
Lyu J. C., Han E. L., Luli G. K. COVID-19 Vaccine–Related Discussion on Twitter: Topic Modeling and Sentiment Analysis J Med Internet Res 2021. 23(6), e24435. doi: 10.2196/24435.
Martinelli G., P. Impicciché, E. Fersini, F. Mambrini, and M. Passarotti. Exploring Neural Topic Modeling on a Classical Latin Corpus. In Proceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Torino, Italy, 6929–34. 2024. Paris: ELRA and ICCL. URL: https://aclanthology.org/2024.main-1.606
Medvecki D., Bašaragin B., Ljajić A., Milošević N. Multilingual Transformer and BERTopic for Short Text Topic Modeling: The Case of Serbian. In: Trajanovic, M., Filipovic, N., Zdravkovic, M. (eds) Disruptive Information Technologies for a Smart Society. ICIST 2023. Lecture Notes in Networks and Systems, vol 872. Springer, Cham. 2024. https://doi.org/10.1007/978-3-031-50755-7_16.
Monika W., V. Amelia, Q. Aris, A. Nasution. Topic Modeling of Indonesian Children’s Literature Using Latent Semantic Analysis. In Proceedings of the 2nd International Conference on Environmental, Energy, and Earth Science, ICEEES 2023, 30 October 2023, Pekanbaru, Indonesia. Pekanbaru: European Alliance for Innovation. http://dx.doi.org/10.4108/eai.30-10-2023.2343063.
Murshed, B. A. H., J. Abawajy, S. Mallappa, M. A. N. Saif, S. M. Al-Ghuribi, and F. A. Ghanem. Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling. IEEE Access, vol. 10, pp. 105328–105351, 2022. doi: 10.1109/ACCESS.2022.3211396.
Narozhnyi V., Kharchenko V. Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm. Innovative technologies and scientific solutions for industries. 2024. (1 (27), 140–153. https://doi.org/10.30837/ITSSI.2024.27.140.
Navarro-Colorado B. On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry. Frontiers in Digital Humanities. 2018. Vol. 5, № 15. https://doi.org/10.3389/fdigh.2018.00015.
Rajan S. D., T. Coombs, M. Jayabalan, N. A. Ismail. A Comparative Study of Methods for Topic Modelling in News Articles in Data Science and Emerging Technologies, 2024, pp. 269–277. doi: 10.1007/978-981-97-0293-0_20.
Schoch C. Topic modeling genre: an exploration of French classical and enlightenment drama. Digital Humanities Quarterly. 2017. Vol. 11. № 2. DOI:10.48550/arXiv.2103.13019
Schröter J., K. Du. Validating Topic Modeling as a Method of Analyzing Sujet and Theme. Journal of Computational Literary Studies 1 (1): 1–18. 2022. DOI: https://doi.org/10.48694/jcls.91.
Svensson K., Blad, J. Exploring NMF and LDA Topic Models of Swedish News Articles. Uppsala Universitet, 2020. URL: https://uu.diva-portal.org/smash/get/diva2:1512130/FULLTEXT01.pdf
Uglanova I., Gius E. The Order of Things. A Study on Topic Modelling of Literary Texts. Workshop on Computational Humanities Research. 2020. URL: https://ceur-ws.org/Vol-2723/long7.pdf
Völkl Y., Sarić S., Scholger, M. Topic Modeling for the Identification of Gender-specific Discourse. Virtues and Vices in French and Spanish 18th Century Periodicals. Journal of Computational Literary Studies. 1(1). 2022. doi: https://doi.org/10.48694/jcls.108.