Workshop Day (19 October 2025)
Computational Stylometry for Deepening Scholarly Engagements in the Humanities: Intelligent Cyberinfrastructure Resources, Community-development Tools, and Learning Applications
Background: Stylometry and its relation to Computational Humanities
Although significant advances have been made in computational linguistics, natural language processing, and
text-mining, and a variety of associated applications have been demonstrated in the domain of computational humanities[1], the closely allied area of stylometry and its relevance to computational humanities has not received sufficient attention. Stylometry’s modern roots can be traced back to the late 15th century, when scholars relied on manual and labor-intensive literary analytic and counting methods. It was not until the late 19th century that statistical methods started to be incorporated into stylometry, with quantitative analysis methods proposed and demonstrated as a means for author profiling [Calle-Martin & Miranda-Garcia, 2012]. In the mid-20th century, the development of computational stylometry was strongly influenced by early computational humanities projects such as the one jointly led by the Jesuit priest Roberto Busa and IBM which created an index of the all the works by St. Thomas Aquinas [Busa, 1951].A stylome, as related to human authorship, has been described as a unique signature which identifies the authorship based on characteristic patterns or motifs that appear in the author’s writing [Hai-Jew, 2015]. Recently, authorship determination based on stylometric methods has gained a new relevancy and even urgency with the increase in fraudulently produced documents and with growing popularity of various publication forums that encourage “paper mill” enterprises [Odri and Yoon, 2023].Stylometry methods, however, offer numerous other, more positive advantages beyond its founding goals for fraud detection. Particularly, interesting are the positive goals and promises surrounding analysis of creative artifacts, especially texts, to understand styles. Stylometry can be highly useful when it comes to unraveling the structures of core stylistic motifs such as the use of figures of speech to produce a particular effect or impression. For example, scholars have applied basic log-linear classifiers to recognize word repetitions called chiasmus, epanaphora, and epiphora to achieve different types of effects [Dubremertz and Nivre, 2018]. Another branch of stylometry conducted analysis of stylome at an individual author and at author-cohort (or peer-group) levels based on a variety of textual attributes, namely lexical, morphological, syntactical and semantic categories. For example, in one study, the individual authorship motifs were established in terms of the usage of non-content bearing words such as by, to, and or and content-bearing words such as war, innovation, and commonly to distinguish between Madison and Hamilton’s contributions to the Federalist papers [Holmes, 1998]. Drawing upon machine-learning methods, in certain stylometric studies the text features are converted to numerical vectors and then supervised learning methods are used to train classifiers to automatically categorize (or predict) the authorship or influences of individuals. An example of a supervised approach applied in stylometry is support vector machines (SVM) applied on a set of texts that were collaboratively produced, and the goal was to identify shifts in authorial influences across individual works [Maciej, 2016]. Using similar vector-based approaches of converting stylistic features, and unsupervised methods, such as clustering, author cohorts or groups of authors that represent similar styles can be detected [Hai-Jew, 2015]. It is expected that authors from relatively narrow time intervals will demonstrate certain similarities in style. However, a study covering a long period of historical interval has shown that the length of such homogeneous periods has gradually shortened in the recent era, due to the volume and diversity of scholarly texts produced now as compared to the past [Hughes, et al, 2012]. Such large-scale applications of stylometry as demonstrated in the latter study has become possible due to the creation and availability of big data sets and open resources, for example the Project Gutenberg, which offers the opportunity to analyze evolution in authorship styles over time and shifting cultural trends.With the advent of generative AI (GA) methods, computational stylometry has found a new value and application domain: detection of fraudulent synthetic content [Odri and Yoon, 2023]. However, viewed from a positive angle, GAs open the possibility of engaging with ancient or classic texts in creative and exciting ways. Using GA, Martin Puchner, a Harvard scholar, has developed novel methods for engaging with classic texts and literature, based on dialogical interactions with historically critical figures, for example, Socrates, Aristotle, Nietzsche, Montaigne, Du Bois, and Virgina Woolf, that are represented as bots [Sachi, 2025]. The stylometric methods utilized to build such bots require identifying and incorporating textual attributes to make the dialog (i.e., language of the speaker) age- or time- sensitive, indicate emphasis or de-emphasis, express emotions, and individualize the articulation based on the personality of the historical figures[2]. The attribute elucidation and specification required in training the bots offers new opportunities for students to understand the important thinkers, philosophers, and writers more deeply and to advance computational stylometry methods.
Foot notes
[1] We will encourage one or more groups to conduct systematic reviews of the field of computational stylometry (or its major components).
[2] The stylometric attributes identified for the various critical humanities figures need to be made explicit and they need to be deliberately manipulated for such bots to be effective. The careful manipulation of the attributes associated with each humanities figure is important due to two factors: 1) to capture the “voice” or the personality of the humanities figure and 2) to personalize the experience for the interlocutors so that they remain “convinced” and engaged in the dialog.
show more
Computational Stylometry: An Opportunity to Invigorate Interest in the Humanities
Surprisingly, at a time in history when we are witnessing rapid growth in digital storytelling, gaming, and animation industries, we are also
experiencing a significant drop in interest and consequent reductions in humanities offerings in higher education [Schmidt, 2018]. There is a deep contradiction in the latter situation which deserves a broader and more serious examination. Here, however, we have a more modest proposal. Based on a survey of the recent advances, we know that there is a strong potential for encouraging learners to engage in the humanities by exposing them to cutting-edge computational stylometry methods. Therefore, we aim to explore the opportunities and barriers associated with applying computational stylometry methods for deepening interest and advancing learning in the humanities.
The 2025 Documentsociety conference[1] has received commitment from a highly knowledgeable set of educators, scholars, and academic professionals that are engaged in humanities endeavors. Several of the scholars have a solid reservoir of knowledge and interest in stylometry and several key participants are involved in campus-wide projects that aim to develop stylometry applications for humanities research and learning (as part of university institutes, consortia, and programs/units in research libraries).
A day preceding the main event, on October 19th, 2025, we will hold a separate session with the focus on methods, tools, and applications of stylometry, called the Document Society Computational Stylometry, for deepening engagements in and promoting learning in the humanities (i.e., the theme of this proposal). Beyond the current invitees, we are in contact with additional experts with strong background in stylometry and their applications in pedagogy, learning, and research and we will add them to the roster for the DCS forum as panelists, speakers, and participants. A three-pronged approach will be taken to define the critical opportunities and challenges. The first dimension is tools and cyberinfrastructures for supporting computational stylometry learning. The second dimension is sustaining the initiatives beyond the first DCS forum, in the form of a consortium with exclusive focus on ongoing nurturing and support for computational stylometry learning. And the third dimension is pedagogy and learning of computational stylometry.
The Oct. 19th, 2025, Document Society Computational Stylometry (DCS) meeting will be a full day event. DCS will be widely advertised through key scholarly conference platforms and social media channels. All registered and invited participants will be requested to prepare a short, 2–3-page, position paper and submit the paper about a month before the meeting. A selected set of the position papers will be featured as “keynote” presentations, and all participants will be given an opportunity to share their ongoing work on computational stylometry in a lightening round session. The day will conclude with two sessions: 1) An hour-long panel session to delve into the three dimensions in a deeper way, with audience participation and 2) a manuscript “workshopping” session whereby each DCS participant who contributed a position paper and with interest to submit a journal paper will be given feedback by experts to expand their work. To make DCS a more engaging and relevant forum, we are exploring a potential special themed issue with the editors of the Computational Humanities Research. A set of concrete areas / topics will form the scope of both the DCS and the special themed issue. They are:
1. Methods and tools for defining features/attributes, annotating corpora and maintaining open computational stylometry data sets from specific humanities domains and associated use-cases. Methods based on human expertise, machine learning or GA, and hybrid approaches will be considered.
2. Secure and scalable cyberinfrastructure for supporting online development, testing, sharing, and open publishing of computational stylometry software and data.
3. Establishment of “gold standard” training data sets and metric-driven evaluation protocols for computational stylometry that are audited and maintained through fully automated and semi-automated means.
4. Development and deployment of “community” tools for exchanging computational stylometry data, software, and information among learners engaged in computational stylometry projects or courses.
5. Landscape analysis, systematic reviews, or surveys of the state-of-art computational stylometry methods and applications[1], and their applications in the humanities domains.
6. Learning and/or pedagogical strategies for integrating computational stylometry into humanities curricula in university-level courses.
Foot notes
[1] We will encourage one or more groups to conduct systematic reviews of the field of computational stylometry (or its major components).
[2] The stylometric attributes identified for the various critical humanities figures need to be made explicit and they need to be deliberately manipulated for such bots to be effective. The careful manipulation of the attributes associated with each humanities figure is important due to two factors: 1) to capture the “voice” or the personality of the humanities figure and 2) to personalize the experience for the interlocutors so that they remain “convinced” and engaged in the dialog.
show more
Main Conference (20 October 2025)
Documentality as a Lens for Analyzing Scholarly Practices in the age of AI: Perspectives from the Humanities, Social Sciences, and Information Science
Goals of the Event: Examine Three Aspects of Documentality
The critical processes carried out during the creation of documents and the processes executed to engage with documents can be succinctly described as
documentality. A major goal of the planned event is to consider three primary aspects of documentality from the perspectives of adoption, use, and manipulation of digital platforms in handling documents. We describe the three critical aspects below, drawing upon the interplay among the critical areas in the humanities, social sciences, and information science fields.
show less
Representation
How is the representation of a document’s content interpreted and how does the content representation influence its receiver?
The core issues from humanistic and social science context that are relevant here have to do with style, creativity, authenticity, authority, and trust. Areas such as poesis and hermeneutics from the humanities, stylometry and knowledge classification from information science, and semiotics from the social sciences can certainly expand our understanding on the representational aspects of documentality.
show more
Coordination
What are the organizational and human coordination level activities that are affected by and in turn affect documents?
The human-centric disciplinary perspectives that are of concern here have to do with socialization and social dimensions of document use and frameworks and theories to understand agency and power associated with documentality[4]. From information science the emerging and growing area of computer-supported collaborative work (CSCW) and human-information interaction could provide helpful concepts and frameworks to understand coordination.
show more
When and why humans transform objects into meaning-bearing or emotion-generating artifacts?
And a closely associated question of how humans interpret objects as documents and what role does context play? With regard to the former question, areas such as anthropology, archeology, science & technology studies, and library and archival practices associated with specialized scholarly collections (e.g., geological or archaeological evidence) could be highly beneficial to identify answers. The obvious human-centric discipline which can expand our understanding of the second question is psychology (particularly cognitive psychology). And other areas such as 3-D rendering, virtual and augmented reality, chatbot design, and digital twining are relevant areas that could aid in understanding transformations as they relate to digital platforms.
For the planned conference we will be seeking short position papers, 2-3 pages long, that we will then discuss in the meeting and some of the key contributors will be invited to develop their core ideas further into book chapters. The chapters will be aggregated into an edited volume (we are currently discussing partnership with an academic publisher). The published edited monographic volume will be a key outcome of the meeting.
show more