Rolling Stylometry and Machine Learning Analyzes QAnon Texts Patterns

cover
7 Dec 2024

Authors:

(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;

(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.

Abstract and Introduction

Why work on QAnon? Specificities and social impact

Who is Q? The theories put to test

Authorship attribution

Results

Discussion

Corpus constitution

Quotes of authors outside of the corpus have been

Definition of two subcorpus: dealing with generic difference and an imbalanced dataset

The genre of “Q drops”: a methodological challenge

Detecting style changes: rolling stylometry

Ethical statement, Acknowledgements, and References

Detecting style changes: rolling stylometry

Collaborative writing is not necessarily easy to handle. The scenario in which authors simply took turns, and divided the work between themselves is already complicated to address. But when the collaboration is more complex, especially when the various authors contribute together to the same passages, the style of the original authors can be hard to recognize. The collaboration then results in a new style, that does not match the style of one of the authors (Kestemont et al., 2015).

The principle of rolling stylometry (Eder, 2016) is simple: rather than attributing a whole text, we arbitrarily decompose it in a series of overlapping smaller parts: from the 1st word to the 1001th word, from the 2nd word to the 1002nd word etc. Then, we attribute each of these parts to a certain author. We only have to define the length of these parts, and by how much they overlap.

Rolling stylometry has been successfully implemented in a wide variety of settings. With Burrows’ delta, it has for instance been used to assess Ford’s claims about his implications in collaborations worth Joseph Conrad (Rybicki et al., 2014), to determine the beginning of Vostaert’s intervention on Dutch Arthurian novel Roman van Walewein (van Dalen-Oskam and Van Zundert, 2007), or to understand Lovecraft’s and Eddy’s implication in The Loved Dead (Gladwin et al., 2017). Using support-vector machines, rolling stylometry more recently helped to confirm Fletcher and Shakespeare’s collaboration for Henry VIII (Plecha´ˇc, 2020) or Moli`ere and Corneille’s collaboration for Psych´e (Cafiero and Camps, 2021).

Support Vector Machine

We choose to train linear Support Vector Classifiers (SVC) to identify the style of each potential candidate. The family of Support Vector Machines algorithm has been widely and successfully used for authorship attribution in a variety of settings and languages, and for very diverse sources, ranging from e-mails or blogs to Shakespeare plays (De Vel et al., 2001; Diederich et al., 2003; Mikros, 2012; Ouamour and Sayoud, 2012; Marukatat et al., 2014; Plecha´ˇc, 2020). At the PAN competition, a reference for digital text forensics and stylometry, it also served as a baseline for the “cross-domain authorship attribution” tasks the last time they were proposed in 2018 (Stamatatos et al., 2018) and 2019 (Kestemont et al., 2019). The Q drops being a sort of “domain” in their own, our own task can be considered as a crossdomain authorship attribution task. Other classifiers could have been used, but would not offer the interpretability we need for that kind of task. In such a delicate context, being able to get a simple and clear intuition on which features the classifier relies is simply crucial. If the features selected were to be too much related to the content of the texts, rather than to properly linguistic properties of the person’s discourse, we need to be aware of it, and train more properly a new classifier.

To determine the choice of features and the size of the training samples, we are constrained by two antagonistic goals: the shorter the samples, the more detailed and precise the results that we will get in terms of attribution, yet the longer, the more statistically reliable. Particularly, authorship attribution has proven to require relatively high amounts of data, with a floor for reliable authorship attribution between 1000 and 3000 words, depending on genre and language (Eder, 2015, 2017). The question of sample length can also be linked to the difficulty of the attribution task; cross-domain attribution with multiple candidates presents a challenge in this regard.

On the other hand, the features we retained, character 3-grams, could increase robustness, ad they are known to reduce sparsity and perform well in attribution studies (Kestemont, 2014; Sapkota et al., 2015). While punctuation can strongly reflect authorial signature (Sapkota et al., 2015), we removed led to remove it because of variety of platforms from where the data were recovered could cause inconsistencies in the use of signs that can be encoded in different fashions, e.g., apostrophes.

For these reasons, we retain a setup that is a compromise between reliability and finer grain analysis:

Sample length 1000 words;

features character 3-grams (all, including punctuation).

To evaluate our setups, we opt for a leave-one-out cross evaluation on the training corpus (Table 1), for a combination of reasons. First, some of the samples are of relatively small size in the context of training an efficient classifier. A leave-one-out procedure helps us using the maximal amount of data. The constitution of which relevant and coherent training sets is also a question in our case. In that context, leave-one-out evaluation can provide a more robust estimate of the model’s performance, as it accounts for the potential variability caused by the specific sample chosen for testing. Finally, leave-one-out can help us avoid overfitting (Ng et al., 1997; Ghojogh and Crowley, 2019), which is a danger in our study - we do not want the classification to be based on some specific piece of news of particular interest to one of the candidates, but to have reliable classification based on linguistic features appearing in a large range of contexts.

The confusion matrix gives more information on the nature of the small number of classification errors (Table 2). As can be expected, performance is slightly lower for authors for which training material is very limited (ColemanR, CourtneyT, RogerS). For the others, it is above 95%, if we except a few confusions between Michael F. and Roger S. (on the large corpus only). These can be explained by the limited size of Roger S. training data, thematic attractions, the fact that he talks about Flynn more or less directly, but also probably by a generational (age) bias.

We then apply our models to all successive overlapping slices of Q drops, arranged in chronological order, with window size of length 1000 words and step 200 words. We then plot the resulting decision functions for each classifier. The higher the value, the more likely the attribution of a sample to a given author.

All analyses are implemented in Python, inside the SuperStyl package (Camps et al., 2021), and use internally the SVM and pipeline facilities provided by Sklearn (Pedregosa et al., 2011). Plots are created using R and Python (matplotlib).

Correspondence analysis

Correspondence analysis (Benz´ecri, 1973) has been performed on a contingency table of the 1000 words samples by RonW and PaulF based on 4105 character 3-grams selected for statistically reliability based on a procedure previously described (Cafiero and Camps, 2019). The QDrops have then been inserted as supplementary rows, using the implementation in the R package FactoMineR (Lˆe et al., 2008). The significance of the two first axes remains relatively low, in part because of the high dimensionality of the input table, but the data clouds for PaulF and RonW are clearly separated, and the QDrops appear in an intermediary position.

This paper is available on arxiv under CC BY 4.0 DEED license.