Authors:
(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;
(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.
Table of Links
Why work on QAnon? Specificities and social impact
Who is Q? The theories put to test
Quotes of authors outside of the corpus have been
Definition of two subcorpus: dealing with generic difference and an imbalanced dataset
The genre of “Q drops”: a methodological challenge
Detecting style changes: rolling stylometry
Ethical statement, Acknowledgements, and References
Dealing with quotations and copy/paste
Quotes of authors outside of the corpus have been excluded as much as possible by close reading: in particular, quotes from Q, Wikipedia, the Stanford Encyclopedia of Philosophy, Abraham Lincoln, the Intelligence Resource Program (irp-fas), Steve Scully’s biography etc. All these quotes have been removed.
Direct quotations (with or without quotation marks) and copy/paste between the writings of the different candidates can also occur. A good deal of them quote Donald, Eric or Melania T. – Q does it too. There is also a certain number of quotations from Q by the others (such as Paul F. for instance). This could lead to small biases in the constitution of idiolectal profiles. To avoid this, we then proceeded to systematically detect citation between the candidates themselves. Direct pairwise comparison being computationally too costly for a corpus of this size, we used a Locality-Sensitive Hashing (LSH) algorithm. To that end, we used the open source TextReuse package (Mullen, 2020). The corpus was tokenised into sentences, and broken word bi-grams (with skip of 1, that is, allowing for any one word to be inserted between the two words of the bigrams) were counted. For all pairs of sentences, a Jaccard similarity score was computed. Be A and B two samples considered as sets of bi-grams, the Jaccard similarity is computed as:
All pairs of sentences with a Jaccard similarity score superior or equal to 0.5 (i.e., at least half of their bi-grams in common) were examined by a human expert, and quotations removed.
Even for J = 1, we were sometimes confronted to false positives. Dan S. and Melania T. both use once the sentence “we are all in this together”, without directly citing each other. We thus left this passage in both their texts. Rarely used, the sentence “the American people are not stupid” nevertheless appears in different texts. It was kept in the texts studied, as other simple sentences (“thank you for your service” etc.)
Other situations were trickier to address. For instance, Dan S. uses once the sentence: “the best is yet to come”. It is used five times by Q, himself quoting former President Donald Trump. This sentence could be used by anyone without directly quoting Q or Donald Trump. Yet, as its use by Dan S. starts with “As the President says. . . ”, we considered it a direct quotation and proceeded to deletion from Dan S.’s text. Yet, we did not delete it from Q’s own writing, as it is never used as an explicit quotation: the sentence could be used in another context, the person(s) writing the Qdrops with this sentence could try to impersonate Donald Trump, etc. In any of these cases, it would be legitimate to leave the information. Same thing goes for expression such as “the world is watching” or “make America great again”, used by Donald Trump. but also by Q and some of the potential candidates here.
This paper is available on arxiv under CC BY 4.0 DEED license.