An authorship analysis of the Jack the Ripper letters

The Whitechapel murders that terrorized London in 1888 are still remembered to this day, thanks to the legend of its unapprehended perpetrator, Jack the Ripper. In addition to the gruesomeness of the murders, the name and the persona of the killer have been popularized by the over 200 letters signed as ‘Jack the Ripper’ that have been received following the murders. The most supported theory on the authorship of these letters is that some of the earliest key texts were written by journalists to sell more newspapers and that the same person is responsible for writing the two most iconic earliest letters. The present article reports on an authorship clustering/verification analysis of the Jack the Ripper letters with a view to detect the presence of one writer for the earliest and most historically important texts. After compiling the ‘Jack the Ripper Corpus’ consisting of the 209 letters linked to the case, a cluster analysis of the letters is carried out using the Jaccard distance of word 2-grams. The quantitative results and the discovery of certain shared distinctive lexicogrammatical structures support the hypothesis that the two most iconic texts responsible for the creation of the persona of Jack the Ripper were written by the same person. In addition, there is also evidence that a link exists between these texts and another of the key texts in the case, the Moab and Midian letter.

1 Introduction

On 31 August 1888, the murder of a prostitute in the Whitechapel area of London started a series of homicides that would be remembered for over a century: the Whitechapel murders. These murders were characterized by mutilations of increasing gruesomeness, such as disembowelment or removal of organs. Experts believe that between five and six murders were committed, culminating with the most violent one on 9 November 1888. Although the killings are traditionally attributed to a single person, commonly known as Jack the Ripper, there has never been definitive evidence to exclude the possibility that the murders were unconnected events, despite some modern research that has found the set of shared behavioural characteristics of the murders to be distinctive ( Keppel et al., 2005).

Besides the investigative aspect, the Whitechapel murders case and the legend of Jack the Ripper have an important socio-cultural dimension. The mystery surrounding the identity of the killer has led to incredible and often unlikely speculations and even though the Whitechapel murders happened more than a century ago, the mystery has created a business that is still alive and generating revenue in the form of media products, books, and tours. These elements have contributed to the engraving of the mythology of Jack the Ripper into modern Western culture far more than the murders themselves, and several academic works have explored both the sociological dimension of the mythology of Jack the Ripper to shed light on 19th century England and the beginning of the modern era ( Walkowitz, 1982; Perry Curtis, 2001; Haggard, 2007) or have identified links between Jack the Ripper and Victorian literature ( Tropp, 1999; Eighteen-Bisang, 2005; Storey, 2012).

The origin of the mythology of Jack the Ripper lies in the communication that the killer allegedly sent to the police or media during the time of the murders and in the following months and years. Although there is no evidence that the real killer was involved in the production of any of them, the more than 200 Jack the Ripper letters significantly contributed to the creation and popularization of the name and persona of Jack the Ripper. However, despite the large number of texts involved in the case, only a small number of the Jack the Ripper letters received substantial investigative or socio-cultural importance at the time.

Probably the most important text in the case is the ‘Dear Boss’ letter, which was received on 27 September 1888 by the Central News Agency of London. This letter is the first ever signed as ‘Jack the Ripper’ and it is responsible for the creation of the pseudonym. The letter claimed responsibility for the murder of Annie Chapman on 8 September 1888 and mentioned that an ear would be cut off from the next victim and sent to the police. Indeed, the murder of Chapman was followed by another murder in which part of one of the ears of the victim was removed, although this was never sent to the police. Because of this fact and its style and content, the letter was considered to be genuine and it became famous for introducing the persona of Jack the Ripper and for providing a name that the press could use to refer to the killer.

The second most important text is the ‘Saucy Jacky’ postcard, which was received on 1 October 1888 by the Central News Agency of London, signed again as ‘Jack the Ripper’. The postcard claimed responsibility for the double murder of Elizabeth Stride and Catherine Eddowes on the night of 30 September 1888. The postcard did not threaten future murders and presented an apology for not having sent an ear to the police. Together with the ‘Dear Boss’ letter, this postcard has also become iconic in the portrayal of Jack the Ripper and was taken more seriously than other letters because of the short window between the murders and the time the postcard was sent ( Begg, 2004).

The police took these two texts seriously enough to produce and post copies outside of police stations on 3 October 1888 ( Rumbelow, 1979; Sugden, 2002). Following that, on 4 October, the two texts were also published in many newspapers ( Sugden, 2002), even though some newspapers had obtained the information of the name ‘Jack the Ripper’ and part of the texts already by 1 October ( Perry Curtis, 2001).

Although much less popular than the other two texts, on 5 October the Central News Agency also received a third text, commonly known by experts as the ‘Moab and Midian’ letter. This text announced a triple event and justified the murders with religious motives. The peculiarity of this letter is that the original had never been sent to the police, as the journalist Tom Bulling of the Central News Agency decided to copy the text and send only the envelope to the police. The reasons behind this choice were not explained and to date they are still unknown.

Besides the three texts delivered to the Central News Agency, a large number of other letters and postcards were sent to several other recipients such as the press or the police between October 1888 and November 1888, that is, after the two iconic texts were made public by the police. During this period, 130 letters allegedly written by the killer were received, and the flow of letters continued for ten more years. Among these letters, another text that has become iconic and that was judged as important during the case is the ‘From Hell’ letter, which was received on 16 October by George Lusk, head of the Whitechapel Vigilance Committee, together with half of a kidney ( Rumbelow, 1979).

In most of the letters, the author(s) mimicked the original ‘Dear Boss’ letter and ‘Saucy Jacky’ postcard in terms of taunting the police and using salient stylistic features, such as the laughter ‘ha ha’, or the salutation ‘Dear Boss’. Some of the letters were almost exact copies of ‘Dear Boss’, especially the ones that were received a year or more later, in conjunction with the anniversary of the murders or in conjunction with new murders in Whitechapel.

Since it is quite unlikely that the same person produced hundreds of letters spanning decades and sent from different places across the UK, it is commonly assumed that most of the letters were written by different individuals, who possibly had not been involved with any of the killings. Particularly interesting is the case of Maria Coroner, a 21 year old girl who was caught sending one of those letters ( Evans and Skinner, 2001). When questioned, she explained that she did so as she was fascinated by the case. It is likely that many of the writers of these letters acted for similar reasons, although the motives behind such actions will probably never be established. These hoax letters themselves represent an interesting mirror into the fears and problems of the people who wrote them ( Remington, 2004). More importantly, these letters still exercise an impact on modern times. The Yorkshire ripper hoaxer, for example, sent letters that borrowed several linguistic elements from the ‘Dear Boss’ letter ( Ellis, 1994; Lewis, 1994).

Such a collection of letters also represents an invaluable data set for forensic linguistics and for authorship analysis. Linguistic analyses of the letters can be useful to provide new evidence for the Whitechapel murders case, since, as opposed to other sources of evidence nowadays corrupted by time, the language of the letters has reached us unchanged. The question of the authorship of the letters mostly focuses on the early ones, such as the ‘Dear Boss’ and ‘Saucy Jacky’ texts. The most common theory about the authorship of these texts is that journalists fabricated them to increase newspaper sales. The ‘enterprising journalist’ theory, more specifically, suggests that letters such as the ‘Dear Boss’ letter were actually works of fiction skilfully created to generate shock and ‘keep the business alive’ ( Begg, 2004; Begg and Bennett, 2013). Evidence for the ‘enterprising journalist’ theory comes from the ‘Littlechild’ letter, in which Detective Chief Inspector John George Littlechild mentions that at Scotland Yard virtually everyone knew that the ‘Dear Boss’ letter was fabricated by Tom Bulling, a journalist of the Central News Agency itself, in collaboration with his manager ( Rumbelow, 1979; Begg, 2004). At the time, the Central News Agency had been in a fierce competition with other news agencies and had a reputation of fabricating or embellishing news ( Evans and Skinner, 2001; Begg, 2004). Another theory proposed by Cook (2009) suggests that a journalist named Frederick Best from the tabloid newspaper The Star was the actual author of the ‘Dear Boss’ letter.

As a first step to shed light on the authorship question of the Jack the Ripper letters, the present article reports on an authorship analysis of the texts received during and after the Whitechapel murders case that are connected to Jack the Ripper. The available data set lends itself to several authorship questions, such as the profiling of the anonymous author(s), or to the comparison between some key letters and Bulling's and Best's writings. In the present article an initial exploration of the Jack the Ripper letters is performed with the general aim of finding out for which of the hundreds of texts there is evidence of common authorship, with a special attention to the most important texts in the case mentioned above and on those earliest texts received before 1 October 1888, that is, before the ‘Dear Boss’ letter and the ‘Saucy Jacky’ postcard became of public domain.

Establishing whether some of the Jack the Ripper texts could be written by the same person is an important preliminary step as any future study, either involving profiling or comparison, would benefit from knowing if a number of questioned texts can be clustered together. In this sense, the authorship question tackled in the present study constitutes a useful starting point for any future authorship study on the Jack the Ripper letters.

2 Data

The data set used in the present study is a corpus that includes the texts connected to the Whitechapel murders: the Jack the Ripper Corpus (JRC) (see Supplementary Material ). This corpus consists of the letters or postcards found and transcribed in the Appendix of Evans and Skinner (2001), who claim to have collected all of the texts involved in the Whitechapel murders related to Jack the Ripper from the Metropolitan Police files. These letters were OCR-scanned from the book and the scans were manually checked for scanning errors. The corpus consists of 209 texts and 17,463 word tokens. The average length of a text in the corpus is of eighty-three tokens (min = 7, max = 648, SD = 67.4).

The peculiarity of the JRC is that almost all of the texts in the corpus are comparable in terms of their broad situational parameters ( Biber, 1994), as they are almost all written letters or postcards with similar linguistic purposes. For example, in terms of addressee, 67% of the texts were addressed to Scotland Yard; Sir Charles Warren, the head of London Metropolitan Police during that time; Inspector Abberline; or other law enforcement units. The remaining 33% were either of unknown addressee (13%), or were addressed to common citizens or to newspapers, news agencies, schools, or private firms (20%). The vast majority of the letters was postmarked or found in London, although other letters were postmarked or found in places all over the UK, such as Birmingham, Bradford, Dublin, Edinburgh, Liverpool, Manchester, or Plymouth. All of the letters were handwritten and a minority of them (4%) included drawings of various items, such as knives, skulls, or coffins. Finally, a large number of the letters (75%) were indeed signed as ‘Jack the Ripper’ or with variants of the name, such as ‘Jack the Whitechapel Ripper’, or ‘JR’, or ‘jack ripper and son’. Some other letters were not signed (11%) while the remaining letters used other pseudonyms, such as ‘Jim the Cutter’, ‘The Whore Killer’, or ‘Bill the Boweler’.

The corpus ranges from 24 September 1888 to 14 October 1896, thus spanning more than 10 years after the murders. However, the majority of the texts, that is 62% of the corpus, was received during the period between October 1888 and November 1888.

Text 1 (24 September, 128 word tokens): In this text the author admits to the killing of Chapman and presents the intention to stop killing. The letter is unsigned;
Text 2 (27 September, 244 word tokens): The ‘Dear Boss’ letter;
Text 3 (1 October, 57 word tokens): The ‘Saucy Jacky’ postcard; and
Text 4 (1 October, 88 word tokens): This text threatens more murders and is signed as ‘Ripper’.

3 Methodology

The authorship question considered for this study concerns finding out which texts in a corpus are likely to be written by the same author. Recently, this task has been called ‘author clustering’ and it has been tackled using hierarchical cluster analysis on frequencies of features ( Gómez-Adorno et al., 2017). This authorship problem could be considered, however, just as a special case of ‘authorship verification’, a problem that has received considerable attention in the literature ( Koppel and Schler, 2004; Koppel et al., 2012; Brocardo et al., 2013; Koppel, Schler and Argamon, 2013). The best solutions proposed to solve this type of problem involve the addition of distractor texts belonging to similar registers and the use of similarity metrics applied to feature sets consisting of frequencies of linguistic features.

The problem in applying any of these techniques to the JRC corpus is that the JRC texts are too short to produce reliable frequencies, as the average text length for the corpus is only eighty-three word tokens. For this reason, in this case it is necessary to adopt a method that does not involve the computation of frequencies.

A solution to the problem of analysing short texts within a forensic linguistic context by considering the presence or absence of features as opposed to their frequencies has been initially proposed by Grant (2010) and then further described in Grant (2013) for text messages. Inspired by research in similarity between species in biology and ecology, and already applied to assess similarity in crime types, this approach consists in quantifying the similarity between two texts using the Jaccard coefficient, or the number of shared features between two texts divided by the total number of features in both texts ( Jaccard, 1912):

J ( A , B ) = | A ∩ B | | A ∪ B |

After being successfully applied to text messages case, methods using the Jaccard coefficient have been applied with good results to other registers, including newspaper articles ( Juola, 2013), short emails ( Johnson and Wright, 2014; Wright, 2017), and elicited personal narratives ( Larner, 2014). These studies have analysed the presence/absence of combination of words, mostly looking at word n-grams, that is, strings of words of length n collected using a moving window.

Within plagiarism detection research, word n-gram techniques based on similar mathematical principles are very common ( Oakes, 2014, p. 65) on the grounds that the more shared strings there are in two documents, the more there is shared similarity of encoding of meanings and therefore the less likely it is that the documents are independent from each other, as explained by Coulthard (2004).

Word n-grams have been extensively adopted as linguistic features in traditional frequency-based stylometric methods for authorship attribution, although they are not deemed the best stylometric features, as they are often surpassed in efficacy by function words, simple word frequency, and, above all, character n-grams ( Grieve, 2007; Stamatatos, 2009). Although word n-grams might not be extremely good features when frequency is taken under consideration, for a method involving presence/absence these features are much better than single words or function words because word strings are rarer and the power of a presence/absence method lies in the measurement and comparison of the linguistic uniqueness of each author on rare features. Character n-grams could also be good features but they are less amenable to interpretation, which can be a drawback depending on the ultimate goal of the research.

In addition to these methodological advantages, the use of word n-grams as features has theoretical support. Corpus linguistics ( Sinclair, 1991; Biber, Conrad and Cortes, 2004; Hoey, 2005) and psycholinguistics/cognitive linguistics ( Langacker, 1987; Barlow and Kemmer, 2000; Schmitt, 2004; Wray, 2005; Schmid, 2016) have long theorized that combination of words is at the core of language processing and empirical support has been found for these theories ( Ellis and Simpson-Vlach, 2009; Tremblay et al., 2009).

Furthermore, there is also empirical support for a strong idiolectal effect in the production and processing of word combinations ( Mollin, 2009; Barlow, 2013; Schmid and Mantlik, 2015; Günther, 2016). Wright, (2017) reveals the idiolectal nature of certain word n-grams by taking one specific speech act as constant and then analysing how different authors realize this act, uncovering that each author recurs to their own idiosyncratic set of lexical choices to perform the same act.

In the present study, for the reasons explained above, the set of features that is taken under consideration is word n-grams, as the ultimate goal is to discover possible idiolectal encoding in the JRC letters. Because the JRC texts are short, presence or absence of word n-grams is considered, as opposed to their frequency. Among all the possible sizes of n-grams, word 2-grams are chosen as any n-gram of n > 2 is ultimately made up of n-grams of n = 2, meaning that word 2-grams return the most complete picture of the shared word combinations in two sets. Presence or absence of word n-grams is quantified using the Jaccard ‘distance’, as opposed to the coefficient, which can be defined as:

d J ( A , B ) = 1 − | A ∩ B | | A ∪ B |

and which returns values between 0, or absolute identity, and 1, or absolute distance. The Jaccard distance is used so that a hierarchical cluster analysis can then be carried out. In this way, it is possible to first find out the major groups of texts that are more similar to each other, and then it is possible to zoom in and explore smaller groups of letters, such as the pre-publication letters.

However, evidence of common authorship of two sets of documents can come not only from finding similarity but also from establishing that this similarity is distinctive ( Grant, 2010, 2013). Although it is difficult to establish a universal threshold for distinctiveness, it is safe to assume that if a particular n-gram or lexicogrammatical structure does not occur at all or occurs extremely infrequently in a comparable reference corpus then this n-gram or structure is distinctive.

The 132 million word 19th century section of the Corpus of Historical American English (COHA);
The 34 million word Corpus of Late Modern English Texts 3 (CLMET3), spanning from 1710 to 1920;
The 19 million word Extended Old Bailey Corpus (EOBC), including the proceedings of the Old Bailey from 1720 to 1913.

In addition, since the analysis involves word n-gram ‘types’, the method faces problems when dealing with texts of different length, as the likelihood of any word or n-gram type being observed is correlated with text length. However, provided that the shared n-grams found are also highly distinctive the evidence of common authorship is nonetheless valid despite differences in text lengths.

4 Results

Figure 1 reveals that the relationship between the percentage of texts using a 2-gram (occurring in at least two texts) and their frequency rank form a zipfian shape, as expected ( Zipf, 1935). The graph shows that the top eight 2-grams appear in at least 20% of the corpus. Some of these are very frequent because they reflect common grammatical structures of English, such as ‘I am’, ‘I have’, ‘I will’. Two 2-grams reflect the influence of the signature and salutation of the ‘Dear Boss’ letter on the rest of the corpus: ‘jack the’ and ‘dear boss’. Finally, the high incidence of the 2-grams ‘I shall’ and ‘yours truly’ are probably explained by both the influence of the ‘Dear Boss’ letter and by the register of the letters.

Relationship between rank and percentage of occurrence for each word 2-gram in the JRC occurring in at least two texts

Because of their frequent occurrence and thus reduced discriminatory power, these top eight 2-grams were excluded from further analysis.

The distance between each pair of texts was quantified using the Jaccard distance based on the presence or absence of the remaining 1541 word 2-grams and a distance matrix was therefore generated. Figure 2 shows a histogram and boxplot of the Jaccard distances for all possible pairs of texts in the JRC.

Histogram and boxplot showing the distribution of Jaccard distance values for all possible pairs of texts in the JRC

As the histogram of Fig. 2 shows, the most frequent Jaccard distance and also the median distance is approximately 1, which generally speaking means that the texts in the JRC are not very similar to each other. Only 25% of the scores are lower than 0.98, which is marked in Fig. 2 by the leftmost edge of the boxplot, and only 6% of the scores are lower than 0.95, that is, the outliers in the boxplot of Fig. 2 indicated by circles.

The distance matrix was then used for a hierarchical cluster analysis that can be visualized through the radial dendrogram in Fig. 3.

Radial dendrogram displaying the results of a hierarchical cluster analysis of the JRC corpus using the Ward method based on Jaccard distances. The name of the texts is a code starting with two letters from the signature and followed by the date in which it was received. The texts mentioned in the introduction, including the pre-publication texts, contain their name in addition to the code

Three main branches stem from the centre of the graph in Fig. 3, corresponding to the three main clusters found. On the right, there are two main clusters, one of which includes only two texts. The remaining texts are all classified into another cluster whose branch points to the left and that further splits into two other clusters that roughly correspond to the two hemispheres of the graph. The most historically interesting texts, including the pre-publication texts, are all grouped in the cluster spanning over the top hemisphere of the graph and therefore the rest of the article will focus on this cluster. Although it would be interesting to explore the other clusters, this is beyond the scope and space of this study. The branch leading to the top hemisphere then splits even further into two more sub-branches, one developing to the right containing the ‘From Hell’ letter, and one to the left where all the other historically important letters, including the pre-publication texts, are grouped. The split at this level suggests that the ‘From Hell’ letter is rather linguistically dissimilar to the other famous letters, at least in terms of word 2-gram use. The left branch then splits into two more clusters, with the rightmost one splitting again into two large clusters. One of these two contains the ‘Dear Boss’ letter, the ‘Saucy Jacky’ postcard, and the ‘Moab and Midian’ letter, while the one next to it contains two of the pre-publication letters. Therefore, among the pre-publication JRC texts, ‘Dear Boss’ and ‘Saucy Jacky’ are the most similar one, with the ‘Moab and Midian’ letter being the most similar to them among all the historically important texts.

4.1 The pre-publication texts

Let us therefore examine the pre-publication texts using a network graph as in Fig. 4, in which each circle represents a text with a size proportional to the text’s length in total word tokens and each link represents an overlapping word 2-gram type. The graph also reports the Jaccard distance for each pair of texts.

As the cluster analysis already suggested, it is evident that the two pre-publication texts that are more similar to each other are the ‘Dear Boss’ letter and the ‘Saucy Jacky’ postcard. Additionally, these two texts have a Jaccard distance of 0.93, which is a degree of dissimilarity that can be found in less than 5% of the pairs of texts in the JRC. The amount of shared language is striking considering the fact that the ‘Saucy Jacky’ postcard is very short and does not share any linguistic link with either the 24 September text or the 1 October text. Although the ‘Dear Boss’ letter shares a number of 2-grams with both Text 1 and Text 4, the Jaccard score for both pairs is in the average for the corpus.

Excluding the 3-gram ‘Jack the Ripper’, which refers to the signatures of the two texts, Table 1 below presents the concordances of their overlapping 2-grams, with an analysis of their syntactic structure.

Syntactic analysis of the concordances for the 2-grams in common between Dear Boss and Saucy Jacky