Mapping Twitter’s Foulest Language

Forensic Linguist Dr Jack Grieve is using geo-tagged data from Twitter to analyse swearing on both sides of the Atlantic.

Dr Jack Grieve

Dr Jack Grieve types a word into his computer and points to a heat-map of the United States. “Talking about swearing makes sense on Twitter because it’s really informal,” he says. “If you wanted to look at the use of scientific vocabulary, for example, Twitter would probably not be the right place for it. If you want to look at swearing and new-word slang creation it makes sense, right?”

Can something positive come out of Twitter’s enormous scope for foul language? Surprisingly, yes. A Senior Lecturer in Forensic Linguistics at Aston University, Dr Grieve has been working on a unique data-mining project - supported by ESRC and AHRC via Digging into Data - which maps words using Twitter’s gigantic corpus of geo-tagged tweets. In the process he has created a free app, Word Mapper, which reveals regional patterns across the United States. Work on mapping British data is also underway and his techniques are proving fruitful in other work supporting the police with profiling suspects in murder and paedophile cases.

Nevertheless, the media focus has - predictably enough - been on swear words. His Twitter project has not only attracted the attention of linguistics bloggers, but he has appeared in the Huffington Post, The Independent, Daily Mail, The Guardian and countless other media outlets; he has been invited to get involved in the American Dialect Society’s Word of the Year and has been asked to write a book on swearing (the latter is unlikely, he says, however he has recently published Regional Variation in Written American English - the first study of its kind). How did he get the idea to look at swear words in the first place? “We have another project where we map the top 10,000 words in American-English. My post-doctoral student was looking at it and he said ‘These swear words are new’, so I posted the maps on Twitter and people seemed to love them.”

The Challenges of Big Data

It’s not often that academic research goes viral, but despite its playfulness, Dr Grieve’s project is underpinned by some serious challenges. The biggest was wrangling such a large amount of data. As he explains, it was the first time in his career that he couldn’t use a standard PC to handle the data but had to start using servers and thinking in terms of parallel processing. Digging into Data - a scheme that supports international data-mining projects in the social sciences - funded the servers and the dataset was built by a data-mining geographer at the University of South Carolina. Although data-driven linguistics is not a new thing, the use of Twitter, which provides access to 10 billion geo-coded words drawn from about 1 billion tweets, is relatively novel and is in stark contrast to the traditional methods such as conducting interviews or running surveys. There have been criticisms - such as the fact that Twitter’s user-base is not representative or that Twitter does not help to analyse the spoken vernacular - but Dr Grieve takes a phlegmatic attitude to these points. 

“I’m a corpus linguist and come from a different tradition,” he explains, “I would argue there is no such thing as the spoken vernacular, there are all sorts of different types of speech, all sorts of different types of writing, and Twitter is pretty informal in many ways. Yes, we are looking at Twitter but we’re doing that because it’s the only thing that gives us this much data. I’d be happy to work with a 10 billion corpus of conversations, but that doesn’t exist. In the American dataset from 2015 we got 20 million different user accounts - some of them will be duplicate accounts, but that’s the size of Canada! There are 320 million people in the US so 20 million is not a bad sample. And Twitter isn’t as skewed as people might think. It’s obviously skewed towards young people and towards African-Americans, but it is what it is. Twitter has a demographic profile and we’re analysing Twitter - we’re not generalising directly past it. Having said that, though, I think the patterns are broadly true, but nobody knows for sure because this is the first time we’ve had this kind of data.”

Map showing the relative frequency of the use of the word "darn" in the USA
Map showing the relative frequency of the use of the word "darn" in the USA

As Dr Grieve types more examples into Word Mapper, it is clear that looking up swear words is in fact the least of the software’s capabilities. A search for the word “but” and a search for the word “and” create complimentary heat-maps of the US: an intriguing result for which Dr Grieve has yet to explain. He has also discovered cultural patterns, such as more talk about family on the East Coast compared to more about work and travel on the West Coast. Another interesting outcome has been the difference between African-American English and white American English. “Clearly African-American is, if not the main source for new words, the single biggest source of lexical innovations. That’s not entirely surprising when you think of things like hip-hop, which is very influential,” he adds.

Police Investigations

But the real benefits of the project have been how this Big Data approach can be applied in a forensic linguistics context. As part of his work at Aston’s Centre for Forensic Linguistics, Dr Grieve is involved in supporting police investigations by analysing a range of written texts - from emails to social media posts - to prove authorship or gain clues about a suspect’s background. One area where his digital mapping may be valuable is in identifying participants in paedophile rings, where written conversations form a large bulk of the evidence. He is also starting to look at British data for the first time to see if he can develop methods of geographical profiling.

“It seems to me that Brits are tweeting a bit more than Americans. I was a little bit worried that we would not have enough data for Britain but we have a lot, so the outcome should be really strong. Although we really like those authorship problems when the police give us three authors and ask ‘Which one wrote this text?’ most times it’s not like that. It’s more likely that we will get a text and be asked what we can tell about the person who wrote it. We can make guesses on education; gender is very hard to guess. But the geographical data really helps. You can see how we could map words using this technology and get some very strong evidence.”

This article first appeared in the Sep 2016 edition of Aston in Touch. You can follow Jack on Twitter here.

Find out more about Dr Jack Grieve

Find out more about Dr Jack Grieve