Say what? Analysing healthcare conversations with corpus linguistics – a Q&A with Dr Gavin Brookes

One of our PSG groups is planning to investigate the health care experiences of breastfeeding mothers; how do they experience the advice? Are there any themes? 

The group is keen to find out –  how will we analyse the data once we’ve asked several mothers to share their experiences?

Gavin Brookes is Research Fellow at University of Nottingham and Assistant Editor of International Journal of Corpus Linguistics; he came to talk to us about one possibility that sounds like it might be just the thing…


Gavin Brookes: I’m a Research Fellow who currently works at the University Nottingham, though I’ll be moving to Lancaster University in July. I’m a linguist who is particularly interested in health-related communication. In my research I take an approach called corpus linguistics – which I think we’ll be getting on to in a bit – but I also use other methods from time to time, too.

PSG A: I see you examined patient feedback on health care services for NHS England and the UK Care Quality Commission. Please could you tell us a bit about that?

Please could I ask, what format was the feedback in? And how did you analyse it?

Gavin: Okay, so every year the NHS collects lots of feedback from its patients about its services. Most of this is quantitative, involving tick boxes and asking patients to rate their experiences from 1 to 5, for example. However, the online forms that patients fill out also include a ‘free text’ box at the bottom, where patients can leave additional comments. As you might imagine, this amounts to lots of data (29 million words over 2.5 years).

However, the NHS didn’t know how to analyse this, so all of this feedback was just sitting there. In this project, the NHS gave us access to this qualitative feedback, which we analysed using corpus linguistics techniques, and then reported back to them on a range of questions broadly concerned with identifying key areas of feedback – what patients did and didn’t like about their experiences, etc.

PSG A: Wow, that’s a lot of data!

What is Corpus Linguistics?

PSG B: Are you able to discuss the details of this technique? Could you give us a beginners description of corpus linguistics? I know that it’s using a computer to analyse text, but that is the extent of my knowledge!

Gavin: So, corpus linguistics essentially refers to a collection of methods that use specialist computer programs to study a digitised (or ‘machine-readable’) collection of text. This collection of texts is known as a ‘corpus’ (Latin for ‘body’ – as in, ‘body of texts’). The computer programs can perform a wide range of procedures on the corpus – such as: telling us which words or chains of words are most frequent (frequency); telling us which words tend to occur together frequently (collocation); and even telling us which words occur significantly more often in our corpus when it is compared against another corpus (keywords).

The beauty of corpus linguistics methods is that, because they use computers, we can study much larger numbers of texts and bigger amounts of language than we would be able to do manually, by hand. So, corpora (the plural of ‘corpus’) tend to be extremely large – often amounting to millions (sometimes billions!) of words. And this means that you can base your findings on more and more widely representative data.That said, you can also do some pretty cool stuff with smaller, more specialised corpora, too.

PSG B: So interesting. How do you then draw messages from the results?

PSG A: What kind of results can you come up with? What can Corpus Linguistics tell us about the content – I’m guessing it’s more meaningful than simply a tally of individual words?

Gavin: A word frequency list is simply a tally of words. But frequency is only ever part of the picture. Even if certain words or chains of words (and the concepts they refer to) are flagged up as frequent, it is then up to the human analyst to dig a little deeper and examine why that might be the case.

Here, the collocation technique I mentioned above can be useful, but it is always most useful to use a technique known as ‘concordancing’, which allows us to look at all of the instances of a particular word or phrase in extended contexts of use within the corpus. So this is the point at which the analysis can become more qualitative, theory-informed and, in my view, most interesting!

It is never sufficient to just report frequency trends (or keywords, for that matter). Instead, we have to carry out this more qualitative work to find out why a word might be used a lot, or at least more frequently than we might expect.

PSG D: Wow – the inner stats geek in me loves this! What’s the next step? Is it as simple as saying things like “Group A uses this key word more than Group B” so this idea is more important to them. I’m guessing that you can draw some more nuanced conclusions…….

Gavin: Glad it sounds interesting. I have sorted hinted at this in my response above. Keywords will show differences. However, we then have to engage with the data more qualitatively to interpret and explain particular patterns of use. This is where corpus linguistics differs to (and, in my view, is superior to) other digital humanities methods – the human interpretation of results is essential.

How much it cost to use corpus linguistics software?

PSG C: Can I ask – would we as a group have access to the type of programmes that Gavin uses?
I ask just because I am aware of the costs of analysis programmes…

PSG A: The answer at this point is… that depends! If it’s something we’re interested in pursuing we can look into what’s possible…

What can you tell us about the costs of analysis programmes?

PSG D: We have a little money……. Gavin – how do people access these sorts of programmes? Suppose we did know what we were doing with this (ha ha!!), it is then a case of buying licenses to use software, or booking time of servers…….?

Gavin: There is a free program known as AntConc

PSG A: Excellent!

What about other automated methods?

PSG D: Could you tell us very briefly about some of the other digital humanities methods? What are their advantages and disadvantages?

Gavin: There are probably too many for me to mention here, but a couple of popular ones are Culturomics and Topic Modelling. However, I would want to steer you clear of these. They are totally automated (rely on slim-to-no human input) and the results they yield are not particularly interesting nor, more importantly, valid, from a linguistic perspective.

Computers are good at counting things and showing trends, but they can’t explain why they occur. So these other methods, which lack linguistic sophistication, provide lots of stats but always beg the question: so what?

PSG D: That’s great -thanks for cutting down the options for us!

How does corpus linguistics work? Some more detail…

PSG E: Late to the party! So we want to analyse people’s experience of their interaction with HCP around breastfeeding. Would their (inherently biased) accounts be enough to create a specialised corpus? How many stories would we need? And would it be possible to compare to other corpora? Possible even the NHS feedback?

PSG D: So many good questions!

Gavin: The fact that their accounts are biased is no problem at all from a methodological perspective. In fact, analysing this type of naturalistic data is just what corpus linguistics is designed for.

In terms of the size of the data, there is no standard. I suppose that you would want broad coverage in terms of the number of stories (so, as many as possible). Likewise, collect as many words as possible (the more the better!).

While there is no standard, I would say that you should aim for a minimum of 100k to do really interesting things. But you can still analyse corpora that are smaller than this. I suppose that the answer is that the data should be big enough for you to require corpus assistance. Analysing 2 or 3 stories using a computer might be like trying to kill a fly with a cannon!

What number of submissions would you be happy with, for a study, very roughly? Or to put it this way – when submissions are coming in – at what point would you think – ah, OK, this is working, we’re doing well here?

PSG A: 100,000 words or submissions?

Gavin: 100,000 words.

PSG A: Phew!!! ??

PSG E: Given that this corpus is likely to resemble the NHS one more than say a corpus of Jane Austen’s work what proportion of it is likely to be useful and what proportion high frequency words that are grammatically necessary but don’t add much? Although even the use of pronouns here is going to tell us a lot about perspectives…

Gavin: That’s a good point. So, word frequency tallies will always show lots of grammatical words like ‘the’, ‘is’, ‘a’, etc. at the top of the list – as these are the most common words found in most genres. We can analyse these if they are helpful. However, it is also standard procedure to filter out grammatical words and focus on lexical (or ‘content’) words.

A more sophisticated and statistically robust route to these words is to use the keywords technique I mentioned earlier. So, you compare your corpus against a corpus that represents general language. Now, because these grammatical words will occur with a high frequency in both corpora, they are disregarded by the computer as part of this procedure. Here, the computer will give you ‘keywords’ which represent words that occur a lot in your data and not very often in the corpus you’re comparing it against.

I hope this makes sense…

So, in other words, the keywords occur with a ‘marked’ or unusually high frequency in your corpus compared against some standard or benchmark (as represented by the corpus you choose to compare it against).

PSG F: I’m really interested in the gap between qualitative and quantitative analysis of unstructured data, I’m a qualitative market researcher (commercial side), and my company have an analytics team working on what they call text analytics… will these be similar things to Corpus linguistics? The techniques as far as I understand them involve creating taxonomies for topics, and feeding them through machine learning/AI tools that spit out the data they then work with.

Gavin: Text analytics is different to corpus linguistics, as it relies on machine learning. In corpus linguistics, the computer doesn’t ‘do’ the analysis for us, but can point us to interesting trends across lots of data and then provides us with the tools to explore why they might be occurring or interesting. I have recently carried out some research using machine learning and topic-modelling and I have to say that I think its reliability is at best mixed and I wouldn’t recommend using it in research.

PSG F: really interesting – I guess it’s the difference between reliability needed for academic vs commercial research?

Gavin: Yes, I would say so. But, at the same time, I would still advise commercial researchers to use corpus linguistics before machine learning methods. I suppose the benefit of ML over CL is that you don’t need linguistics expertise to analyse the data, as the computer ‘does’ it for you.

PSG A: Would Corpus Linguistics be a good tool to find themes in advice given to patients – as reported by the patients in written submissions?

Gavin: Yes absolutely, and I can think of lots of routes to getting at this. You could even search for certain words – quotatives like ‘said’ and ‘told’ or ‘advised’ and look at what it was that people said they were told to do.

What are the shortcomings of corpus linguistics?

PSG G: One thing I’m wondering is, we are looking at people’s experiences with breastfeeding advice from HCPs, I can imagine two people telling very similar stories (e.g. being pushed to supplement with formula early), using different words. But the underlying stories could be very similar. How does corpus linguistics help with that? Are their other ways of analysing texts that we should consider?

Or, to put it another way, what are the shortcomings of this method, if any?

Gavin: This is a good question and something we encountered when looking at the patient feedback data. This is where the human element of the analysis is key. You can explore lots of patterns and even apply a bit of introspection to find words or phrases that you think are saying the same thing. As a human analyst, you can look at 2 patterns in the data and know that they are doing the same job, then report their individual frequencies as part of a broader overall trend. Machine learning methods lack this sophistication, and the computer would treat those 2 aforementioned patterns as 2 discrete things.

There are shortcomings to every method and, in the case of corpus linguistics, the main shortcoming is that you are usually dealing with so much data that it becomes decontextualized. In other words, you cannot talk in huge detail about what is happening in or around any one text. However, with a smaller corpus such as the one I think you’re planning to build, I don’t think this would be much of an issue. The other limitation is having the time to train to use the programs (and to know the best way to combine the procedures). This isn’t too hard, but I realise that I speak as someone who has received training in this.

What’s involved in the human steps of corpus linguistics analysis?

PSG A: Please could you tell us a bit about the human factor at the end. So far it sounds easy peasy! Put the data in a machine, out come the frequencies!

But what happens then – is the subsequent data analysis done by people something we could attempt as citizen scientists with a bit of training – or is it too specialist to learn in a short time?

Gavin: It depends entirely on what you want to get out of it. The key is to have a clear idea about what you want to investigate – what specific questions do you want to answer – before you collect your data. The human-led analysis can be as complicated or uncomplicated as you like.

PSG A: If you were analysing 100,000 words, how long (very approximately!) would you expect the data analysis to take, after you’d put it through the corpus linguistics tool?

Gavin: Difficult question to answer. Again, it depends on what you want to do. But you can usually find some interesting patterns within the first day. The more detailed you go, the longer it takes. The most time-consuming part is collecting the data and then – if it is not readily available in a digital format – transcribing it into a plain text file so that it can be processed by the computer.

PSG C: Do you use other methods (discourse analysis, thematic analysis, etc) to do the interpretation sections, or does CL have it’s own method for that bit too?

Gavin: Yes, most CL practitioners would use Discourse Analysis (as I do), but some people also use thematic analysis. Having a qualitative approach at that human-led phase of the analysis can be really useful to get to grips with the data.

PSG C: So as a group we have established that several of us have research training/knowledge/qualifications. But if we take on those bits, that isn’t really a different way of doing science.

Do you have any suggestions for what method might work for us as a whole group to undertake the interpretation section of CL, with variable levels of previous research knowledge/experience?

Gavin: I think starting with discourse analysis would be a good bet. The book ‘Using Corpora in Discourse Analysis‘ by Paul Baker is fantastic and very accessible.

PSG H: This is all very interesting! I am not really sure if I can explain my question but here goes… would it make a difference how we collected stories. Eg would it make a difference to the analysis if we just collected advice given but no context or would we get more out this method if we gather stories with context eg gathering stories about whole interactions with HCPs and context like issues and problems that may have been present

Gavin: In my view, you will gain much more mileage out of doing the latter!

Gavin: I really enjoyed the chat and I hope you found it useful. I’d be interested to hear more about your project and, if you think it would be useful, I could find some time to get involved on the corpus side of things at least. My upcoming post has a corpus linguistics and health focus specifically, so it could fit with my workload fairly easily. Just let me know if this is something that could be useful.

PSG G: That sounds amazing, thank you!

Leave a Reply