In this sequence of posts, which I began here, I’m exploring how far the English Language GCSE is fit for purpose.
One issue for the qualification is the long list of purposes it’s meant to fulfill. In Part 1, I looked at the GCSE’s relationship with the curriculum. This time, we’ll look at its connections with assessment and we’ll explore how well it asesses performance relating to a proportion of the content of the National Curriculum for English at Key Stage 4.
The way that we walk, the way that we talk, so easily caught.
In terms of the first of these two purposes, it’s worth reminding ourselves that the government publish both a National Curriculum document as well as a document which identifies the Subject Content and Assessment Objectives for Key Stage 4. Exam boards then go on to produce their specifications, such as this one for AQA English Language and then their examination papers are produced based on these. Here are the specimen exams for AQA Paper 1 and Paper 2. In the case of most English exam boards, and certainly in the case of all of AQA’s sample papers, as well as this year’s first proper exam paper, the format and sequence of the questions is formulaic. In addition, the exam boards provide training to teachers to help them support their students to answer these formulaic question types. Teachers become examiners with the dual purpose of making a little more money for the summer and understanding how best to answer the questions so that they can share this with their students the following year. Schools will organise for Edexcel to share the marked scripts of their students this year so that their future students get better grades than the students of schools who have not. We do this, not because we want to know how to teach the English curriculum better, but because we want to teach the exam better so that our students get better results. At the moment, that means being better than other students in their cohort.
Most, if not all, of us are complicit in this. We want our students to do well in their exams for all kinds of reasons. Perhaps it’s because the results will have an impact on their life chances. Perhaps it’s because the results affect their self-esteem or our own sense of self-worth. Perhaps it’s because the results are a measure of our department or our school. Perhaps it’s due to performance related pay.
Water’s running in the wrong direction, got a feeling it’s a mixed up sign.
In order to more fully understand why this is important in relation to assessment, we need to remind ourselves of three key pieces of terminology relevant to the field. The following are adapted from my reading of Daisy Christodoulou’s latest book. If you’ve not yet read it, then you should.
The domain is the entirety of the knowledge from which an exam/assessment could draw to test a student’s understanding/ability. In the case of Key Stage 4 English, this is defined by the two governmental documents mentioned above and the exam board specification. However, there are also vast expanses of knowledge from previous Key Stages, as well as life generally, which aren’t listed in the specification that still form part of the domain for English Language. The subject is currently an aleph – a point at which all other subjects meet.
The sample indicates the parts of the domain which are assessed in a specific task or exam. It’s rare we’d assess the whole of a domain as the assessment would be overly cumbersome. Well designed assessments are carefully thought through. Samples should represent the domain effectively so that valid inferences can be made based on the data gained from the assessment. The sample in English Language is defined each year through the choice of texts and tasks which are set in the exams. For example, if we take AQA’s Specimen Paper 2, we can see that the texts and questions sample (to name but a few) students’ knowledge of:
- Vocabulary, including “remorselessly,” “obliged,” “confidentially” and “endeavour.”
- Father son relationships and educational experiences in social and historical contexts which may be vastly different to those in their own lives.
- Linguistics and literary terminology.
- Typical linguistic and structural features of a broadsheet newspaper article.
In addition though, because the question types are formulaic and because AQA and other exam boards produce booklets like this there is a wave of procedural knowledge which it appears students need in order to be able to respond to each task type. This domain is sampled too. In the worst case, students need to know that AQA Paper 1 Question 4 asks them to evaluate but doesn’t actually want them to evaluate.
The validity of an assessment relates to how useful it is in allowing us to make the inferences we’d wish to draw from it. “A test may provide good support for one inference, but weak support for another.” (Koretz D, Measuring Up) We do not describe a test as valid or invalid, but rather the inferences which we draw from them.
I think the main inference we draw from someone’s grade in an English Language GCSE is that they have a certain level of proficiency in reading and writing. This level of proficiency could be measured against a set of criteria, it could be established through comparison with a cohort of other students, a combination of the two or through an odd statistical fix. At a student level, we measure these proficiencies in order to decide whether students:
- Have the ability to take a qualification at a higher level.
- Are able to communicate at a level appropriate to the job for which they’ve applied.
In order to be able to make these inferences, education institutions, training providers and employers need to have a shared understanding of the relationship between grades and competency or ability levels.
Problems arise here. Firstly, because these three groups want different parts of the domain emphasised in the sample. Many employers would want to know that applicants who have a grade 4 have basic literacy skills. In this report from 2016, though the overall tone is one of positivity about the improvements in schools, the CBI report that 50% of the employers they’d surveyed said “school and college is not equipping all young people with with…skills in communication.” Further to this, 38% of respondents said “There should be a [greater] focus in this [secondary] phase on developing pupils’ core skills such as communication.” 35% said the same of literacy and numeracy. When employers talk of communication or literacy skills, they tend to mean competency in reading for meaning, fluency of speech and clarity of writing in terms of spelling, punctuation, grammar and structure. In addition to these things, Educational establishments – particularly where the student is applying for further qualifications in English Language – are likely to want to know how well the student can analyse language and write creatively. As students only receive a single grade for all of these things, it’s possible that the student has performed adequately well in terms of language analysis, but not the other aspects or vice versa. This could lead to invalid inferences being made.
Further problems occur because, what you can infer from the grades depends on whether your results are criterion referenced or norm referenced as well as how far the system of comparable outcomes, which we now use, is understood by those making the inferences. In the 2017 CBI report, we’re told that “more than a third of businesses (35%) are wholly unaware of the GCSE grading reform in England.” In addition, there is a concern raised that, “Many young people still leave school without the solid educational foundations needed for success in work and life: on the academic side alone, more than a third of candidates did not achieve a grade C or better in GCSE English (39.8%)” Given the changes to the assessment system over the past couple of years, it is unlikely that this percentage will change dramatically. In making this statement in the Executive Summary, it seems that the CBI seem unaware that this is the case. It’s clear that either further explanation needs to be provided to employers (and likely educators too) or another set of changes need to be made to the qualification if more valid inferences are to be drawn from results over coming years.
I couldn’t help but realise the truth was being compromised. I could see it in your eyes.
More often than not, when English teachers talk about the reliability of the GCSE, they are thinking about the ten students whose papers they requested a remark on, some of which were successfully upgraded. In the circles inhabited by assessment experts, the term reliability tends to be used in relation to specific assessments rather than qualifications which rely on different assessments each year and are made up of multiple components consisting of a number of different questions or tasks like the English GCSE. If an assessment is deemed reliable, it would “show little inconsistency between one measurement and the next.” (Making Good Progress? – Daisy Christodoulou). 100% reliability at qualification level is a utopian concept as it would require each element of the qualification to be absolutely reliable. Task, component and qualification reliability can be affected by marking, but if can also be impacted by sampling and a range of factors relating to students.
As we’ve already established, most tests don’t directly measure a whole domain; they only sample from it as the domain is too big. If the sample is too narrow, the assessment can become unreliable. Moreover, if the sample is always the same, teachers can fall into the trap of strategically teaching to the test to seemingly improve student performance.
The domain from which the English Language GCSE texts can be sampled is huge. It could be argued that this is problematic as students who have a broader general knowledge, much of which they’ve been exposed to outside of school through access to a wider range of books to read or experiences, are at an advantage.
Of equal concern is the limited sample of question styles. As the question style remains the same and the range of texts which can be drawn from is so huge, teachers will look for the aspects of the exams their students can have most control over and confidence in whilst in the exam hall. This heightens the extent to which teachers focus more on how to respond to the question types, rather than how to be a better communicator through building students knowledge.
Marking and grading:
The two biggest issues in terms of marking in English are that:
- Different markers may apply the mark scheme rubric differently.
- One marker’s standards may fluctuate over time during a marking period.
On page 25 of Ofqual’s Marking Consistency Metrics is a graph which highlights the relative consistency in marking of different subjects. The report states “the quality of marking for physics components is higher than that for the more ‘subjective’ English language or history components.” Interestingly, though perhaps unsurprisingly as it’s marked in an even more subjective way, English literature is gauged as being less consistent.
What we have to ask ourselves here though, is whether we’re happy with this level of reliability given the number of remarks we might request each year, whether we want an English qualification which takes the same objective approach as physics at GCSE level in order to increase reliability or whether there might be a third way, perhaps in the form of comparative judgement, which might increase reliability but also maintain opportunities for extended, open answer questions.
Students’ performance can vary from one particular day to another as well as between the start and end of a test. They can perform differently due to illness, time of the day, whether they have eaten and as a result of the emotional impact of life experiences. However, it is difficult to apportion any blame for this to the test itself.
Let me see you through. I’ve seen the dark side too.
There are numerous issues with the GCSE in terms of assessment. The domain is enormous, the sample is skewed by the nature of the assessment being so formulaic, the reliability of marking is affected by marking rubrics and marker accuracy and the inferences we can draw are blurry in their validity. Despite this, the nature of the assessments themselves are, in spirit, closer to the study of English than we might end up with were we to seek a more traditionally reliable method. It’s clear that, even if we’re not happy with what we have at present, we need to be careful what we wish for in terms of a change.