25th January 2018

Round-table discussion: ‘Making the Grade? Exam accuracy and its implications’

The Education Policy Institute held a round-table on exam grade accuracy and its implications, on Thursday 25th January 2018.

The event was supported by the Association of School and College Leaders (ASCL) and by the Headmasters’ and Headmistresses’ Conference (HMC) – representatives of both organisations made some opening comments about the subject, which were then followed by  a presentation from Ofqual. A wide-ranging discussion then took place, under Chatham House rules.

In the discussion, it was noted that exam grade reliability is important both because of its implications for individual students and their progression routes, and because of institution-level accountability. Some participants considered that exam grades have become more important and “high stakes” over time, as reliance on these (as opposed to wider judgements of student competence) may have increased.

In November 2016, Ofqual published “Marking Consistency Metrics“, which sets out the probability of a candidate being awarded the definitive grade in certain GCSE and A level subject modules/units. The definitive grade is that which would have been awarded if the candidate received the definitive mark – the mark awarded to the candidates’ work in the exam boards’ quality assurance processes. This research highlighted that in some subjects grade reliability is quite high (approaching 90%), while in other subjects the probability of agreement with the definitive grade is much lower (for example, just over 50% for English Literature). The Ofqual report did not include the relevant probability for GCSE maths, but Ofqual indicated that this is towards the top end of the reliability scale.

Ofqual confirmed that this analysis would be updated to show the probability of being awarded the definitive grade in the overall qualification – rather than in its units/modules. It was agreed that this would be helpful as it would indicate the real impact of the issue on overall exam grades, and the approximate number of students affected. It is likely that the probability of agreement with the definitive grade will be materially higher at a qualification level than at the published unit/module level.

Ofqual explained that there are a range of reasons for variation in marking – these could be categorised into four types: procedural error (e.g. not marking all the pages of an answer); attentional error (concentration lapses by examiners); inferential uncertainty (insufficient evidence provided by the candidate for the examiner to reach a definitive judgement) and definitional uncertainty (there is a range of legitimate marks allowed by the mark scheme because of a lack of tight definition of the construct to be rewarded).

Ofqual noted that marking differences in the English public examinations system are within the international bands.

Ofqual also noted that their marking reliability work in England currently does not cover vocational qualifications.

Participants welcomed the work conducted by Ofqual to improve understanding of this issue.

There were some concerns that while marking cannot ever be “perfect”, the high probabilities in some subject units of not being awarded the definitive grade could be cause for concern – particularly when the differences between grade boundaries could have significant consequences – for example, having to re-take certain subjects post- 16, or being unable to access subject or institution choices.

There was particular concern about the impact of unreliable grading on GCSE resits and FE funding/delivery. It was noted that funding for maths and English GCSE resits comes out of the current 600 funded hours for FE students. Accordingly,  any unnecessary resits create a pressure on the funding for the chosen post 16 pathway.

Some participants felt that overall assessment quality could be improved through a different assessment system, and that extended responses could give more scope for accurate judgements to be made.

Other participants highlighted the risk that extended responses would make it more difficult to award a definitive mark.

It was felt by some that the aspiration to improve marking/grading reliability could drive learning and examinations to an increasingly narrow focus around ease and consistency of judgements, which could create bad incentives for teaching and learning – in this view some degree of grading unreliability might be a price worth paying for an education system that encourages deep thinking.

There was some discussion about whether the current grading structure should be reviewed, with a move to publishing scaled scores for GCSEs, with more information being given and less weight being placed on existing high stakes boundaries. However, some participants felt that high stakes thresholds were still likely to emerge.

Some also held the view that the use of grades by education institutions and employers needed to change – to reduce the weight placed on a few high stakes borderlines in a small number of “high stakes” subjects. One proposal was to encourage greater usage of measures such as Attainment 8, for helping to determine post 16 pathways.

It was noted that even when marking reliability is very strong, there is still an issue of how reliable/robust the overall grading of a student will be, as this relies upon the individual students’ competence in the questions that happen to be asked in any particular exam paper. The same student, perfectly marked, would be highly likely to secure different marks on different exam papers, based upon the questions selected by examiners.

It was agreed that future Ofqual data on overall grade level reliability for reformed qualifications needs to be carefully assessed before considering further the extent of this issue, and the case for changes to either improve grading reliability or reduce its potential negative consequences.