It may sometimes feel – and be the case – that the nature and structure of modern higher education leads students to focus upon performance, and that we dedicate ourselves to teaching in ways that are designed, at least in part, to eventuate in short-term student achievement. We, therefore, spend time finding ways to allow students to circumvent difficulty, or otherwise to provide support and structuring that means their confrontation with difficult experiences is cushioned and as far as possible negated. However, in pursuit of these short-term performance orientated goals, we may in fact be doing a disservice to long-term student learning; by mitigating student encounters with difficulty we may be robbing them of invaluable and essential learning experiences: so-called desirable difficulties (Bjork 1994).
This is the argument underpinning the papers being considered in this SoTL Seminar and the desirable difficulty that the authors advocate exposing students to is (more) testing.
Testing is obviously an integral aspect of a higher education. It is the key means by which we determine how well students have learned, it allows us to confer a mark that reflects that learning and ultimately underpins the degree classification with which a student will graduate from university. There is a recognition, also, that testing, or assessment, can serve a formative function, i.e. as a means of enabling both the student and the teacher to gauge how far learning has progressed, what the strengths and the weaknesses are and what needs to be done to improve. This is a mediated effect in that it potentially changes future behaviour. However, can it be the case that testing/assessment can have an unmediated – a direct – effect on learning? Is testing/assessment a learning event itself and one, potentially, that is more effective than (re)studying?
The two papers chosen for consideration in this seminar are the result of experimental cognitive psychology research. Both purport to show that the act of taking a test improves long-term retention of material, although intensive study (cramming) may be more effective for promoting short-term retention. The reason for this, it is argued, is that the act of testing itself – of requiring students to recall information – is not a passive process but one that actually affects how well a particular piece of information can be recalled in the future – it “modifies its representation in memory”. It is further argued that learning without being tested can lead to complacency in that we become reassured that we know and understand something – that we are ‘fluent’ in it – whereas the reality is that we may only have a superficial or incomplete understanding, particularly if our learning and study methods reinforce a false sense of confidence. For example, have you ever highlighted something because you feel it is important, but simply relied on the act of highlighting to ensure that you have really learned it?
The two papers chosen for this seminar explore the ‘testing effect’ through distinct testing formats. Roediger et al use prose passages, asking participants to free-recall as much of the text as they can between five minutes and one week after initially engaging with it. They conclude that testing (and re-testing) is markedly more effective in promoting long-term retention of material than (re)study. Bjork et al investigate the ‘testing-effect’ with multiple choice questions (MCQs) in both laboratory and classroom-based settings. They argue that if students are tested using carefully designed MCQs – that is, where all answer options are credible – after being exposed to material, then students’ performance (learning) in a later test is better than if they had not taken the earlier MCQ test at all. This is the case whether the students are tested on the same questions, or with questions on related, but not previously tested material. It is further argued that an MCQ test undertaken before exposure to material (pre-testing) can result in improved learning of material that is subsequently introduced and then tested.
- Do you find the research discussed in these papers, the methods, results and the conclusions drawn from them credible and convincing?
- The research focuses on the recall and retention of factual information. Do you think the use of testing as envisaged in these articles could be adapted to address learning aimed at developing other skills such as critical thinking or analysis? We might, for example, recall Marton and Säljö’s observations on the different levels of understanding that students can take away from reading a text in relation to Experiment 2 in Roediger et al.
- The implication of this research is that we should increase the amount we test students – testing should not just be of learning, or even for learning but should be as learning. What barriers / challenges / drawbacks might there be in adopting this view and approach?
- What impact might this research have on your own teaching?
The authors of the papers we looked in this session include two key figures working in the field of cognitive psychology on the science of memory. Partially as a result of his contribution with Brown and McDaniel to the accessible Make It Stick: The Science of Successful Learning (2014), Henry L. Roediger’s has gained some degree of traction in recent higher education debates on effective study practice to improve student learning; not least here at York where their work has been cited as evidence in support of the curriculum design enhancements encouraged by the York Pedagogy (Robinson 2015). As with Dweck’s Self Theories (1999) that we discussed last year, and with which it engages liberally, Make it Stick is a summarised culmination for a wider readership of extensive original research developed since the 1970s. Likewise, Elizabeth Bjork co-founded the Bjork Learning and Forgetting Lab at UCLA building on work produced with John Bjork on the New Theory of Disuse first published in 1992. Their particular interests are in the relationship between memory input, storage and retrieval processes in the context of learning. Research into human learning and memory continues to develop through more recent work by figures such as J D Karpicke and Bjork’s colleagues at the Learning and Forgetting Lab. The shared core arguments of these researchers is that approaches to learning adopted by students (and often encouraged by teachers) tend to be ineffective and transient; relying as they do on techniques such as re-reading, blocked practice (focusing on one key idea at a time) and massed practice (practicing the same thing repeatedly until learned). Instead, they propose that learning is more effective using strategies of interleaved practice: switching between topics and leaving intervals in between returning to key ideas. They also propose that good learning is effortful, requiring us to think as teachers and learners in terms of producing, in Bjork’s term, “desirable difficulty.”
The papers we looked at in this session aimed to look in more detail at the argument, evidence and methodology of some of this research, focusing specifically on its implication for assessment practices generally, and for the use of multiple-choice questions (MCQs) specifically.
Roediger and Karpicke’s paper (2006) described two experiments comparing the effects of repeated studying versus repeat testing on recall of a factual prose passage five minutes, two days and one week after a first reading. Their findings convincingly demonstrate that, although recall is higher amongst restudy groups on assessment after five minutes, the results reverse after two days and three weeks, with repeat-tested students demonstrating significantly better recall. They also drew on a participant questionnaire to demonstrate that students nevertheless, and contrary to the evidence, self-report higher confidence in recall following restudy compared to pre-testing. Their strong recommendation is that frequent testing – at least weekly if not after every teaching session – should be built into learning design.
Bjork, Soderstrom and Little’s paper (2015) summarised a number of lab and classroom experiments they conducted on testing and pre-testing using Multiple Choice Questions (MCQ) to investigate an assumption that they do not support recall and learning as well as other modes of assessment. A further concern was evidence from previous research that MCQs might risk impairment of recall on related information that had been studied but not assessed through a process called “retrieval-induced forgetting (RIF)” – i.e. if answer A was incorrect in Test 1 but correct in Test 2, then the fact students had previously and accurately rejected A as an incorrect answer in Test 1 could mean they erroneously do so again in Test 2. Their conclusion was that MCQs can support recall and potentiate learning so long as they are carefully designed to include plausible incorrect answers on related material that has been studied amongst the options (“Competitive incorrect alternatives”); and that, moreover, the incidents of RIF in this design of MCQs was actually lower than with cued recall tests.
In general, students formatively tested with MCQs did better even when subsequent assessment included related material that they had studied but not been pre-tested on. The authors’ hypothesis for why this occurs is that answering MCQs with plausible incorrect answers forced students to think through their reasons for rejecting the wrong answers as well as for choosing the right one. In the process they were cognitively attempting to retrieve all the information they had learned in order to rule out incorrect choices, hence inducing “desirable difficulty.” The authors note that this is an effective metacognitive test strategy that only 30% of students routinely engage in, and propose that pre-testing with MCQs might, therefore, have the additional benefit of increasing students’ ability to self-assess their own learning more accurately.
Methodology and Research validity
Overall, the SoTL Network members at the session were convinced by the validity of the research and methodologies outlined in the papers. Some expressed uncertainty at its applicability in practice, although we agreed that the idea that testing could support as well as assess learning was very appealing. The general view on the limitations of the research was that their conclusions were inferred from observation and that it was difficult to determine their overarching view of how learning works. One member questioned the underlying tendency in this strand of cognitive psychology, emerging in the decades corresponding with the inception of computer technology, to view the brain analogously to early computers as an input/storage/output information processing system. In the case of these two papers, had their thinking on this underlying model kept sufficiently apace with more recent developments in the interplay between cognitive, neurological and computer science that complicate this mechanistic model? Another member suggested the research could be further strengthened with the addition of an eye-tracking strand to complement the use of MCQs with a more nuanced analysis of student attention while studying and recall processes during testing; particularly in terms of evaluating answer selection in non-competitive and competitive-incorrect answer options on MCQs discussed by Bjork et al (2015).
What is “good” learning?
One of the main points that much of the debate at the session circled around focused on what, if anything, the research revealed about what constituted “good learning” generally, and what kind of learning was being assessed by and supported with MCQ specifically. Is good learning all about memory recall of learned information, for example? Is it not as, if not more, important to learn how to find, evaluate and apply information effectively so that one can stay abreast of rapidly changing knowledge and adapt to changeable and diverse application contexts? Reflecting on our existing views, professional experiences and assumptions, it was clear that most felt MCQs assessed the learning of facts and “how to” processes where there was a (relatively) high academic consensus on what constituted correct and incorrect information. The rarity of MCQs on Arts and Humanities programmes was noted, with the assumption being from members working within these disciplines that MCQs were unsuited to disciplines that emphasised informed critical argument and debate on subjects where there is less knowledge consensus (i.e. divergent thinking). One member wondered whether the performance improvement measured in the experiments was due to pre-testing encouraging active learning, which could as easily be duplicated through other study practices such as writing summaries, “minute papers” etc of learned material rather than more passively re-reading it.
Many working in the Sciences and Social Sciences, where MCQs are more likely to be routinely used, were conscious of debates on their usefulness for assessment, particularly in relation to the depth/surface learning paradigm that has challenged their use on the grounds that they encourage strategic, surface, short-term memorisation and regurgitation of facts at the expense of deeper retention, understanding and application of knowledge in new contexts. Some in the group did not necessarily have a problem with this criticism, proposing that some disciplines required foundational memorisation of existing core information to scaffold the subsequent development of deeper, creative, problem solving abilities (i.e. convergent thinking). MCQs, they argued, worked well to assess the former with other modes of assessment being used to evaluate the latter at appropriate stages on the programme. Others argued that carefully designed MCQs can already challenge students to consider issues of relatedness between a range of facts and processes, particularly if designed around the “competitive incorrect alternative questions” model recommended by Bjork et al (2015). The implication in the paper that this question model was novel or rarely used was noted with some bemusement by regular users of MCQs, as it was strongly felt that this approach was commonplace when robust and thoughtful question design was applied.
While the claim in the research that regular use of MCQs encourages learning and learning fluency through regular assessment was of interest to many, few in the group seemed entirely convinced by it. Many expressed an overarching concern about summative overassessment creating unwanted and excessive anxiety amongst students on the one hand, and a common complaint of formative assessment “not counting” towards their final grade on the other. The discussion recalled our previous consideration of the Alverno model of continuous assessment that we had concluded would require a seismic shift in thinking about higher education curriculum design to be applied effectively in the UK. There was also some doubt expressed about Bjork et al’s suggestion of the need to train students in taking competitive incorrect alternative MCQs to improve performance encouraging strategic “test wiseness” over substantive understanding.
In conclusion, while the papers offered fascinating food for thought, provoked an interesting and lively debate, and gave a fascinating insight into this area of cognitive psychology, there was little final consensus in the group on the benefits of actually implementing regular MCQ testing in practice as a means of supporting student learning.
Brown, Peter C, Roediger III, Henry L and McDaniel, Mark A (2014) Make It Stick: The Science of Successful Learning. Harvard University Press. Mass.
Bjork, Elizabeth Ligon, Soderstrom, Nicholas C. and Little, Jeri L. (2015) Can multiple-choice testing induce desirable difficulties? Evidence from the laboratory and the classroom. The American Journal of Psychology 128(2), pp. 229 – 239.
Bjork, R. A. (1994) Memory and metamemory considerations in the training of human beings. In J. Metcalfe and A. Shimamura (eds.) Metacognition: Knowing about knowing, pp. 185 – 205)
Roediger, Henry L. III and Karpicke, Jeffrey D. (2006) Test-enhanced learning: taking memory tests improves long-term retention. Psychological Science 17(3), pp. 249 – 255.