Is our assessment reliable and valid?’ - asks Dylan Wiliam

By Rosie Mackay 11 December 2020

Emeritus Professor Dylan Wiliam has been asking difficult questions about assessment for the past 25 years, trying to unpack the “black box” and discover what assessment can and cannot do.

Dylan claims that our concerns about validity and reliability in assessment are not necessarily helpful, as they don’t guarantee that we are meeting the intended purpose of assessment. For him, asking about reliability and validity is asking the wrong question.

Assessment is just a process. It’s the purpose of that process that counts.

Assessment is a procedure for drawing inferences.

Assessment is often described as being of a particular ‘type’, such as summative, formative, evaluative etc. These are not actually types of assessment, but purposes for gathering the information through assessment (Wiliam & Black, 1996 as cited in Wiliam, 2000, p.3).

Common purposes for collecting assessment data include:

Establishing learner understanding and process. (Griffin &; Care, 2009; Masters, 2013; Newstead, 2003; Newton, 2007)
Determining student transition, progress and success and establishing learner accountability (Wiliam, 2011c),
Student motivation (Newstead, 2003; Masters, 2013)
System requirements such as grading, ranking and reporting on student learning (Newstead, 2003).
Determining allocation of funding, intervention, monitoring and evaluation of programs (Newton, 2007).

When we design assessment, there are certain purposes we need to meet because our educational systems and culture demand it. There are other purposes that we choose ourselves (either consciously or unconsciously). It’s important to acknowledge that we can’t meet all of these needs at the same time and, sometimes, we need to prioritise them. For example, certain tasks might be designed to provide feedback for students to improve their future learning, while others might be to provide them with grades to allow the completion of a degree or registration in a professio. Developing assessment tasks requires decision-making about which parts of the student experience are ‘valuable’ and ‘worth’ marks in the academic grading system, which parts are ‘valuable’ for the learning experience, and how to navigate the myriad of choices to achieve these.

Am I using the right assessment?

At the heart of effective assessment is knowing what it is that you want to assess.

If the point of assessment is to draw conclusions, or make inferences about learning, then validity or reliability isn’t about the task you set. No type of assessment can be considered more or less valid or reliable than another one because those characteristics relate more to the use we make of the data we collect.

For example, an essay or an examination can’t be more reliable or valid than other tasks - the validity and reliability of any task depends on its ability to match the intended purpose. The type of task itself is only relevant in terms of its ability to collect that data.

Any assessment is the right assessment, if it is measuring what we intended it to measure and if the evidence we collect meets the purpose of the assessment.

Am I assessing the right things?

Assessments cannot measure everything.

Once we have decided the possible things that can be assessed (which can sometimes be the most difficult part!), we need to make decisions about which aspects we are assessing, as it is impossible to assess all things, all of the time. Often assessment design becomes a discussion space for defining and redefining the “construct” (what do we want to assess) through calibration and moderation processes. If you are struggling to define assessment, ask if the underlying construct of what you are trying to assess has been defined first.

Consider the frequency, breadth and criteria of assessments, as these each have implications:

You can make assessment more reliable by assessing the things that are easy to assess and easy to mark, but this may make the assessment less valid in terms of its purpose.
You can increase reliability by assessing more frequently, or having longer assessments, but this can reduce the scope of the construct that is being assessed despite increasing frequency of assessment and the associated times and costs of marking.
Sometimes the “variable” in assessments might not be related to the concept or skill that the assessment is designed for, it might be about the system or chance (e.g., did the student just happen to study for the right thing that was to be assessed?)

Agonising over whether students should submit an essay or a report, or deciding whether an examination should be invigilated or take-home is not the most important decision to make regarding assessment. Rather, these are conditions that arise as a result of earlier decisions that we have made (either consciously or unconsciously) about the purpose of assessment in the first place.

Before we choose the ‘vehicle’ for evaluating student learning, let’s ensure that we know the ‘destination’ and understand the purpose of our assessments separately from the tasks themselves.

We need to recognise that the design decisions we make will have an impact on our students and the evidence we gather. For example asking students to ride a bike to the Moon isn’t going to help us to understand if they can enter a freeway in a sedan!

Being clear about the performance we need to assess will help us to know if, indeed, the use of a report, essay, examination or any other task type will give us the evidence of learning we are looking for. To be effective, we need to choose the right vehicles for our destination; that is, ensure we select tasks based on their intended purpose and recognise their limitations.

Access recording here

Access slides here

About the author

Rosie Mackay has worked in education for the past 15 years, leading teams and learning in schools as well as educational change to improve teaching and learning in the domestic and transnational education sectors. She is currently Senior Specialist, Learning and Teaching at Monash Education Academy in the Portfolio of the Deputy Vice-Chancellor Education where she works across institutional projects relating to teaching and learning quality, academic integrity and assessment security.

References and Further Reading

https://www.dylanwiliam.org/Dylan_Wiliams_website/Welcome.html

Griffin, P., & Care, E. (2009). Assessment is for teaching. Independence Journal, 34(2), 56-59. Retrieved from http://www.arc-ots.com/alp/resources/M1_reading.pdf

Masters, G. (2013). Reforming education assessment: Imperatives, principles and challenges (Australian Education Review No. 57). 9-31 Retrieved from Australian Council for Educational Research website: http://research.acer.edu.au/aer/12/

Newstead, S. (2003). The purposes of assessment. Psychology Learning and Teaching, 3(2), 97-101.

Newton, P. (2007). Clarifying the purposes of education assessment. Assessment in Education: Principles, Policy & Practice, 14(2), 149-170.

Wiliam, D. (2015) Principled assessment design Hawker Brownlow Education

Wiliam, D. (2011) What assessment can, and cannot do (originally published as Bryggan mellan undervisning och lärande) Pedagogiska Magasinet. Retrieved from http://www.dylanwiliam.org/Dylan_Wiliams_website/Papers.html

Wiliam, D. (2010) Inside the Black Box: Raising Standards through Classroom Assessment. Phi Delta Kappan https://doi.org/10.1177%2F003172171009200119

Wiliam, D. (2006) Assessment for learning: Why, what and how. Paper presented at the Cambridge Assessment Network, Cambridge, United Kingdom. Retrieved from http://www.dylanwiliam.org/Dylan_Wiliams_website/Papers.html

Wiliam, D., & Black, P. (1996). Meanings and consequences: a basis for distinguishing formative and summative functions of assessment? British Educational Research Journal, 22(5), 537-548.