The five characteristics that make an effective language test
This post comes at the request of teachers asking about tips and guidance on how to design useful classroom language tests. The post is quite thorough and supported with examples in order to help the reader understand the most important concepts that relate to language testing design.
Part of our job as teachers is to assess students learning and academic achievement. Language assessment seemingly is an important phase of our instruction. There are several reasons why we, teachers, have to assess students. We assess for pedagogical reasons when the primary purpose is to help students improve, promote their learning, and make progress. We also assess for administrative reasons when our purpose is to make judgments about students’ learning. This is the type of assessment that usually comes at the end of a unit or semester, and which we use to report students’ results to the admin.
Regardless of its purpose, language assessment remains one of the most challenging tasks for language teachers to undertake. When we design language tests, we, teachers, tend to ask questions like:
- Do our tests reflect the principles of test design?
- Are the results of our tests valid?
- Are the results of our tests fair to students?
- Do students consider test scores fair?
The primary goal of this post is to draw on the theoretical research of language test design and share the criteria and principles that classroom language teachers have to abide by when designing tests for their students.
The first principle of effective language tests: Validity
Validity is one the most central terms in psychological measurement and language testing, and so it comes at the beginning of this post. Validity is quite difficult to understand, so let us start with a hypothetical situation. Let us suppose that a teacher of English administered a language proficiency test to their students (a test that is intended to measure students’ overall language proficiency in English). Let us suppose one of the students got 19.5/20 on the test.
What does this score mean? Want inferences can the teacher make based on this score?
For example, let us assume that the obtained score means that the student has developed proficiency in English and has the ability to use English communicatively. Let us also suppose that one of the decisions that one can take on the basis of this inference is that the student can join a post-high school institution or university to pursue their studies in English. In this case, there are three important questions to ask:
- Can the teacher prove that the score really means what it means?
- Can the teacher justify the inferences and interpretations (that the student has developed proficiency in English?)
- Can the teacher provide evidence that the decisions made based on the test score are fair?
To answer these three questions, the teacher has to argue for the validity of the test. The related literature shows that validity has evolved and developed over time. There are several definitions of validity. For example, Brown (2004) defines it as the extent to which the test measures what is intended to measure. The Standards for Educational and Psychological Testing defines it as “the degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test” (APA, AERA, NCME, 2014, p. 14).
To argue that the test is valid, the test designers (teachers) have to provide evidence that the test is measuring the intended construct and content (construct and content validity). There are two types of evidence: content-related evidence and construct-related evidence.
Content validity (content-related evidence)
Content validity refers to the degree to which the content of the test represents the content of the syllabus, what is taught, or what is intended to be tested (Brown, 2004). For example, if a teacher teaches five units, and is supposed to test all of them, but provides a test that measures only two units. The test, in this case, lacks content validity. Another example to understand content validity is when a teacher administers a global test that includes only the content covered in the last weeks and excludes the content the students studied at the beginning of the semester. The test is not valid in terms of content because it does not assess all the intended learning objectives.
Construct validity (construct-related evidence)
Construct validity refers to the degree to which the test measures the intended construct, language ability (Brown, 2004). This concept is very problematic because of the term construct, which is very difficult to define. Construct is a psychological term used in psychology, and is used to mean an abstract entity that is believed to exist inside the mind of students. For instance, anxiety and intelligence are psychological constructs while proficiency, listening, and reading ability are examples of language constructs.
A test is said to be valid if it measures the intended construct. In other expressions, if a particular test is found to measure grammatical knowledge and a few lexical items while it was specifically designed to measure grammar, it lacks construct validity because it assesses something else other than the intended construct.
The second principle of effective language tests: reliability
Davidson and Fulsher defined reliability as the consistency of measurement or scores (2007, p. 15). A reliable test will obtain the same scores across two different situations. For example, if a teacher administers a test today, and in a couple of days, they administer the same test to the same group of learners; it should yield relatively the same scores. If the teacher gets consistent scores, the test is reliable.
A reliable test will also obtain the same scores across different characteristics of the test. If a teacher administers two variants of the same test to the same group of students, (the two variants are intended to measure the same content and to be used interchangeably). The teacher should get consistent scores on both tests. It should make no difference to the students which test they should take.
If the test scores are inconsistent, then the two variants are not reliable (they are unreliable indicators of the language ability the teacher wants to measure). Reliability is a very important characteristic of the test. Unless the test scores are relatively consistent, they do not provide us with any information about students’ language abilities. It is very difficult to eliminate score inconsistencies. They will always be a degree of unreliability. What teachers can do instead is to minimize the potential sources of unreliability.
The reliability of the test might be affected by several factors:
1 – Student-related reliability:
This includes issues such as temporary illness, fatigue, test anxiety, and other psychological and physical factors. These factors may affect the scores of the students on a test.
2 – Rater-reliability:
This relates to the person who scores the test. Rater-related unreliability may happen because of human errors, subjectivity, and bias.
This happens when two raters get inconsistent scores on the same test possibly because of a lack of attention to the scoring criteria, inattention, inexperience, or perceived bias. This kind of unreliability is not common in classrooms because teachers always correct their own tests.
Intra-rater related reliability:
This kind of issue is common among teachers. It happens when teachers have many students in the classroom, say 45 students in one class. The way they score the first ten students is totally different from the way they score the last students. They may go easy with the first students and be firm with the last ones, or vice versa. This kind of unreliability might be caused by a bias towards good and bad students, carelessness, or unclear scoring criteria.
3 – Test-administration reliability
Unreliability may come from the conditions in which the test is administered (noisy classroom, unclear items, the setting, the light in the room, and the conditions of chairs and desks).
4 – Test reliability
Sometimes, the nature of the test itself can cause measurement errors. If the test is too long, the students might become tired. If the items are not clear or ambiguous, they may affect students’ performance, and as a result, scores may get inconsistent.
The third principle of effective language tests: Authenticity
Authenticity is defined as the degree of correspondence of the characteristics of a given language test task to the features of the target language use context (Bachman and Palmer, 1996, p. 23). The target language use context is the context in which the test takers are expected to use the target language outside the classroom.
Teachers teach English because they want their students to be able to speak English in real-life situations. So when they test them, they want to measure their ability to perform language in real life beyond the testing situation itself.
Authenticity is a critical aspect of language testing because it relates the test task to the domain of generalization to which teachers want their score interpretations to generalize. It helps them claim generalization beyond the performance on the language test. An authentic test means the following:
- The language in the test is natural,
- Contextualized items,
- The topics are meaningful, relevant, and interesting,
- Thematic organization of the items is provided, and
- The tasks stimulate real-world tasks.
Teachers, as language test developers, have to consider authenticity in designing their own tests. for instance, in developing a reading test, teachers are likely to choose a passage whose topical content matches the kind of topics and material they think the test taker may read outside the testing situation.
The fourth principle of effective language tests: Practicality
The first characteristic of language tests is practicality. It refers to the relationship between the resources that will be needed in the design, development, and use of the test and the resources that are actually available for the testing activity. (Bachman and palmer, 1996). There are two different types of resources that teachers have to take onto consideration:
1)- Material resources
This refers to the space or the rooms where the test will take place. Types of equipment such as typewriters, computers, tape materials, and recorders.
This includes the time to design and develop the tests, time for administering the test, time for scoring the test, and time for analyzing the test scores.
We can achieve practicality when we meet the demands and requirements of the test specifications. If the resources of the test specifications do not exceed the available resources at any stage, then the test is practical.
If the available resources are exceeded, then the test is impractical. When designing classroom tests, teachers have to think of the practicality of the test.
- A practical test is easy to design, administer and score.
- It is not excessively expensive.
- The layout is easy to follow and understand.
- Students can complete it within a reasonable time.
The fifth principle of effective language tests: washback
Washback refers to “The effect of testing on teaching and learning” (Hughes, 1989, p. 1). Studies have found that tests, especially high stakes, may have deleterious effects on teachers and students. There are three main aspects of the test that may affect students.
1. The experience of taking and preparing for the test
2. The feedback they receive about their performance
3. The decisions made about them based on the test score.
1. The experience of taking and preparing for the test
The experience of preparing for and taking may have a direct effect on students. In high-stakes testing contexts where public exams are operating, students spend several weeks preparing for such exams. Because these exams are very important and their results are used to make important decisions about students’ education and future. Students spend a lot of time preparing for tests. This preparation is superficial and comes at the expense of true learning.
The experience of taking the test may affect students’ language abilities. For many students, the test can provide some confirmation or disconfirmation of their perceptions of their language abilities. Just imagine a student, who got a low grade on the first test, a low grade on the second test, and a low grade on the third test. This student may change their perception of their language ability. They may attribute their failure on the test to their lack of ability rather than any other factor.
2The feedback they receive about their performance
The types of feedback students receive about their performance are likely to affect them. Feedback should be meaningful, relevant, and useful. Feedback should help students answer three main questions:
- Where am I going?
- How am I going?
- How to get there?
Unfortunately, a common practice among teachers is that feedback is always in the form of a grade. It does not explain to students how to improve. Teachers need to consider additional types of feedback such as verbal descriptions that help encourage students to improve and go beyond the actual performance.
The decisions made about them based on the test score.
The decisions that are made about students based on their performance on the test may directly affect them. Decisions such as acceptance or non-acceptance into a program, passing or failing a school grade, passing or failing a proficiency test … etc, are important and may have serious consequences for students. Therefore, teachers need to consider the fairness of the decisions they make. Fair decisions are those that are equally appropriate regardless of students’ membership, ethnicity, race …etc. teachers need to ask whether the decisions, procedures, and criteria are applied uniformly to all students.
How to achieve positive washback from classroom tests
Hughes (2003) shared practical suggestions on how to promote beneficial washback from classroom tests.
1. Test the abilities whose development you want to encourage.
If teachers want their students to develop speaking skills, then they have to test verbal ability. If they want their students to develop grammatical knowledge, then they have to test grammatical knowledge. It is obvious. However, it is rarely done. Unfortunately, there is a conspicuous tendency to test what is easy to test rather than what is important to test. There are a couple of reasons why certain abilities are not tested. Teachers would say that they are prioritizing practicality ad reliability. For instance, many teachers are struggling to test listening skills simply because the resources are not available for such a procedure. Additionally, testing listening may evoke unreliability issues in the testing setting.
2. Sample widely and unpredictably
Normally, a test can measure only a sample of everything in the specifications (or what the teacher taught to students). It is important that the sample of the test represents the content teachers want to measure. If the sample should focus on limited content, it is likely that washback will be felt in that area. For example, if a writing test repeatedly includes two types of tasks, say writing an email and a paragraph. The outcome is that the students will spend more time and preparation in these two areas.
3. Use direct testing
This refers to testing performance and skills. Direct testing requires students to perform the target task. If teachers want their students to learn to write compositions, they must test them. If they want their students to perform a business phone call, the test should measure students’ performance if making a call.
4. Base achievement test on objectives
If teachers’ achievement tests are based on standards and instructional objectives, they will provide a clear picture of what has actually been achieved. therefore, teachers and learners can evaluate learning and teaching against those objectives.
5. Ensure that the students know and understand the test
Teachers have to make sure that students know and understand the following aspects:
- The objectives of the test,
- The format of the test,
- The construct and content that the teacher wants to measure,
- Examples of test items and tasks.
Share the post if you think it is useful. Share your thoughts and perspectives in the comments below.
- Alderson, J. C., and Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), pp. 115-
- Bachman, l. and Palmer, A. S. (1996) Language testing in practice. Oxford: Oxford University Press.
- Bachman, L.F. and Palmer, A. (2010) Language assessment in practice: developing language assessments and justifying their use in the real world, Oxford: Oxford University Press.
- Bailey, K., M. (1996). Working for washback: A review of the washback concept in Language testing. Language Testing, 13, pp. 257-279.
- Hughes, A. (2003). Testing for language teachers, 2nd ed. Cambridge: Cambridge University Press
- Ghaicha, A, and Oufela, Y. (2020). Backwash in Higher Education: Calibrating assessment and swinging the pendulum From Summative Assessment. Canadian Social Science, 16(11), pp. 1-6.
- Messick, S, (1995). Standards of Validity and the Validity of Standardizing performance Assessment. Educational Measurement: Issues and Practice, 14(4), pp.5-8.
- Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), pp. 241-256.
- Messick, S. (1998). Test Validity: A Matter of Consequence. Social Indicators Research, 45, pp.35-44.
- Van der Walt, J.L. and Steyn., H.S. (2008). The validation of language tests. Stellenbosch Papers in Linguistics, 38, pp. 191-204