Language Testing
Chia sẻ bởi Nong Hien Huong |
Ngày 12/10/2018 |
41
Chia sẻ tài liệu: Language Testing thuộc Bài giảng khác
Nội dung tài liệu:
Structure
Test Development: Types of tests, Qualities of a good test
Issues Specific to Language Tests
Developing Item Specifications
Language Test Development:
From Test Specification to Test Use
What makes a good test?
Validity:
the test fulfils its purpose,
the test gives you the information you want,
the test enables you to make well-founded decisions
Reliability:
the test is precise enough for its purpose
Practicality:
the test can be administered and scored in a reasonable amount of time and with reasonable use of resources
Fairness:
students know the purpose of the test,
results are only used for decisions that they can reasonably inform
What test for what purpose?
proficiency test:
assesses students` knowledge of a language in general without reference to a curriculum or syllabus,
usually ranks students` in relation to each other (a norm-oriented test),
one of the main considerations in constructing them is discrimination: use a mix of easy, medium, and difficult items
this will make it possible to distinguish between students at different levels
if the test only consisted of easy items, there would no way to tell a medium-ability student from a high-ability student
Examples: TOEFL, TOEIC, IELTS, university admission tests
achievement test:
assesses what students have learned,
ranks students in terms of their degree of mastery of a curriculum or syllabus (a criterion-oriented test),
discrimination may or may not be important for achievement tests:
if a very well-defined body of knowledge is tested (e.g., a set of vocabulary words), it’s fine to just randomly sample from the possibilities and not worry about discrimination
if a more abstract construct is tested (e.g., reading comprehension), discrimination is important because only by including items of different levels of difficulty can different levels of knowledge in the students be distinguished
Examples: mid-term and final tests in schools and universities
Know your construct - Validity
Construct: the invisible, intangible attribute about which testers are trying to collect information, e.g., English language proficiency, intelligence, religious devotion, suitability as a pilot etc.
the problem of constructs is that they are not directly observable, so testers must gather observable performance and then draw conclusions about the construct
in other words, testers must collect the right kind of evidence to make statements about the construct?
the construct – evidence connection must be theoretically and empirically defensible, e.g.,
a test taker`s time in the 100-yard dash (evidence) has no conceivable connection to their ability to comprehend spoken English (construct) => this performance provides no useful information
a test taker`s score on a test of English listening comprehension with taped dialogs and multiple-choice questions (evidence) has a much stronger connection to their ability to comprehend spoken English (construct) => this performance provides useful information
Threats to validity
construct underrepresentation: only an aspect of the construct is tested, not everything, e.g., a writing test where students only produce individual sentences, not extended texts
construct-irrelevant variance: factors that might influence the measurement but that are not the object of the measurement, e.g., in a listening comprehension test with a tape and multiple choice questions, reading ability influences the result, possibly also topic knowledge
Sources of construct-irrelevant variance
A test of ESL listening, where test-takers listen to a short conversation and then answer multiple-choice questions
A test of ESL writing, where test-takers have 30 minutes to produce a brief essay on a general topic
A test of ESL speaking, where test takers role play a situation with the tester
A test of ESL reading, where test takers read the text and then answer brief-response questions about the main point, specific information, and the author`s stance
Construct Validity
Construct Validity: A test has construct validity if it is a way to gain useful information about the construct and therefore inferences and decisions based on test scores are justifiable and defensible
To make sure you make construct-valid tests, work backwards from decisions
Decisions: What decisions will you make based on the scores?
Construct: What construct underlies these decisions?
Evidence: What evidence / information do you need to find out about the strength of the construct in a test taker?
Measurement procedures: What kinds of testing procedures will help you gather that information?
Example: Constructing a test for assessing student’s learning after a semester of ESL
Decision: Is the student ready for the next level?
Construct: English proficiency
Evidence: test takers` comprehension and production of academic English in the oral and written mode
Measurement procedures: brief-response listening comprehension tests with lecture stimuli, multiple-choice reading test with academic texts, oral interview, writing sample
Test items: Reliability and Practicality
Reliability: The precision / consistency with which the test measures.
Reliability is a necessary condition for validity: an imprecise test cannot elicit useful information
Reliability is not a sufficient condition for validity: a test may be highly precise but may measure something entirely different than the construct (e.g., 100-yard dash as a measure of ESL vocabulary knowledge)
the more items measure the same attribute, the more precise the measurement will be => higher reliability!
to increase reliability, many short items are better than a few long items; essays are the worst for reliability
however, certain abilities can only be measured with long items, e.g., essay writing ability => sometimes faithful representation of the construct means a loss of reliability
Practicality
Practicality: The ratio between resources available and resources necessary to administer and score the test.
A highly practical test requires few resources and is likely to be used whereas an impractical test requires many resources and is much less likely to be used
in reality, practicality is a trade-off between measurement precision and construct validity on the one hand and real-world constraints on the other:
major considerations in practicality:
preparation: a test must be written, assembled, piloted etc., so if items are easy to write and can be re-used, the test becomes more practical
length: a test cannot be so long that test takers get tired and lose concentration; maximum length depends on test takers` proficiency but a 4-hour test is the absolute maximum
medium: a paper-and-pencil test is much cheaper to produce than a computer-based test
scoring: dichotomous items (multiple choice, true / false) are much easier to score than extended writing (essay) items or speaking tests (Oral Proficiency Interviews); a test that can be scored by a machine is very practical with regard to scoring
what if reliable measurement of the construct would require so many items that the test would become too long and not practical? => limit the scope of the construct, limit the inferences drawn from scores
item specifications are necessary to make sure items are produced in a systematic fashion and have predictable measurement properties
even with item specifications there will still be "rogue" items that measure something other than the construct under investigation
for advice on how to write items of different types, cf. Brown (2005) or Hughes (2003)
Validation
once the test is built, it needs to be piloted with a small group to make sure the instructions are comprehensible and it can be done in the time allotted
once piloted, it needs to be revised and then run experimentally with a larger population
validation is the collection of evidence to ensure that the test measures the construct it is supposed to measure and inferences drawn from scores are defensible
Consequences: Fairness, Ethics and Inferences
Scores from tests are used to make inferences about the strength of the construct in test takers, e.g., their English proficiency, and these inferences lead to decisions, e.g., whether to admit the test taker to a university program or not
for the test to be fair, it is important to avoid test bias, i.e., test taker characteristics other than the construct influencing the scores
test bias is present if items are easier for one group of test takers than for another, e.g., different genders, different races, different ages, different socio-economic backgrounds
certain background characteristics are likely to coincide with certain constructs, e.g., test takers from rich families may have attended better schools, had better ESL instruction, have higher English proficiency, and therefore score higher on an English test: whether this is bias or not is a judgment call
fairness increases the less judgment is involved in scores: "objectively" scored items (like multiple-choice) are best, "subjectively" scored items (like oral proficiency interviews or essays) are more problematic but can be improved by having scoring guidelines, scorer training, and multiple scorers
ideally, scoring would be anonymous but that is not always possible
Issues Specific to Language Tests
Competence and Performance
Competence is idealized knowledge: what someone would be able to do under ideal conditions (no fatigue, distraction, full concentration)
Performance is actual production: what someone does in a real-world situation
We can only ever observe performance and try to infer competence from it
Performance assessments try to avoid this inference by having test takers do real-world (like) tasks
Controlled and automatic processing
Automatized processing is fast and effortless and it’s unavoidable in listening and speaking
In conversations, people have to comprehend and contribute quickly, in real time, otherwise they get lost
Controlled processing is slower and takes more effort
It’s possible in reading and writing where there’s no pressure of an ongoing interaction (however, internet chat might be different)
Testing Language Skills
Learners’ L2 competence can be divided in various ways for the purposes of assessment, for example:
“building blocks”: Grammar, phonemes, vocabulary
Skills: listening, reading, writing, speaking
Notions & functions: complaining, describing, negotiating
Situations: on the phone, in a shop, in a lecture
Genres: letter writing, giving a speech, making small talk…
Testing skills and language code is context-free, which is unnatural (language use always happens in context)
However, skills are possibly applicable across contexts
Testing functions & notions or situations / genres is more contextualized and closer to actual language use
However, it requires knowing in advance how & where learners will use the language (needs analysis)
Unlike skills, there’s an unlimited number of notions, functions, and situations
Issues in Skills Assessments
Cross-contamination: most tests assess several skills at the same time, e.g.,
a writing test where the instructions are written out also assesses reading to an extent
a listening test with multiple-choice questions also assesses reading
any speaking test done through an oral interview assesses listening
a reading test with brief-response prompts also assesses writing
Pervasive language code effects: low grammatical competence or lack of vocabulary will affect a test taker’s performance in all four skills
This leads to higher correlations between test sections but it is also inefficient because the same attribute is measured several times
It may also be unfair because a test taker is punished for lack of ability in one area multiple times
Item Specifications
item specs provide a blueprint of an item (type)
this helps create more similar items (increases reliability!) or replace overused / retiring items
Components of item specs
GD (General Description): general description of the item,
PA (Prompt Attributes): description of prompt attributes: what will the test question look like?
RA (Response Attributes): description of the response attributes: what will the test taker have to do?
SI (Sample Item): sample item: an example item
[SS: specification supplement: additional information]
General Description (GD)
concise summary of what the spec is about
sketch out the ability or criterion for which the spec is supposed to produce tasks
GDs can be quite general…
(R1) This item tests reading comprehension. Test takers will understand the gist of a non-technical text.
(W1) This item tests writing. Test takers will be able to write a summary of a genuine academic lecture on a social science topic.
… or more specific…
(R2) This item tests reading comprehension. The test takers will demonstrate their in-depth understanding of a non-technical text, specifically:
they will understand the gist of the text,
they will be able to extract specific information,
they will understand the author`s stance towards the issue,
they will understand the logical structure of the text.
(S1) This item tests speaking. Test takers will be able to bargain for a lower price in a shop setting.
(G1) This item tests grammar. Test takers will be able to recognize the correct tense for talking about the past, present, and future.
Prompt Attributes
describes the task / stimulus / elicitation procedure
summarizes what the test taker will have to do
(R1) The test taker will read a complete, self-contained non-technical text of between 300-500 words. The text should be dealing with an academic topic like social science, environment, psychology, political science, business but it should be written for a non-specialist audience.
Articles from Time or Newsweek are often suitable. The question will be asked in a multiple-choice format with a one-sentence item stem and 4 response options:
the correct answer
an answer focusing on a minor point in the text
an answer overreaching the text`s main point
an answer claiming the opposite from the main point
(S1) The test taker will be interviewed by one tester. The test taker will be given a role-play card explaining the situation of bargaining in a shop. The text on the role play card should specify what the object is and that the goal of the interaction is to reduce the price substantially. The tester will assume the role of the shopkeeper.
The object to be bargained about can be a vase, clothing, etc., and objects in the test environment can be used in place of the imaginary object. The tester will open the interaction by quoting a price and in the course of the interaction will try to keep the price as high as possible while keeping the interaction going. The interaction will not take more than 5 minutes.
Response Attributes (RA)
RAs describe what the test taker will have to do
(R1) The test taker will mark the best answer on the answer sheet.
(S1) The test taker will claim that the price of the object is too high and will try to bargain for a lower price. S/he will respond to tester reactions and obtain a significantly reduced price. S/he will be comprehensible and use appropriate language, although minor errors that do not interfere with comprehension are acceptable.
Sample Item (SI)
(R1)
[text on test validation]
What is the main point of the article?
1. Validitation involves the collection of evidence of test usefulness.
2. Validation depends crucially on high reliability.
3. Validation is a political-educational enterprise.
4. Validation can be done through one of a number of methods.
Problems / Challenges
GD and PA: don`t specify the prompt attributes in the general description, e.g.,
This item test reading comprehension. The test taker will summarize in one paragraph the content of a non-technical text from a popular science journal containing at least one graph or diagram.
the GD should describe the criterion or ability—it should be a reflection of the specific part of the construct that is getting tested
the GD should focus on general description, not the specifics of test materials and responses
PA and RA: the PA should focus on the test materials and what their effects whereas
the RA should focus on the test takers` actions or interactions with the materials
Test Development: Types of tests, Qualities of a good test
Issues Specific to Language Tests
Developing Item Specifications
Language Test Development:
From Test Specification to Test Use
What makes a good test?
Validity:
the test fulfils its purpose,
the test gives you the information you want,
the test enables you to make well-founded decisions
Reliability:
the test is precise enough for its purpose
Practicality:
the test can be administered and scored in a reasonable amount of time and with reasonable use of resources
Fairness:
students know the purpose of the test,
results are only used for decisions that they can reasonably inform
What test for what purpose?
proficiency test:
assesses students` knowledge of a language in general without reference to a curriculum or syllabus,
usually ranks students` in relation to each other (a norm-oriented test),
one of the main considerations in constructing them is discrimination: use a mix of easy, medium, and difficult items
this will make it possible to distinguish between students at different levels
if the test only consisted of easy items, there would no way to tell a medium-ability student from a high-ability student
Examples: TOEFL, TOEIC, IELTS, university admission tests
achievement test:
assesses what students have learned,
ranks students in terms of their degree of mastery of a curriculum or syllabus (a criterion-oriented test),
discrimination may or may not be important for achievement tests:
if a very well-defined body of knowledge is tested (e.g., a set of vocabulary words), it’s fine to just randomly sample from the possibilities and not worry about discrimination
if a more abstract construct is tested (e.g., reading comprehension), discrimination is important because only by including items of different levels of difficulty can different levels of knowledge in the students be distinguished
Examples: mid-term and final tests in schools and universities
Know your construct - Validity
Construct: the invisible, intangible attribute about which testers are trying to collect information, e.g., English language proficiency, intelligence, religious devotion, suitability as a pilot etc.
the problem of constructs is that they are not directly observable, so testers must gather observable performance and then draw conclusions about the construct
in other words, testers must collect the right kind of evidence to make statements about the construct?
the construct – evidence connection must be theoretically and empirically defensible, e.g.,
a test taker`s time in the 100-yard dash (evidence) has no conceivable connection to their ability to comprehend spoken English (construct) => this performance provides no useful information
a test taker`s score on a test of English listening comprehension with taped dialogs and multiple-choice questions (evidence) has a much stronger connection to their ability to comprehend spoken English (construct) => this performance provides useful information
Threats to validity
construct underrepresentation: only an aspect of the construct is tested, not everything, e.g., a writing test where students only produce individual sentences, not extended texts
construct-irrelevant variance: factors that might influence the measurement but that are not the object of the measurement, e.g., in a listening comprehension test with a tape and multiple choice questions, reading ability influences the result, possibly also topic knowledge
Sources of construct-irrelevant variance
A test of ESL listening, where test-takers listen to a short conversation and then answer multiple-choice questions
A test of ESL writing, where test-takers have 30 minutes to produce a brief essay on a general topic
A test of ESL speaking, where test takers role play a situation with the tester
A test of ESL reading, where test takers read the text and then answer brief-response questions about the main point, specific information, and the author`s stance
Construct Validity
Construct Validity: A test has construct validity if it is a way to gain useful information about the construct and therefore inferences and decisions based on test scores are justifiable and defensible
To make sure you make construct-valid tests, work backwards from decisions
Decisions: What decisions will you make based on the scores?
Construct: What construct underlies these decisions?
Evidence: What evidence / information do you need to find out about the strength of the construct in a test taker?
Measurement procedures: What kinds of testing procedures will help you gather that information?
Example: Constructing a test for assessing student’s learning after a semester of ESL
Decision: Is the student ready for the next level?
Construct: English proficiency
Evidence: test takers` comprehension and production of academic English in the oral and written mode
Measurement procedures: brief-response listening comprehension tests with lecture stimuli, multiple-choice reading test with academic texts, oral interview, writing sample
Test items: Reliability and Practicality
Reliability: The precision / consistency with which the test measures.
Reliability is a necessary condition for validity: an imprecise test cannot elicit useful information
Reliability is not a sufficient condition for validity: a test may be highly precise but may measure something entirely different than the construct (e.g., 100-yard dash as a measure of ESL vocabulary knowledge)
the more items measure the same attribute, the more precise the measurement will be => higher reliability!
to increase reliability, many short items are better than a few long items; essays are the worst for reliability
however, certain abilities can only be measured with long items, e.g., essay writing ability => sometimes faithful representation of the construct means a loss of reliability
Practicality
Practicality: The ratio between resources available and resources necessary to administer and score the test.
A highly practical test requires few resources and is likely to be used whereas an impractical test requires many resources and is much less likely to be used
in reality, practicality is a trade-off between measurement precision and construct validity on the one hand and real-world constraints on the other:
major considerations in practicality:
preparation: a test must be written, assembled, piloted etc., so if items are easy to write and can be re-used, the test becomes more practical
length: a test cannot be so long that test takers get tired and lose concentration; maximum length depends on test takers` proficiency but a 4-hour test is the absolute maximum
medium: a paper-and-pencil test is much cheaper to produce than a computer-based test
scoring: dichotomous items (multiple choice, true / false) are much easier to score than extended writing (essay) items or speaking tests (Oral Proficiency Interviews); a test that can be scored by a machine is very practical with regard to scoring
what if reliable measurement of the construct would require so many items that the test would become too long and not practical? => limit the scope of the construct, limit the inferences drawn from scores
item specifications are necessary to make sure items are produced in a systematic fashion and have predictable measurement properties
even with item specifications there will still be "rogue" items that measure something other than the construct under investigation
for advice on how to write items of different types, cf. Brown (2005) or Hughes (2003)
Validation
once the test is built, it needs to be piloted with a small group to make sure the instructions are comprehensible and it can be done in the time allotted
once piloted, it needs to be revised and then run experimentally with a larger population
validation is the collection of evidence to ensure that the test measures the construct it is supposed to measure and inferences drawn from scores are defensible
Consequences: Fairness, Ethics and Inferences
Scores from tests are used to make inferences about the strength of the construct in test takers, e.g., their English proficiency, and these inferences lead to decisions, e.g., whether to admit the test taker to a university program or not
for the test to be fair, it is important to avoid test bias, i.e., test taker characteristics other than the construct influencing the scores
test bias is present if items are easier for one group of test takers than for another, e.g., different genders, different races, different ages, different socio-economic backgrounds
certain background characteristics are likely to coincide with certain constructs, e.g., test takers from rich families may have attended better schools, had better ESL instruction, have higher English proficiency, and therefore score higher on an English test: whether this is bias or not is a judgment call
fairness increases the less judgment is involved in scores: "objectively" scored items (like multiple-choice) are best, "subjectively" scored items (like oral proficiency interviews or essays) are more problematic but can be improved by having scoring guidelines, scorer training, and multiple scorers
ideally, scoring would be anonymous but that is not always possible
Issues Specific to Language Tests
Competence and Performance
Competence is idealized knowledge: what someone would be able to do under ideal conditions (no fatigue, distraction, full concentration)
Performance is actual production: what someone does in a real-world situation
We can only ever observe performance and try to infer competence from it
Performance assessments try to avoid this inference by having test takers do real-world (like) tasks
Controlled and automatic processing
Automatized processing is fast and effortless and it’s unavoidable in listening and speaking
In conversations, people have to comprehend and contribute quickly, in real time, otherwise they get lost
Controlled processing is slower and takes more effort
It’s possible in reading and writing where there’s no pressure of an ongoing interaction (however, internet chat might be different)
Testing Language Skills
Learners’ L2 competence can be divided in various ways for the purposes of assessment, for example:
“building blocks”: Grammar, phonemes, vocabulary
Skills: listening, reading, writing, speaking
Notions & functions: complaining, describing, negotiating
Situations: on the phone, in a shop, in a lecture
Genres: letter writing, giving a speech, making small talk…
Testing skills and language code is context-free, which is unnatural (language use always happens in context)
However, skills are possibly applicable across contexts
Testing functions & notions or situations / genres is more contextualized and closer to actual language use
However, it requires knowing in advance how & where learners will use the language (needs analysis)
Unlike skills, there’s an unlimited number of notions, functions, and situations
Issues in Skills Assessments
Cross-contamination: most tests assess several skills at the same time, e.g.,
a writing test where the instructions are written out also assesses reading to an extent
a listening test with multiple-choice questions also assesses reading
any speaking test done through an oral interview assesses listening
a reading test with brief-response prompts also assesses writing
Pervasive language code effects: low grammatical competence or lack of vocabulary will affect a test taker’s performance in all four skills
This leads to higher correlations between test sections but it is also inefficient because the same attribute is measured several times
It may also be unfair because a test taker is punished for lack of ability in one area multiple times
Item Specifications
item specs provide a blueprint of an item (type)
this helps create more similar items (increases reliability!) or replace overused / retiring items
Components of item specs
GD (General Description): general description of the item,
PA (Prompt Attributes): description of prompt attributes: what will the test question look like?
RA (Response Attributes): description of the response attributes: what will the test taker have to do?
SI (Sample Item): sample item: an example item
[SS: specification supplement: additional information]
General Description (GD)
concise summary of what the spec is about
sketch out the ability or criterion for which the spec is supposed to produce tasks
GDs can be quite general…
(R1) This item tests reading comprehension. Test takers will understand the gist of a non-technical text.
(W1) This item tests writing. Test takers will be able to write a summary of a genuine academic lecture on a social science topic.
… or more specific…
(R2) This item tests reading comprehension. The test takers will demonstrate their in-depth understanding of a non-technical text, specifically:
they will understand the gist of the text,
they will be able to extract specific information,
they will understand the author`s stance towards the issue,
they will understand the logical structure of the text.
(S1) This item tests speaking. Test takers will be able to bargain for a lower price in a shop setting.
(G1) This item tests grammar. Test takers will be able to recognize the correct tense for talking about the past, present, and future.
Prompt Attributes
describes the task / stimulus / elicitation procedure
summarizes what the test taker will have to do
(R1) The test taker will read a complete, self-contained non-technical text of between 300-500 words. The text should be dealing with an academic topic like social science, environment, psychology, political science, business but it should be written for a non-specialist audience.
Articles from Time or Newsweek are often suitable. The question will be asked in a multiple-choice format with a one-sentence item stem and 4 response options:
the correct answer
an answer focusing on a minor point in the text
an answer overreaching the text`s main point
an answer claiming the opposite from the main point
(S1) The test taker will be interviewed by one tester. The test taker will be given a role-play card explaining the situation of bargaining in a shop. The text on the role play card should specify what the object is and that the goal of the interaction is to reduce the price substantially. The tester will assume the role of the shopkeeper.
The object to be bargained about can be a vase, clothing, etc., and objects in the test environment can be used in place of the imaginary object. The tester will open the interaction by quoting a price and in the course of the interaction will try to keep the price as high as possible while keeping the interaction going. The interaction will not take more than 5 minutes.
Response Attributes (RA)
RAs describe what the test taker will have to do
(R1) The test taker will mark the best answer on the answer sheet.
(S1) The test taker will claim that the price of the object is too high and will try to bargain for a lower price. S/he will respond to tester reactions and obtain a significantly reduced price. S/he will be comprehensible and use appropriate language, although minor errors that do not interfere with comprehension are acceptable.
Sample Item (SI)
(R1)
[text on test validation]
What is the main point of the article?
1. Validitation involves the collection of evidence of test usefulness.
2. Validation depends crucially on high reliability.
3. Validation is a political-educational enterprise.
4. Validation can be done through one of a number of methods.
Problems / Challenges
GD and PA: don`t specify the prompt attributes in the general description, e.g.,
This item test reading comprehension. The test taker will summarize in one paragraph the content of a non-technical text from a popular science journal containing at least one graph or diagram.
the GD should describe the criterion or ability—it should be a reflection of the specific part of the construct that is getting tested
the GD should focus on general description, not the specifics of test materials and responses
PA and RA: the PA should focus on the test materials and what their effects whereas
the RA should focus on the test takers` actions or interactions with the materials
* Một số tài liệu cũ có thể bị lỗi font khi hiển thị do dùng bộ mã không phải Unikey ...
Người chia sẻ: Nong Hien Huong
Dung lượng: 111,00KB|
Lượt tài: 1
Loại file: ppt
Nguồn : Chưa rõ
(Tài liệu chưa được thẩm định)