SUMMARY OF LANGUAGE ASSESSMENT
Submitted for the Partial Fulfillment of Language Assessment Course
Lectured by
Dr. Gunadi Harry Sulistyo, MA
by
Risca Dwiaryanti
2121040256
ENGLISH EDUCATION DEPARTMENT
GRADUATE PROGRAM
ISLAMIC UNIVERSITY OF MALANG
2014
I. TEST, MEASUREMENT, EVALUATION, AND ASSESSMENT
Test
Test can be defined as a procedure designed to elicit certain behavior from
which one can make inferences about certain characteristics of an individual.
Measurement
Measurement can also be defined as a process to quantify communicative
competence by using number. If we measure, we must consider what kind of
number (as label). For example: (0) for woman/female and (1) for man/male.
The processes of measurement or levels of measurements are as follows:
1. Nominal scales
As its name suggests, a nominal scale consists of numbers that are used to
‘name’ the classes or categories of a given attribute.
2. Ordinal scale
An ordinal scale comprises the numbering of different levels of an attribute that
are ordered with respect to each other. and no mathematical operations.
3. Interval scale.
An interval scale is a numbering of different levels in which the distances, or
intervals, between the levels are equal.
4. Ratio scale
The ratio scale is the highest level, possessing all four properties and thus
capable of providing the greatest amount of information.
Evaluation
Evaluation can be defined as the systematic gathering of information for
the purpose of making decisions. In evaluation process, there are some procedures
as follows:
1. Keeping the scores for examples; mid test, and final test scores.
2. Collecting the scores
3. Process of synthesizing or combining the scores.
4. Comparing the synthesized scores, and
5. Making decision.
Assessment
Assessment is a really data-gathering activity in which the teacher
interacts with the learner in order to clarify what that learner needs.
In doing assessment, an assessor must combine techniques of collecting
data, and use many kinds of places, instruments, etc. such as observation (non-
test), quizzes, and self-assessment. Non-test, at the level of authentic assessment,
can be in the forms of:
- Porto folio
- Project
- Observation and report
- Extended response
- Interview
- Performances such as drama, play, and simulation.
- Checklist
- Rating scale
To make give a better understanding in distinguishing those terms, the
following is a figure of the relation of assessment, test, measurement, and
evaluation.
Assessment
Evaluation Measurement
Tests
The relationships among measurement, tests, and evaluation are illustrated
above. An example of evaluation that does not involve either tests or measures
(area ‘1’) is the use of qualitative descriptions of student performance for
diagnosing learning problems.
II. FUNCTION OF TEST
The functions of test are to discover or to elicit students’ competence i.e.
the students’ communicative competence within the students’ mind. The second
function of tests is to make hidden or latent attributes (competence) become
observable.
It is a fact that in English lesson, there are four skills that must be tested.
They are speaking, writing (productive skills), listening, and reading (perceptive
skills). In language assessment, all of those skills are known as communicative
competence - the final goal of teaching. The competence must include knowledge,
attitude and skills. In other words, those four skills must be supported by
knowledge, application (skill) as well as attitude.
III. STANDARD STAGES FOR ASSESSMENT TOOL
CONSTRUCTION
a. Standard Stages
It is important to pay attention the following aspects in constructing
assessment tool. This is because assessments are employed for different purposes,
and one purpose can conflict with another. Therefore, it is necessary to look out
the following stages:
1. Thinking the objective of the assessment or test
2. Identifying the competencies of the area of the assessment such as the
dimension, variables, sub variables, and indicatory.
3. Making up blue print
4. Reviewing of the blue print
5. Revising based on the feed back
6. Item writing
7. Reviewing the items in the instruments
8. Revision
9. Try out
10. Banking for future use
b. The Purposes of Assessment
There are many kinds of assessment based on the purpose of assessment as
listed below:
1. Aptitude assessment
This kind of test is categorized as future-oriented. The components to be
assessed are the concepts of general language ability including five language
systems i.e. phonological system, morphological system, syntactical system,
and semantic system. Five of them are structural components.
2. Screening
Screening, selection, admission, or entrance assessment is based on future
orientation. It depends on the needs of assessment, job analysis for example.
3. Placement
This type of test is focused on future orientation. The components to be
assessed are essential competencies required in the class. This type of test is
usually used in courses.
4. Achievement
This is past-oriented test used to assess the competences in curriculum.
Examples of this type of test are Ujian Nasional (UN), mid term test, and the
like.
5. Diagnostic
This past-oriented test is used to diagnose the competence in curriculum.
6. Proficiency
Proficiency test has the same components to be assessed as screening test
which is depending on the needs of assessments. Some examples of
proficiency test as IELT, TOEFL, TOEIC, TOEP, and BULATS. Proficiency
test is considered as future-oriented test.
7. Research purposes
The components of to be assessed in this test are the concepts of general
language ability and categorized as future-oriented test.
8. Program evaluation
Program evaluation assessed the past intake, input, process, output, and
outcome. This kind of evaluation is commonly used to evaluate programs that
have existed. Examples for program evaluation are like RSBI and MGMP in
Indonesia.
IV. TYPES OF LANGUAGE ASSESSMENT BASED ON APPROACH TO
LANGUAGE TESTING
Five Main Approaches to Language Testing
1. The Essay-Translation Approach
A. Characteristics and Types of Tests in Essay-Translation Approach
1. This is commonly referred to as the pre-scientific stage of language
testing.
2. No special skill or expertise in testing is required.
3. Tests usually consist of essay of translation and grammatical analysis.
4. Tests have a heavy literary and cultural bias.
5. Public examinations resulting from the tests using this approach
sometimes have an oral component at the upper intermediate and
advance levels.
B. Strengths of Essay-Translation Approach
This approach is easy to follow because teachers will simply use their
subjective judgement.
The essay-translation approach may be used for testing any level of
examinees.
The model of tester can easily be modified based on the essentials of
the tests.
C. Weaknesses of Essay-Translation Approach
Subjective judgement of teachers tends to be biased.
As mentioned, the tests have a heavy literary and cultural bias.
D. Types of Test in Essay-Translation Approach
Tests used in this approach are essay-writing, translation, and
grammatical analysis
2. The Structuralist Approach
A. Characteristics and Types of Tests in Structuralist Approach
1. This approach views that language learning is chiefly concerned with
systematic acquisition of a set of habits.
2. The structuralist approach involves structural linguistics which stresses
the importance of constructive analysis and the need to identify and
measure the learners’ mastery of the separate elements of the target
language such as phonology, vocabulary and grammar.
3. Testing the skills of listening, speaking, reading and writing is separate
from another as much as possible.
4. The psychometric approach to measurement with its emphasis on
reliability and objectivity forms an integral part of structuralist testing.
B. Strengths of Structuralist Approach
In testing students’ capability, this approach may objectively and
surely be used by testers.
Many forms of tests can be covered in the test in a short time.
Using this approach in testing will help students find their strengths
and weaknesses in every skill they study.
C. Weaknesses of Structuralist Approach
It tends to be a complicated job for teachers to prepare questionnaires
using this approach.
This approach considers measuring non-integrated skills more than
integrated skills.
D. Types of Test in Structuralist Approach
The test used in structuralist approach is Multiple-choice
3. The Integrative Approach
A. Characteristics and Types of Tests in Integrative Approach
1. This approach involves the testing of language in context and is thus
concerned primarily with meaning and the total communicative effect
of discourse.
2. Integrative tests are concerned with a global view of proficiency.
3. Integrative testing involves functional language but not the use of
functional language.
4. The use of cloze test, dictation, oral interview, translation and essay
writing are included in many integrative tests.
B. Strengths of Integrative Approach
The approach to meaning and the total communicative effect of
discourse will be very useful for students in testing.
This approach can view students’ proficiency with a global view.
A model cloze test used in this approach measures the reader’s ability
to decode ‘interrupted’ and ‘mutilated’ messages by making the most
acceptable substitutions from all the contextual clues available.
Dictation, another type using this approach, was regarded solely as a
means of measuring students’ skills of listening comprehension.
C. Weakness of Integrative Approach
Even if many think that measuring integrated skills is better, sometimes
there is a need to consider the importance of measuring skills based on
students’ need, such as writing only, speaking only, etc.
D. Types of Test in Integrative Approach
1. Cloze test
2. Dictation
3. Translation, oral interview and composition writing
4. The Communicative Approach
Communicative language tests are intended to be a measure of how the
students are able to use language in real life situations
A. Characteristics and Types of Tests in Communicative Approach
It tests how language is used in communication
It tests language use
It tests language usage
It tests language skills both integratively and separately
It suggests contextualized testing
It tests cultural understanding
B. Strengths of Communicative Approach
Communicative tests are able to measure all integrated skills of
students.
The tests using this approach face students in real life so it will be
very useful for them.
Because a communicative test can measure all language skills, it can
help students in getting the score. Consider students who have a poor
ability in using spoken language but may score quite highly on tests of
reading.
Detailed statements of each performance level serve to increase the
reliability of the scoring by enabling the examiner to make decisions
according to carefully drawn-up and well-established criteria.
C. Weaknesses of Communicative Approach
Unlike the structuralist approach, this approach does not emphasize
learning structural grammar, yet it may be difficult to achieve
communicative competence without a considerable mastery of the
grammar of a language.
It is possible for cultural bias to affect the reliability of the tests being
administered.
D. Types of Test in Essay-Translation Approach
Tests used in this approach is Real life tasks
5. Performance based
The table below is the summary of the way of judgement, component tests,
skills aimed to be tested, aim of testing used in Essay-Translation
Approach, Structuralist Approach, Integrative Approach, and
Communicative Approach to make the characteristics of them clearer.
Types of Way of Components Skills aimed Aim of
Approach judgment of tests to be tested testing
Subjective Essay-writing Writing, Literary
reading,
Essay- grammar &
Translation
translation usage, understandin
Grammatical
approach Oral & aural g of language
analysis
skills in later
stages
Testing
Phonology,
language
Vocabulary,
Structuralist Multiple - skills
Objective Grammar
approach choice separately in
and skills
an objective
separately
manner
Testing the
Cloze test
language
Translation
skills in
Integrative Objective & Oral Allskills
context
approach Subjective interviews integratively
including all
Composition
language
writing
skills
All skills and
Objective
areas both Communicati
Communicati (quantitative)
Real life tasks interactively ve skills in
ve approach & subjective
and context
(qualitative)
separately
Perform
Based
In any case, assessment can be done in many ways, and testing is
only one of them. When tests have to be used in assessment, they must
always follow a set of principles which guarantee assessment validity
(real-like communication) and reliability. Varying test formats according
to the particular assessment purposes and contexts helps to make testing
fairer and more reliable and authentic.
V. VALIDITY
Validity is a unitary concept. Although evidence may be
accumulated in many ways, validity always refers to the degree to which
that evidence supports the inferences that are made from the scores. The
inferences regarding specific uses of a test are validated, not the test itself.
In validation, on the other hand, we must consider other sources of
variance, and must utilize a theory of abilities to identify these sources.
That is, in order to examine validity, we need a theory that specifies the
language abilities that we hypothesize will affect test performance. The
process of validation thus must look beyond reliability and examine the
relationship between test performance and factors outside the test itself.
Reliability is agreement between similar measures of the same trait
(for example, correlation between scores on parallel tests). Validity is
agreement between differentmeasures of the same trait (for example,
correlation between scores on a multiple choice test of grammar and
ratings of grammar on an oral interview)
The evidential basis of validity
In defining validity as ‘the appropriateness, meaningfulness, and
usefulness of the specific inferences made from test scores’ (American
Psychological Association 1985: 9), the measurement profession has
clearly linked validity to the inferences that are made on the basis of test
scores. The process of validation, therefore, ‘starts with the inferences that
are drawn and the uses that are made of scores.
Content relevance and content coverage (content validity)
One of the first characteristics of 2 tests that we, as prospective test
users, examine is its content. If we cannot examine an actual copy of the
test, we would generally like to see a table of specifications and example
items, or at least a listing of the content areas covered, and the number of
items, or relative importance of each area. Likewise, in developing a test,
we begin with a definition of the content or ability domain, or at the very
least, with a list of content areas, from which we generate items, or test
tasks. The consideration of test content is thus an important part of both
test development and test use. Demonstrating that a test is relevant to and
covers a given area of content or ability is therefore a necessary part of
Criterion relatedness (criterion validity)
Another kind of information we may gather in the validation
process is that which demonstrates a relationship between test scores and
some criterion which we believe is also an indicator of the ability tested.
This ‘criterion’ may be level of ability as defined by group membership,
individuals’ performance on another test of the ability in question, or their
relative success in performing some task that involves this ability. In some
cases this criterion behavior may be concurrent with, or occur nearly
simultaneously with the administration of the test, while in other cases, it
may be some future behavior that we want to predict.
Concurrent criterion relatedness (concurrent validity)
Information on concurrent criterion relatedness is undoubtedly the
most commonly used in language testing. Such information typically takes
one of two forms: (1) examining differences in test performance among
groups of individuals at different levels of language ability, or (2)
examining correlations among various measures of a given ability. If we
can identify groups of individuals that are at different levels on the ability
in which we are interested, we can investigate the degree to which a test of
this ability accurately discriminates between these groups of individuals.
Typical groups that have been used for such comparisons are ‘native
speakers’ and ‘non-native speakers’ of the language (for example, Chihara
et al. 1977; Alderson 1980; Bachman 1985).
Predictive utility (predictive validity)
Another use of information on criterion relatedness is in
determining how well test scores predict some future behavior. Suppose
we wanted to use language test scores to predict satisfactory job
performance or successful achievement in a course of instruction.
Construct validation
Construct validity concerns the extent to which performance on
tests is consistent with predictions that we make on the basis of a theory of
abilities, or constructs. Construct validity is thus seen as a unifying
concept, and construct validation as a process that incorporates all the
evidential bases for validity discussed thus far: Construct validity is indeed
the unifying concept that integrates criterion and content considerations
into a common framework for testing rational hypotheses about
theoretically relevant relationships. (Messick 1980: 1015)
Evidence supporting construct validity
While the examples above involve correlations among test scores,
On the contrary, in examining the relationships among different
observations of language performance, not all of which will be tests, the
test developer involved in the process of construct validation is likely to
collect several types of empirical evidence. These may include any or all
of the following: (1) the examination of patterns of correlations among
item scores and test scores, and between characteristics of items and tests
and scores on items and tests; (2) analyses and modeling of the processes
underlying test performance; (3) studies of group differences; (4) studies
of changes over time, or (5) investigation of the effects of experimental
treatment (Messick 1989). Two
of these types of evidence, correlational and experimental, are particularly
powerful.
Correlational evidence
Correlational evidence is derived from a family of statistical
procedures that examine the relationships among variables, or measures.
The approach that has been used most extensively in
construct validation studies of language tests is to examine patterns of
correlations among test scores, either directly, or, for correlations among
large numbers of test scores, through factor analysis.
Postmortem: face validity
Face validity is a term that is bandied about in the field of test
construction until it seems about to become part of accepted terminology.
The frequency of its use and the emotional reaction which it arouses -
ranging almost from complete contempt to highest approbation - make it
desirable to examine its meaning more closely. (Mosier 1947: 191)
Mosier discusses three meanings of face validity that were then current:
‘validity by assumption’, ‘validity by definition’, and ‘appearance of
validity’. Validity by definition is a special case of what we now call
content coverage, in which the tasks required in the test are identical to
those that define the domain. Mosier considers the appearance of validity
an added attribute that is important with regard to the use of the test, but
unrelated to validity. He reserves his harshest criticism for validity by
assumption:
VI. RELIABILITY
Reliability is the most important characteristic of evaluation
results. It refers to the consistency of measurement, i.e. how
consistent test scores or other evaluation results are from one
measurement to another. It provides the consistency that makes
validity possible and indicates how much confidence we can place
in our tests. Some general points of reliability :
1. It refers to the result obtained with an evaluation instrument and
not to the instrument itself. Any particular instrument may have a
number of different reliabilities, depending on the group involved and
the situation used. It is more appropriate to speak of the reliability
of the “ test score” or of the measurement than of the “ test “ or
the “ instrument”.
[Link] is a necessary but not a sufficient condition for
validity. It merely provides the consistency that makes validity
possible.
3. If a test is not reliable, it will make the actual scores of many
individuals are likely to be quite different from their true scores.
4. Reliability coefficient is a correlation coefficient that indicates
the degree of relationship between two sets of measures obtained
from the same instrument or procedures.
There are several methods of estimating reliability. Different
types of consistency are determined by the different methods. The
reliability coefficient resulting from each method must be interpreted
according to the type of consistency being investigated.
METHODS OF ESTIMATING RELIABILITY
Types of Reliability Measure
Procedure
1. Test-retest
2. Equivalent-forms
[Link]-retest with equivalent
4. Spilt-half method
[Link]-Richardson
Test-Retest Method
To estimate reliability by means of the test-retest method, the same
test is administered twice to the same group of pupils with a given
time interval between the two administrations. The resulting test
scores are correlated, and this correlation coefficient provides a measure
of stability, i.e. indicating how stable the test results are over the given
period of time. If the results are highly stable, the pupils who are high
in one administration of the test will tend to be high on the other
administration, and the remaining pupils will tend to stay in their same
relative position on both adminisrtations.
Equivalent-Forms Method
Estimating reliability by means of Equivalent-Forms Method uses two
different but equivalent forms of the test ( parallel/alternate forms ). The
two forms of the test are administered to the same group of pupil in close
seccession, and the resulting test scores are correlated. This correlation
coefficient provides a measure of equivalence which indicates the
degree to which both forms of the test are measuring the same aspect of
behavior.
The equivalent-forms method is sometimes used with a time interval
between the administration of the two forms of the test. Under this test-
retest condition the resulting reliability coefficient provides a measure of
stability and equivalance.
Split-Half Method
The reliability of test can also be estimated from a single
administration of single form of a test. The test is administered to a
group of pupils in the usual manner and then is divided in half for
scoring purposes. To split the test into halves which are most
equivalent, the procedure is to score the even-numbered and the odd-
numbered items separately. This produces two scores for each pupil ,
which, when correlated, provides a measure of internal consistency. To
estimate the scores’ reliability based on full-length test, use the Spearman-
Brown formula as follows :
2 X Reliability on ½ test
Reliability on full test = ---------------------------------------------
1 + Relliability on ½ test
The simplicity of the formula can be seen in the following example, in
which the correlation coefficient between the test’ s two halves is .60 :
2 X .60 1.20
Reliability on full test = --------------- = ------- = . 75
1 + .60 1.60
Kuder-Richardson Method
Kuder and Richardson developed a method of estimating the
reliability of test scores from a single administration of a single form of
a test. It provides a measure of internal consistency but does not require
splitting the test in half for scoring. One of the formula is Kuder-
Richardson Formula 20, which is based on the proportion of persons
passing each item and the standard deviation of the total scores. A less
accurate but simpler is the Kuder-Richardson formula 21 , which can be
applied to the result of any test that has been scored on basis of the
number of correct answers.
K M(K - M)
Reliability estimate ( K R 21 ) = ------- ( 1 -
-------------------------)
K - 1 Ks2
K = number of items in the test M = the mean ( arithmetic average)
of the test scores )
s = the standard deviation of the test scores
Comparing the Methods
Each method of estimating reliability provides different information
about the consistency of test results. The test-retest method ,without a
time interval, considers only the consistency of the testing procedure
and the short term consistency of the response. If a time interval is used
between the test, the consistency of the characteristics of the pupil from
day to day will be included. The equivalent-forms method without a time
interval , the split-half method, and the Kuder-Richardson method all
take into account the consistency of testing procedures and the
consistency of results over different samples of items.
Type of consistency
-------------------------------------------------------------------------
--------------------
The Methods [Link] test [Link] Pupil
[Link] different
Procedure Characteristics
Samples of Items
Test-retest (immediate) X -
-
Test-retest (time interval ) X X
-
Equivalent-forms (imediate) X -
X
Equivalent-forms(time interval) X X
X
Split-half X -
X
Kuder-Richardson X -
X
From the the table, only the equivalent-forms method with time
interval takes into account all three types of consistency, and makes it
as the most useful estimate of test reliability.
VII. Item Analysis
Item analysis is a general term that refers to the specific methods used in
education to evaluate test items, typically for the purpose of test
construction and revision. Regarded as one of the most important aspects
of test construction and increasingly receiving attention, it is an approach
incorporated into item response theory (IRT), which serves as an
alternative to classical measurement theory (CMT) or classical test theory
(CTT).
THE PURPOSE OF ITEM ANALYSIS
There must be a match between what is taught and what is
assessed. However, there must also be an effort to test for more complex
levels of understanding, with care taken to avoid over-sampling items that
assess only basic levels of knowledge. Tests that are too difficult (and
have an insufficient floor) tend to lead to frustration and lead to deflated
scores, whereas tests that are too easy (and have an insufficient ceiling)
facilitate a decline in motivation and lead to inflated scores. Tests can be
improved by maintaining and developing a pool of valid items from which
future tests can be drawn and that cover a reasonable span of difficulty
levels.
ITEM DIFFICULTY
In test construction, item difficulty is determined by the number of
people who answer a particular test item correctly. For example, if the first
question on a test was answered correctly by 76% of the class, then the
difficulty level (p or percentage passing) for that question is p = .76. If the
second question on a test was answered correctly by only 48% of the class,
then the difficulty level for that question is p = .48. The higher the
percentage of people who answer correctly, the easier the item, so that a
difficulty level of .48 indicates that question two was more difficult than
question one, which had a difficulty level of .76.
Figure ILLUSTRATION BY GGS INFORMATION SERVICES.
CENGAGE LEARNING, GALE.
p = .60 or higher, with a number of items in the p> .90 range to enhance
motivation and test for mastery of certain essential concepts.
DISCRIMINATION INDEX
According to Wilson (2005), item difficulty is the most essential
component of item analysis. However, it is not the only way to evaluate
test items. Discrimination goes beyond determining the proportion of
people who answer correctly and looks more specifically at who answers
correctly. In other words, item discrimination determines whether those
who did well on the entire test did well on a particular item.
In Figure 1, it can be seen that Item 1 discriminates well with those in the
top performing group obtaining the correct response far more often (p
= .92) than those in the
Figure 2ILLUSTRATION BY GGS INFORMATION SERVICES.
CENGAGE LEARNING, GALE.
Low performing group (p = .40), thus resulting in an index of .52
(i.e., .92 - .40 = .52). Next, Item 2 is not difficult enough with a
discriminability index of only .04, meaning this particular item was not
useful in discriminating between the high and low scoring individuals.
Finally, Item 3 is in need of revision or discarding as it discriminates
negatively, meaning low performing group members actually obtained the
correct keyed answer more often than high performing group members.
In Figure 2, the point-biserial correlation between item score and
total score is evaluated similarly to the extreme group discrimination
index. If the resulting value is negative or low, the item should be revised
or discarded. The closer the value is to 1.0, the stronger the item's
discrimination power; the closer the value is to 0,
Figure 3ILLUSTRATION BY GGS INFORMATION SERVICES.
CENGAGE LEARNING, GALE.
CHARACTERISTIC CURVE
A third parameter used to conduct item analysis is known as the
item characteristic curve (ICC). This is a graphical or pictorial depiction of
the characteristics of a particular item, or taken collectively, can be
representative of the entire test. In the item characteristic curve the total
test score is represented on the horizontal axis and the proportion of test
takers passing the item within that range of test scores is scaled along the
vertical axis.
Item Statistics
Item statistics are used to assess the performance of individual test
items on the assumption that the overall quality of a test derives from the
quality of its items. The ScorePak® item analysis report provides the
following item information:
Item Number
This is the question number taken from the student answer sheet, and the
ScorePak® Key Sheet. Up to 150 items can be scored on the Standard Answer
Sheet.
Mean and Standard Deviation
The mean is the "average" student response to an item. It is computed by
adding up the number of points earned by all students on the item, and
dividing that total by the number of students.
Item Difficulty
For items with one correct alternative worth a single point, the item
difficulty is simply the percentage of students who answer an item correctly.
In this case, it is also equal to the item mean. The item difficulty index ranges
from 0 to 100; the higher the value, the easier the question. When an
alternative is worth other than a single point, or when there is more than one
correct alternative per question, the item difficulty is the average score on that
item divided by the highest number of points for any one alternative. Item
difficulty is relevant for determining whether students have learned the
concept being tested. It also plays an important role in the ability of an item to
discriminate between students who know the tested material and those who do
not. The item will have low discrimination if it is so difficult that almost
everyone gets it wrong or guesses, or so easy that almost everyone gets it
right.
VII. SCORE INTERPRETATION
A. TYPES OF SCORE
• Raw scores is the number of items he/she answered correctly. For
example: on a 50-item multiple-choice test, a student correctly answers 37
items out of 50, so his/her raw score is 37.
• Percent Correct is a more useful measure of performance than a raw
score i.e. to compare among different assessment.
To calculate the percent correct score: Percent correct = 37/50 = 74%
Given: A student’s performance on two different assessments; a midterm
and a final. The student had a raw score of 35 on the midterm and 70 on
the final. It might be assumed that he/she performed better on the final.
But what about if on the midterm contained 50 items and 100 on the final,
do you think so?
It turns out that neither the raw score nor the percent correct reveals
anything about how one student performs relative to another student,
group of students, or all students who took the same test.
• Percentile Rank à the most typical norm-referenced test score. It refers
to the percentage of items answered correctly but rather to a student’s
performance relative to all other test takers.
When a student performed at 60th percentile, it means he/she performed
better than 60 percent of the students in the norm sample.
A widely used scores because they are easy to understand and interpret.
Verbal Scores of Tenth-Grade Students with Percentile Rank
Verbal Score Percentile Rank Verbal Score Percentile Rank
790 99 570 50
710 95 560 45
670 90 550 40
660 85 530 35
640 80 520 30
630 75 510 25
620 70 500 20
610 65 480 15
600 60 470 10
580 55 420 5
• Stanines Scores is short for “standard nine”.
• They indicate where student performance falls on a scale of 1 to 9.
• A simple system with only a few numbers, stanines can be especially easy
to
Percentile Rank Stanine Descriptive Level
96 – 99 9 Well above average
89 – 95 8
Above average
77 – 88 7
60 – 76 6
40 – 59 5 Average
23 – 39 4
11 – 22 3
Below average
4 – 10 2
Less than 4 1 Well below average
explain.
• Stanine 1 represents the lowest scores, 9 represents the highest scores.
• On a standardized test, most test takers would fall in the average of range,
in stanines 4 to 6. therefore, stanine t or above could be considered above
avrage, and stanine 3 or below would be considered below average.
The meaning of Stanines and Their Relationship to Percentile Rank
Grade equivalents (GE) is considered as easy to understand and easy to
misinterpret.
Communicate a level of performance relative to test takers in the same
grade.
GE is represented in two numbers separated by a decimal. The first
number represents a grade in school, and the second a tenth of a school
year , or about a month.
Example: a GE score of 5.2 represents the expected performance of a
student in the second month of fifth grade.
Most assessments experts argue that staninse and percentile ranks are
more informative and precise.
VIII. AUTHENTIC ASSESSMENT
A. ASSESSMENT
1. PORTFOLIO
A portfolio is a collection of student work with a common theme or purpose.
The use of portfolios is not new. Portfolios have been common in the fine and
performing arts for years in seeking support for one’s work, to document change
or improvement in style and performance, or to gain admission to special schools.
Principles
1. Assessment provides feedback for learning and is intertwined with
curriculum and instruction
2. Graduation involves completing work across the curriculum as shown by
portfolios that are evaluated by teams of teachers and outside experts
3. Professional development focuses on student work, and thus involves
assessment, while teacher collaboration informs the assessment and
graduation processes
4. Parents are expected to have a real involvement in the school, and at times
they participate in the portfolio review process
5. Members of the wider community also participate in portfolio reviews and
in evaluating
2. PROJECT
Projects: Student works with other students as a team to create a project that
often involves multimedia production, oral and written presentations, and a
display.
The Assessment Project aimed to:
to ensure engagement among students
to assess student learning and, therefore, school effectiveness authentically
to raise the stakes for students and a school in ways that support and
advance an Essential school’s goals
Develop innovative, high quality, on demand assessment tasks for the
students
Enhance the capacity of teachers to assess students and inform teachers’
planning, teaching and reporting in areas of the curriculum where quality
assessment resources are limited.
3. EXTENDED RESPONSE
Extended-response questions are writing prompts or questions that give students
the opportunity to prepare a written answer, often a short phrase, a list, or a more
substantial composition such as a multipage essay. Extended-response questions
are usually open-ended so that students can demonstrate the extent of their
knowledge and skills in the area under assessment.
Characteristics
1. Give students a great freedom so that it allows problem formulation,
organization, and originality
2. Assess broader learning outcomes
3. Contain subjectivity in scoring
4. Ideally these items focus on major concepts of the content unit and demand
higher level thinking
Strength
1. Emphasize integration and application of high-level skills
2. Easy to construct
3. Contribute to student learning, directly and indirectly
Weaknesses
1. Unreliability of scoring
2. Time-consuming to score
3. There is no single correct or best answer to an essay question
4. Responses must be scored subjectively by content experts
EXAMPLE:
A. Extended response in speaking
1. Oral presentation
2. Picture-Cued Story-Telling
3. Retelling a Story, News Event
4. Translation (of extended prose)
B. Extended response in Reading
1. Skimming tasks
, academic needs, and interests)?
2. Summarizing and Responding
3. Note-taking and Outlining
C. Extended Response in Writing
1. Paraphrasing
2. Guided Question and answer
4. OBSERVASION
Observation assessment is exactly as the name suggested , the assessors
observe the students performing the assessment and see if they have the ability to
perform it properly. Observation is a direct means for learning about students,
including what they do or do not know and can or cannot do. This information
makes it is possible for teacher to plan ways to encourage students strengths and
to work on their weakneses.
Observation tools are instrument and techniques that help teacher to record
useful data about students learning in a systematic way. Some observation tools
include:
a. Anecdotal notes
Short notes writes during a lesson, as students either work in groups or
individually or after a lesson.
b. Anecdotal notebook
A notebook where teacher records hir or her observations. An index on the
side, organized by either student name or behaviour.
c. Anecdotal nnote cards
An alternative system to an anecdotal notebook, in which the teavher
records observation using one card per child. One way to facilitate this
processis to select five children per dayy for observation. The cards can be
kept together on a ring.
d. Labels/adhesive notes
Like note crads, the use of these small adhesive notes frees the teavher
from having to carry notebook around the classroom. After the observation
is complete, the teacher can adhere the notes into his or her filling system.
The advatages
1. Observation may sometimes be the only assessment method possible
2. There can be no plagiarism of false report
3. It is a great way to assess practical skill
The disadvantages
1. It doesn’t assess the higher order levels of leaving aoutcomes, and it often
not adequals for a full assessment; oral questioning or other suplementary
assessment be required.
2. Observation assessment requires a lot of time to assess and to prepare thus,
it is as an expensive way to assessing.
3. The presence of the observer can change students performance as being
watched can be intimidating for many students.
B. NON TEST
1. Questionnaire
Writing a questionnaire can be as simple or as complicated as you want it
to be. It really depends on the information you want to collect and the significance
of the decision that it’s going to help you make. For example, if you are planning
to open a new shop based on your research results, you should be as rigorous as
possible in conducting your research.
The criteria for a good questionnaire:
1. Validity
The degree to which an instrument measures what it is supposed to
measure
2. Reliability
The degree to which an instrument can produce consistent result, and
consistent result on deifferent occasion, even when there is not evidence of
change
3. Responsiveness
An instruments’ ability to detect change over time
The main types of questions are:
a. Open ended
eg: Do you have any suggestions about how we could improve our
customer service? (indicate below)
_________________________________________________________
_________________________________________________________
b. Closed
eg: Do you own a car? (select one response)
YES
NO
c. Multiple choice
eg: Which of these media do you get your news from?
(select more than one response)
a) Newspaper
b) Radio
c) TV
d) Internet
d. Scaled
eg: Using the scale below, how would you rank the following?
1 = excellent 3 = satisfactory 5 = poor
2 = good 4 = fair
Our prices
Our customer service
Our product range
Our location
The quality of our products
2. Checklist
Checklists contain a list of behaviors or specific steps, which can be
marked as Present/Absent, Complete/Incomplete, Yes /No, etc. In some cases, a
teacher will use a checklist to observe the students. In other cases, students use
checklists to ensure that they have completed all of the steps and considered all of
the possibilities. For example, the checklist at the Blue Rubric from the Center for
technology in learning is an observation checklist a teacher could use to assess
students' science experiments. Whereas, the Multimedia mania Checklist is
designed for self-assessment by students.
Figure 1. Language and Literacy
A. Listening F W S
1. Listens for meaning in discussions and Not Yet ___ ___ ___
conversations In Process ___ ___ ___
Proficient ___ ___ ___
2. Follows directions that involve a series of actions. Not Yet ___ ___ ___
In process ___ ___ ___
Proficient ___ ___ ___
B. Speaking
1. Speaks easily, conveying ideas in discussions and Not Yet ___ ___ ___
conversations. In Process ___ ___ ___
Proficient ___ ___ ___
2. Uses language for a variety of purposes Not Yet ___ ___ ___
In Process ___ ___ ___
Proficient ___ ___ ___
From the Work Sampling System Developmental Checklist—First Grade. The
checklist provides for fall (F), winter (W), and spring assessments (S).
Figure 2. Mathematical Thinking
A. Approach to mathematical thinking F W S
1. Uses strategies flexibly to solve mathematical Not Yet ___ ___ ___
problems In Process___ ___ ___
Proficient___ ___ ___
2. Communicates mathematical thinking using oral or Not Yet ___ ___ ___
written language In process___ ___ ___
Proficient___ ___ ___
B. Patterns and relationships
1. Uses the concept of patterning to make predictions Not Yet ___ ___ ___
and draw conclusions In Process ___ ___ ___
Proficient ___ ___ ___
2. Uses sorting, classifying, and comparing to analyze Not Yet ___ ___ ___
data In Process ___ ___ ___
Proficient ___ ___ ___
From the Work Sampling System Developmental Checklist—First Grade. The
checklist provides for fall (F), winter (W), and spring (S) assessments.
3. Observasion sheet
An observation sheet is a document used in making recordings for the purpose
of analysis. Observation sheets are of many varieties. They could be in the
form of a questionnaire with questions to be answered or a checklist in which
one has to confirm the presence or absence or a certain feature.
4. Self assessment
Every teacher uses learning intentions, and children fill in learning logs at
the end of the week that match the particular learning intentions. The intentions
themselves have clear success criteria, which are co-constructed with the students.
The students and the teachers reflect on whether they’ve met the criteria or not,
and if they need to do something more or different next time.
The example is a student self assessment form. By filling out this form, students
learn to reflect on their day. Students learn that their success is determied by the
effort they put into their day.
5. Peer assessment
For pair or activities, students can be asked to rate each other as well as
their functioning as a group. Underhill (1987) suggest that peer aseessment is an
authentic assessment approach because peers are asked to rate the effectiveness of
communication by others. An example of peer assessment to determine the
effectiveness of using oral language to explain a process has been modified from
an instructional activity designed by ESL teacher
6. Rating scales
A rating scale is a tool used for assessing the performance of tasks, skill
levels, procedures, processes, qualities, quantities, or end products, such as
reports, drawings, and computer programs. These are judged at a defined level
within a stated range. Rating scales are similar to checklists except that they
indicate the degree of accomplishment rather than just yes or no.
Rating scales list performance statements in one column and the range of
accomplishment in descriptive words, with or without numbers, in other columns.
These other columns form “the scale” and can indicate a range of achievement,
such as from poor to excellent, never to always, beginning to exemplary, or
strongly disagree to strongly agree.
Some tasks, such as procedures and processes, need to be observed in order to be
assessed.
Characteristics of rating scales
Rating scales should:
• have criteria for success based on expected outcomes
• have clearly defined, detailed statements
This gives more reliable results.
For assessing end products, it can sometimes help to have a set of photographs
or real samples that show the different levels of achievement. Students can
visually compare their work to the standards provided.
• have statements that are chunked into logical sections or flow sequentially
• include clear wording with numbers when a number scale is used
As an example,
when the performance statement describes a behaviour or quality, 1 = poor
through to 5 = excellent is better than 1 = lowest through to 5 = highest or
simply 1 through 5.
The range of numbers should be the same for all rows within a section (such as
all being from 1 to 5).
The range of numbers should always increase or always decrease. For example,
if the last number is the highest achievement in one section, the last number
should be the highest achievement in the other sections.
• have specific, clearly distinguishable terms
Using good then excellent is better than good then very good because it is hard
to distinguish between good and very good. Some terms, such as often or
sometimes, are less clear than numbers, such as 80% of the time.
• be short enough to be practical
• highlight critical tasks or skills
• indicate levels of success required before proceeding further, if applicable
• sometimes have a column or space for providing additional feedback
• have space for other information such as the student’s name, date, course,
examiner, and overall result
• be reviewed by other instructors
Considerations for numeric rating scales
If you assign numbers to each column for marks, consider the following:
• What should the first number be? If 0, does the student deserve 0%? If 1, does
the student deserve 20% (assuming 5 is the top mark) even if he/she has done
extremely poorly?
• What should the second number be? If 2 (assuming 5 is the top mark), does the
person really deserve a failing mark (40%)? This would mean that the first two
or three columns represent different degrees of failure.
• Consider variations in the value of each column. Assuming 5 is the top mark, the
columns could be valued at 0, 2.5, 3, 4, and 5.
• Consider the weighting for each row. For example, for rating a student’s report,
should the introduction, main body, and summary be proportionally rated the
same? Perhaps, the main body should be valued at five times the amount of the
introduction and summary. A multiplier or weight can be put in another column
for calculating a total mark in the last column.
Expected learning outcome: The student will demonstrate professionalism and
high-quality work during the practicum.
Criteria for success: A maximum of one item is rated as “Needs improvement”
in each section
Performanc Needs Average Above Comments
e area improvement average
A. Attitude
• Punctual
• Respectful of
equipment
•Usessupplies
conscientiousl
B. Quality of work done
• ...
Above average = Performance is above the expectations stated in the
outcomes.
Average = Performance meets the expectations stated in the
outcomes.
Needs improvement = Performance does not meet the expectations stated
in the outcomes.
Tools handling assessment
Expected learning outcome: The student will select the proper tool for each task
and use it both skillfully and safely.
Criteria for success: All skills must be performed “Average” or better.
Skill Unaccepta Weak Average Good Excellent Weight Score
ble 0 2.5 3 4 5
Selects 1
the
proper
tool
Uses 5
the tool
skillfull
y
Uses 2
the tool
safely
Total
Checklist for developing a rating scale
In developing your rating scale, use the following checklist.
In developing a rating scale:
..1. Review the learning outcome and associated criteria for success.
..2. Determine the scale to use (words or words with numbers) to represent the levels
of success.
..3. Write a description for the meaning of each point on the scale, as needed.
..4. List the categories of performance to be assessed, as needed
..5. Clearly describe each skill.
..6. Arrange the skills in a logical order, if you can.
..7. Highlight the critical steps, checkpoints, or indicators of success.
..8. Write clear instructions for the observer.
..9. Review the rating scale for details and clarity.
..10. Format the scale.
..11. Ask for feedback from other instructors before using it with students.
7. Semantic differential scale
Semantic Differential is a measurement technique and linguistic tool
designed to measure attitudes towards a topic, event, object, activity, person or
concept, revealing the deeper meanings that are attached to an individual
experience, culture and belief system. It is a common (perhaps too common) tool
in market research, although originally developed as a social research tool, with
its popularity based on its simple and easy format. The tool is particularly
powerful in revealing cross-cultural differences in attitudes and beliefs and
reflects the common interests of linguistics and psychology.
Questions simply ask for an indication of where on a continuum, such as good to
bad, active to passive or strong to weak, a concept is most accurately described.
The approach was developed by Charles Osgood, George Suci and Percy
Tannenbaum in the 1950s (see reference) and has been a mainstay of survey
research ever since. Care should be taken in developing the right items for an
effective survey, including defining the right concepts. developing the right word
pairs, using the most effective scales and grouping concepts together.
Concept stimuli can be events, experiences, topics. objects, activities, people or
general concepts, but must be meaningful to those surveyed in order to produce
meaningful answers.
Each semantic differential uses a bipolar word pair , usually pairs of antonyms
which define two ends of a continuum (e.g., good – bad). The antonyms can be
complementary (pleasant-unpleasant) or more nuanced (the opposite of friendly
can be unfriendly, or shy, guarded, etc). [For a good source of antonyms, check
Roget's Thesaurus.] The poles of the differentials are randomised so positive and
negative meanings don’t always fall on the same side of the scale.