All Achievement Tests are Not Created Equal

15 Pages Posted: 15 Jul 2009

Date Written: January 29, 2009


Why We Were Interested: Standardized achievement tests have come to be recognized as the "one and only acceptable means of measuring how well kids, teachers, and schools are doing. This is despite the fact that the tests don't in any way match the instruction that kids in any given class receive. Moreover, they don't even try to determine what kids have learned, but only how they stack up with kids at the same grade level. People inside and outside of testing recognize these fatal flaws, but they don't consider them fatal. The excuse is, "They're the best we have. There's no better way." We thought there might indeed be a better way.

What We Did: We had access to around 200,000 kids in about 450 schools in rural districts in 19 states in the US that were participating in a structured program in reading and math in grades 1-6. The instruction cut across grade levels to concentrate on matters the kids had not learned, irrespective of their grade. At the end of the year, we gave a large sample of the kids tests to get data on four areas of instruction: reading decoding and math computation, that we termed "definite instruction" and reading comprehension and math concepts that we termed "indefinite instruction. We used three different kinds of tests in each instructional area: off-the-shelf standardized achievement tests, tests that matched the general curriculum emphases at a given grade, and tests that matched the actual instruction the kids had received.

We tested each kid at a level of the test that most closely matched the level of instruction received during the year (referred to as "at instruction") and also the next higher level for each kind of test (referred to as "above instruction").

So we were able to look at what happened in the two subjects - reading and math - in the two areas of instruction - definite and indefinite - on the three kinds of tests - that varied in the degree they departed from the instruction received, and that also varied in terms of whether they focused on matters - the level of the instruction received or were above the level of the instruction received. Quite a bit to look at.

What We Found Out: First we did the standard analyzes of how reliable the tests were and how they interrelate, It turned out that each test was acceptably reliable. And the tests in each subject were highly correlated. Had we stopped here we would have concluded, "It really doesn't matter which kind of test you use. And since standardized tests are traditional, they win."-which is about what the testing experts say and what the public accepts. We went on to look at the information yielded by the standardized achievement test (SAT) Grade Equivalents. The SATs that most closely matched the instruction the kids received produced low grade equivalents. But these grade equivalents reflected the level of the instruction at which the kids were working. But SATs are not administered at the grade level at which the kid is (or should be) working, but rather at the grade level in which they are enrolled. Looking at the "abovelevel" results, which reflect this practice, we find higher Grade Equivalents (although still somewhat "behind) despite the fact the percentage of items answered correctly was lower than for the "at level" test.

Tests referenced to the general curriculum both showed a high level of performance "at level" on matters they had been taught and a lower level of performance on matters "above level" that they hadn't yet been taught.

Finally, we looked at the differences between the results of "definite" and "indefinite instruction. The results on the "definite" instruction were consistently higher than on the "indefinite" instruction. Secondly, with the exception of Reading Decoding at Grade 3 (which is where the remaining complicated and infrequently encountered words are the focus) the performance increases from grade to grade. With "indefinite instruction" the pattern is different. Here the performance rises to a peak and then declines! This is obviously a function of the nature of the indefinite instruction rather than the nature of the kids. Kids don't become weaker conceptually or comprehend less well with further instruction. It just appears that way when the definition of what is meant by the instructional rubric changes from grade to grade. "It's not fair," as the kids might accurately say.

Bottom Line: If you want to find out what kids have learned, test them on the instruction they've received. The further you depart from this common sense conclusion, the more misleading will be the results. It happens, however, that current testing practice departs as far from the conclusion as possible. Since people learn what they are taught (that's the point of teaching, right?) if that's what you test, the results will be a good deal more positive than what we are accustomed to seeing. Some kids learn more than others, but it does little good to rank them with each other. The trick is to determine what a kid has learned, in order to use these assets as a basis for further instruction. The more clearly you define the structure of an instructional matter, the more effective the instruction will be. The results of instruction that is "all over the place" will reflect the impoverished thought given to the structure, ever though today the results are commonly attributed to the poverty of kids.

Keywords: Achievement tests, Standardized Achievement Tests, Educational Measurement, Reading Tests, Mathematics Tests

Suggested Citation

Hanson, Ralph A. and Schutz, Dick, All Achievement Tests are Not Created Equal (January 29, 2009). Available at SSRN: or

Ralph A. Hanson

3RsPlus, Inc. ( email )

Long Beach, CA 90807
United States

Dick Schutz (Contact Author)

3RsPlus, Inc. ( email )

Long Beach, CA 90807
United States
562 427-5949 (Phone)

Here is the Coronavirus
related research on SSRN

Paper statistics

Abstract Views
PlumX Metrics