By Brian M. Stecher and Laura S. Hamilton(*)
From Foresight, Vol. 9, No. 2
published 2002
Now that the president has signed the No Child Left Behind Act, every state must develop a plan to begin testing all students in reading and math in grades 3 through 8 and in high school. Cash and other rewards could be conferred upon districts and schools with high scores, and tough sanctions will be imposed on the schools with persistently low scores. However, there is no guarantee that the strict accountability provisions of the new law will promote student achievement or improve poor schools. In fact, it is quite possible that the new accountability systems may produce some negative results. Therefore, it is important for states to design their accountability systems to prevent any unintended negative results.
During the 1990s, a number of states adopted these kinds of “high-stakes” accountability systems—whereby schools, districts, students, teachers, and administrators were held accountable in various ways for the scores of students on achievement tests. Test-based accountability formed the cornerstone of the bipartisan No Child Left Behind Act, which Congress passed in December and the president signed in January.
Test-based accountability systems embody the belief that public education can be improved through a simple strategy: Test all students, and reward or sanction schools and districts based on the scores. Rewards can include formal public recognition and cash for teachers and schools. Sanctions can include progressively more severe interventions into school operations.
As stipulated in the new legislation, the interventions begin with external experts being assigned to meet with the administrators of schools and districts to help them improve. The interventions can escalate to include mandatory supplemental instruction for students and blanket permission for parents to enroll their children in other schools. If these interventions still fail to improve the test scores at a school, it can then be reconstituted (with the administrators being removed and the state taking over). Many state and federal policymakers have come to regard such test-based accountability as the most promising policy for improving education.
Nonetheless, the evidence has yet to justify the expectations. The initial evidence is, at best, mixed. On the plus side, students and teachers seem to respond to the incentives created by the accountability systems, and test scores generally rise. Yet how this occurs is puzzling. It could be the result of students working harder, of teachers adopting better strategies, and of everyone focusing on the desired subject matter. On the minus side, it is unclear if the test score gains reflect meaningful improvements in student learning or, rather, artificial score inflation caused by excessive coaching or other kinds of narrow test preparation. If the test scores are indeed inflated, then they send misleading signals about student performance. The accountability systems can also lead to academically undesirable changes in curriculum and instruction, such as emphasizing some subjects or topics at the expense of others.
If the accountability systems have the power to change behavior, as the early evidence indicates, then we need to ensure that the systems change behavior in the correct ways. We can structure the accountability systems to maximize educational improvement and to minimize negative consequences, but a bit of pedagogical perspective is in order first.
Much of the impetus for accountability in schools has come from beliefs about accountability that were nurtured outside the educational sphere—mostly in the business world. The business model of setting clear targets, attaching incentives to the attainment of those targets, and rewarding those responsible for reaching the targets has proven successful in a wide range of business enterprises. But there is no evidence that these accountability principles will work well in an educational context, and there are many reasons to doubt that the principles can be applied without significant adaptation.
In the industrial sector, production is easily quantified, and output can be translated into a single measure: profits. In education, there are multiple desired outcomes, and only a subset of them can be measured by tests. Schools are also expected to foster positive interpersonal relations, enhance citizenship, improve physical development, and promote general reasoning skills. Even in terms of achievement, where tests can measure output, we test only a portion of the subjects that are taught. Under the new law, the states will be required to test reading, math, and (eventually) science, but not writing, history, government, music, art, or other subjects.
Education differs from the manufacturing sector in other important ways. To begin with, schools have little control over the “inputs.” In other words, students enter school with widely diverse skills and experiences. Test scores are further influenced by factors that prevail outside of school, and we have been unable to accurately adjust for these factors in estimating school effectiveness. Schools themselves differ greatly in their capacity to effect change. They cannot be “retooled” as easily as a factory. These differences suggest that the industrial model of accountability may not work equally well for education.
An additional impetus for the accountability systems has come from the well-publicized experiences of a few states (e.g., Florida, Kentucky, Maryland, and Texas) where test scores rose when rewards and sanctions were attached to state-administered achievement tests. Proponents of accountability attribute the improved scores in these states to clearer expectations, greater motivation on the part of the students and teachers, a focused curriculum, and more-effective instruction. However, there is little or no research to substantiate these positive changes or their effects on scores.
Meanwhile, there is moderate evidence of some negative changes, such as reduced attention to nontested curriculum, excessive narrow test preparation, and occasional cheating on the part of school personnel. Some schools have reassigned “better” teachers to the accountability grades or hired commercial test preparation companies. Some of these changes may ultimately prove harmful to overall student achievement. At the same time, inflated scores can create an illusion of progress that may identify the wrong programs as effective.
The new federal legislation requires that each state create a test-based accountability system. To a limited extent, each state may customize its system to be responsive to local conditions, but the systems will generally be used to help make the following important decisions:
whether recognition and cash bonuses will be awarded to teachers, administrators, or schools
which schools enter and exit from mandatory school-improvement programs
whether parents can transfer students from their home school to another school.
The lack of strong evidence regarding the design and effectiveness of accountability systems hampers policymaking at a critical juncture. Under the new federal law, states will have to select tests, set performance standards, and implement reward systems that have “teeth.” Yet the states at the forefront of the accountability movement are just beginning to observe the unintended consequences of score inflation and curriculum narrowing that often accompany high-stakes testing. Based on the evidence now emerging from California, Florida, Kentucky, Texas, Vermont, and other states, we can predict what will likely happen as the remaining states implement their new, tougher testing policies.
First, we can expect average test scores to rise each year for the first three or four years. Teachers and administrators at both low-scoring and high-scoring schools will shift their instruction in ways that will produce higher scores. Every state that has implemented test-based accountability has seen its scores rise. In some cases, the rises have been dramatic.
Second, we know that, to some extent, the initial large gains in test scores may not be indicative of real gains in the knowledge and skills that the tests were designed to measure. There is extensive evidence that the scores on high-stakes tests rise faster than the scores on other standardized tests that are given to the same students at the same time to measure aptitude in the same subjects. It appears that students do not know as much as we think they know based on only the high-stakes test scores. Therefore, another ironic but likely result of the accountability systems is that the test scores themselves will be less accurate than they were prior to the attachment of high stakes.
The most common way to detect score inflation is to compare the scores of the same students on two separate tests. The logic of this approach is that valid gains on a high-stakes test ought to be reflected in similar gains on other tests of similar subjects. In Kentucky, we found that the gains in mathematics during the 1990s on a statewide high-stakes achievement test were nearly four times as large as the gains registered by the state’s students on a national achievement test. The latter test, known as the National Assessment of Educational Progress (NAEP), was designed solely with a monitoring role in mind and carries no rewards or punishments. In 2000, we found similarly divergent results in Texas, a state with an accountability system that is often considered a model for other states to follow. As in Kentucky, though, the gains on the statewide Texas Assessment of Academic Skills (TAAS) were much larger than the gains on the NAEP.
Third, we are likely to see more emphasis on tested subjects and less emphasis on nontested subjects. Our research has clearly demonstrated that teachers shift classroom time toward the subjects that are tested at the expense of those that are not.
One of the earliest studies on the effects of testing (conducted in two Arizona schools in the late 1980s) showed that teachers reduced their emphasis on important, nontested material. Teachers neglected subjects—such as science, social studies, and writing—that were not part of the mandated testing program. Similar declines in instructional time for nontested subjects have also been observed in Kentucky, Maryland, and Washington. The figure below shows the shifts in instructional emphasis reported by fourth grade teachers in Washington. Although state educational standards cover all eight of the subjects shown, the high-stakes testing was conducted only in the top four subjects.
Teachers Increase Time on Tested Subjects, Decrease Time on Other Subjects
Fourth, there is likely to be an increase in undesirable test-related behaviors. These behaviors include narrowly focused test-preparation activities that take further time away from normal instruction. Our research has clearly demonstrated that teachers change their instructional emphasis to mimic the formats used in the state tests. For instance, if states adopt multiple-choice tests (which are the most economical among the alternatives), less attention will likely be paid to the elements of math and reading that do not lend themselves to multiple-choice questions. On rare but well-publicized occasions, the undesirable behaviors have also included instances of cheating.
Fifth, we can expect large annual fluctuations in the scores for many schools. Some schools that make the greatest gains one year will see the gains evaporate the next year. Schools whose teachers earn large bonuses one year may have stagnant scores the next, as has occurred in California. This volatility in scores results from a variety of factors, such as student mobility, different cohorts of students taking the tests, measurement error, and other transitory conditions.
Sixth, the sanctions imposed on low-performing schools will not ensure that the students in those schools are not “left behind.” Sanctions often include external consultants and ultimately staff reassignment and school takeover. The record of success of such sanctions is mixed, and there is no guarantee that they will result in improved educational environments for students.
The table below provides a partial list of the potentially positive and negative impacts of high-stakes tests on students, teachers, administrators, and policymakers. More research is needed to understand the prevalence (and balance) of these positive and negative effects.
High-Stakes Testing Could Potentially Have Positive and Negative Effects
There are a number of steps that states can take to maximize the benefits and minimize the harm of test-based accountability systems. The following recommendations are not exhaustive, but they address the major concerns raised above.
First, states should monitor the extent of score inflation. The amount of inflation is likely to depend on the specific features of each state’s testing program, such as whether the same test items are used year after year. Fortunately, states are required to participate in the NAEP nationwide testing of grades 4 and 8 every other year. The NAEP tests provide a good starting point for examining score inflation. Each state needs to establish a plan for comparing the NAEP results to the state results. Each state should then consider supplementing the NAEP scores with other comparative measures in other subjects and other grade levels.
Second, states should consider expanding “what counts” in their accountability systems to include more than just reading and math. Other subjects could be tested without overburdening the system. The overall testing burden could be limited by varying the subjects and grade levels over time and by using sampling approaches that do not require every student to take every test or answer every question. States should also consider measuring what is taught and how it is taught. Gathering this information could reveal shifts in instructional practices while also sending the signal that all subjects are important. (Unfortunately, it is not yet clear whether the No Child Left Behind Act will encourage or discourage the expansion of “what counts” at the state level. There could be disincentives to adopt this approach, depending on how the specific guidelines for implementing the federal law are written.)
Third, states should create student information systems to track the test scores of individual students over time. Such data can allow the states to monitor the progress of individuals, whether they remain in the same schools or transfer to different schools. This approach has two important strengths. It can help identify which teachers are effective, and it can correct for the effects of student background characteristics that are beyond the control of the schools. The data can also be used to understand what happens to students in low-performing schools.
Fourth, states should base their rewards and sanctions on changes in multiyear averages of scores rather than on single-year fluctuations. This change would help ensure that rewards and sanctions reflect real changes in student achievement.
Fifth, states should monitor the progress of schools that are subject to such interventions as mandatory external consultants, supplemental instructional service, or parent transfer rights. By monitoring the changes that occur in these schools, the states can help to make sure that the sanctions will result in better learning environments for the students.
The new federal law has many attractive features, but it contains inadequate provisions for review and improvement to help it perform as intended. Fifty states will be struggling with the new federal requirements and with very little guidance about how to proceed. To make sure that no child is left behind and to make the accountability systems work better, the systems themselves need to be monitored for failure or success.
One of the good features of the new law is the requirement that states promote scientifically based instructional methods—that is, methods that have been evaluated and have produced strong evidence of success. We believe that this same emphasis on scientific legitimacy should also be applied to the provisions of the law itself. Test-based accountability systems will work better if we acknowledge how little we know about them, if the federal government devotes appropriate resources to studying them, and if the states make ongoing efforts to improve them.
© RAND 2002, reprinted with permission.
Brian M. Stecher, Sheila I. Barron, Tammi J. Chun, Karen Ross, The Effects of the Washington State Education Reform on Schools and Classrooms: Initial Findings RAND/DB-309-EDU, 2000.
Stephen P. Klein and Laura S. Hamilton, Large-Scale Testing: Current Practices and New Directions, RAND/IP-182, 1999.
Hamilton, Stecher, and Klein, eds., Making Sense of Test-Based Accountability in Education, forthcoming.
Stecher and Barron, “Quadrennial Milepost Accountability Testing in Kentucky,” in CRESST, National Center for Research on Evaluation, Standards, and Student Testing, Center for the Study of Evaluation Technical Report 505, June 1999, pp. 1-35. Also available as RAND/RP-984.
Test-Based Accountability Systems: Lessons of Kentucky’s Experiment, RAND/RB-8017, 1999.
Hamilton, Klein, and William Lorie, Using Web-Based Testing for Large-Scale Assessment, RAND/IP-196, 2000.
Daniel M. Koretz and Barron, The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS), RAND/MR-1014-EDU, 1998.
Klein, Hamilton, Daniel F. McCaffrey, and Stecher, What Do Test Scores in Texas Tell Us? RAND/IP-202, 2000.
* Brian Stecher, a senior social scientist at RAND, evaluates state testing programs and new forms of educational assessment. Laura Hamilton, a behavioral scientist at RAND, conducts research on the validity of scores and gains on high-stakes tests.
This article first appeared in the Spring 2002 issue of RAND Review, a publication of RAND, a nonprofit research institution based in Santa Monica, California. A larger study by the authors is forthcoming. Return to text.