Tuesday, 21 June 2016
The trend towards private schooling has largely been a phenomenon of industrialised country education systems, starting with Charter schools in the USA and spreading to other countries such as England where Government policy announced in 2016 is to convert all schools into ‘academies’ run by so-called multi-academy chains. In these systems the commercial returns from privatisation is often indirect, being expressed through the letting of contracts for support services and the like.
In developing countries, however, the privatisation is often directly commercially driven where for-profit companies set up or take over schools and charge parents for the education provided. The following commentary looks at the case of one corporation that involved in several African countries where it makes claims for the superiority of its education it. Specifically, Bridge International Academies (BIA) has recently published a report comparing its schools in Kenya with neighbouring State schools and claims greater learning gains. The report can be viewed at:
A detailed commentary and critique of this report has been compiled by Graham Brown-Martin and can be accessed at the following site.
Brown-Martin makes reference to some of my own remarks on the report and what follows is a more detailed quantitative critique of the report.
The study sampled 42 Bridge Academy schools and 42 ‘geographically close’ State schools, and carried out testing in the early grades of primary schooling on approximately 2,700 pupils who were followed up one year later, with just under half being lost through attrition. On the basis of their analyses the report claims
“a Bridge effect of .31 standard deviations in English. This is equivalent to 64 additional days of schooling in one academic year. In maths, the Bridge effect is .09 standard deviations, or 26 additional days of schooling in one academic year.”
Such effect sizes are large, but there are serious problems with the analysis carried out.
First, and most importantly, parents pay to send their children to Bridge schools; $6 a month per student which represents a large percentage of the income of poor parents with several children, and where the daily income per household can fall below $2 a day. So some adjustment for 'ability to pay' is needed, yet this is not attempted, presumably because such data is very difficult to obtain. Presumably those with higher incomes can also support out-of-school learning. Does this go on?
Instead the report uses factors such as whether the family has electricity or a TV, but these are relatively poor surrogates for income. Yet the report has no mention of this problem.
Some of the state schools approached to participate refused and were replaced by others, but there is no comparison of the characteristics of the included schools and all non-bridge schools. Likewise we know little about the students who left the study (relatively more from the Bridge schools) after the initial assessment. Were these pupils who were 'failing'? For example did parents with children ‘failing’ at Bridge schools withdraw them more often, or did parents who could barely afford the school fee tend to withdraw their children more often? What is policy of bridge schools with pupils who fall behind? Are they retained a year or otherwise treated so that they are not included in the follow-up. Such policies, if different in Bridge and State schools would lead to potentially large biases. To be fair, section VII does look at the issue of whether differential attrition could affect results and suggests that it might and recommends further work. In these circumstances one might expect to see, for example, some kind of propensity score analysis whereby a model predicting propensity to leave, using all available data including school characteristics, would yield individual probabilities of leaving that can be used as weights in the statistical modelling of outcomes. Without such an analysis, apart from other problems, it is difficult to place much reliance on the results.
The differences in differences (DiD) model is the principal model used throughout the report, yet has serious flaws which are not mentioned. The first problem is that it is scale dependent - thus any monotone (order preserving) transformation will produce different estimates - so that at the very least different scalings need to be tried. Since all educational tests are on arbitrary scales anyway this is an issue that needs to be addressed, especially where the treatment groups (Bridge and non-Bridge schools) have very different student test score distributions.
Secondly, even ignoring scale dependency, the differences across time may in fact be (and usually are) a function of initial test score, so that the latter needs to be included in the model, otherwise the DiD will reflect the average difference and if, as is the case here, the baseline score is higher for bridge schools, and for the scale chosen the higher baseline scoring pupils tend to make more progress in Bridge schools, then DiD will automatically favour the Bridge schools.
Thirdly, the claim that DiD effectively adjusts for confounders is only true if there are no interactions between such confounders and treatment. This point does appear to be understood, but nevertheless remains relevant and is not properly pursued in the report.
The report does carry out an analysis using a regression model which, in principle, is more secure than the DiD model but requires a nonlinear relationship with baseline, which is done, but also possible interactions with covariates which is not done. Even more important is that there needs to be an adjustment for reliability which is likely to be low for such early year tests. If the baseline test reliability is low - say less that 0.8, then inferences will be greatly changed and the common effect found in other research around this age is that the treatment effect is weakened. (Goldstein, 2015).
Table 15 is especially difficult to interpret. It essentially looks at what happens to the lower achieving group at time 1 using a common cut-off score. Yet this group overall is even lower achieving in control schools than Bridge schools, so it will be easier on average for those in this group in Bridge schools to move out of this category. The evidence from these comparisons is therefore even less reliable than the above analyses and can be discounted as providing anything useful. Surprisingly this point appears to be understood yet is still used as 'evidence'
There is a section in the report on cross-country comparisons. The problem is that country assessments are fundamentally different and comparability is a very slippery concept and this section’s results really should be ignored since they are highly unreliable.
In short, this report has such considerable weaknesses that its claims need to be treated with scepticism. It also appears to be authored by people associated with BIA, and hence presumably with a certain vested interest. The issue of whether private education can deliver ‘superior’ education remains an interesting and open question.
Goldstein, H. (2015), Jumping to the wrong conclusions. Significance, 12: 18–21. doi: 10.1111/j.1740-9713.2015.00853.x
University of Bristol
21 June 2016
Sunday, 13 April 2014
Do league tables really improve test scores?
There is still an argument about whether school league tables, despite their well-known side effects, actually improve the performance of pupils by ‘holding schools to account’. This is despite careful analysis of extensive data collected by Government on all pupils in England via the National Pupil Dataset (NPD) and details can be found at
This research, the latest in a sequence, shows that the uncertainty surrounding value added scores for schools is not only so large that most schools cannot reliably be separated, it is also of hardly any use for parents choosing schools for their children in terms of predicting school performance.
Yet, still there are reports that claim to demonstrate that public league tables do in fact raise pupil performance. So, despite the fact that they appear to be unreliable measures of real school performance, they nevertheless, somehow, manage to improve overall results. We need, therefore, to take the claims seriously, especially when they are published in peer refereed journals. One of the more recent ones to appear by Professor Simon Burgess and colleagues
compares public examination scores trends over time in Wales, where league tables were dropped in 2001, and England which has continued to publish them. The authors conclude that “If uniform national test results exist, publishing these in a locally comparative format appears to be an extremely cost-effective policy for raising attainment and reducing inequalities in attainment.” Of course, schools are about much more than test and exam scores, and this does need to be borne in mind.
Essentially what the authors do is to compare the difference in GCSE results between Wales and England over the period 2002 – 2008 and show that the difference between England and wales increases steadily over time. This compares with the period before 2002 when there were no differential trends over time. The authors are careful to try to rule out causes other than the abolition of league tables in Wales for this trend, testing a series of assumptions and using a series of carefully constructed and complex statistical models, but are left concluding that it is indeed the abolition of the league tables that has placed Wales at an increasing disadvantage compared to England.
As the authors recognise, a major problem with studies of this kind, sometimes rather misleadingly (as is done here) referred to as ‘natural experiments’, that attempts to infer causation from correlated trends across time is that there are so many things that are also changing over the period that one can never be sure that important alternative explanations have not been ruled out. In this note I want to suggest that there are indeed alternative explanations other than the abolition of league tables for this increasing difference and that the authors’ conclusion as quoted above simply is not justified by the evidence. First let me deal with a few ‘technical’ issues that have been overlooked by the authors.
A basic assumption when public examination scores are used for comparisons is that there is an equivalence of marking and grading standards across different examination boards, or at least in this case that any differences are constant over time. Yet this is problematic. When the league tables and associated key stage testing was abolished in Wales, there was no longer any satisfactory way that such common tests could be used to establish and monitor exam standards in Wales where most pupils sat the Welsh Joint Examination Council (WJEC) exams compared to England where only a small minority took the WJEC exams. There is therefore some concern that comparability may have changed over time. The Authors of the paper unfortunately, do not take this problem very seriously and merely state that
“The National Qualifications Framework ensured that qualifications attained by pupils across the countries were comparable during this period”
One of the ways in which the authors might have tested out their ‘causal’ hypothesis is by dividing England into regions and studying comparative trends in each of these, in order to see if in fact Wales really was different, but this does not seem to have occurred to them. The major omission in the paper, however, is that the authors fail to mention that at the same time as the league tables were stopped, so was the testing and because the pupils became less exposed to the tests, they were arguably less well-equipped for the examinations too. This is, admittedly, somewhat speculative, but we do know that in part the ability to do well on tests is strongly related to the amount of practice that pupils have been given, and it would be somewhat surprising if this did not also extend to the public exams. Interestingly, when piloting for the reintroduction of regular testing in Wales took place in 2012, there was evidence that the performance of pupils had deteriorated as a result of not being tested intensively during their schooling. It has also been suggested that this lack of exposure to testing is associated with a relative decline in PISA test scores.
So here we have a very plausible mechanism, ignored by the authors, that, if you believe it to be real, explains the relative Welsh decline in exam results. It may be nothing to do with publishing league tables, but rather to the lack of test practice. Of course, if this is in fact the case it may have useful implications for schools in terms of how they prepare pupils for exams.
I would argue, therefore, that we should not take seriously these claims for the efficacy of league tables. I believe that there is no strong evidence that their publication in any way enhances pupil performance and furthermore that their well-understood drawbacks remain a powerful argument against their use.
Friday, 13 December 2013
Educational policy: what can PISA tell us?
For over a decade OECD has been promoting its Programme for International Student Assessment (PISA) as a robust means of comparing the performance of educational systems in a range of countries. The latest results of tests on 15 year olds was published early in December and the British government, along with many others in Europe and elsewhere, reacted as usual with shock and horror. Welsh politicians immediately, but with no evidence, put the blame for their 'slide down the international league table' onto their abandonment of testing 8 years ago. Both Labour and Coalition spokespeople predictably blamed their rivals policies for England's 'mediocre' performance - again with no possible evidence.
What has often been termed ‘PISA Shock’, or more accurately ‘PISA Panic’, has accompanied past releases and politicians of all persuasions, in many countries, have used the ‘evidence’ about movements up and down the tables to justify changes to their own educational curriculums or assessment systems. So Finland, which consistently comes towards the top, and continues to do so, has been held up as a model to follow: if you come from the right you are likely to emphasise the ‘formality’ of the curriculum to justify ‘traditional’ curriculum approaches, and if you hail from the left you can find yourself pointing to the comprehensive nature of the Finnish system to justify reinstating comprehensivisation in England. The reality, of course, is that we simply do not know what characteristics of the Finnish system may be responsible for its performance, nor indeed, whether we should take much notice of these comparisons, given the weaknesses that I shall point out. Similar remarks apply to Shanghai whose performance is hardly believable.
I don’t want to go into detail about the technical controversies that surround the PISA data. Just to say that there is an increasing literature pointing out that it is a vastly oversimplified view of what counts as performance in the areas of reading, maths and science. There is research that shows that countries cannot be ranked unequivocally along a single scale and that they differ along many dimensions. Thus, in a comparison of France and England myself and colleagues were able to show that different factors were at work in each system. This is further complicated by the way the systems are differently structured, with up to a third of pupils in French schools repeating a year at some stage, compared to very few in England.
There is good evidence that the process of translating the PISA tests from one language to another is problematic so that there is no assurance that the ‘same things’ are being assessed in different educational systems. Detailed analysis of the International Adult Literacy Survey has shown how much translation can depend upon context and in many cases that it is virtually impossible to achieve comparability of difficulty for translated items. PISA does in fact attempt to eliminate items that appear to be very discrepant in terms of how pupils respond to them in different countries. The problem with this, however, is that this will tend to leave you with a kind of ‘lowest common denominator’ set of items that fails to reflect the unique characters associated with different educational systems.
Most importantly, PISA is a one off ‘cross-sectional’ snapshot where each 15 year old pupil in the sample is tested at one point of time. No attempt is made (except in a few isolated countries) to relate pupil test scores to earlier test scores so that progress through the educational system can be studied. This is a severe handicap when it comes to making any ‘causal’ inferences about reasons for country differences, and in particular comparing educational systems in terms of how much they progress over time given their attainments when they start school. Often known as ‘value added’ analysis, this provides a much more secure basis for making any kind of causal attribution. OECD has in the past refused to implement any kind of ‘longitudinal’ linking of data across time for pupils, although this may be changing.
PISA still talks about using the data to inform policymakers about which educational policies may be best. Yet, OECD itself points out that PISA is designed to measure not merely the results of different curricular but is a more general statement about the performance of fifteen year olds, and that such performance will be influenced by many factors outside the educational system as such, including economic and cultural ones.
It is also worth pointing out that researchers who are interested in evaluating PISA claims by reanalysing the data, are severely handicapped by the fact that, apart from a small handful, it is impossible to obtain details of the tasks that are given to the pupils. These are kept ‘secure’ because, OECD argues, they may be reused for purposes of attempting to make comparisons across time. This is, in fact, a rather feeble excuse and not a procedure that is adopted in other large scale repeated surveys of performance. It offends against openness and freedom of information, and obstructs users of the data from properly understanding the nature of the results and what they actually refer to. Again, OECD has been resistant to moving on this issue.
So, given all these caveats, is there anything that PISA can tell us that will justify the expense of the studies and the effort that goes into their use? The answer is perhaps a qualified yes. The efforts that have gone into studying translational issues have given insights into the difficulties of this and provided pointers to the reservations which need to be borne in mind when interpreting the results. This is not something highlighted by OECD since it would somewhat detract from the need to provide simple country rankings, but nevertheless could be valuable. The extensiveness of the data collected, including background socio-economic characteristic of the pupils and information about curriculum and schools, is impressive, and with the addition of longitudinal follow-up data could be quite valuable. What is needed, however, is a change of focus by both OECD and the governments that sign up to PISA. As a suitably enhanced research exercise devoted to understanding how different educational systems function, what are the unique characteristics of each one and how far it may be legitimate to assign any differences to particular system features, PISA has some justification. If its major function is to produce country league tables, however, it is uninformative, misleading, very expensive and difficult to justify.
The best thing to do when the results are published would be for policymakers to shrug their shoulders, ignore the simplistic comparisons that the media will undoubtedly make, and try to work towards making PISA, and other similar studies, such as TIMSS, more useful and better value for money.
Affman, I. (2013). “Problems and issues in translating international educational achievement tests”. In Educational Measurement, issues and practice, vol 32, Pp2-14.
Goldstein H. (2004). International comparisons of student attainment: some issues arising from the PISA study. In Assessment in Education, Vol.11, No.3, November 2004 pp 319-330
University of Bristol
University of Bristol