Reforms in the health-care system usually apply to several levels so that the impact of single actions is hardly measurable in an isolated manner: Tanzanian surgical nurse.
GIZ’s declared goal is to make a positive and lasting impact in partner countries. Its evaluation system, which measures and assesses results, is meant to serve this goal. Systematic evaluation helps the agency to learn from mistakes as well as success, and contributes to better work at the project and institutional levels in the future. Tracking positive and negative impacts is no simple task however. GIZ’s monitoring and evaluation unit therefore keeps testing innovative approaches. The aim is to better understand what works, how and why.
Since 2009, our intensive work on randomised control trials (RCTs) has taught us several lessons. This evaluation design is currently attracting a lot of media attention and is sometimes considered the gold standard in impact assessment. The most prominent proponents of this approach are Esther Duflo and Abhijit Banerjee of the Massachussetts Institute of Technology. Their bestseller “Poor Economics” (2011) deals with RCT-studies.
RCTs are based on the idea that the impact of an intervention can be determined if you know what would have happened without that intervention. To find this out, a group that is participating in a given project is compared with a control group that is not participating. This is how pharmaceutical research is done. The participants are randomly assigned to either group before the intervention starts. Doing so largely ensures that the differences measured are truly caused by the intervention and nothing else.
In Senegal in 2010, the Rheinisch-Westfälische Institut für Wirtschaftsforschung (RWI) conducted the first RCT for GTZ (which is now GIZ). The task was to find out what improved cooking stoves meant in terms of fire-wood consumption, health problems and the time and money invested in procuring fuel and cooking. To begin with, the evaluators collected data on everyday cooking routine of 253 families. Next, each household was randomly allotted either a stove (target group) or a sack of rice (control group).
To check how the stoves were used and whether there were any technical problems, three field visits were carried out. They showed that 87 % of the target group was using the new stoves. The results were statistically significant: the consumption of fire wood went down by 30 %, the time spent on cooking was reduced by 70 minutes a day, and there were fewer eye infections and respiratory problems. These developments were not seen in the control group, so the new stoves were definitely the cause. On the basis of these findings, more stoves are now being distributed. The clear proof of causality is what makes RCTs attractive at first glance. By comparing target and control groups, alternative explanations for a particular impact can be eliminated to a great extent.
The design has inherent weaknesses however. For instance, one cannot assume that findings from one social setting tell us much about what would happen in a different setting. It is far from obvious, for instance, that problems in India can be solved by relying on the findings of experiments conducted in Africa. However, such generalisations are quite common, even though they are profoundly unscientific.
There is also the risk of evaluators having a very narrow view of reality. If, for example, data is collected via standardised questionnaires, participants can only refer to aspects that matter to them if those aspects are included in the questionnaire. This makes it difficult or even impossible to detect unexpected impacts. The problem can be solved by including open questions that allow various answers.
It is similarly important to note that many econometric methods only show whether a certain impact is made, but not why. In case of lacking impact, it may remain unclear, for example, whether a project was planned or merely implemented in the wrong way.
RCTs are often not applicable in practice for other reasons. One of the scientific prerequisites is that neither the members of the target and the control groups nor the project staff know who is in which group. In technical terms this is called “double blindness”. In the reality of development cooperation, this rule can typically not be met.
Spill-over effects are another challenge. It is difficult to rule out that the control group is affected by the intervention in the target group some way or another. The control group may learn about positive or negative impacts and change its behaviour accordingly. In Senegal, for instance, it turned out that some members of the control group had actually borrowed improved stoves from members of the target group.
Other factors similarly limit the use of RCTs for evaluating GIZ programmes. For good reason, development agencies tend to choose their target groups with consideration. After all, they need committed persons. The random selection of target groups will often not bring about the desired results
In other cases, reforms are made at the national level, so everybody is affected by a certain measure (such as new legislation) and it is impossible to establish any meaningful control group. The same is true of macroeconomic policies which, by definition, affect the entire economy.
Many development efforts, moreover, are part of complex programmes that intervene at various levels of society. When a country’s health-care system is being reformed, for example, many things need to be tackled. Among others, they include:
- ensuring that wide sections of the people get access to hospitals and health centres,
- improving the qualification of personnel and
- distributing drugs more efficiently et cetera.
Multi-pronged approaches are necessary, but they make it difficult to measure the impact of individual inventions. If there are synergies of mutually reinforcing interventions, moreover, it does not even make sense to try to assess the precise impact of only one of them.
In some sectors, quantified assessment of impacts is more difficult than in others. The quality of governance is harder to measure than the quality of vocational training for instance.
For these reasons, RCTs do not fit most GIZ projects and programmes. Things are no different for many other implementing agencies. The modalities of development cooperation have changed in recent years. Isolated projects have become rare. Typically, development programmes nowadays involve interventions at several levels; stand-alone projects have become rare.
Given that RCTs are expensive and require a lot of effort, one must diligently weigh the costs and benefits. After all, the money spent on evaluations is not available for development action.
In view of the limited applicability of RCTs, GIZ and the agencies that preceded it have been testing “everyday rigorous impact evaluations” in recent years. The aim was to understand direct and indirect impacts of specific interventions as precisely as possible, but to stick to a reasonable budget and stay within a sensible time frame too.
An evaluation concept that suits everyday use was developed in cooperation with Professor Reinhard Stockmann, the director of the Centre for Evaluation at the Universität des Saarlandes (CEval). It includes the following steps:
- First of all, all expected impacts of a specific measure are spelled out in theory, and suitable indicators to measure them are chosen accordingly.
- Comparison groups are not formed before the actual evaluation takes place (“quasi experimental evaluation design”). Different statistical procedures are available to define comparison groups in a coherent manner.
- A mix of different methods to collect and analyse data is used (“triangulation”). This approach allows for balancing the weaknesses of one method with strengths of others. For instance, triangulation can mean to not only collect quantitative data, but to also conduct in-depth interviews with open answers.
- Data is collected in two phases. During a pre-mission on site, the formation of comparison groups as well as the type and quality of the available data are reviewed. Based on the pre-mission insights, the evaluation design is adjusted to close data and knowledge gaps during the main mission.
GIZ has taken this approach for several evaluations. It will continue to do so in the future. Especially suitable programmes are those that:
- have a big budget,
- serve as pilot schemes or
- have a high strategic relevance for GIZ.
Unfortunately, this everyday evaluation concept is not applicable to many complex programmes which tackle different levels of society. Accordingly, GIZ is also working with other innovative evaluation approaches like contribution analysis and developmental evaluation. Contribution analysis is about assessing on the basis of plausibility as well as empirical data why a certain impact was made (or not), what factors played a role and exactly what contribution was made by the intervention itself. Developmental evaluation according to Michael Quinn Patton (2010) is about an evaluator observing an intervention from beginning to end in order to help the implementing team to re-assess and, if necessary, modify its action continuously.
GIZ will continue to use RCTs in specific situations, provided they are feasible, reasonable and financially affordable. The real challenge is to figure out what method serves best to evaluate the impacts of any specific project or programme. Since all approaches have specific strengths and weaknesses, there is no gold standard. Each case must be analysed to see which evaluation approach is most suited. In most cases, a mix of methods will be best.
Sabine Dinges works in GIZ’s monitoring & evaluation unit. She is preparing her PhD thesis on the effects of evaluations at the University of Bradford.
Sylvia Schweitzer also works in GIZ’s monitoring & evaluation unit. She earned her PhD from the University of Bochum with a dissertation on resource scarcity and resource conflict.