In one large scale impact evaluation, the realization rate varied from 0.27 to 0.87 depending on the inclusion of fewer than ten buildings out of a sample of more than a thousand. While the concerned parties can debate the merits of keeping, removing or downweighting these buildings, none of these options is necessarily the "correct" answer. Assuming that the outliers have not been identified as some type of data error, the real issue is whether the SAE model is viable as specified. Such tremendously influential outliers usually indicate a problem with the model specification. An examination of that model reveals a questionable specification that also undoubtedly suffers from heteroscedasticity (which may be responsible for some of these outliers). Only after such a model is re-estimated with a better specification should the issue of removing or keeping outliers be addressed.
If the model is still sensitive to a few observations, then one needs to assess whether these buildings' influence on savings is appropriate (based on participant population characteristics). If a building was expected to save a significant fraction of the total savings of a program, then a large effect on the impact estimate may be acceptable. If a building with only 1% of the predicted program impact changes the savings estimate by 30%, then the observation may have too much influence.
There are many techniques for assessing outliers and influence and different analysts may have different favored approaches. However, it may be wise to require some minimal analysis and reporting requirements otherwise the resulting array of approaches employed may make it difficult to interpret or compare studies to each other. For standard SAE models, three approaches are proposed as a useful minimum requirement.
First, studentized residuals should be calculated to identify outliers. Observations with studentized residuals greater than 3 in absolute value are usually worth further investigation (since their values should be distributed approximately as a t-statistic, there should only be about 2 such observations in a sample of a thousand). The evaluator should identify and describe the observations with the five largest values above this cut off. The impact on savings estimates from removing all such observations should be presented. If more than about 1% of the observations are in this category (as in the Monte Carlo analysis in the previous section), then the model is probably suffering from uncorrected heteroscedasticity or other problems.
Second, df-betas should be calculated to identify observations with large influence. Df-betas are valuable diagnostic statistics for assessing SAE models because they directly measure how much each observation affects the realization rate. Observations which affect the realization rate by more than a few percent may be worth investigating (the cut-off number should depend on the size of the sample and the relative predicted impact of the observation, but three to six percent may be a reasonable starting point). Again, the observations with the five largest df-betas should be identified, and the impact of removing all observations beyond the cut-off reported.
Third, robust regression (e.g., bi-weighted least squares or least absolute values) should be used to re-estimate the model in order to assess the overall quality of the fit to the data. If a robust fit gives very different impact estimates, it implies that the model does not fit the bulk of the data very well and needs further investigation. Discrepancies need to be explained. The Monte Carlo simulations described in the previous section also included use of least absolute values (LAV) regression to estimate the SAE model.
Due to its greater resistance to outliers, the LAV model proved to be almost twice as accurate as the standard SAE estimate and comparable to the simple pre/post. In addition, the LAV model properly covered its confidence interval (when standard errors were estimated through bootstrapping, but not when estimated analytically). These results do not necessarily mean that the LAV model should be used instead of ordinary least squares, but do tend to endorse the principle of using LAV estimates as a cross-check on the standard results.
The particular cut-offs for studentized residuals and df-betas cited above are suggestions based on theory and experience. Reasonable arguments could be made for using different values. However, some values need to be selected in order to provide consistency between evaluations and to help the evaluation field develop a better sense for what values may be typical and how large a problem outliers and influence points may be. Evaluators should feel free to supplement any minimum requirements with additional preferred tests or approaches. Regardless of the particular approach taken, quality evaluations provide information on the potential extent of outlier and influence problems and show the impacts of any analytical choices on the results.
If both samples are perfectly representative, then a simple pre/post treatment/comparison evaluation design provides unbiased results. One of the prime justifications for SAE models is that unbiased samples are very hard to find and a regression model is one way to try to control for these problems. SAE models, like all evaluation methods, still depend on the representativeness of the participant sample. While a model may be able to capture some confounding factors affecting changes in energy usage, the sample must still represent the population in terms of specific technologies and applications of measures and their impacts. SAE models also depend on the non-participant sample to represent how the participant's usage would have changed if subject to the same factors included in the model.
Potential sources of bias include systematic differences between the groups which aren't captured in the model and/or differences which are captured in the model but affect the participants and non-participants differently. Because SAE models require data from surveys, they may add to sample problems due to survey non-response bias.
In a more sophisticated attempt to deal with comparison group representativeness, and particularly self-selection bias, many evaluations combine a logit participation model with an SAE model. These models are subject to many of their own problems, including poor predictive ability. Participation models also make certain assumptions about sample representativeness and in many cases may be trying to correct for sample differences which are really due to differential non-response biases between the samples. In addition, there has been considerable debate about what exactly the nested logit/SAE approach is really trying to accomplish. (see reference 5. below).
When the modeling approach works properly, it may be estimating the wrong thing -- what the savings would have been if everyone were forced to participate, as opposed to removing the effect of what the participants would have saved if they hadn't participated.
Sample problems are frequently downplayed in DSM evaluations. For example, a recent residential evaluation found that the surveyed non- participant sample used 30% more energy in the pre-program year than either the participant population or sample. The text noted that the usage is "moderately higher" and then stated "These differences are controlled for in both the participation decision and energy impact model" In the same report, evaluating another customer sector, a table of summary statistics reveals that the surveyed participant sample used 20% less electricity in the pre-program period than the full participant population or the non- participant sample.
The text accompanying the table stated "The three groups are roughly similar in terms of initial consumption". However, the differences were highly statistically significant and clearly of practical significance. In addition, a simple pre/post savings calculation indicates that the participant sample had apparent net savings 40% lower than the participant population. A logit participation model indicated that pre- program energy usage is the only significant determinant of participation.
This "finding" was then incorporated into an SAE model, attempting to adjust the impact estimates for differences between the participants and non-participants that may only exist due to non-response bias, not actual population differences. Ironically, the net result of the two stage modeling process was a savings estimate indistinguishable (within 5 kWh/yr.) from the simple pre/post comparison of the analysis samples. It is not clear what relation this savings estimate bears to the actual program impact, although one could make a reasonable argument that it is 40% too low given the sample bias.
Because of the underlying assumptions about sample representativeness which all evaluation methods rely upon to some extent, a detailed assessment of sample representativeness should be an integral component of all evaluations. This assessment needs to go beyond simple means and t-tests (which are commonly misinterpreted as proving that the two groups are the same, and are subject to type II error).
Comparisons should be made between all relevant groups (populations, initial samples, and final analysis samples) on all available variables (e.g., sector, building size, occupancy, major end uses, energy usage, measure types, predicted savings, and all variables used in statistical models). The comparisons should include an analysis of the similarity of the distributions (not just means) of the variables particularly including the "tails" (regression models tend to give the greatest weight to extreme observations, making the representativeness of such values in the sample critical).
Graphical approaches (e.g. histograms or quantile-quantile plots) and/or simple reporting of percentiles (e.g., min, 1st, 5th, 10th, 25th, 50th, etc.) could be used. For variables in regression models which frequently take on zero values (e.g., dummy variables), the proportion of zeros and the distribution of the remaining values (if they vary) should be reported.
While these proposed requirements may seem onerous in comparison to typical practices, compliance should be relatively easy since almost any statistics package can produce these results easily. The sheer quantity of information may be overwhelming in some cases and will require a clear presentation format and a useful narrative. In addition to assessing representativeness, the resulting information should also provide greater insight into participant characteristics.
The reporting on the model is often brief and provides few analyses or discussions such as described in this paper. Many evaluations don't provide even basic summary statistics on the variables in the model. In addition to what isn't reported, much of what is reported is unsupported in the data provided or indicative of a statistical misinterpretation. A typical example, from a peer-reviewed evaluation paper, presented the usual SAE model output table. The narrative with the model stated that the model was good because the r-squared was high, most variables were statistically significant, and all but one variable had the right sign. There are several problems with this narrative.
First, r-squared is a poor indicator of model performance, particularly for SAE models, because the dependent variable is post- program usage, not savings, and pre and post program usage are highly correlated. Therefore, the r-squared will typically be very high (>.9) even if the only explanatory variable is pre-program usage. The value of r- squared will be dominated by this underlying correlation, regardless of the quality of the model. In practice, few SAE models provide a substantial increase in this already high r-squared.
Second, if most variables in a model have t-statistics greater than 2 it doesn't mean that the model specification is correct, or that the t-statistics are correct, or that these statistically significant factors have any practical significance in understanding usage variations (particularly true when dummy variables take on almost all zero values). Many evaluators mistake statistical significance for accuracy. For example, in another recent paper the authors stated that the "analysis produced very accurate (i.e., statistically significant) results".
The problems with internal measures of uncertainty such as t-statistics was summarized quite well by famed quality expert and statistician W.E. Deming who noted, "Statistical 'significance' by itself is not a rational basis for action." (see reference 6. below) There are many threats to the validity of standard errors from SAE-type regression models (only some of which have been described in this paper) and therefore it may not be wise to rely upon them to assess uncertainty.
The third problem with the example narrative is the claim that all but one variable has the anticipated sign. A brief examination of the coefficients indicates that of the 11 "control" variables in the model which actually have an anticipated sign and supposedly had the right sign, four clearly have incorrect signs (often two variables indicating opposite responses to a question both had the same sign, e.g. floor space increased and floor space decreased were both associated with increased usage).
Unfortunately, the problems with model assessment and interpretation in this example are not uncommon. It is unusual to find a discussion which explains what the model accomplished, why it makes sense, which variables affected the results, what other specifications were tried, why the presented model was selected, whether there were any problems with outliers, the extent to which assumptions were violated, analytical choices made and their impacts on the results, etc.. Quality evaluations include these analyses and provide this level of detail because they recognize the many potential threats to validity.
Given this political climate, regulators need to adopt minimum analysis and reporting requirements to change the status quo. The added cost of compliance should be modest for evaluators who follow principles of sound data analysis, because all of the proposed analyses need to be performed anyway. The only added costs are in presentation of results and more detailed explanations of the analysis process.






