Project Report


The Need for Statistical Analysis
and Reporting Requirements:
Some Suggestions for Regression Models

Paper Presented at 1995 Energy Program Evaluation Conference, Chicago
Dated: August, 1995
By: Michael Blasnik, Proctor Engineering Group, Boston, MA


Abstract

Several regulatory bodies have developed DSM impact evaluation standards. The technical aspects of these standards have primarily focused on sample sizes, methodology choices, and internal measures of uncertainty. While these issues are important, greater challenges to evaluation reliability may come from external sources of uncertainty such as sample bias, reliance on statistical methods whose underlying assumptions are not met, and from choices made during the analysis which are not fully explained or justified.

This paper describes some of these threats to reliability and provides examples which indicate the potential magnitude of their impact on results, particularly focusing on SAE-type regression models. Reporting and analysis requirements are proposed which may help in identifying and assessing these potential problems.

The general approach proposed is based on the idea that quality evaluations describe and test key assumptions to the extent possible, accompany the analysis with a well-reasoned narrative which explains the role and impact of analytical choices and statistical models, provide readers with sufficient information to assess the conclusions drawn, and include appropriate caveats.

The proposed requirements are assessed in terms of some of the arguments against evaluation standards (excessive cost of compliance, stifling of innovation). While it is hoped that the proposals will enable evaluation consumers to become better informed of the true uncertainties and analytical choices involved in impact evaluations, they should not be interpreted as providing "quality assurance".

The recommendations are only intended to help uncover some of the more common problems, no assurances can be given that a model which appears "OK" actually provides a reliable estimate. As a professional courtesy, and in some cases to maintain confidentiality, examples from actual evaluations and published papers are generally used without citation.


Background

Impact evaluations of DSM programs have grown in importance as regulators in some jurisdictions have tied shareholder incentives to measured results. Regulators in several states have developed protocols for conducting impact evaluations in an effort to produce more reliable estimates of measured savings. In addition to addressing the basic evaluation issues of what, when, and how often, evaluation protocols have devoted a good deal of attention to the selection of precision requirements and evaluation methodologies.

Precision requirements have been the subject of ongoing debate in the DSM field. However, one critical issue missing from the discussion has been the reliability of the reported precision; i.e., Does the reported confidence interval accurately reflect the uncertainty in the savings estimate? It is often assumed that the reported standard errors and confidence intervals are accurate reflections of uncertainty. However, these measures of precision are themselves statistical estimates subject to potential bias and are predicated on the assumption that the impact estimate is unbiased. The accuracy of uncertainty estimates needs to be assessed if the debate over precision levels is to be truly useful.

Multiple regression models, such as SAE or CDA models, have been the primary approach promoted by the evaluation industry and by evaluation protocols for estimating kWh impacts of major programs. Regression approaches have been considered superior to simpler analyses because they are designed to control for confounding non-program effects and potentially biased comparison groups. The implicit assumption is that they will work as designed and provide more precise and less biased estimates of program impacts. Until recently, little attention has been given to assessing the reliability or stability of these models.

It is the author's belief, based on experiences working as an evaluator and as an evaluation reviewer on behalf of regulators, intervenors, and implementors, that much of the statistical analyses, and most of the regression models, presented in evaluation reports are subject to many potentially significant threats to their ability to provide reliable impact estimates. The proposals in this paper are based on the observation that these problems are often not adequately identified or addressed.

Potential problems with the quality of statistical analyses have also been recognized in California and led to the recent development of quality assurance guidelines (which are currently under review). These guidelines address many of the key issues that may undermine the reliability of impact estimates. The approach taken by the guidelines is to require evaluators to describe how they have dealt with certain common analysis problems, particularly those related to regression models.

The guidelines are not prescriptive in that they do not require specific methods for identifying problems and do not indicate how problems should be resolved. The rationale for this approach is that it provides evaluators with flexibility to choose among a number of legitimate methodological choices. One disadvantage of this flexibility is the difficulties it may create for readers in both comparing different evaluations and in becoming familiar enough with the variety of techniques to properly assess them. An alternative approach, taken in this paper, is to create minimum requirements which evaluators would be free to expand upon or add to as they see fit.


General Approach to Evaluation

Because they are intended to evaluate the effectiveness of an operating program, DSM evaluations are seldom based on a true experimental design with random assignment of treatments and substantial control over experimental conditions. Instead, they are observational studies of a complex system of engineering and behavioral effects which tend to be based on "quasi-experimental" designs. The problems with observational studies are well known in the statistical literature. For example, noted statistician William Cochran has stated (see reference 1. below) that an investigator

"may do well to adopt the attitude that, in general, estimates of the effect of a treatment or program from observational studies are likely to be biased. The number of variables affecting y on which control can be attempted is limited, and the controls may be only partially effective on these variables. One consequence of this vulnerability to bias is that the results of observational studies are open to dispute."

Most advanced impact evaluation techniques are intended to reduce bias by controlling for as many potentially confounding factors as possible. But, as Cochran points out, one can never be certain whether all or even most of the important sources of bias have been identified. Even for those sources properly identified, the effectiveness of the techniques which attempt to deal with them is uncertain. A quality evaluation recognizes the existence of this fundamental challenge and attempts to identify and address threats to reliability through a combination of careful data analysis (employing multiple approaches where feasible), well reasoned conclusions, and appropriate caveats.

In contrast, DSM impact evaluations often display a great deal of confidence in the approaches and results. For example, many SAE-based evaluations include a statement such as "SAE models are able to control for confounding factors that affect energy use" (emphasis added). Such statements are not confined to evaluation reports but also appear in DSM evaluation handbooks, one such example is "regression models can control most of the confounding factors that determine energy usage, so the evaluation researcher can be certain that the effects being measured are due to the DSM program and not to other, non-program, factors" (emphasis added). Caveats about potential biases or modeling problems are relatively rare.

For reasons explained by Cochran above, and elaborated upon in this paper, the optimism displayed in many impact evaluations may be unfounded. Instead, when performing or assessing an evaluation, it is generally wise to assume that all samples are biased and that the data fail to meet the underlying assumptions behind the statistical analyses performed. The burden of proof rests upon the evaluator to investigate identifiable threats to validity and to provide supporting evidence that the conclusions drawn are reasonable.


Regression Analysis: Some Potential Problems

Regression analysis has been termed the most used and most abused statistical tool (see reference 2. below) . When it works as intended, regression is a powerful tool for analyzing data and uncovering relationships. Yet, the reliability of a regression model is dependent upon many assumptions which are virtually never fully satisfied in practice, particularly for observational studies.

Quality evaluations recognize the assumptions which the analysis methods rely upon and, to the extent possible, test the degree to which they are satisfied. Because of their sensitivity to certain violations of assumptions, regression models are particularly challenging to employ successfully without being misled by faulty analysis or interpretation. Meeting this challenge usually requires a healthy degree of skepticism combined with considerable expertise about not only data analysis, but about the subject being evaluated.

SAE Models

SAE-type regression models are intended to control for non-program factors which influence energy usage and therefore improve precision and/or reduce bias in savings estimates. The typical model specification attempts to predict post-program energy usage as a function of pre- program usage, engineering-based predicted savings, and a variety of survey responses to questions concerning changes in facility use, business activity and equipment level. Models may also include some demographic variables and often incorporate a variable derived from a logit participation model. The coefficient on the predicted savings is interpreted as the "realization rate", which is meant to represent the average proportion of predicted savings actually realized by the program.

Because an SAE model is based on the change in energy usage (since pre-program usage is included as an explanatory variable), it may be seen as a way of adjusting a simple pre/post treatment/comparison savings estimate for differences captured in the variables representing non- program effects. In fact, one can view the simple comparison approach as a regression model of change in usage as a function of a constant and a dummy variable indicating participation.

SAE models attempt to improve this simple model by including more variables to explain changes in usage. If there are no systematic differences between the participant and comparison groups, then no adjustment to simple pre/post results is needed and the SAE model should produce essentially the same savings estimate as the simple comparison but with greater precision (because of usage variations "explained" by the variables in the model). However, if the comparison group differs from the participant group, then the regression model attempts to control for these differences and adjusts the savings estimates to account for this bias.

If an SAE model does not properly control for non-program effects, the savings estimate may be adjusted inappropriately. Given this possibility, a quality evaluation provides a narrative which explains what the model accomplished (or attempted to accomplish). This narrative would include a discussion of how and why the results differ from a simpler pre/post analysis and would describe any important confounding factors which were identified and how they affected the results. Without such a narrative, the reader is not given enough information to assess whether the model is believable.

Model Specification Issues

The most fundamental assumption made by regression modeling is that the model is "correct" -- it includes all of the variables which influence the dependent variable and the functional form of the relationship is properly specified. Of course, few evaluators would claim that their SAE model includes all factors influencing energy usage. However, some would point out that the model does not necessarily have to be correct for the impact estimate to be unbiased. This statement is true if all of the variables which are omitted from the model are unrelated to the variable representing the impact estimate (i.e., predicted savings in SAE models).

There is no method available which can prove that this is the case for a particular model, although there are some tests for omitted variables which may disprove it. This fundamental threat to the reliability of regression coefficients is well known in the statistical literature (see reference 3. below), yet it has received little recognition, and has even been disputed, in the DSM impact evaluation field.

A simple example may help illustrate this problem. Using data from a residential conservation program, the author fit a regression model of pre-program energy usage in terms of house airtightness (measured in CFM50 by a blower door). The model indicated that each CFM50 increased gas usage by 0.12 ccf/yr. (+/-.02 @90% conf.). Engineering algorithms indicated that the impact should only be half as large, yet this value is far outside the confidence interval. The discrepancy is due to omitted variable bias -- there are factors correlated with airtightness that also affect energy usage. The airtightness of the building acts as a proxy for related omitted variables, most obviously the size of the house. When the model was re- estimated including the area of the house, the new coefficient on CFM50 was .06 (+/- .01) ccf/yr. This new value is consistent with expectations and is statistically significantly different from the initial model's coefficient.

The model with the omitted variable produced a biased coefficient and the confidence interval provided no indication of a potential problem. In fact, the model indicated that the coefficient estimate was very precise, yet it was precisely wrong. This example is simple and perhaps obvious to many readers. Unfortunately, the problem that it illustrates is often not obvious and quite difficult to detect in practical applications of SAE models which are considerably more complex.

Potential problems with model specification threaten the reliability of all regression models. This statement should not be interpreted as suggesting that all SAE models give "bad" answers or that regression should be abandoned or that simple pre/post comparisons will give better answers. The intent is that such model results should be presented with this point in mind and that evaluations which rely solely upon a single regression model coefficient are at risk of providing misleading answers. Quality evaluations are cautious in drawing conclusions about regression coefficients, attempt to estimate savings using multiple approaches, and compare results to related studies.

In addition to potential omitted variable bias, SAE modelers need to be aware of a variety of other specification issues, including collinearity and model selection subjectivity.

Collinearity can cause problems in SAE models when "control" variables accidentally capture program effects. If an SAE model includes a variable (or set of variables) which is strongly related to participation, then such a variable may absorb some of the program impact and reduce the estimated realization rate. Regression models have difficulty fully distinguishing the separate impacts of correlated explanatory variables. In extreme situations (unlikely to occur in most SAE models, but common in CDA models), coefficients are poorly determined because two or more variables are highly correlated.

A variety of approaches are available for identifying such extreme situations (e.g., variance inflation factors, condition indices) and several possible approaches may be pursued (e.g., dropping a variable, ridge regression).

In the context of SAE models, the problem is usually not as severe, but the impact on savings estimates may be substantial. One approach for detecting potential problems is to examine the correlation matrix on the estimated coefficients. Coefficients which are well correlated with the realization rate coefficient may deserve further scrutiny. The related coefficients may not cause a problem and, indeed, are considered quite valuable as they represent the confounding factors which SAE models are meant to control for. However, their impact on savings estimates needs to be assessed and explained.

Another technique which can help identify which variables most affect the savings estimate and may also provide a better understanding of the model, is to re-fit the model in steps. One can start by fitting the simplest model and then examine how the savings estimate changes as additional terms are added. For example, a model of change in usage with just a constant and a participation dummy variable can provide a baseline equivalent to a simple pre/post analysis.

The participation variable can then be changed to predicted savings with pre-program usage added, then the other variables can be added in order of perceived importance. This exercise can help identify which variables affect impact estimates the most and may help the evaluator to describe what the SAE model actually accomplished. If the savings estimate is stable under a variety of specifications, then the SAE model is not adjusting for sample biases, it is merely attempting to improve precision. If the savings estimate varies dramatically when a particular variable is included, then an explanation can be sought.

For example, if including a variable which is intended to reflect changes in business activity shifts the impact estimate upward, an examination of the data may reveal that the non-participant sample was more likely to be downsizing and therefore their consumption declined at a greater rate than would have happened to participants without the program. While the accuracy of these explanations cannot be tested, the fact that the evaluator can create a sensible narrative which describes how and why the model affected the impact estimates and why the final model is reasonable can provide crucial supporting evidence for their conclusions.

If such a "story" can't be created, then the evaluator needs to look closer at the model and perhaps consult with program implementors for potential theories. (Note: fitting a model in steps can be quite sensitive to the order in which the variables are entered, although in the author's experience it is often quite useful for SAE models.)

In addition to technical problems such as collinearity, SAE modelers need to be aware of potential biases which may be introduced in the model building process. In the course of performing an evaluation, the search for the "best" model is typically a key part of the analysis. The process of fitting and comparing different models is considered by many an "art", which renders it subjective and open to potential manipulation.

Experienced evaluators know that, by choice of model specification, they can usually have a meaningful effect on the impact estimates. Decisions concerning data screening and sample selection can exert similar influence on virtually all analysis methods. Because these threats to unbiased results usually can not be eliminated, they need to be addressed through reporting. Quality evaluations document key decisions which are likely to affect impact estimates (data screening, sample selection, and model specification choices) and provide a rationale for the particular choices made.

The impact of such decisions on the final results is provided and compared to other reasonable choices that could have been made. The range of values for the realization rate under the different model specifications tested is often a useful part of this reporting.

Heteroscedasticity

Regression models, and particularly the estimated standard errors, rely upon the assumption that the residuals are independently and identically distributed. Two common violations of this assumption are serial correlation and heteroscedasticity. For the typical SAE model based on annual (not monthly) pre and post program consumption, serial correlation is not a significant issue. However, heteroscedasticity, which refers to non-constant error variance, is a common problem in SAE models especially those applied to the commercial and industrial sector.

Because high use buildings tend to have more variable energy usage than low use buildings (in absolute kWh), SAE models which include buildings of widely varying usage levels experience heteroscedasticity. The primary effect attributed to heteroscedasticity is that it biases the estimated standard errors.

However, it can also exacerbate other model problems leading to substantial changes in the estimated realization rate and reduced model accuracy. Heteroscedasticity often reveals itself through large influential outliers because modest usage variations for large facilities appear as tremendous changes in usage compared to the variations seen in the majority of (smaller) buildings in the sample. These high use facilities with large influence may substantially reduce the accuracy of an SAE model, while the standard errors indicate that the model is quite accurate. An example using synthetic data may help demonstrate the potential importance of heteroscedasticity and its relation to outlier problems in SAE models.

Monte Carlo simulations allow one to create a synthetic world where the true answers are known and all sources of variability are specified. Repeated replications of this known world allow one to assess the performance of different statistical estimators under the given assumptions. The author performed a Monte Carlo analysis of a simple commercial DSM program. The mean values and variability in usage, predicted savings, realization rates, and post-program usage were specified as follows: pre-program usage was log-normally distributed (where log(usage-20,000) has mean 4.8 and std. dev of 0.42), predicted savings averaged 15% of this usage (with 5% std. dev), the average true realization rate was 75% (with 15% std. dev.), and post program usage averaged pre- program usage minus true savings with an added random variation of 10% of usage.

These values provide a relatively well-behaved data set with fairly tight distributions and no sources of bias. The log-normal usage distribution leads to a ratio of about 100:1 for largest to smallest usage rate. The 10% random variation in post-program usage is the source of the heteroscedasticity since it makes the standard deviation proportional to usage. This assumption is believed to be more realistic than the constant kWh value assumed by ordinary regression.

The Monte Carlo analysis involved generating the values of all variables using these specifications for each of 1000 buildings with half the buildings randomly declared non-participants. The resulting data set was analyzed using a simple SAE model of post program usage with pre- program usage and predicted savings as the explanatory variables. A simple pre/post analysis was also performed. This entire 1000 building data generation and analysis process was replicated 1000 times and the resulting 1000 realization rate estimates and confidence intervals were compiled.

The analysis revealed that the 90% confidence interval from the SAE model included the true realization rate only 35% of the time! The true uncertainty in the SAE realization rate was three and a half times greater than reported. In contrast, the 90% confidence interval from the simple pre/post analysis included the true value 94% of the time. In addition to providing a conservative confidence interval, the simple pre/post analysis proved to be more than twice as accurate at estimating the realization rate than the SAE model (as measured by the median discrepancy between the estimate and the true value which was .09 for the SAE model and .04 for the simple pre/post).

Overall, the SAE model claimed to be about twice as precise as the simple pre/post but was only about half as precise. The failure of the SAE model to properly cover its confidence interval is an expected result of heteroscedasticity. The relatively poor accuracy of the SAE model is also due to heteroscedasticity as the greater absolute usage variations in high use buildings lead to relatively wide fluctuations in the estimated realization rate because of their large influence on the model fit.

This problem also manifested itself through apparent outliers. An average of 20 observations per replication (2% of the sample) had studentized residuals greater than three in absolute value, while one should only expect about 2 such observations in a sample of 1000 (see next section). Additional simulations performed using different specifications (including different usage distributions) found varying but similar results for all but the homoscedastic error case (where the SAE model properly covered its confidence interval and was slightly more accurate than the simple pre/post).

In addition to the Monte Carlo findings, heteroscedasticity can lead to other problems under more complex situations. Heteroscedasticity can be viewed as improper weighting of the observations. If realization rates are thought to vary across facility or measure types and the model is attempting to estimate the "average" rate, then one result of this improper weighting may be incorrect "averaging" of these realization rates. Fixing heteroscedasticity involves downweighting observations with higher variability.

Such fixes may be at odds with efforts to properly weight samples for representativeness. For example, if much of a program's predicted impacts occur in very large facilities then one would want these facilities to have greater weight in the impact estimate. But if usage rates are more variable in large facilities, then correcting for heteroscedasticity would involve downweighting these facilities.

Given the issues described above, principles of sound data analysis dictate testing for heteroscedasticity in all regression models, particularly those involving samples with a large range of usage rates. There are a variety of tests available (e.g., Breusch-Pagan, White, Cook-Weisberg). When applied to C&I SAE models which use ordinary least squares, the tests virtually always indicate a problem. When heteroscedasticity is found, the estimates are suspect and the standard errors are invalid.

The situation may be improved by respecifying the model, using stratification or weighted least squares, or calculating standard errors that aren't dependent on homoscedasticity. The rationale for the selected approach needs to be stated and the impact reported.

Report continued