Panel Data Analysis Tutorial - Appendix

Created by Steve Hoover, Modified on Thu, Dec 12, 2024 at 10:34 AM by Steve Hoover

Technical Notes

Panel data analytics is a vast topic that could (and should) be a separate course by itself. For our purpose, we can think of panel data analytics as a sophisticated version of the linear regression model. Let y_it be the observed value of the dependent variable (e.g., Sales, number of conversions) for entity i at time (or replication) t. Using standard notation, we represent the panel model estimated by Enginius as:

y_it = X'_itβ + c_i + ε_it

Where X_itdenotes a set of independent variables associated with different entities when they are observed in different contexts or replications t, β are the coefficients that denote the effects of the X variables on y, c_iis an entity-specific additive effect on y, and ε_itdenotes the error term. Note that c_idoes not vary with t and that β is the same for all entities i and replications t. The main difference of this model from a standard regression model is that c_i represents an entity-specific unobserved characteristic (e.g., the unknown or hidden characteristic of a keyword, or the hidden talent or personality of an individual), which makes c_i a random variable. If we can obtain statistically valid estimates for c_i, that will not only provide us useful information about those entity-specific effects, but it will also ensure that the effects of the X variables (i.e. β), are also statistically valid. This is the primary benefit of using panel data models instead of the standard regression model – the ability to account for, and measure entity-specific effects that cannot be detected by the standard regression model. The potential complexities associated with estimating the above panel regression model have to do with the assumptions we make about the nature of the errors (ε_it). For our purpose, we will consider three possibilities for estimating c_iwith the usual simplifying assumptions about ε_it. For further technical details, especially for advanced users, please review the appropriate chapters in Greene (2017, Chapter 11), Pesaran (2015, Chapter 26), or other econometric textbooks.

Pooled regression model: Here c_i has the same value for all entities (i.e., c_i= cfor all entities i), and the error term satisfies all the requirements of the Ordinary Least Squares (OLS) model for each entity. In this case, the results obtained from model estimation will be comparable to those obtained from OLS – any differences in the results are likely to be minor, and occur because of the more robust ways the errors are handled due to the panel structure of the data.
Fixed effects model: This model allows c_i to be correlated with X_it and still provide statistically valid estimates of c_i and β. This is a critical feature of the Fixed Effects model in that it allows for the possibility the unobserved hidden characteristics of an entity could influence the observed characteristics X_it, and for the two of them together to jointly determine the dependent variable y_it. The key downside of this model is that it does not allow the estimation of the effects of any observed entity-specific characteristics X_i that do not vary across t.
Random effects model: This model requires c_i to be uncorrelated with X_it but still provide statistically valid estimates of c_i and β. If this requirement is met, this model offers two advantages over the fixed effects model: (i) It allows for estimation of the effects of observed entity-specific characteristics X_i that do not vary across t, and (2) The estimates have higher statistical efficiency (i.e., they are estimated with greater precision than an equivalent fixed-effects model).

The Enginius software is structured so that one of the options available is for the software to automatically determine which of the three models would be the most appropriate one to use for a given data set. Here is one way to think about how to choose between the three model options. The most parsimonious and the simplest option would be pooled regression model (equivalent to ordinary least squares regression) if that applies to the data set, as determined from the appropriate statistical test. The next most parsimonious is the random-effects model, as determined by the structure of the data set and the appropriate statistical test. An advantage of the random-effects model is that you can estimate the effects of observed independent variables within the model that do not vary across time or replications (e.g., race, education level, risk tolerance, the state or country where a firm is incorporated, etc.). The most comprehensive model is the fixed effects model if it is appropriate for the data set, again as determined from the structure of the data and the appropriate statistical tests. However, we cannot estimate the effects of independent variables that do not vary across time or replications – the effects of those independent variables are absorbed into the fixed effects c_i.

Technical Details about the Panel Data Analytics used in Enginius

The main outputs of panel data analysis are generated using the plm package available in R. The Automatic option first tests the null hypothesis that the Pooled OLS model fits the data best by executing a pooling test. If the poolability hypothesis is rejected, then the Automatic option tests whether a random effects model is appropriate by applying the Hausman test. If the null hypothesis is rejected, then the Fixed-effects model is recommended. Otherwise, the Automatic option recommends the Random-effects model. In the fixed effects model, the fixed effects are obtained via the fixef function, and the random effects are obtained via the ranef function. For the fixed effects model, the sum of the fixed effects is set to 0 for estimation.

References

Greene, William H. (2017). Econometric Analysis (Eighth edition), New York, Pearson.

Pesaran, M. Hashem (2015), Time Series and Panel Data Econometrics, Oxford, Oxford University Press.