Panel Data Analysis Tutorial

Created by Steve Hoover, Modified on Fri, Apr 19 at 1:06 PM by Steve Hoover

Background

Panel data denotes data sets where the same entities (e.g., people, countries, firms, keywords, products) have been observed at multiple times or in different contexts.  Each entity has a unique panel ID, and the multiple observations of the entities could occur across time (say, product sales data obtained in daily intervals), across geographies (say, product sales in different cities), or across other contexts.  The usual structure for panel data are observations of the same entities at different times; such data are referred to as cross-sectional time-series data, where the cross-sections refer to the different entities, and the multiple observations of those entities occur at different times.  Panel data is now readily available in many areas of marketing because of continuous data collection through web logs, sensors, transactions, mobile Apps, etc.  Because of the multiple observations we have about the same entities, more information is available to obtain more precise estimates of the effects of marketing variables. 

The use of panel data analytics (also known as panel regression) can help us answer questions such as the following:

  • What is the incremental effect of each keyword on conversions after accounting for the effects of other variables such as number of impressions, clicks, and the average position of the keyword in paid search advertising?
  • Effect of advertising on sales after accounting for the effects of unobserved characteristics of each ad copy. 

Entering Your Data

You can upload your own data or use one of the pre-loaded data sets associated with the business cases distributed with Enginius. Load your own data by cutting and pasting from Excel or uploading an Excel file using the “Load” button on your Dashboard. For Enginius to analyze the data, please format your data so that the first column of data is an ID for each row and include a column heading for each column. It is important that the data set contains at least 4 variables (columns) in addition to an ID for each row of the data. Below is an example of the necessary format required for Panel Regression analysis. This is the sample OfficeStar data set included when opening the Panel Regression tutorial and contains a total of 7 variables available for analysis. 

It is important that the data set contain a column with the panel variable, which contains a unique numerical or textual identifier for each entity in the data.  Here the panel variable is Keyword, which contains the keywords used in a paid campaign.  There should be at least two observations per panel ID (i.e., per unique keyword), and ideally, the number of panel IDs should be reasonably large, say 10 or more.  The data set should also contain a column for time or replication, which contains a unique numerical or text identifier for each time or replication.  Here the replication variable is Campaign.  The number of replications or time periods per panel ID can be the same (called balanced panel), or they can be different (called unbalanced panel).  Finally, the dependent variable should be such that it could be treated as a continuous variable.  Here either Conversions, Clicks, or Impressions could be used as the dependent variable.  The remaining variables could be the independent variables in the panel regression model.  In a model in which Conversions is the dependent variable, Impressions, Clicks, Cost, and Avg Position or a subset of them could be the independent variables.  In a model in which Impressions is the dependent variable, the independent variables could be Cost and/or Avg Position.  Currently, it is not feasible to analyze two or more panel variables simultaneously. 

Run Analysis

To run panel regression on the data you have loaded, click on the Panel Regression icon on the left side of the Enginius dashboard. The following example shows the data blocks contained in the OfficeStar data set associated with the Panel Regression tutorial.

The above dialog box will allow you to specify the parameters for the analysis you are about to run.

  • Panel data is the name of the data block in Enginius that you will analyzing.
  • Target variable allows you to select the dependent variable for your analysis. In our example, we have chosen Conversions as the target.
  • Panel variable: allows you to specify panel variable containing the IDs for the entities being modeled.  Here, the panel variable is Keyword.  
  • Time or Replication variable: allows you to specify the variable containing IDs for the multiple observations pertaining to each entity or panel ID.  Here, the Time or Replication variable is Campaign.

The other variables in the data set will be used as independent variables in the panel regression model. 

Finally, Model allows you to specify the model type that you want to use for the estimation.  Our recommendation is that you first use the Automatic option to determine the statistically most appropriate model for your data.  Once you explore these results, you can then decide to re-run the analysis with your own choice of the Model options.  The three models available are Pooled OLS model, Fixed-effects model, and Random-effect model. These models are described in greater detail in the Appendix. 

Next, select the output format for your analysis by clicking on the world Icon on the Run button.  If you wish to make further manipulations or making additional charts from the output, select the Microsoft Excel format option.

 

Reminder: Clicking the world icon beside the “Run” option will allow you to choose a different output format for the report.

 

 

Click “Run” to generate the report in the format of your choice.

 

At this stage, any errors encountered (for example, if the data or model options are incorrectly specified), you may get an error message.  If you get an error message, please check your data to make sure there are no non-numerical data or other errors where they are not expected (i.e., for variables that are not specified as Panel Variable or Time/Replication variable) and ensure that the correct set of variables are specified in the Panel Regression dialog box.

 

Interpreting the Results

Data plots

The first set of outputs simply plot the data for each panel ID.  Only the data for the first 15 panel IDs are plotted.  These plots will show the distribution of the dependent variable values (Conversions) observed for each specific panel ID.  If you need to see the plots for the other variables, re-order input data so that the panel IDs of interest to you appear first in the data set.

It may be useful to check the data for panel IDs with a large variance (e.g., “office supplies”) to make sure there are no data errors.  Here, the large variance in the performance of the keyword “office supplies” could be because it is a generic keyword that might attract a broad audience only some of whom purchase from the company. Further people searching with this keyword are more likely to explore products from different competitors.  Depending on the specific promotions being offered by competitors at a time when a specific campaign is being run, they may choose to purchase from different competitors in different time periods. 

Statistics associated with the selected model

The first output is optional and will appear depending on which model is selected by the Automatic modeling option.  Here the model selected the Random-effects model as the most appropriate one for the OfficeStar data.  The Pooled-OLS model was rejected (i.e., it was inferior to both the Fixed-effects model and the Random-effects model). Then the Hausman test was applied to check whether the Fixed effects model would be the most appropriate.  Here that hypothesis is rejected because the significance level is much greater than 0.05.  And, the preferred model is the Random effects model.  

Table

Description automatically generated

The next set of results pertain to the estimated coefficients from the selected model.  Here, the results of the Random effects model are summarized.  Had the automatic options selected the Pooled-OLS or the Fixed-effects model, those results would be displayed instead.

Table

Description automatically generated

This model has good fit (R-squared is significantly higher than 0).  The results suggest that the number of Clicks and the total expenditures (Cost) have a positive influence on generating Conversions, even after accounting for the effects of unobserved characteristics associated with the keywords (Panel IDs). The significant positive coefficients are highlighted in green and the significant negative coefficients are highlighted in red.  Interestingly, for OfficeStar, the number of Impressions generated by a keyword or the Avg Position at which the keyword advertising appeared do not seem to influence Conversions (if the ad appears at the top or the first position it has a position value of 1, in the second position, it will have a value equal to by 2, etc.).

In addition, for the Random-effects model, the output provides the estimated distribution of the random effects across all panel IDs.  Here the distribution is a bit left-skewed.  The two vertical lines correspond to the mean and standard deviation of the distribution. The random effects are the incremental effects on Conversions due to the unobserved characteristics associated with a Panel ID (Keyword).

Sizes of the random or fixed effects associated with each panel ID

The next set of outputs summarize the sizes of the fixed effects or the random effects depending on the model selected for analysis (this table will not appear for the Pooled-OLS model).  The reported effects are the deviations for a specific panel ID from the overall average (equal to 0). 

We can see that the branded keyword “office star” has a strong positive impact on Conversions over and above the effects due to the number of Clicks garnered by that keyword, or the Cost (amount spent) on that Keyword.  This is to be expected because those searching using a branded keyword are already familiar with the OfficeStar brand and may have favorable impressions about that brand.  On the other hand, the keyword “office star coupons” generates a negative effect on Conversions (after accounting for the effects of the other variables included in the panel regression model).  This result may be surprising and should warrant closer examination.  Perhaps this result occurs because those searching for coupons may be price sensitive and may be disappointed when they find that OfficeStar does not offer any discount coupons, or those searchers may find the prices of products sold by OfficeStar to be at a higher price point than what they expected.   

A related set of outputs plot the random or fixed effects ordering them from the highest to the lowest fixed effect based on the absolute magnitudes of these effects.  When there are many panel IDs, these plots simplify interpretation by focusing attention on the most important effects.

Correlation matrix of coefficients

The final output contains the correlations amongst the coefficients of the independent variables. The values range from 1 for perfectly correlated coefficients to -1 for perfectly negatively correlated coefficients. A value of 0 means the two estimated coefficients are not correlated.

It is useful to check whether the correlations make sense, especially regarding potential multicollinearity, which results in the coefficient estimates being unreliable.  Here, there is strong positive correlation between the independent variables Clicks and Cost, which can be computed to be equal to 0.92.  This correlation makes sense – this data is from a paid search campaign where the costs are incurred through a “pay-per-click” model, which means the company incurs higher costs when there are more clicks.  But the correlation between their coefficient estimates is negative.  This result is somewhat surprising but is likely to be due to branded keywords costing less money, but generating higher clicks and conversions compared to unbranded keywords that cost more (i.e., generate more clicks) but result in fewer conversions, even if they generate more clicks.  A similar process may be at play regarding the positive correlation between Avg Position and Cost -- this company is spending money on competitive keywords thereby incurring higher costs (i.e., the company needs to bid higher amounts for the same position) which do not necessarily result in a higher placement in the listing (note that the lower the Avg Position, the better is the placement of the ad on the paid listing).

 

 

Appendix

Panel data analytics is a vast topic that could (and should) be a separate course by itself.  For our purpose, we can think of panel data analytics as a sophisticated version of the linear regression model.  Let be the observed value of the dependent variable (e.g., Sales, number of conversions) for entity i at time (or replication) t.  Using standard notation, we represent the panel model estimated by Enginius as:

Where  denotes a set of independent variables associated with different entities when they are observed in different contexts or replications t., are the coefficients that denote the effects of the X variables on y, is an entity-specific additive effect on y, and denotes the error term.  Note that does not vary with and that β is the same for all entities i and replications t.  The main difference of this model from a standard regression model is that represents an entity-specific unobserved characteristic (e.g., the unknown or hidden characteristic of a keyword, or the hidden talent or personality of an individual), which makes a random variable.  If we can obtain statistically valid estimates for , that will not only provide us useful information about those entity-specific effects, but it will also ensure that the effects of the X variables (i.e., are also statistically valid.  This is the primary benefit of using panel data models instead of the standard regression model – the ability to account for, and measure entity-specific effects that cannot be detected by the standard regression model.  The potential complexities associated with estimating the above panel regression model have to do with the assumptions we make about the nature of the errors ().  For our purpose, we will consider three possibilities for estimating with the usual simplifying assumptions about .  For further technical details, especially for advanced users, please review the appropriate chapters in Greene (2017, Chapter 11), Pesaran (2015, Chapter 26), or other econometric textbooks.

(1) Pooled regression model:  Here has the same value for all entities (i.e., for all entities i), and the error term satisfies all the requirements of the Ordinary Least Squares (OLS) model for each entity.  In this case, the results obtained from model estimation will be comparable to those obtained from OLS – any differences in the results are likely to be minor, and occur because of the more robust ways the errors are handled due to the panel structure of the data

(2) Fixed effects model: This model allows to be correlated with and still provide statistically valid estimates of and .  This is a critical feature of the Fixed Effects model in that it allows for the possibility the unobserved hidden characteristics of an entity could influence the observed characteristics , and for the two of them together to jointly determine the dependent variable .  The key downside of this model is that it does not allow the estimation of the effects of any observed entity-specific characteristics that do not vary across t

(3) Random effects model: This model requires to be uncorrelated with but still provide statistically valid estimates of and .  If this requirement is met, this model offers two advantages over the fixed effects model: (i) It allows for estimation of the effects of observed entity-specific characteristics that do not vary across t, and (2) The estimates have higher statistical efficiency (i.e., they are estimated with greater precision than an equivalent fixed-effects model). 

The Enginius software is structured so that one of the options available is for the software to automatically determine which of the three models would be the most appropriate one to use for a given data set. Here is one way to think about how to choose between the three model options. The most parsimonious and the simplest option would be pooled regression model (equivalent to ordinary least squares regression) if that applies to the data set, as determined from the appropriate statistical test. The next most parsimonious is the random-effects model, as determined by the structure of the data set and the appropriate statistical test.  An advantage of the random-effects model is that you can estimate the effects of observed independent variables within the model that do not vary across time or replications (e.g., race, education level, risk tolerance, the state or country where a firm is incorporated, etc.). The most comprehensive model is the fixed effects model if it is appropriate for the data set, again as determined from the structure of the data and the appropriate statistical tests. However, we cannot estimate the effects of independent variables that do not vary across time or replications – the effects of those independent variables are absorbed into the fixed effects

Technical details about the Panel Data Analytics used in Enginius

The main outputs of panel data analysis are generated using the plm package available in R.  The Automatic option first tests the null hypothesis that the Pooled OLS model fits the data best by executing a pooling test.  If the poolability hypothesis is rejected, then the Automatic option tests whether a random effects model is appropriate by applying the Hausman test.  If the null hypothesis is rejected, then the Fixed-effects model is recommended.  Otherwise, the Automatic option recommends the Random-effects model.  In the fixed effects model, the fixed effects are obtained via the fixef function, and the random effects are obtained via the ranef function.  For the fixed effects model, the sum of the fixed effects is set to 0 for estimation.

 

References

Greene, William H. (2017). Econometric Analysis (Eighth edition), New York, Pearson. 

Pesaran, M. Hashem (2015), Time Series and Panel Data Econometrics, Oxford, Oxford University Press.  

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article