Predictive Modeling Tutorial - Appendix: Variable Transformation Techniques

Created by Steve Hoover, Modified on Tue, Jan 14 at 2:28 PM by Steve Hoover

Box-Cox transform the predictors

In predictive models, the distribution of some variables might be highly skewed. Typically, the number of past customers' transactions or past purchases will be skewed: Many customers have made just 1 purchase in the past, but many others have made approximately 10 purchases, and a handful have made 100 purchases or more. The same problem will often happen with purchase amounts, income, etc.

Since many predictive models (linear and logistic regressions) work best when predictors and target variables follow a more Normal-like distribution, the Box-Cox transformation will re-compute skewed variables so that they become more balanced.

A Box-Cox transformation will automatically transform a variable X into a new variable Y. Even though there is an assignment form every X to Y (i.e., X -> Y), the same may not be true for Y -> X. For this reason, while a Box-Cox transform can be applied to predictors, it cannot be applied to the target variable. In the case of target variables, only log-transforms are available.

Log transform the target variable

When using a Continuous or Discrete-continuous target variable. The log transformation can be used to make highly skewed distributions less skewed. This can be valuable for making patterns in the data more interpretable.

Cross-validation

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. In case of a 10-fold cross-validation, for instance, the model is estimated on 90% of the data set and tested on the remaining 10%. The operation is repeated 10 times, with a different test set each time.

Continuous elasticities of the Conditional Logit Model

(Menu option: “Choice between multiple alternatives, one line per alternative (0/0/1)”).

Elasticities are computed analytically using the formula where i represents a choice set (or customer), j represents a variable (for example, price), and k represents a choice alternative (for example, a brand). η_ijk is the elasticity denoting the % change in choice probability for alternative k in choice set i for a 1% increase in variable j. β̂_j is the estimated coefficient corresponding to variable j, X_ijk is the value of variable j for alternative k in choice set i and P_ik is the estimated choice probability for alternative k in choice set i. Enginius averages the elasticities across the choice sets and reports the average elasticity,

which are in the diagonals of the elasticity matrices, with one matrix for each variable j.

Cross elasticities are computed using the formula, η_ijkh = -β̂_jX_ijhP_ih, where k and h denote choice alternatives k and h. Here, η_ijkh is the cross elasticity denoting the % change in the choice probability of alternative k when the value X_ijh for variable j for alternative h in choice set i changes by 1%. The interesting thing to note here is that η_ijkh is the same for all k when a variable corresponding to another alternative h changes. This is a manifestation of what is referred to as the IIA (Independence of Irrelevant Alternatives) property of the multinomial logit model. Enginius averages the elasticities across the choice sets and reports the average,

which are the off-diagonal elements in the elasticity matrices, with one matrix for each variable j.