Source can be viewed on GitHub

Setup and loading the data

Begin by loading the data used for the visualizations in part 1.

#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")

#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
source("Rscript1b - derived variables.R")

The data frame mderived now contains the fields that may be useful for modeling.

Quick review

My goal is to find a way to accurately forecast fundraising revenue. The current model – “JMod” – discounts solicitations to a fixed percentage of their expected value based on their current stage and how long it is until the end of the fiscal year.

Stage	July-Feb	Mar-Apr	May	June
Plan	1/6	1/8	0	0
Clear	1/3	1/6	1/8	0
Ask	2/3	1/3	1/6	1/8
Oral	2/3	1/3	1/6	1/8
Paperwork	1	1	1	1

This makes use of three pieces of information: the current solicitation stage $S_{i}$, and today’s date $t_{i}$ compared to the expected close date.

Last time, I ran the data through random forests and found several additional predictive variables: Solicitation.Type.Desc, Planned.Amt, Expected.Amt, and Ask.Amt.

Logistic regression, take two

Main effects model, raw data

What happens throwing all the main effects in without a transformation?

# Plan model, main effects only
glm.pln <- glm(FY.Plan.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Plan.Future + planning.fiscal.mo, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr <- glm(FY.Clear.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Clear.Future + clear.fiscal.mo, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask <- glm(FY.Ask.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Ask.Future + ask.fiscal.mo, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora <- glm(FY.Oral.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Oral.Future + oral.fiscal.mo, data=mderived, family=binomial())
# Put together the error rates
modnames <- c("Plan", "Clear", "Ask", "Oral")
confuse <- ConfusionMatrix(list(glm.pln, glm.clr, glm.ask, glm.ora), modnames=modnames, counts=F)
# Row to add to the error table
dat1 <- CalcErrorRates(confuse, model.name="GLM main effects notrans", modnames)
# Print the error table
kable(rbind(errors, dat1), digits=2)

Model	Plan	Clear	Ask	Oral
GLM	0.34	0.41	0.42	0.14
JMod	0.40	0.46	0.40	0.43
Random forest 3	0.17	0.21	0.23	0.10
GLM main effects notrans	0.20	0.26	0.25	0.07

That’s a shockingly large improvement whether we use the original GLM or JMod as the baseline. It’s not directly comparable to the out-of-bag Random forest 3 estimates, but I’ll use a cross-validated error estimate after I’m happy with the final model form.

Variable distributions

Dollar amounts

Before looking at the data, I know that Planned.Amt, Expected.Amt, and Ask.Amt are continuous variables theoretically bounded by $[0,\infty)$, but expect them to be very unevenly distributed – because of solicitation strategy they will tend to cluster around nice, round numbers like $100, $25,000, and so on. I don’t want to use complex transformations (mixture basis, spline basis) to maintain interpretability, and because the lower bound is exactly 0, some simple transformations won’t make sense.

Here’s how the raw data look:

There’s a very small number of very large ask amounts. It’s a little hard to see, so here’s what the data look like on a $\log_{10}$ scale. So that we don’t have to drop the 0s, define the transformation:

\[ x^\text{*} = \log_{10}(x + \alpha)\]

The smallest non-zero ask amount is:

(alpha <- min(na.omit(mderived$Ask.Amt[mderived$Ask.Amt != 0])))

## [1] 100

Plotting the data under this transformation.

That’s actually not too far from symmetric.

Interactions

Are there interactions worth looking at between Solicitation.Type.Desc and the others?

table(mderived$Solicitation.Type.Desc)

## 
## Bequest Expectancy        Combination      Deferred Gift 
##                 20                  2                  5 
##              Grant      Outright Gift    Standard Pledge 
##                  2                501                779

Okay, Solicitation.Type.Desc clearly needs to be aggregated, so let’s go with Other, Outright Gift, and Standard Pledge. By default, the alphabetically first factor level is the baseline under the treatment constraint, so the model will show contrasts between Other and Outright Gift, and Other and Standard Pledge, which is nice for interpretability.

Look at Ask.Amt by Sol.Type.Agg:

Other and Standard Pledge look more nearly flat than Outright Gift, so it could be worth checking out.

Also check ask.fiscal.mo by Sol.Type.Agg:

Outright gift asks have more success by far in December. Could be worth checking out.

Based on previous findings I expect the Annual Fund group is mostly responsible for plummeting closed solicitation rates toward the end of the fiscal year. Let’s check.

Confirmed – the AF group gives on their schedule, not ours.

Additional considerations – does it make sense to remove 0s before the $\log_{10}$ transformation as opposed to adding a constant $\alpha$?

Planned.Amt = $0 is an error
Ask.Amt = $0 is an error
Expected.Amt = $0 is not an error; reflects what the solicitation manager thought was most likely at the time

I think it makes sense to keep the $0 values and transform them appropraitely. If necessary, Ask.Amt = $0 can be filled in with Planned.Amt and vice versa.

Main effects model with transformed data

# Function to transform dollar amounts as above
LogTrans <- function(x, a=alpha) {log10(x + a)}
# Function to turn fiscal month into a factor
FacMonth <- function(x) {factor(x, labels=fiscal.month.name)}
# New variables
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt), fm.planning = FacMonth(planning.fiscal.mo), fm.clear = FacMonth(clear.fiscal.mo), fm.ask = FacMonth(ask.fiscal.mo), fm.oral = FacMonth(oral.fiscal.mo))
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future + fm.planning, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future + fm.clear, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future + fm.ask, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future + fm.oral, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat2 <- CalcErrorRates(confuse, model.name="GLM main effects trans", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2), digits=2)

Model	Plan	Clear	Ask	Oral
GLM	0.34	0.41	0.42	0.14
JMod	0.40	0.46	0.40	0.43
Random forest 3	0.17	0.21	0.23	0.10
GLM main effects notrans	0.20	0.26	0.25	0.07
GLM main effects trans	0.20	0.25	0.24	0.08

This gives the same error rate as the untransformed data, interestingly. Let’s look at each of these models.

summary(glm.pln.t)

## 
## Call:
## glm(formula = FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + 
##     lt.Expected.Amt + Plan.Future + fm.planning, family = binomial(), 
##     data = mderived)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2149  -0.4128  -0.2284   0.7647   2.8061  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  1.19175    0.77209   1.544 0.122699    
## Sol.Type.AggOutright Gift    0.44469    0.55453   0.802 0.422601    
## Sol.Type.AggStandard Pledge  0.11605    0.53804   0.216 0.829234    
## lt.Planned.Amt              -0.95618    0.19441  -4.918 8.72e-07 ***
## lt.Expected.Amt              0.71511    0.18283   3.911 9.18e-05 ***
## Plan.FutureTRUE             -3.54825    0.23961 -14.809  < 2e-16 ***
## fm.planningAug               1.06579    0.37654   2.830 0.004648 ** 
## fm.planningSep               1.24399    0.37043   3.358 0.000784 ***
## fm.planningOct               0.26686    0.31737   0.841 0.400419    
## fm.planningNov               0.48326    0.34860   1.386 0.165666    
## fm.planningDec               0.33666    0.33551   1.003 0.315668    
## fm.planningJan              -0.09131    0.37938  -0.241 0.809806    
## fm.planningFeb               1.35247    0.44837   3.016 0.002558 ** 
## fm.planningMar               1.33966    0.41448   3.232 0.001229 ** 
## fm.planningApr               0.38552    0.39483   0.976 0.328854    
## fm.planningMay               0.64025    0.41574   1.540 0.123552    
## fm.planningJun               0.57245    0.43996   1.301 0.193206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1792.3  on 1308  degrees of freedom
## Residual deviance: 1110.8  on 1292  degrees of freedom
## AIC: 1144.8
## 
## Number of Fisher Scoring iterations: 5

Planned and Expected seem to cancel each other out to some degree, but just including one increases the error rate:

# Error comparison
rbind(dat2, CalcErrorRates(ConfusionMatrix(list(
    update(glm.pln.t, . ~ . -lt.Expected.Amt),
    update(glm.clr.t, . ~ . -lt.Expected.Amt),
    update(glm.ask.t, . ~ . -lt.Expected.Amt),
    update(glm.ora.t, . ~ . -lt.Expected.Amt)
  ), modnames=modnames, counts=F
), model.name="No Expected.Amt", modnames))

##                    Model  Plan Clear   Ask  Oral
## 1 GLM main effects trans 0.198 0.253 0.242 0.075
## 2        No Expected.Amt 0.199 0.269 0.255 0.075

So it looks like Month doesn’t make much of a difference in the end, which is intellectally unsatisfying, but perhaps an inevitable result of optimizing for probability to close (too noisy). The final models can drop these.

# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat3 <- CalcErrorRates(confuse, model.name="GLM trans no month", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2, dat3), digits=2)

Model	Plan	Clear	Ask	Oral
GLM	0.34	0.41	0.42	0.14
JMod	0.40	0.46	0.40	0.43
Random forest 3	0.17	0.21	0.23	0.10
GLM main effects notrans	0.20	0.26	0.25	0.07
GLM main effects trans	0.20	0.25	0.24	0.08
GLM trans no month	0.21	0.25	0.25	0.07

Cross-validated error estimate

In-sample error estimates are biased, so let’s see what the 5-fold cross-validated error is. Note that it’s important that the coefficients be calculated separately for each holdout sample.

## Randomly shuffle the data
set.seed(57141)
k <- 5
rand.ind <- sample(1:nrow(mderived), nrow(mderived))
n.size <- ceiling(nrow(mderived)/k)
folds <- list()
# Grab next n.size shuffled row indices
for (i in 1:k) {folds[[i]] <- na.omit(rand.ind[(n.size * (i - 1) + 1):(n.size * i)])}
## Cross-validation 
# Vector to store the errors
error.list <- NULL
for (i in 1:k) {
  # Plan model, main effects only
  glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived[-folds[[i]],], family=binomial())
  # Clear model, main effects only
  glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived[-folds[[i]],], family=binomial())
  # Ask model, main effects only
  glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived[-folds[[i]],], family=binomial())
  # Oral model, main effects only
  glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived[-folds[[i]],], family=binomial())
  # Put together the error rates
  confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived[folds[[i]],], modnames=modnames, counts=T)
  # Row to add to the error table
  error.list <- rbind(error.list, CalcErrorRates(confuse, model.name="GLM trans no month, xval", modnames))
}
## Add averaged error to the table
dat4 <- error.list %>% group_by(Model) %>% summarise(Plan = sum(Plan)/nrow(model.matrix(glm.pln)), Clear = sum(Clear)/nrow(model.matrix(glm.clr)), Ask = sum(Ask)/nrow(model.matrix(glm.ask)), Oral = sum(Oral)/nrow(model.matrix(glm.ora)))
kable(rbind(errors, dat3, dat4), digits=2)

Model	Plan	Clear	Ask	Oral
GLM	0.34	0.41	0.42	0.14
JMod	0.40	0.46	0.40	0.43
Random forest 3	0.17	0.21	0.23	0.10
GLM trans no month	0.21	0.25	0.25	0.07
GLM trans no month, xval	0.21	0.26	0.25	0.08

The error actually comes out pretty much the same.

Of course, these error rates and models need to be compared to JMod in its native environment, namely predicted dollar amounts.

Packages used

session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.0 (2016-05-03)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/Chicago             
##  date     2016-06-03

## Packages ------------------------------------------------------------------

##  package    * version date       source        
##  assertthat   0.1     2013-12-06 CRAN (R 3.3.0)
##  colorspace   1.2-6   2015-03-11 CRAN (R 3.3.0)
##  DBI          0.4-1   2016-05-08 CRAN (R 3.3.0)
##  devtools   * 1.11.1  2016-04-21 CRAN (R 3.3.0)
##  digest       0.6.9   2016-01-08 CRAN (R 3.3.0)
##  dplyr      * 0.4.3   2015-09-01 CRAN (R 3.3.0)
##  evaluate     0.9     2016-04-29 CRAN (R 3.3.0)
##  formatR      1.4     2016-05-09 CRAN (R 3.3.0)
##  ggplot2    * 2.1.0   2016-03-01 CRAN (R 3.3.0)
##  gridExtra  * 2.2.1   2016-02-29 CRAN (R 3.3.0)
##  gtable       0.2.0   2016-02-26 CRAN (R 3.3.0)
##  highr        0.6     2016-05-09 CRAN (R 3.3.0)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.0)
##  knitr      * 1.13    2016-05-09 CRAN (R 3.3.0)
##  labeling     0.3     2014-08-23 CRAN (R 3.3.0)
##  lazyeval     0.1.10  2015-01-02 CRAN (R 3.3.0)
##  lubridate  * 1.5.6   2016-04-06 CRAN (R 3.3.0)
##  magrittr     1.5     2014-11-22 CRAN (R 3.3.0)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.3.0)
##  munsell      0.4.3   2016-02-13 CRAN (R 3.3.0)
##  plyr         1.8.3   2015-06-12 CRAN (R 3.3.0)
##  R6           2.1.2   2016-01-26 CRAN (R 3.3.0)
##  Rcpp         0.12.4  2016-03-26 CRAN (R 3.3.0)
##  reshape2     1.4.1   2014-12-06 CRAN (R 3.3.0)
##  rmarkdown    0.9.6   2016-05-01 CRAN (R 3.3.0)
##  scales     * 0.4.0   2016-02-26 CRAN (R 3.3.0)
##  stringi      1.0-1   2015-10-22 CRAN (R 3.3.0)
##  stringr      1.0.0   2015-04-30 CRAN (R 3.3.0)
##  tidyr      * 0.4.1   2016-02-05 CRAN (R 3.3.0)
##  withr        1.0.1   2016-02-04 CRAN (R 3.3.0)
##  yaml         2.1.13  2014-06-12 CRAN (R 3.3.0)

3 Model Selection and Validation

Paul Hively

May 16, 2016