Source can be viewed on GitHub
Setup and loading the data
Begin by loading the data used for the visualizations in part 1.
#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")
#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
source("Rscript1b - derived variables.R")
The data frame mderived
now contains the fields that may be useful for modeling.
Quick review
My goal is to find a way to accurately forecast fundraising revenue. The current model – “JMod” – discounts solicitations to a fixed percentage of their expected value based on their current stage and how long it is until the end of the fiscal year.
Plan |
1/6 |
1/8 |
0 |
0 |
Clear |
1/3 |
1/6 |
1/8 |
0 |
Ask |
2/3 |
1/3 |
1/6 |
1/8 |
Oral |
2/3 |
1/3 |
1/6 |
1/8 |
Paperwork |
1 |
1 |
1 |
1 |
This makes use of three pieces of information: the current solicitation stage \(S_{i}\), and today’s date \(t_{i}\) compared to the expected close date.
Last time, I ran the data through random forests and found several additional predictive variables: Solicitation.Type.Desc
, Planned.Amt
, Expected.Amt
, and Ask.Amt
.
Logistic regression, take two
Main effects model, raw data
What happens throwing all the main effects in without a transformation?
# Plan model, main effects only
glm.pln <- glm(FY.Plan.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Plan.Future + planning.fiscal.mo, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr <- glm(FY.Clear.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Clear.Future + clear.fiscal.mo, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask <- glm(FY.Ask.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Ask.Future + ask.fiscal.mo, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora <- glm(FY.Oral.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Oral.Future + oral.fiscal.mo, data=mderived, family=binomial())
# Put together the error rates
modnames <- c("Plan", "Clear", "Ask", "Oral")
confuse <- ConfusionMatrix(list(glm.pln, glm.clr, glm.ask, glm.ora), modnames=modnames, counts=F)
# Row to add to the error table
dat1 <- CalcErrorRates(confuse, model.name="GLM main effects notrans", modnames)
# Print the error table
kable(rbind(errors, dat1), digits=2)
GLM |
0.34 |
0.41 |
0.42 |
0.14 |
JMod |
0.40 |
0.46 |
0.40 |
0.43 |
Random forest 3 |
0.17 |
0.21 |
0.23 |
0.10 |
GLM main effects notrans |
0.20 |
0.26 |
0.25 |
0.07 |
That’s a shockingly large improvement whether we use the original GLM or JMod as the baseline. It’s not directly comparable to the out-of-bag Random forest 3 estimates, but I’ll use a cross-validated error estimate after I’m happy with the final model form.
Variable distributions
Dollar amounts
Before looking at the data, I know that Planned.Amt
, Expected.Amt
, and Ask.Amt
are continuous variables theoretically bounded by \([0,\infty)\), but expect them to be very unevenly distributed – because of solicitation strategy they will tend to cluster around nice, round numbers like $100, $25,000, and so on. I don’t want to use complex transformations (mixture basis, spline basis) to maintain interpretability, and because the lower bound is exactly 0, some simple transformations won’t make sense.
Here’s how the raw data look:
There’s a very small number of very large ask amounts. It’s a little hard to see, so here’s what the data look like on a \(\log_{10}\) scale. So that we don’t have to drop the 0s, define the transformation:
\[ x^\text{*} = \log_{10}(x + \alpha)\]
The smallest non-zero ask amount is:
(alpha <- min(na.omit(mderived$Ask.Amt[mderived$Ask.Amt != 0])))
## [1] 100
Plotting the data under this transformation.
That’s actually not too far from symmetric.
Interactions
Are there interactions worth looking at between Solicitation.Type.Desc
and the others?
table(mderived$Solicitation.Type.Desc)
##
## Bequest Expectancy Combination Deferred Gift
## 20 2 5
## Grant Outright Gift Standard Pledge
## 2 501 779
Okay, Solicitation.Type.Desc
clearly needs to be aggregated, so let’s go with Other, Outright Gift, and Standard Pledge. By default, the alphabetically first factor level is the baseline under the treatment constraint, so the model will show contrasts between Other and Outright Gift, and Other and Standard Pledge, which is nice for interpretability.
Look at Ask.Amt
by Sol.Type.Agg
:
Other and Standard Pledge look more nearly flat than Outright Gift, so it could be worth checking out.
Also check ask.fiscal.mo
by Sol.Type.Agg
:
Outright gift asks have more success by far in December. Could be worth checking out.
Based on previous findings I expect the Annual Fund group is mostly responsible for plummeting closed solicitation rates toward the end of the fiscal year. Let’s check.
Confirmed – the AF group gives on their schedule, not ours.
Additional considerations – does it make sense to remove 0s before the \(\log_{10}\) transformation as opposed to adding a constant \(\alpha\)?
Planned.Amt
= $0 is an error
Ask.Amt
= $0 is an error
Expected.Amt
= $0 is not an error; reflects what the solicitation manager thought was most likely at the time
I think it makes sense to keep the $0 values and transform them appropraitely. If necessary, Ask.Amt
= $0 can be filled in with Planned.Amt
and vice versa.
Main effects model with transformed data
# Function to transform dollar amounts as above
LogTrans <- function(x, a=alpha) {log10(x + a)}
# Function to turn fiscal month into a factor
FacMonth <- function(x) {factor(x, labels=fiscal.month.name)}
# New variables
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt), fm.planning = FacMonth(planning.fiscal.mo), fm.clear = FacMonth(clear.fiscal.mo), fm.ask = FacMonth(ask.fiscal.mo), fm.oral = FacMonth(oral.fiscal.mo))
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future + fm.planning, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future + fm.clear, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future + fm.ask, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future + fm.oral, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat2 <- CalcErrorRates(confuse, model.name="GLM main effects trans", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2), digits=2)
GLM |
0.34 |
0.41 |
0.42 |
0.14 |
JMod |
0.40 |
0.46 |
0.40 |
0.43 |
Random forest 3 |
0.17 |
0.21 |
0.23 |
0.10 |
GLM main effects notrans |
0.20 |
0.26 |
0.25 |
0.07 |
GLM main effects trans |
0.20 |
0.25 |
0.24 |
0.08 |
This gives the same error rate as the untransformed data, interestingly. Let’s look at each of these models.
summary(glm.pln.t)
##
## Call:
## glm(formula = FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt +
## lt.Expected.Amt + Plan.Future + fm.planning, family = binomial(),
## data = mderived)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2149 -0.4128 -0.2284 0.7647 2.8061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.19175 0.77209 1.544 0.122699
## Sol.Type.AggOutright Gift 0.44469 0.55453 0.802 0.422601
## Sol.Type.AggStandard Pledge 0.11605 0.53804 0.216 0.829234
## lt.Planned.Amt -0.95618 0.19441 -4.918 8.72e-07 ***
## lt.Expected.Amt 0.71511 0.18283 3.911 9.18e-05 ***
## Plan.FutureTRUE -3.54825 0.23961 -14.809 < 2e-16 ***
## fm.planningAug 1.06579 0.37654 2.830 0.004648 **
## fm.planningSep 1.24399 0.37043 3.358 0.000784 ***
## fm.planningOct 0.26686 0.31737 0.841 0.400419
## fm.planningNov 0.48326 0.34860 1.386 0.165666
## fm.planningDec 0.33666 0.33551 1.003 0.315668
## fm.planningJan -0.09131 0.37938 -0.241 0.809806
## fm.planningFeb 1.35247 0.44837 3.016 0.002558 **
## fm.planningMar 1.33966 0.41448 3.232 0.001229 **
## fm.planningApr 0.38552 0.39483 0.976 0.328854
## fm.planningMay 0.64025 0.41574 1.540 0.123552
## fm.planningJun 0.57245 0.43996 1.301 0.193206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1792.3 on 1308 degrees of freedom
## Residual deviance: 1110.8 on 1292 degrees of freedom
## AIC: 1144.8
##
## Number of Fisher Scoring iterations: 5
Planned and Expected seem to cancel each other out to some degree, but just including one increases the error rate:
# Error comparison
rbind(dat2, CalcErrorRates(ConfusionMatrix(list(
update(glm.pln.t, . ~ . -lt.Expected.Amt),
update(glm.clr.t, . ~ . -lt.Expected.Amt),
update(glm.ask.t, . ~ . -lt.Expected.Amt),
update(glm.ora.t, . ~ . -lt.Expected.Amt)
), modnames=modnames, counts=F
), model.name="No Expected.Amt", modnames))
## Model Plan Clear Ask Oral
## 1 GLM main effects trans 0.198 0.253 0.242 0.075
## 2 No Expected.Amt 0.199 0.269 0.255 0.075
So it looks like Month doesn’t make much of a difference in the end, which is intellectally unsatisfying, but perhaps an inevitable result of optimizing for probability to close (too noisy). The final models can drop these.
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat3 <- CalcErrorRates(confuse, model.name="GLM trans no month", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2, dat3), digits=2)
GLM |
0.34 |
0.41 |
0.42 |
0.14 |
JMod |
0.40 |
0.46 |
0.40 |
0.43 |
Random forest 3 |
0.17 |
0.21 |
0.23 |
0.10 |
GLM main effects notrans |
0.20 |
0.26 |
0.25 |
0.07 |
GLM main effects trans |
0.20 |
0.25 |
0.24 |
0.08 |
GLM trans no month |
0.21 |
0.25 |
0.25 |
0.07 |
Cross-validated error estimate
In-sample error estimates are biased, so let’s see what the 5-fold cross-validated error is. Note that it’s important that the coefficients be calculated separately for each holdout sample.
## Randomly shuffle the data
set.seed(57141)
k <- 5
rand.ind <- sample(1:nrow(mderived), nrow(mderived))
n.size <- ceiling(nrow(mderived)/k)
folds <- list()
# Grab next n.size shuffled row indices
for (i in 1:k) {folds[[i]] <- na.omit(rand.ind[(n.size * (i - 1) + 1):(n.size * i)])}
## Cross-validation
# Vector to store the errors
error.list <- NULL
for (i in 1:k) {
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived[-folds[[i]],], family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived[-folds[[i]],], family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived[-folds[[i]],], family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived[-folds[[i]],], family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived[folds[[i]],], modnames=modnames, counts=T)
# Row to add to the error table
error.list <- rbind(error.list, CalcErrorRates(confuse, model.name="GLM trans no month, xval", modnames))
}
## Add averaged error to the table
dat4 <- error.list %>% group_by(Model) %>% summarise(Plan = sum(Plan)/nrow(model.matrix(glm.pln)), Clear = sum(Clear)/nrow(model.matrix(glm.clr)), Ask = sum(Ask)/nrow(model.matrix(glm.ask)), Oral = sum(Oral)/nrow(model.matrix(glm.ora)))
kable(rbind(errors, dat3, dat4), digits=2)
GLM |
0.34 |
0.41 |
0.42 |
0.14 |
JMod |
0.40 |
0.46 |
0.40 |
0.43 |
Random forest 3 |
0.17 |
0.21 |
0.23 |
0.10 |
GLM trans no month |
0.21 |
0.25 |
0.25 |
0.07 |
GLM trans no month, xval |
0.21 |
0.26 |
0.25 |
0.08 |
The error actually comes out pretty much the same.
Of course, these error rates and models need to be compared to JMod in its native environment, namely predicted dollar amounts.
Packages used
session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.0 (2016-05-03)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Chicago
## date 2016-06-03
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.3.0)
## DBI 0.4-1 2016-05-08 CRAN (R 3.3.0)
## devtools * 1.11.1 2016-04-21 CRAN (R 3.3.0)
## digest 0.6.9 2016-01-08 CRAN (R 3.3.0)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.3.0)
## evaluate 0.9 2016-04-29 CRAN (R 3.3.0)
## formatR 1.4 2016-05-09 CRAN (R 3.3.0)
## ggplot2 * 2.1.0 2016-03-01 CRAN (R 3.3.0)
## gridExtra * 2.2.1 2016-02-29 CRAN (R 3.3.0)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## highr 0.6 2016-05-09 CRAN (R 3.3.0)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## knitr * 1.13 2016-05-09 CRAN (R 3.3.0)
## labeling 0.3 2014-08-23 CRAN (R 3.3.0)
## lazyeval 0.1.10 2015-01-02 CRAN (R 3.3.0)
## lubridate * 1.5.6 2016-04-06 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## plyr 1.8.3 2015-06-12 CRAN (R 3.3.0)
## R6 2.1.2 2016-01-26 CRAN (R 3.3.0)
## Rcpp 0.12.4 2016-03-26 CRAN (R 3.3.0)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.3.0)
## rmarkdown 0.9.6 2016-05-01 CRAN (R 3.3.0)
## scales * 0.4.0 2016-02-26 CRAN (R 3.3.0)
## stringi 1.0-1 2015-10-22 CRAN (R 3.3.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.3.0)
## tidyr * 0.4.1 2016-02-05 CRAN (R 3.3.0)
## withr 1.0.1 2016-02-04 CRAN (R 3.3.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.3.0)