Source can be viewed on GitHub.

Problem statement

My goal is to find a way to accurately forecast fundraising revenue.

The existing model (“JMod”) has two components. First, the probability a solicitation \(S_{i}\) comes in before the end of the fiscal year is a function of the current stage progress \(c\) and time until the end of the year \(t\), \(P(S_{i}=1)=f(c_{i},t)\). The expected FRP in the current fiscal year is this probability times the expected amount, \(E(FRP)=P(S_{i}=1)E(S_{i})\).

Focus on the probability model. I assume that there should also be some effect due to the average close rate, season (e.g. people give more in December), planned ask amount, actual ask amount, and potentially interactions. Something like this:

\[P(S_{ij}=1)=f(c_{i},t,\mu,t_{i}^{*},a_{j}^{*},a_{j})\]

It makes sense to start with a few straightforward classifiers, but I’d also like to look into nesting models.

Visualizations

The dataset includes all closed (successful = booked, or unsuccessful = not booked) solicitations with an add date on or after 7/1/2011. Data is obtained from the first tab of the “Booth solicitation history” report (saved as a .csv).

#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
#### Run Rscript1 to load data ----
source("Rscript1 - data load and transform.R")

Begin with some visualizations of time in each stage:

There’s extremely obvious bimodality in most cases. Note in particular that while PG (defined as $5M+) is skewed positive overall, there seem to be two distinct groups: takes a while and takes a long while. The purple “Days from ask to actual” is about as perfect a mixture of two \(N(\mu_{k},\sigma^{2}_{k})\) as I’ve ever seen. Might be interesting to see if a certain population just takes longer in general.

There isn’t much data in the oral pledge stage so things are a bit bumpier here.

Here’s the cleaned up data file:

#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
colnames(mdat)
##  [1] "Solicitation.ID"        "Solicitation.Type.Desc"
##  [3] "Final.Sol.Stage.Dt"     "Final.Sol.Stage"       
##  [5] "Booked"                 "Final.Sol.Mgr"         
##  [7] "Solicit.Dt.Added"       "Planned.Dt"            
##  [9] "Planned.Amt"            "Expected.Dt"           
## [11] "Expected.Amt"           "Planning.Dt"           
## [13] "Clear.Dt"               "Ask.Dt"                
## [15] "Ask.Amt"                "Oral.Dt"               
## [17] "Actual.Dt"              "Actual.Amt"            
## [19] "plan2actual"            "clear2actual"          
## [21] "ask2actual"             "oral2actual"           
## [23] "FY.plan"                "FY.clear"              
## [25] "FY.ask"                 "FY.oral"               
## [27] "FY.Plan.Book"           "FY.Clear.Book"         
## [29] "FY.Ask.Book"            "FY.Oral.Book"

Booked indicates whether the solicitation came in at \(>\)$0, and FY.plan, FY.clear, FY.ask, FY.oral indicate whether the solicitation closed in the same fiscal year as that stage was reached, while the corresponding FY.[Stage].Book indicates that a solicitation was booked in the same fiscal year that it reached that stage. Combinations of these indicators will be the dependent variables. plan2actual, clear2actual, ask2actual, oral2actual are measured in days. Planned.Dt is the planned ask date, while Planning.Dt is the date the solicitation entered the planning stage. The other covariates should be self-explanatory.

Variable exploration

Dependent variables

Do I want to model time to close from each current stage \(c_{i}\) as opposed to \(P(S_{ij}=1)\)?

plan2actual clear2actual ask2actual oral2actual
Min. : 1.0 Min. : -8.25 Min. :-35.21 Min. :-33.25
1st Qu.: 45.0 1st Qu.: 28.75 1st Qu.: 22.75 1st Qu.: 12.75
Median : 126.0 Median : 91.75 Median : 66.75 Median : 30.79
Mean : 195.8 Mean : 142.96 Mean :114.12 Mean : 64.33
3rd Qu.: 283.0 3rd Qu.: 209.79 3rd Qu.:162.79 3rd Qu.: 75.75
Max. :1426.0 Max. :1426.75 Max. :919.75 Max. :919.75
NA NA’s :116 NA’s :172 NA’s :947

Negative values indicate a data issue or update. Decidedly right-skewed. Try a square root transformation:

Not bad. How’s log look?

Might be a good idea to try e.g. Box-Cox transformations to see how they compare.

Data counts - do I have enough to do what I want? Is it remotely reasonable to split by month or quarter?

Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
Plan 83 98 115 159 106 125 82 65 100 133 137 106
Clear 64 82 142 164 139 145 92 76 79 79 72 59
Ask 50 65 90 133 163 156 74 71 118 67 78 72
Oral 28 11 21 34 47 37 33 25 18 33 37 38

Peak ask occurs around Nov-Dec and Mar each year. Not nearly as much oral pledge data; usually oral to booked is fast enough that the date isn’t captured in the system. (JMod rolls ask and oral together.)

Finally, let’s take a look at the booked versus declined solicitations by stage by fiscal year closed.

Confidence bands are based on the variance for a binomial random variable, \(\sqrt{\frac{p(1-p)}{n}}\) (assuming independence).

This is fascinating; a solicitation entering plan or clear in January or plan in April-July is likely to be refused rather than booked. Is there a specific year driving that?

Note that this is fiscal, not calendar, year. FY2011 didn’t have much stage progress, as it is the first year of data.

Looks like 2013 was a particularly good year, but it’s normal to have lots of duds in those winter months

Discrete covariates

I’m thinking time to close is not the natural outcome measure; a model predicting time to close can’t be swapped out directly with the corresponding terms in JMod. Recall from above:

For the probability model, I assume that there should also be some effect due to the average close rate, season (e.g. people give more in December), planned ask amount, actual ask amount, and potentially interactions. Something like this:

\[P(S_{ij}=1)=f(c_{i},t,\mu,t_{i}^{*},a_{j}^{*},a_{j})\]

Take a look at each of these additional variables against Booked.

Ask Amount

Histogram and smoothed density estimate. It’s above 50% close rate at every level, but the poorest close rate is between the $100k to $1M levels. There’s obviously nonlinearity; a cubic polynomial might be a good approximation?

Ask Amount/Planned Amount

I think differences in planned ask versus actual ask amounts might indicate trouble. How does the ratio of actual ask over planned ask look?

Too choppy to really see much of anything; most solicitations are asked at the planned amount and there are so few in the tails that there isn’t much interesting to say. What if we just look at a few binned thresholds?

If someone asks slightly more than the planned ask amount it’s a good sign that the talks were going well; otherwise there doesn’t seem to be much to see here. The 25%-50% bin might also be statistically significant; don’t know about the others given the small sample sizes.

Stage progression within a single year

Given that a solicitation reaches a certain stage, how likely is it to close that year, whether or not it’s booked?

Stage Closed Count Proportion
Plan FALSE 553 0.42
Plan TRUE 756 0.58
Clear FALSE 331 0.28
Clear TRUE 862 0.72
Ask FALSE 284 0.25
Ask TRUE 853 0.75
Oral FALSE 60 0.17
Oral TRUE 302 0.83

Ask and Clear are closer than I would have thought. What about if we look at whether they were both closed and booked that year?

Stage Booked.In.FY Count Proportion
Plan FALSE 767 0.59
Plan TRUE 542 0.41
Clear FALSE 580 0.49
Clear TRUE 613 0.51
Ask FALSE 516 0.45
Ask TRUE 621 0.55
Oral FALSE 71 0.20
Oral TRUE 291 0.80

Curious; there really doesn’t seem to be a discernible difference between clear and ask. The last thing to check is these hybrid “reached stage and closed” indicators by month.

There are a couple of crossover points; January for Plan and Clear; Jan for ask (though it crosses back in March interestingly). Can’t count on oral pledges made in May or June coming in before the end of the year. Let’s focus on asks in the MG range, \(x\in[\$25\text{k},\$5\text{M})\):

Ha, that’s more like it. Plan has a negative trend after Feb; Clear is surprisingly close to flat; Ask has a negative trend after March; Oral can’t be counted on in May and June.

Packages used

session_info()
## Session info --------------------------------------------------------------
##  setting  value                                      
##  version  R version 3.2.4 Revised (2016-03-16 r70336)
##  system   x86_64, mingw32                            
##  ui       RTerm                                      
##  language (EN)                                       
##  collate  English_United States.1252                 
##  tz       America/Chicago                            
##  date     2016-05-09
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  assertthat   0.1     2013-12-06 CRAN (R 3.2.2)
##  colorspace   1.2-6   2015-03-11 CRAN (R 3.2.2)
##  DBI          0.3.1   2014-09-24 CRAN (R 3.2.2)
##  devtools   * 1.11.0  2016-04-12 CRAN (R 3.2.4)
##  digest       0.6.8   2014-12-31 CRAN (R 3.2.2)
##  dplyr      * 0.4.3   2015-09-01 CRAN (R 3.2.3)
##  evaluate     0.8.3   2016-03-05 CRAN (R 3.2.4)
##  formatR      1.3     2016-03-05 CRAN (R 3.2.4)
##  ggplot2    * 2.0.0   2015-12-18 CRAN (R 3.2.3)
##  gridExtra  * 2.0.0   2015-07-14 CRAN (R 3.2.3)
##  gtable       0.1.2   2012-12-05 CRAN (R 3.2.2)
##  highr        0.5.1   2015-09-18 CRAN (R 3.2.4)
##  htmltools    0.3     2015-12-29 CRAN (R 3.2.3)
##  knitr      * 1.12.3  2016-01-22 CRAN (R 3.2.4)
##  labeling     0.3     2014-08-23 CRAN (R 3.2.2)
##  lazyeval     0.1.10  2015-01-02 CRAN (R 3.2.2)
##  lubridate  * 1.5.0   2015-12-03 CRAN (R 3.2.3)
##  magrittr     1.5     2014-11-22 CRAN (R 3.2.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.4)
##  munsell      0.4.2   2013-07-11 CRAN (R 3.2.2)
##  plyr         1.8.3   2015-06-12 CRAN (R 3.2.2)
##  R6           2.1.1   2015-08-19 CRAN (R 3.2.2)
##  Rcpp         0.12.1  2015-09-10 CRAN (R 3.2.2)
##  reshape2     1.4.1   2014-12-06 CRAN (R 3.2.2)
##  rmarkdown    0.9.5   2016-02-22 CRAN (R 3.2.3)
##  scales     * 0.3.0   2015-08-25 CRAN (R 3.2.2)
##  stringi      0.5-5   2015-06-29 CRAN (R 3.2.2)
##  stringr      1.0.0   2015-04-30 CRAN (R 3.2.2)
##  tidyr      * 0.3.1   2015-09-10 CRAN (R 3.2.2)
##  withr        1.0.1   2016-02-04 CRAN (R 3.2.4)
##  yaml         2.1.13  2014-06-12 CRAN (R 3.2.3)