Source can be viewed on GitHub.
My goal is to find a way to accurately forecast fundraising revenue.
The existing model (“JMod”) has two components. First, the probability a solicitation \(S_{i}\) comes in before the end of the fiscal year is a function of the current stage progress \(c\) and time until the end of the year \(t\), \(P(S_{i}=1)=f(c_{i},t)\). The expected FRP in the current fiscal year is this probability times the expected amount, \(E(FRP)=P(S_{i}=1)E(S_{i})\).
Focus on the probability model. I assume that there should also be some effect due to the average close rate, season (e.g. people give more in December), planned ask amount, actual ask amount, and potentially interactions. Something like this:
\[P(S_{ij}=1)=f(c_{i},t,\mu,t_{i}^{*},a_{j}^{*},a_{j})\]
It makes sense to start with a few straightforward classifiers, but I’d also like to look into nesting models.
The dataset includes all closed (successful = booked, or unsuccessful = not booked) solicitations with an add date on or after 7/1/2011. Data is obtained from the first tab of the “Booth solicitation history” report (saved as a .csv).
#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
#### Run Rscript1 to load data ----
source("Rscript1 - data load and transform.R")
Begin with some visualizations of time in each stage:
There’s extremely obvious bimodality in most cases. Note in particular that while PG (defined as $5M+) is skewed positive overall, there seem to be two distinct groups: takes a while and takes a long while. The purple “Days from ask to actual” is about as perfect a mixture of two \(N(\mu_{k},\sigma^{2}_{k})\) as I’ve ever seen. Might be interesting to see if a certain population just takes longer in general.
There isn’t much data in the oral pledge stage so things are a bit bumpier here.
Here’s the cleaned up data file:
#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
colnames(mdat)
## [1] "Solicitation.ID" "Solicitation.Type.Desc"
## [3] "Final.Sol.Stage.Dt" "Final.Sol.Stage"
## [5] "Booked" "Final.Sol.Mgr"
## [7] "Solicit.Dt.Added" "Planned.Dt"
## [9] "Planned.Amt" "Expected.Dt"
## [11] "Expected.Amt" "Planning.Dt"
## [13] "Clear.Dt" "Ask.Dt"
## [15] "Ask.Amt" "Oral.Dt"
## [17] "Actual.Dt" "Actual.Amt"
## [19] "plan2actual" "clear2actual"
## [21] "ask2actual" "oral2actual"
## [23] "FY.plan" "FY.clear"
## [25] "FY.ask" "FY.oral"
## [27] "FY.Plan.Book" "FY.Clear.Book"
## [29] "FY.Ask.Book" "FY.Oral.Book"
Booked indicates whether the solicitation came in at \(>\)$0, and FY.plan, FY.clear, FY.ask, FY.oral indicate whether the solicitation closed in the same fiscal year as that stage was reached, while the corresponding FY.[Stage].Book indicates that a solicitation was booked in the same fiscal year that it reached that stage. Combinations of these indicators will be the dependent variables. plan2actual, clear2actual, ask2actual, oral2actual are measured in days. Planned.Dt is the planned ask date, while Planning.Dt is the date the solicitation entered the planning stage. The other covariates should be self-explanatory.
Do I want to model time to close from each current stage \(c_{i}\) as opposed to \(P(S_{ij}=1)\)?
plan2actual | clear2actual | ask2actual | oral2actual | |
---|---|---|---|---|
Min. : 1.0 | Min. : -8.25 | Min. :-35.21 | Min. :-33.25 | |
1st Qu.: 45.0 | 1st Qu.: 28.75 | 1st Qu.: 22.75 | 1st Qu.: 12.75 | |
Median : 126.0 | Median : 91.75 | Median : 66.75 | Median : 30.79 | |
Mean : 195.8 | Mean : 142.96 | Mean :114.12 | Mean : 64.33 | |
3rd Qu.: 283.0 | 3rd Qu.: 209.79 | 3rd Qu.:162.79 | 3rd Qu.: 75.75 | |
Max. :1426.0 | Max. :1426.75 | Max. :919.75 | Max. :919.75 | |
NA | NA’s :116 | NA’s :172 | NA’s :947 |
Negative values indicate a data issue or update. Decidedly right-skewed. Try a square root transformation:
Not bad. How’s log look?
Might be a good idea to try e.g. Box-Cox transformations to see how they compare.
Data counts - do I have enough to do what I want? Is it remotely reasonable to split by month or quarter?
Jul | Aug | Sep | Oct | Nov | Dec | Jan | Feb | Mar | Apr | May | Jun | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Plan | 83 | 98 | 115 | 159 | 106 | 125 | 82 | 65 | 100 | 133 | 137 | 106 |
Clear | 64 | 82 | 142 | 164 | 139 | 145 | 92 | 76 | 79 | 79 | 72 | 59 |
Ask | 50 | 65 | 90 | 133 | 163 | 156 | 74 | 71 | 118 | 67 | 78 | 72 |
Oral | 28 | 11 | 21 | 34 | 47 | 37 | 33 | 25 | 18 | 33 | 37 | 38 |
Peak ask occurs around Nov-Dec and Mar each year. Not nearly as much oral pledge data; usually oral to booked is fast enough that the date isn’t captured in the system. (JMod rolls ask and oral together.)
Finally, let’s take a look at the booked versus declined solicitations by stage by fiscal year closed.
Confidence bands are based on the variance for a binomial random variable, \(\sqrt{\frac{p(1-p)}{n}}\) (assuming independence).
This is fascinating; a solicitation entering plan or clear in January or plan in April-July is likely to be refused rather than booked. Is there a specific year driving that?
Note that this is fiscal, not calendar, year. FY2011 didn’t have much stage progress, as it is the first year of data.
Looks like 2013 was a particularly good year, but it’s normal to have lots of duds in those winter months
I’m thinking time to close is not the natural outcome measure; a model predicting time to close can’t be swapped out directly with the corresponding terms in JMod. Recall from above:
For the probability model, I assume that there should also be some effect due to the average close rate, season (e.g. people give more in December), planned ask amount, actual ask amount, and potentially interactions. Something like this:
\[P(S_{ij}=1)=f(c_{i},t,\mu,t_{i}^{*},a_{j}^{*},a_{j})\]
Take a look at each of these additional variables against Booked.
Histogram and smoothed density estimate. It’s above 50% close rate at every level, but the poorest close rate is between the $100k to $1M levels. There’s obviously nonlinearity; a cubic polynomial might be a good approximation?
I think differences in planned ask versus actual ask amounts might indicate trouble. How does the ratio of actual ask over planned ask look?
Too choppy to really see much of anything; most solicitations are asked at the planned amount and there are so few in the tails that there isn’t much interesting to say. What if we just look at a few binned thresholds?
If someone asks slightly more than the planned ask amount it’s a good sign that the talks were going well; otherwise there doesn’t seem to be much to see here. The 25%-50% bin might also be statistically significant; don’t know about the others given the small sample sizes.
Given that a solicitation reaches a certain stage, how likely is it to close that year, whether or not it’s booked?
Stage | Closed | Count | Proportion |
---|---|---|---|
Plan | FALSE | 553 | 0.42 |
Plan | TRUE | 756 | 0.58 |
Clear | FALSE | 331 | 0.28 |
Clear | TRUE | 862 | 0.72 |
Ask | FALSE | 284 | 0.25 |
Ask | TRUE | 853 | 0.75 |
Oral | FALSE | 60 | 0.17 |
Oral | TRUE | 302 | 0.83 |
Ask and Clear are closer than I would have thought. What about if we look at whether they were both closed and booked that year?
Stage | Booked.In.FY | Count | Proportion |
---|---|---|---|
Plan | FALSE | 767 | 0.59 |
Plan | TRUE | 542 | 0.41 |
Clear | FALSE | 580 | 0.49 |
Clear | TRUE | 613 | 0.51 |
Ask | FALSE | 516 | 0.45 |
Ask | TRUE | 621 | 0.55 |
Oral | FALSE | 71 | 0.20 |
Oral | TRUE | 291 | 0.80 |
Curious; there really doesn’t seem to be a discernible difference between clear and ask. The last thing to check is these hybrid “reached stage and closed” indicators by month.
There are a couple of crossover points; January for Plan and Clear; Jan for ask (though it crosses back in March interestingly). Can’t count on oral pledges made in May or June coming in before the end of the year. Let’s focus on asks in the MG range, \(x\in[\$25\text{k},\$5\text{M})\):
Ha, that’s more like it. Plan has a negative trend after Feb; Clear is surprisingly close to flat; Ask has a negative trend after March; Oral can’t be counted on in May and June.
session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.4 Revised (2016-03-16 r70336)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Chicago
## date 2016-05-09
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.2)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.2)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.2)
## devtools * 1.11.0 2016-04-12 CRAN (R 3.2.4)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.2)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.3)
## evaluate 0.8.3 2016-03-05 CRAN (R 3.2.4)
## formatR 1.3 2016-03-05 CRAN (R 3.2.4)
## ggplot2 * 2.0.0 2015-12-18 CRAN (R 3.2.3)
## gridExtra * 2.0.0 2015-07-14 CRAN (R 3.2.3)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.2)
## highr 0.5.1 2015-09-18 CRAN (R 3.2.4)
## htmltools 0.3 2015-12-29 CRAN (R 3.2.3)
## knitr * 1.12.3 2016-01-22 CRAN (R 3.2.4)
## labeling 0.3 2014-08-23 CRAN (R 3.2.2)
## lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.2)
## lubridate * 1.5.0 2015-12-03 CRAN (R 3.2.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.2)
## memoise 1.0.0 2016-01-29 CRAN (R 3.2.4)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.2)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.2)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.2)
## Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.2)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.2)
## rmarkdown 0.9.5 2016-02-22 CRAN (R 3.2.3)
## scales * 0.3.0 2015-08-25 CRAN (R 3.2.2)
## stringi 0.5-5 2015-06-29 CRAN (R 3.2.2)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.2)
## tidyr * 0.3.1 2015-09-10 CRAN (R 3.2.2)
## withr 1.0.1 2016-02-04 CRAN (R 3.2.4)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.3)