Those engaged in social science – a category that includes most forms of risk modeling – are often accused of being bad scientists. The accusations generally come from physical scientists, who enjoy the luxury of access to repeatable experiments with few real-world consequences. However, most would rightly frown on a bank offering loans to those who can’t afford them just to gather data to improve their models for future use.
Though, in my opinion, non-experimental risk modeling is much harder than rocket science, we should nevertheless aspire to be become better scientists.
The reason I’m jumping on this particular soapbox is that in mid-2019, editors of the influential science journal Nature questioned whether the concept of statistical significance should be ditched in its entirety. Contemporaneously, a special edition of American Statistician was published containing 43 papers hammering ostensibly the same point: that p-values are afforded too much salience in statistics and should be downgraded or set aside.
The main problem with significance testing is its arbitrary, dichotomous nature. In the standard application, a p-value of 0.04 indicates a successful study – a “significant” finding no less – while a 0.06 constitutes an abject failure. In reality, there’s little practical difference between the two results: one may simply be a little more precisely measured or have access to a smidgen more data. Applying a hard and fast 5% rule seems, well, unscientific.
I’m writing this during the holidays, so allow me to indulge in a personal anecdote. Back in my academic days, I sourced some data that allowed me to test whether a “bad year” for a certain premium wine label affected the auction prices of subsequent vintages.
This was technically very interesting, because it suggested a panel data model with lagged dependent variables in two different time dimensions – vintage and auction date. I spent six months working out the theory and carefully estimating the model. Though this was 20 years ago, I still remember the p-value on the key coefficient: 0.063. The paper remains unpublished.
Setting aside my own academic disappointments, we must remember that the Nature editors are mainly working in the experimental realm, where nuisance factors can literally be controlled. In the observational world that risk modelers inhabit, the difficulties associated with significance tests only multiply.
In my wine paper, the task at hand was to test a single hypothesis – that the parameter on the vintage-lagged dependent variable was, under the null, zero. Most risk modeling studies you see have a far broader set of aims. Generally, the analyst is asked to assess the myriad risk factors that may affect the behavior of the portfolio being analyzed, and tell their bosses which ones are important. It is more of a voyage of discovery than a pure hypothesis testing exercise.
The model used to these ends does not hatch from an egg on a mountaintop – it is built, most likely using a large number of statistical tests. These pre-tests, which are usually unavoidable, distort the properties of the model estimates and standard errors – you think you’re testing the coefficients of the final specification at the 5% level, but the actual size of the test may be 1%, 5% or 50%. It’s very hard to know for sure.
Reasonable Adjustments: Debunking Misconceptions and Embracing Uncertainty
The p-value detractors know they are fighting an uphill battle to change perceptions of statistical testing. It is so easy, and so readily accepted, to blindly apply the 5% rule that proposing a radical alternative is probably futile. The adjustments they propose are therefore very reasonable and accommodative.
One interesting suggestion from the American Statistician contributors is that academic journals agree to publish the findings of pre-registered research projects, irrespective of the statistical findings they uncover. This would mean that surprising results that happen to be negative will gain more attention, possibly spurring fruitful future research projects.
In the context of risk, this suggestion is especially useful. If analysts report suspected risk factors that ultimately prove to be statistically insignificant, this will provide a great deal of insight to model users. It seems to me that debunking bankers’ misconceptions should be a core duty of empirical risk modelers; giving prominence to negative results should enable this to happen far more frequently than it does at present.
The second point concerns the language of significance testing. Rather than using a strict dichotomy between significant and insignificant results, analysts should instead “embrace uncertainty” and try to express the spectrum of outcomes that occur when testing. This means that we should avoid saying that there is “no association” upon finding an insignificant p-value and instead use terms like “insufficient evidence” of a relationship or, better yet, use confidence intervals to express the results.
This seems like a small shift, and something we could easily embrace without sparking much backlash.
The third suggestion concerns repeatability – the situation where multiple independent researchers find ostensibly the same result if they go looking. In the academic scene, the authors suggested that crowdsourcing is a good way to check this. If many researchers consider a particular question and if a plurality of opinion emerges, the result is far more likely to stand up to rigorous scrutiny.
This option is probably not available in the financial world. I can’t really imagine a bank posting their data on the web with a view to crowdsourcing the final specification. That said, Fannie Mae and Freddie Mac released their data a few years ago, which sparked a huge amount of research in the private sector and academia. One imagines executives at financial institutions voraciously consuming this research and using it productively whenever the results are compelling.
I’ve always thought that banks should use lots of models, reflecting the fact that one model will rarely answer all the questions you might want to ask. Crowdsourcing, even if the crowd is internal, is a great way to achieve this outcome.
I’m not sure if risk modelers, or scientists more generally for that matter, will ever be able to wean themselves off p-values. That said, there are some simple steps we can take to improve communication, and these should be tried.
I’m at least 95% sure that this is the case.