A few years ago, I was lucky to hire an excellent summer intern from a leading economics PhD program in Europe. At the time, Lending Club made their historical performance data public and they included in the file a brief written request (likely penned by the prospective borrower), urging investors to fund their loan. I asked my intern to explore whether a quantitative treatment of the text would be useful in assessing the subsequent credit risk of the observed consumers.
Cutting a long story short, it was.

We explored several different words that you would expect people to use when requesting loans through a social lending platform. We found that words, phrases and clauses suggesting human capital increasing activities (e.g., “my daughter is getting married”) tended to clearly reduce credit risk. The use of words indicating “desperation” or “panic,” meanwhile, increased observed default rates sharply, even after controlling for credit scores and other commonly reported borrower attributes.
Some results were rather sad – health problems increase the likelihood of default – and some were uplifting. We found, for example, that people who were newly unemployed, but borrowing to do something about it, were more likely to remain current on their obligations than those in the baseline group.
We also found that people who write at a higher level, as measured by the Flesch-Kincaid scale, defaulted at a lower rate than those with poorer writing skills.
Text-Mining Do’s and Don’ts
The upshot of all this is that text-based analysis holds a lot of potential to improve the quality of credit risk models in common use.
In our analysis, one key point to note is that the text could be directly associated with the motivations of the borrower in seeking a loan. (We assumed, of course, that the requests were made honestly.) Had the text been on an irrelevant topic – like the historical significance of Hannibal’s campaigns on the Italian peninsula – the text mining would likely have proved ineffective. We might have been able to deduce the applicant’s level of education from such a screed, but little else of any possible relevance for credit assessment.
The most available source of a potential borrower’s writing, of course, is social media, but we would argue that this is not the route that lenders should pursue when making underwriting decisions. While it may be possible to scrape these communication channels for clues as to a person’s core creditworthiness – whether they have strong societal bonds or family connections, for example – these are likely to be less useful than the mission-critical paragraphs we were able to access in the Lending Club data.
Tying credit availability to social media activity would also change the nature of online society, triggering what statisticians call a “Hawthorne Effect.” Who, after all, would be willing to share the cute cat video with their friends if doing so would increase their mortgage payment by $50 a month?
Impact on Commercial Credit, Privately-Held Businesses and B2Cs
One area where text mining will be more useful, and its use less potentially corrosive, is in commercial credit. One can reasonably suppose that any document or electronic communication made public by a company would be relevant to its business success and pertinent in a full consideration of the institution’s credit risk. This is certainly true of official public filings, like those made to the SEC, but it is also true of every communication that defines a company’s public face.
For large public companies, the relevance of published documents to performance is likely, in most cases, to affect stock and bond prices. Existing credit models based on observed financial market data will quickly reflect the changed circumstances caused by the damaging (or helpful) text.
Where text mining is likely to be more effective is for smaller, privately-held businesses whose market value is opaque and whose financial statements are not always available. It is in more rarefied data environments, like this one, that non-traditional sources are most highly prized. The smaller the enterprise, the more valuable text-based data are ultimately likely to be.
In addition, text-mining assessments of B2C businesses are likely to be more fruitful than those performed for B2Bs. Consumers may react to a scandal in an emotional or political manner, punishing the business even if they previously derived value from buying the product on offer. In the business world, on the other hand, one suspects that corporations will take a purely pragmatic approach and continue to use a supplier – unless doing so directly harms their bottom line or reputation.
Parting Thoughts
When pondering the overall impact of text mining, we need to recognize that text-based AI algorithms may have a limited shelf life in the lending business, especially if they are built using chance-based, data-mined correlations. If, for example, more commas on a webpage is one day associated with higher credit risk, you can bet the house that companies will instantly drop them.
When using any AI-based tool, it is critical to ensure that the metrics used as inputs actually do pertain to the underlying creditworthiness of the target institution.
By the way, this article scores a 45.1 on the Flesch-Kincaid scale, meaning it is aimed at a college-educated readership. Can I have my loan now?