“First we have to get our internal data in order”. It’s something we hear frequently. Why invest externally when there is an underutilized resource on hand? Today’s article breaks down the dimensions of when to focus on external data and why, and discusses why the demand for external data is expanding fast.
Commercial impact from analytics, i.e. by improvement over the status quo, is driven foundationally by data and algorithms. Within data, one can simplistically think of rows and columns. More rows (observations internally), more columns (features – typically through external data), and a better analytical algorithm multiplicatively combine to drive value, as depicted below.
In today’s competitive landscape, extracting information content from data is crucial to customer acquisition and better risk selection. Doing so via internal data and improved AI alone yields rapidly diminishing returns. Internal data and advanced algorithms are table stakes. The source of analytical improvement and the basis of competition is shifting to features; to columns; to external data. Enterprises seeking a step change in model quality have an imperative to understand, assess, and capture value from the emerging external data ecosystem.
And those that build the muscle to harness external data are achieving outsized returns. A sample of Demyst platform users demonstrates leading organizations not just exploring but actually consuming and deriving value from over 9 times the number of external data sources versus the market average.
Diminishing returns – Rows
The last decade has delivered a renaissance in big data technologies that amass huge amounts of data. Drilling deeper, most applications of these technologies involve “tall and skinny” datasets that have many rows but relatively few columns. Examples include ad-tech applications, that have massive quantities of cookies, but relatively little data about each. Cyber and web log ingestion platforms, sitting in spark/hadoop clusters, contain billions of observations but limited information.
There is value in this to be sure, however the challenge is that most analytical applications and algorithms plateau at scale. I.e. orders of magnitude more data are required in order to deliver lift in modeling algorithms. This can present practical challenges :
- Many enterprises just don’t have the data, e.g. there aren’t enough responders, fraudsters, etc – pick your response – so the quality of solution is capped by definition
- Expanding data introduces bias through exposure to exogenous shocks. E.g. insurers would like to expand their platforms to use years of data but everything from competitor price to weather changes affect model bias
- There are material costs in harnessing internal data. Historically this was in storage, however nowadays the primary cost is often implementation and maintenance cost – handling a 10 or 100x the scale of data lake introduces rigidity in all aspects of managing the ETL life-cycle
Enterprises that haven’t leveraged and modeled internal data will benefit from doing so to be sure, but this isn’t enough.
Modeling fatigue – AI
Non parametric modeling techniques have been used for a long time – KNN, Neural Nets, Decision trees and forests, Gradient boosting techniques; all offer the unique opportunity to discover and leverage deep interaction effects and tease more signal from big data noise to deliver outsized impacts. More recently however advances in compute capacity, frontend tools like DataRobot and H20, and frameworks like tensorflow and scikit-learn, non parametric and ensemble techniques have become mainstream.
Coase famously said “if you torture data long enough, it will confess”. A sample only contains a limited representation the population statistics, no matter how sophisticated the modeling algorithm. Furthermore, brute force modeling platforms can exacerbate overfit against biased or limited input data. I love a good Kaggle or DataRobot leaderboard as much as the next person – there are very few things more satisfying – but how often does that ensemble Xgboost make it’s way in to production vs a tried and tested logistic regression? Infrequently; because, while marginally better, they are often exponentially more complex, unstable, and prone to deployment error.
Organizations are adopting AI in a big way, however like with observations, there are diminishing returns as orders of magnitude more capacity, sophistication and effort are required to extract marginal gains.
The impact of external data – Columns
It’s not an either / or, but rather and. One needs observations, attributes, and AI to capture model value.
While data lakes and AI operate at massive scale, there remains major low hanging fruit through adding columns. Going from 50 attributes to 500 can be a lot more impactful and practical than going from 100k records to 1m.
For example if we are predicting compliance risks within a bank, expanding to hundreds of millions of daily transactions daily but still mining only basic signals such as transaction to/from, amount vs norms, and static lists is unlikely to yield much. However running every name and company against news articles, blogs, running emails against social media presence, triangulating PII across third party sources to identify inconsistencies, looking up employer profiles, and triangulating device locations, can all yield major lift even without going beyond the current sample sizes.
Harnessing more features is typically but not always an external data challenge. Most organizations know only a small fraction of information about their own customers versus what is available.
What’s more, capturing data externally unlocks opportunities that aren’t solely about better predictions. Pre-fill and customer journey optimization is only possible through external data – i.e. without creating undue friction for the customer. Protecting against adverse selection in risk use cases is best done through dimensions not rows.
Why now? And why has this opportunity slid down the priority list? We are at an inflection point where disruptors creating better digital journeys are capturing customer more effectively, and big platform players are entering every vertical. The fight for the customer has never been more intense, and with it the need optimize every customer interaction with data. However until now the market for external data has not made the life of the data scientist particularly easy. It’s fragmented. Noisy. Collinear. The value is unpredictable. Compliance and information security are hard, increasingly so. It’s impossible to know what matters until it’s tested. So perhaps not surprising that this is left until later.
Demyst is investing heavily in solving this. Our vision is that expanding the cube to the right – learning more about what attributes matter and harnessing them within your analytical workflows – is just a few lines of code in a seamless API. We abstract away the discovery, data compliance, contracting, types and schema, and other mechanical aspects of getting the data to where it needs to be to capture value.
Investing in lift against all 3 dimensions is no longer optional. In the search for competitive edge and model quality, fueling your next gen engines with more columns is the logical area of strategic focus for analytical leaders and CDOs. Doubling down on the same data with the same models isn’t going to delight your customers, but delivering exactly the right products in the right way will – that’s the external data imperative.