We all know at 30,000 feet that data generation is exploding. Telcos, social media, medical records, open banking platforms, marketing insights providers, credit bureaus, and government APIs to name a few – the information of life and commerce is being digitized. But, at 5,000 feet, what is out there? Today’s data scientist, long before the science itself, is tasked with the upstream activity of sourcing relevant data to improve their models. Today’s article seeks to demystify the first steps.
Sources, Entities, or Use Cases?
Inside organizations broadly speaking, we hear 3 different framings of questions posed to the data scientist that are fundamentally distinct. We recommend an entity first data search for maximum impact and re-use.
Understandably, by far the most frequent challenges we hear in the market are around use cases. Analytical initiatives are typically triggered by senior executive attention on business metric with the most current organizational attention. And often those executives are laser focused on execution, not foundational data activities. Typically this takes the form of an episodic top down mandate to solve a fundamental metric such as conversion, fraud, disruption, or growth. That in turn triggers an analytical initiative to build a new workflow, model, or solution to address that, which in turn requires someone to “find me the data that solves use case X”.
However we recommend, in response to a use case challenge, data scientists should respond with an entity first approach to data sourcing. This is best illustrated through an example, let’s say the business challenge is how to cut consumer fraud. An entity first approach will lead to:
- Better models / data : Under a use case approach the data scientist will immediately try to figure out how to source fraud blacklists and nefarious trigger information. By an entity approach a data scientist takes a more unconstrained view and may look for consumer identity information, consumer demographics and footprint, consumer product holdings. And she will also look outside the source and application systems of an organization’s 4 walls to find rich customer level information across the landscape which even if not tagged by fraud will create higher gini. This thus opens the lens to more data and better models
- More re-use : By asking about the consumer data (broader) vs the problem, data scientists can re-use existing assets and resources from entirely separate groups. E.g. the sales and marketing team often have great consumer information that can help fraud. On the back end the same is true, a data scientist leaves behind a legacy of re-usable incremental consumer data (not consumer fraud data). Use cases are by definition solving a marginal problem not previously solved and can be very challenging. However through packaging the data around the entities, projects can happen in a fraction of the time.
Of course the communication back to the business stakeholder may still be entirely about the use case and business value, but that is a separate issue.
So, what external data is there on [Entity X]?
The depth of information and nuances and complexities on how to get it are far more than we can cover in a short post. That said there are some fundamental discoverable truths, depending on the entity, that fuel a lot of value.
Whether it’s digital workflows, fraud, risk, pricing, or marketing, the analytical questions one asks to serve a human being are not that fundamentally different. E.g.
- Identifiers : E.g. phone number, email, name, address, employer
- Demographics : Life stage, family situation, education history
- Health, wellness, and responsibility : Medical events, health indicators from images, pharmaceutical usage, participation in sporting events, organization membership
- Operations : What industry are they in. What products. How many staff. What mix of staff. What customers do they have. What suppliers. Where are their headquarters and other offices. How long has it been actively trading. Import/Export activity
- Images : Photos of operations, products, and staff
- Financial health : Cash incomings / outgoings. Job and visa listings. Major funding and client events. Macroeconomic indicators
- Events : M&A events, new contracts and client situations, fundraise and partnership announcements, regulatory mentions. Key executive and director events (new appointments, criminal activities and keyword association, death and disability)
- Sentiment : Level of customer and partner opinion and support, from formal trade references to public social media complaints
- Dwelling / building makeup : Construction types, Street view, Aerial images, Security systems, Quality of upkeep and maintenance, Number of bedrooms/bathrooms, land area
- Geo demographics : Nearby permits, footfall traffic, average annual income, most popular occupations, most common household compositions and unemployment rate for the region
- Geology : Weather patterns, extreme event frequency
- Accessibility : Travel times and distances
The data scientist when sourcing entity related information should be asking themselves 2 basic unconstrained questions. First, fundamental information would someone want to know about a counterparty – i.e. what could she see, touch, feel and understand about that entity. Second, what digital trace would they likely leave – i.e. what crm or other systems, what devices and IOT objects relate to that entity, and what would they logically store.
The next step
Once you know what entity related information, the next step is the sourcing itself. This involves searching and working directly with data vendors and platforms to access the required information with the appropriate consent. Procurement teams can often have indexes of data vendors, however often they package their products based on use cases which can leave searching a little inefficient. Then there is a vendor onboarding and evaluation process, the topic of another post.
DemystData provides an External Data Access platform which addresses the sourcing directly. Engage us or try out our toolkit to discovery entity (or use case or source for that matter) related information in one place.
The first step in the external data journey is realizing that almost any attribute about an entity that you think should exist, does exist. Taking a moment to open the aperture to what you want to know about the entity, and documenting it, independent of the use case, will unlock next generation models and customer workflows with improved project outcomes.