Combining single-platform data access and one-click model building is instant gratification for data scientists. DemystData grants users access to hundreds of real-time data sources – that’s the single-platform data access piece. DataRobot runs a model horserace with just a CSV – that’s the one-click model building piece. Together, they can execute any statistical experiment that you design.
Of course, a model sitting on a data scientist’s laptop can’t solve a business problem in practice. That is why DemystData developed Data Functions, a tool to handle deployment so you can get your hard work into production.
The goal of this post is to build, deploy, and execute a predictive model. Readers will come away with an understanding of where DataRobot and DemystData fit into the data science lifecycle, how to use both tools, and how they work together well.
While describing and solving a specific problem here – namely, commercial credit risk, it is important to keep in mind the flexibility of the DemystData and DataRobot. Commercial credit is just one use case. Data scientists on other teams, and in other sectors are tackling fraud, targeting customer segments, or pricing insurance, and they can all still use the same core process to achieve their goals.
To get an even more in-depth understanding, readers can follow along using the DemystData platform, the DataRobot platform, and a Jupyter Notebook made for the post. To access the two platforms and the notebook you’ll need.
1) A DemystData account
2) A DataRobot account
3) Python 3.5 or greater and Jupyter to follow the code from this gist
For those who do not have any of the above, reading along without reproducing is okay too. We will be using Pandas and the DemystData Python packge, demyst-analytics.
Here is an outline of the steps:
Designing the Experiment by gathering materials and defining parameters
Preparing Input File to ensure the append goes smoothly
Appending external data using the DemystData platform and python package
Building the model using DataRobot
Exporting the Model as python code from DataRobot
Deploying of the model as a DemystData Data Function
Each step is brief, and only takes 2 minutes to read through or 5 minutes to reproduce.
Designing The Experiment
Defining a sample, a response variable, and a list of data sources is the first step for any data study. This study uses the following:
The sample is a subset of a list of a few hundred-thousand 7(a) SBA loan recipients between 2010 and today. The sample size is 5,000 rows (those following along should feel free to decrease the sample size for faster appends and less data cost).
Next, the response variable – a flag to say whether a business is a credit risk or not – the piece of information that the model will go on to use other data to predict – is derived from the column called LoanStatus that is already on the file. Two of the possible outcomes of LoanStatus are PIF (paid in full) and CHGOFF (charge off). After filtering out all other outcomes, that column becomes a binary response variable.
Finally, appending data from a list of data providers from DemystData’s catalog widens out the data set and adds features. Small business data and property data are two categories that apply to the input set and have the potential to provide lift to a commercial credit risk model. DemystData has dozens of small business and property data providers integrated, so browsing the catalog, downloading data dictionaries, and consulting with some expert data scientists at DemystData helps choose exactly which ones to use.
The full source list used here is Attom, InfoGroup, Google, Infutor, UtilityScore, DnB, Acxiom, Walk Score, HouseCanary, and Bing. Of those, Google, Attom, UtilityScore, Walk Score, HouseCanary, and Infutor are property data providers. The type of building, the risks or benefits of the location, and a property value estimation are potentially predictive of commercial credit risk. Acxiom, DnB, and InfoGroup return information directly about a business. Business type, number of employees, and credit assessments or scores are extremely useful, and come in various forms from these data providers. Bing is more exploratory, and has the potential for customization with the raw search result data it returns.
With both a sample of inputs and a list of data sources, it is time to append the data!
Preparing Input File
Before uploading the full 5,000 records, it’s a good idea to run one row, or a few, to see if the append goes smoothly.
That did not go smoothly at all! Right away, DemystData responds with messages for all of the providers, saying input fields are missing. So, what is wrong with the inputs? What is missing, and what is the fix?
At a high level, DemystData is strict about requiring specific input headers for each data provider. Usually, these headers are names of Demyst types, and are standardized across data providers (like city or email). Sometimes, they are specific to a given data provider (like we will see with attom_id). Each provider’s inputs are well documented and catalogued in the Demyst platform, so it is straightforward to figure out what is missing.
A good first step to fixing the input file is to convert each column header to a Demyst type if applicable. The current headers are:
['AsOfDate', 'Program', 'BorrName', 'BorrStreet', 'BorrCity', 'BorrState', 'BorrZip', 'BankName', 'BankStreet', 'BankCity', 'BankState', 'BankZip', 'GrossApproval', 'SBAGuaranteedApproval', 'ApprovalDate', 'ApprovalFiscalYear', 'FirstDisbursementDate', 'DeliveryMethod', 'subpgmdesc', 'InitialInterestRate', 'TermInMonths', 'NaicsCode', 'NaicsDescription', 'FranchiseCode', 'FranchiseName', 'ProjectCounty', 'ProjectState', 'SBADistrictOffice', 'CongressionalDistrict', 'BusinessType', 'LoanStatus', 'ChargeOffDate', 'GrossChargeOffAmount', 'RevolverStatus', 'JobsSupported']
The headers that start with Borr are the important ones since they identify the businesses in question. Here is a quick header transformation (again, the headers are being converted to Demyst types, detailed here):
There are a couple of functions in the
demyst-analytics python package that double check the validity of input files.
validate2 function scans the values in the columns and verifies that the values conform to the type.
Taking a look at the output of
validate2, there is really good feedback. Four out of the six columns transformed above are All Valid, so those don’t need any more work. The state column indicates that seven (out of the 5,000) rows have issues. One example of a type mismatch the feedback provides is the value VI. All of that is okay because seven out of 5,000 is a very low percentage, and US territories like the Virgin Islands (VI) will probably work as valid states anyway. The only real issue is on the post_code column, but the remedy, as suggested by the notebook output, converting it to a string, is easy. We’ll do it here:
The next function we’ll use from the
demyst-analytics package is the
analytics function. It is a useful function to run as a next step. It helps takes measure of which providers are ready to run and which are still missing inputs.
Notice the argument for
providers above. The function needs that information so it can compute a boolean value describing whether each provider in the list will run, based off of the inputs provided as the first argument. See the output below:
Attom, Walkscore, and Bing are still invalid – they won’t return data with the current inputs. Now we need to figure out what they’re missing. This is where using the DemystData catalog comes in handy. A quick look at the catalog will show what input columns each provider is missing. We will go through each next.
Every Attom provider needs an attom_id as an input:
The provider, Attom Expanded Profile, turns an address into an attom_id:
This append will add attom_id to the sample:
It is not uncommon for providers to be setup this way. Many others involve a chained lookup as well, where the first step is taking raw data, and turning it into a unique ID, and the second step is appending more interesting data using that ID.
Walkscore requires a latitude and longitude. Because the original file did not include latitude and longitude, they need to be appended now. Google LatLon, a source in our list already, returns latitude and longitude as part of its output, and it takes an address as an input. Google LatLon is a good tool to have since this issue pops up from time to time. The following append will add latitude and longitude to the sample so that Walkscore’s inputs are satisfied:
Bing requires query which is open-ended, and can be created by combining the business_name with the city and state.
Appending External Data
Now that the input file is properly setup, all that’s left to do is run the data providers. Running sixteen at a time is the same command as running a single provider.
Because the code snippet is so short, it’s easy to lose track of what is actually happening. We are sending each row of our sample to each of these APIs, getting the response bodies flattened and serialized back into a CSV and then turned into a DataFrame.
Next up, there is a little bit of post-processing to do. Remember, Google LatLon ran earlier, and was not included in the most recent append, so it needs to be merged in. Also, there is a pattern with the 6 Attom providers – they go in pairs. Attom AVM and Attom Commercial AVM. Attom Pre Foreclosure and Attom Commercial Pre Foreclosure. Attom Tax Assessor and Attom Commercial Tax assessor. The non-commercial versions of the pairs provide data on residential property. However, with a sample of small businesses, the business addresses may be categorized as residential. So, they need to be combined column by column.
This is the script accomplishes both of those tasks:
Now, we will have a large brick of data in a DataFrame. After saving it as a CSV, it is ready for upload into DataRobot.
Building The Model
Like accessing data is to DemystData, building models is what DataRobot was built for. A data scientist using DataRobot does not have to much actual building – the DataRobot software itself does most of hard work. It also is a very intuitive process. There are only two or three steps before models start churning out.
First, DataRobot prompts a user to upload a dataset.
Next, it asks for a response variable. DataRobot will autofill the variable, name nicely:
DataRobot then shows a small breakdown of the response variable distribution. The big Start button in the middle of the screen is unmissable.
Before triggering the Start button, a user can optionally exclude some fields from the uploaded file. In this case, some of the appended data includes secondary or third place matches to a given query, so it’s good to filter those out at this time. Any known unique identifiers should also be filtered out. It is easy to use the +Create Feature List button, checkboxes next to column names, and search to filter out the undesired fields.
When a new feature list is created at the start, DataRobot automatically selects that feature list for use. It is time to build some models. The model horse race will begin at the click of the Start button.
Immediately, visualizations of processing models appear. DataRobot displays the types of models being built and the models in the queue. Once models finished building, they are added to the leaderboard, where users can explore the models’ performance and metadata.
One of the coolest things to check out on a DataRobot model is the feature impact view! It shows which features were weighted most in the model.
There are a tons of metrics and charts to examine for each DataRobot model. It would take hours to explore them all.
Exporting The Model
Exporting a complex model as a simple code file isn’t always a simple task, but DataRobot Prime gives users the ability to easily export models as an executable python script. Selecting Run DataRobot Prime from one of the model drop-downs generates another model approximating the existing model.
DataRobot lists the new exportable model, and names it DataRobot Prime. Once processing DataRobot Prime completes, the Generate and Download Code link becomes available.
Once downloaded, the code inside the file shows how simple it is to run the model from the command line – giving insight into the files and formats involved.
DemystData’s Data Functions are easily editable and deployable scripts that execute python code. In this case the Data Function executes the DataRobot model, but in other cases, Data Functions can perform business decision logic or simply format data. Data Functions can even use the DemystData platform to access more data, for use cases like chained or conditional queries.
create command from the
demyst-df python package, registers a Data Function and downloads a template. The
create command also includes the desired name – commercial_credit_risk_v1.
The template is populated from the start with some example code. The main file to view is
function.py, within the
function directory.The full, completed
function.py file is available in this gist. There is a lot of metadata specified below, like
output. Defining those can be very useful, but it is are all optional. The most important part is the actual executable code, which lives in the top of the file in the function named
data_function. That function executes when the Data Function itself is called (via API), so in this case, it needs to append data and execute the DataRobot model.
data_function, takes an argument named
df. The argument,
df, is a dict, and has several interesting tools attached to it. For example, the entry
df.inputs is the inputs object sent via API. Also,
df.connectors is a class that helps access external data from multiple sources. Below,
df.connectors.fetch is a function that starts the asynchronous process of fetching data from multiple connectors. Then,
df.connectors.get accesses that data. Able to traverse the tree of the providers’ results,
df.connectors.get is able to access each data point that the model needs. That happens in the
After the data is plucked out, it is fed into the DataRobot code that was pasted in. In the end, it retuns a dict with the result.
Now that the Data Function is constructed, the
deploy command pushes everything up, and makes it available for execution.
Inside the file
df.config, there is a variable set, named
DATA_FUNCTION_NAME. That is important to copy and save for when we execute the Data Function.
It is time to execute the Data Function via an API request. We’ve come all the way from raw data to a deployed model, and it is time to test it out! Here is the request as a cUrl:
We have now achieved our goal of producing, deploying, and executing a commercial credit risk model using DemystData and DataRobot! This process is an example of how an actual lender might automate commercial credit risk checks. I hope you will use this as a guide to build your own versions and explore your own use cases.