Follow along with this sample notebook.
What improvements can a data platform make to better serve its users? The easy answer is to add more data. The more data sources are available to a user, the greater the opportunity for that user to find interesting data points and lift-producing features. However, more data also brings challenges. What happens when the number of new data sources added to a platform outpaces the user’s ability to conduct tests on that data? At that point, platforms need to expand their technology to bridge the gap.
Data platform developers can take different approaches. One possible technological solution for growing data is to build better UI tools. At Demyst, we are improving search and filter capabilities, expanding metadata, and experimenting with recommendation engines. Those features are all in the “UI” bucket. A second solution is automated evaluation. Automation bypasses the need to browse and automatically tests all newly integrated data sources. I recently worked on a proof-of-concept that accomplishes that automation and fits it into a data science workflow. Here, I’ll explain how that automation works.
The automated processes are simple and consist of two core parts. Number one: identify new sources and append data from those sources. And, number two: build models and evaluate the data. The two sets of tools used for that automation are the DemystData python package and the DataRobot API (complete with docs). Prerequisites for implementation are access to those two platforms and an input dataset. DemystData’s sources need inputs to query in order to return data. The inputs are data records that can identify a business or a consumer – name or business name, address, phone number, etc. As with any data science project, those businesses or consumers need to be associated with a target variable.
The first step of the program leverages a couple of functions from the DemystData python package in order to acquire new data. One function is a “search” function. Search accepts an input file as a parameter and returns all data products that are able to return data based on those inputs. The second is the “enrich_and_download” function that reaches out to all of the new data sources and organizes the results of the appends into DataFrames. So, the program executes a search, and then with newly identified data sources, executes an enrich_and_download with the input data. This data is then stored locally where it will be available for future modeling as well as manual inspection and evaluation if desired.
The section of the script passes the new data, in addition to old known impactful data (if any), to DataRobot via its API. DataRobot’s python wrapper of its API allows users to easily create DataRobot projects, and give instructions to those projects. So, the program creates a project and kicks off the autopilot process for that project.
Once that completes, the API gives the script access to the top model and all of the model’s metadata via API. The API gives programmatic access to a model’s top factors and other interesting metrics also available through the UI.
Because sometimes training with the top factors alone can produce a better model than the larger feature list, the automated program repeats the DataRobot process with just the top 10 factors.
Finally, the program has all of the data available to choose the best model – is it the first model built with the larger amount of data, the second model with the top ten factors, or is it a model built in the past (the case where the new data provided no lift)? Using evaluation metrics from the DataRobot API (such as AUC), the program chooses one model and records it as the top model. All of the models are also available in the DataRobot UI, where a data scientist can do an extra evaluation.
The above process can be repeated over and over to improve models and identify which data points improve upon the previous status quo model. This process, in the form of a python script, runs without any manual work, and it leaves a rich data trail. The proof of concept and the general ideas laid out here are a starting point where more engineers, data scientists, and project teams can optimize and customize.