Automating Enrichment Jobs
Install
First you need to make sure you have the Analytics package installed. If you aren't sure try running this:
:::bash
pip install demyst-analytics
Test Data
First, let's create some test data to use in this example. In an IPython environment or in a Python script, execute this code:
:::python
import pandas as pd
test_df = pd.DataFrame({'email_address': ['test@test.com', 'test2@test.com']})
test_df.to_dense().to_csv("inputs.csv", index = False, sep=',', encoding='utf-8')
You should end up with a file called inputs.csv
that looks like this:
email_address
test@test.com
test2@test.com
Automation
Now that we have some test data, let's build a script to enrich our input file using the Demyst platform. For purposes of this test we are going to be using the domain_from_email
data product, which is a test product Demyst offers that simply splits up email_address
columns sent to it.
Let's start by importing the necessary packages.
:::python
import pandas as pd
from demyst.analytics import Analytics
You will need a production API Key from the Demyst Console.
analytics = Analytics(key='XXXXXX')
If you don't have an API Key yet, you can test using your Username and Password by leaving out the key
parameter.
analytics = Analytics()
Now let's read in our inputs file. Because our CSV file has a header that is understood by the Demyst platform email_address
, the file can be used as a dataframe without modification.
inputs = pd.read_csv('inputs.csv')
To enrich the file, we pass the list of providers along with the input dataframe to the enrich
function.
job_id = analytics.enrich(['domain_from_email'], inputs, validate=False)
The enrich_download
function will block until the job is complete and return a dataframe:
outputs = analytics.enrich_download(job_id)
Lastly, we can take the resulting ouput dataframe, and write it to a file.
outputs.to_dense().to_csv('outputs.csv', index = False, sep=',', encoding='utf-8')
The output of this script will be a file called outputs.csv
which should look like this:
inputs.email_address,domain_from_email.row_id,domain_from_email.client_id,domain_from_email.host,domain_from_email.user,domain_from_email.error
test@test.com,0,,test.com,test,
test2@test.com,1,,test.com,test2,
This output could be for the next stage of ETL pipeline or it could be imported into a modeling tool.
The full solution is provided below. If you need help automating a production job, don't hesitate to reach out to support@demystdata.com.
:::python
import pandas as pd
from demyst.analytics import Analytics
analytics = Analytics()
inputs = pd.read_csv('inputs.csv')
job_id = analytics.enrich(['domain_from_email'],
inputs,
validate=False)
outputs = analytics.enrich_download(job_id)
outputs.to_dense().to_csv('outputs.csv',
index = False,
sep=',',
encoding='utf-8')