Dewey Data Bulk Download in Python (API v2)

InvTech · September 1, 2023, 10:30pm

UPDATE: There is a new version of the API (v3). Please find the Python tutorial for that API here: Bulk Data Downloading in Python (API v3)

Dewey launched a new data deployment system starting on August 23, 2023. Here is a step-by-step guide to download in bulk or read sample data in Python. (This is a Python version of this R guide.)

1. Create API Key

In the system, click Connections → Add Connection to create your API key.

As the message says, please make a copy of your API key and store it somewhere. Also, please hit the Save button before use.

2. Get a product path
Choose your product and Get / Subscribe → Connect to API then you can get API endpoint (product path). Make a copy of it.

3. Python Code
Here is the Python code for file downloads.

import requests
import pandas as pd
import sys
from io import BytesIO

# Define a function to extract the last part of a string
# n_part = -1: file anme
#          -2: data folder date
def extract_part_from(x, n_part = -1):
    parts = x.split('/')
    return parts[n_part]

# Get the list of files on server
def get_file_list(apikey, product_path, print_info = True):
  # print("Collecting files information----------------------------")

  try:
    response = requests.get(url=product_path,
                         headers={"X-API-KEY": apikey,
                         'accept': 'application/json'})
  except Exception as e:
    print("Error in requests.get")
    print(e)
    print(" ")
    return None

  #print(response)
  
  if response is None:
    return None
  elif response.status_code == 401:
    print(response)
    return None
  
  res_json = response.json()
  num_files = res_json["metadata"]["num_files"]
  total_size_mb = res_json["metadata"]["total_size_mb"]
  avg_file_size_mb = res_json["metadata"]["avg_file_size_mb"]
  expires_at = res_json["metadata"]["expires_at"]
  
  if print_info == True:
    print("Files information---------------------------------------")
    print(f"Number of files: {num_files}")
    print(f"Total size (MB): {total_size_mb}")
    print(f"Average file size (MB): {avg_file_size_mb}")
    print(f"Link expires: {expires_at}")
    print("--------------------------------------------------------")
    sys.stdout.flush()
  
  files_df = pd.DataFrame(res_json["download_links"], columns = ["download_link"])
  split_links = files_df['download_link'].str.split('?', expand=True)
  files_df["file_link"] = split_links[0]
  files_df = files_df.sort_values(by="file_link")
  files_df.reset_index(drop=True, inplace=True)
  
  # Extract the file name
  files_df["file_name"] = files_df["file_link"].apply(extract_part_from, n_part = -1)
  
  return(files_df)

# Read URL data into memory
def read_sample_data(url, nrows = 100):
  # if(nrows > 1000) {
  #   print("Warning: set nrows no greater than 1000.");
  #   nrows = 1000;
  # }
  
  # Create a response object from the URL
  response = requests.get(url)
  
  try:
    df = pd.read_csv(BytesIO(response.content), compression="gzip", nrows = nrows)
  except gzip.BadGzipFile: # not gzip file. try normal csv
    df = pd.read_csv(BytesIO(response.content), nrows = nrows)
  except:
    print("Could not read the data. Can only open gzip csv file or csv file.")
  
  return(df)


# Read first file data into memory
def read_sample_data0(apikey, product_path, nrows = 100):
  files_df = get_file_list(apikey, product_path, print_info = True)
  print("    ")
  
  if not(files_df is None) & (files_df.shape[0] > 0):
    return read_sample_data(files_df["download_link"][0], nrows)
  else:
    return None


# Download files from file list to a destination folder
def download_files(files_df, dest_folder, filename_prefix = ""):
  dest_folder = dest_folder.replace("\\", "/")
  if(not (dest_folder.endswith("/"))):
    dest_folder = dest_folder + "/"

  files_df.reset_index(drop=True, inplace=True)
  
  # number of files
  num_files = files_df.shape[0]
  
  for i in range(0, num_files):
    print(f"Downloading {i +1}/{num_files}")
  
    download_link = files_df["download_link"][i]
    
    file_name = filename_prefix + files_df["file_name"][i]
    dest_path = dest_folder + file_name
    print(f"Writing {dest_path}")
    print("Please be patient. It may take a while...")
    sys.stdout.flush()

    data = requests.get(download_link)
    open(dest_path, 'wb').write(data.content)
    print(f"   ")
    sys.stdout.flush()

# Download files with apikey and product path to a destination folder
def download_files0(api_key, product_path, dest_folder, filename_prefix = ""):
  files_df = get_file_list(apikey, product_path, print_info = True)
  print("   ")
  
  download_files(files_df, dest_folder, filename_prefix)

# Slice files_df for specific data in dates_str for Advan monthly and weekly
# patterns.
# For example, dates_str = ["2023-08-14", "2023-08-21"]  
def slice_file_list_advan(files_df, dates_str):
  file_date = files_df["file_link"].apply(extract_part_from, n_part = -2)
  dates_idx = file_date.str.contains("|".join(dates_str))
  
  sliced_df = files_df[dates_idx]
  sliced_df.reset_index(drop=True, inplace=True)
  
  return sliced_df

It has the following functions:

get_file_list: gets the list of files in a DataFrame
read_sample_data: read a sample of data for a file download URL
read_sample_data0: read a sample of data for the first file with apikey and product path
download_files: download files from the file list to a destination folder
download_files0: download files with apikey and product path to a destination folder
slice_file_list_advan: slice files_df from get_file_list for specific data in dates_str for Advan monthly and weekly patterns. For example, dates_str = ["2023-08-14", "2023-08-21"] .

4. Examples
I am going to use Advan monthly pattern as an example.

# API Key
apikey_ = "Paste your API key from step 1 here."

# Advan product path
product_path_= "Paste product path from step 2 here."

You will only have one API Key while having different product paths for each product.

You can now see the list of files to download by

files_df = get_file_list(apikey_, product_path_, print_info = True)
files_df

print_info = True set to print the meta information of the files like below:

Advan has total 2560 files with 197.9MB average file size.

files_df includes a file list (DataFrame) like below:

You can quickly load/see a sample data by

sample_data = read_sample_data(files_df["download_link"][0], nrows = 100)

This will load sample data for the first file in files_df (files_df[“download_link”][0]) for the first 100 rows. You can see any files in the list.

If you want to see the first n rows of the first file skipping get_file_list, you can use

sample_data = read_sample_data0(apikey_, product_path_, nrows = 1000)

This will load the first 1000 rows for the first file of Advan data.

Now it’s time to download data to your local drive. First, you can download all the files by

download_files0(apikey_, product_path_, "E:/temp", "advan_mp_")

The third parameter is for your destination folder (“E:/temp”), and the last parameter (“advan_mp_”) is the filename prefix. So, all the files will be saved as “advan_mp_xxxxxxx.csv.gz”, etc. You can leave this empty or NULL not to have a prefix.

The second approach to download files is to pass files_df:

download_files(files_df, "E:/temp", "advan_mp_")

Sometimes, the download may stop/fail for any reason in the middle. If you want to resume some specific files (numbers), then you can pass a slice of files_df:

download_files(files_df.iloc[4:7, :], "E:/temp", "advan_mp_")

You can slice the files_df for specific data periods for Advan monthly and weekly data patterns. Advan datasets’ product paths include dates like “2023-08-21” as highlighted in red.

slice_file_list_advan function slices the file list based on those dates. So, you can download data for specific periods, as in the example below, assuming your Advan weekly pattern product path is stored to product_path_ and you want to download for the weeks of “2023-08-07” and “2023-08-21”.

files_df = get_file_list(apikey_, product_path_)
dates_str = ["2023-08-07", "2023-08-21"]
weeks_files_df = slice_file_list_advan(files_df, dates_str)

download_files(weeks_files_df, "E:/temp", "partial_advan_wp_")

You can use this for Advan monthly patterns as well.

Donn

Ryan_Zhenqi_Zhou_SUNY_Buffalo · September 3, 2023, 10:29pm

Hello, I’m the member of Dewey. I use the code above to access to the Advan Weekly Patterns. However, I only get <Response [401]>. Do you know how to fix it?

InvTech · September 4, 2023, 3:18pm

Are you trying get_file_list?
One possibility is your API key (apikey parameter) is incorrect.

Ryan_Zhenqi_Zhou_SUNY_Buffalo · September 4, 2023, 4:15pm

Yes, I try get_file_list.
Yes, I do exactly the same thing as above.
Still get error, why?

InvTech · September 4, 2023, 4:57pm

Can you share your code? You can send me a message instead of posting here.

Ryan_Kruse_MN_State · September 6, 2023, 2:26am

Hi @Ryan_Zhenqi_Zhou_SUNY_Buffalo , not sure if you and @InvTech resolved this already, but check out this post – it seems another user had 401 errors caused by not saving the generated API key

InvTech · September 6, 2023, 5:34pm

Per @Ryan_Kruse_MN_State’s comment,

Would you please copy the API key and then hit the Save button before use and try again?

Let me know if it works.

Ryan_Zhenqi_Zhou_SUNY_Buffalo · September 7, 2023, 3:32am

Thank you so much, Dongshin and Ryan, it works.

Ryan_Zhenqi_Zhou_SUNY_Buffalo · September 7, 2023, 2:34pm

Hi,

I’m download the Advan Weekly Pattern data in the new platform. I only saw the patterns_weekly_csv_gz. I’m wondering where are other files, such as “visit_panel_summary.csv”, “normalization_stats.csv”, and “home_panel_summary.csv”. It used have all in the old platform

Ryan_Kruse_MN_State · September 8, 2023, 4:30am

Hi @Ryan_Zhenqi_Zhou_SUNY_Buffalo , I found the visit panel summary, normalization stats, and home panel summary on other product pages on the Dewey site. They were linked at the bottom of the Weekly Patterns page in the “Related” section.

Does that give you what you need?

dhoconno · September 10, 2023, 5:08pm

Is there a recommended way for downloading incremental changes from one release to the next? For example, the Weekly Patterns - Foot Traffic is 1.6TB and refreshes every Wednesday. After an initial download of the 1.6TB data, what is the best way to fetch just the newest data? Are the file suffixes (e.g., 03_6_5) stable between data releases, such that the presence of a 03_6_5 would allow me to safely skip downloading this file again after the data refreshes?

Any guidance appreciated!

InvTech · September 10, 2023, 10:04pm

Hi @dhoconno. I updated the code in response to your question. Please download the entire Python code again and read the last part of the tutorial.

Hope this helps.

dhoconno · September 11, 2023, 12:38am

Thanks - this is exactly what I needed to avoid hammering the download server.

dhoconno · September 15, 2023, 5:05pm

I have one more related question – I’m looking at the Weekly Patterns - Foot Traffic dataset. When I filter this on a specific site, I get weekly data through 2019 and then sparse data after that. I thought my download of the dataset was complete and I think my filtering is correct. Before I spend a lot of time trying to troubleshoot, is there supposed to be weekly data for every week between 2019 and 2023 in this dataset? Or is the more recent data supposed to be more sparse?

Thanks (an example of the dates from a single site are shown below)

10/14/19 12:00 AM
10/21/19 12:00 AM
11/4/19 12:00 AM
11/18/19 12:00 AM
11/25/19 12:00 AM
12/2/19 12:00 AM
12/9/19 12:00 AM
12/23/19 12:00 AM
12/30/19 12:00 AM
1/6/20 12:00 AM
2/3/20 12:00 AM
2/17/20 12:00 AM
2/24/20 12:00 AM
5/25/20 12:00 AM
6/1/20 12:00 AM
6/29/20 12:00 AM
7/20/20 12:00 AM
7/27/20 12:00 AM
8/10/20 12:00 AM
8/31/20 12:00 AM
10/26/20 12:00 AM
11/16/20 12:00 AM
1/4/21 12:00 AM
4/5/21 12:00 AM
4/12/21 12:00 AM
7/26/21 12:00 AM
9/20/21 12:00 AM
12/20/21 12:00 AM
3/7/22 12:00 AM
4/4/22 12:00 AM
4/18/22 12:00 AM
5/16/22 12:00 AM
4/3/23 12:00 AM
5/29/23 12:00 AM
8/7/23 12:00 AM