Bulk Data Downloading in Python (API v3)-depreciated

InvTech · October 5, 2023, 3:45am

Starting from November 12, 2023, deweydatapy Python library is available to install from GitHub and this tutorial is depreciated. Please find out the new tutorial here (Bulk Data Downloading in Python (API v3) - Help / Python - Dewey Community (deweydata.io)).

There was an upgrade in Dewey API to v3 from v2. This tutorial reflects the v3 API with additional convenience functionalities. I tried to maintain the tutorial as close to the v2 tutorial. Generic v3 API Python example code can be found here. R version tutorial is available here.

1. Create API Key

In the system, click Connections → Add Connection to create your API key.

As the message says, please make a copy of your API key and store it somewhere. Also, please hit the Save button before use.

2. Get a product path
Choose your product and Get / Subscribe → Connect to API then you can get API endpoint (product path). Make a copy of it. You will notice that the path now includes “v3” instead of “v2”.

3. Python Code
Here is the Python code for file downloads (for API v3).

import requests
import pandas as pd
from datetime import datetime
import sys
import os
from io import BytesIO

def make_api_endpoint(path):
    if not path.startswith("https://"):
        api_endpoint = f"https://app.deweydata.io/external-api/v3/products/{path}/files"
        return api_endpoint
    else:
        return path

def get_file_list(apikey, product_path, start_page=1, end_page=float('inf'), print_info=True):
  # apikey = apikey_
  # product_path = pp_advan_wp
  # start_page = 1
  # end_page = float('inf')
  
  product_path = make_api_endpoint(product_path)
  data_meta = None
  page_meta = None
  files_df = None
  
  page = start_page
  while True:
      try:
          response = requests.get(url = product_path,
                                  params = {'page': page},
                                  headers = {'X-API-KEY': apikey,
                                            'accept': 'application/json'})
      except Exception as e:
        print("Error in requests.get")
        print(e)
        print(" ")
        return None
  
      if response is None:
        return None
      elif response.status_code == 401:
        print(response)
        return None
  
      res_json = response.json()
  
      if res_json['page'] == start_page:
          data_meta = pd.DataFrame({
              'total_files': [res_json['total_files']],
              'total_pages': [res_json['total_pages']],
              'total_size': [res_json['total_size'] / 1000000],
              'expires_at': [res_json['expires_at']]
          })
  
      print(f"Collecting files information for page {res_json['page']}/{res_json['total_pages']}...")
  
      page_meta = pd.concat([
          page_meta,
          pd.DataFrame({
              'page': [res_json['page']],
              'number_of_files_for_page': [res_json['number_of_files_for_page']],
              'avg_file_size_for_page': [res_json['avg_file_size_for_page'] / 1000000],
              'partition_column': [res_json['partition_column']]
          })], ignore_index=True)
  
      page_files_df = pd.DataFrame(res_json['download_links'])
      
      page_files_df.insert(loc = 0, column = 'page', value = res_json['page'])
      
      files_df = pd.concat([files_df, page_files_df], ignore_index=True)
  
      page = res_json['page'] + 1
      
      sys.stdout.flush()
  
      if page > res_json['total_pages'] or page > end_page:
          print("Files information collection completed.")
          sys.stdout.flush()
          break
        
  # Backward compatibility
  files_df['download_link'] = files_df['link']
  # Attach index
  files_df.insert(loc = 0, column = 'index', value = range(0, files_df.shape[0]))
  
  if print_info:
      print("\nFiles information summary ---------------------------------------")
      print(f"Total number of pages: {data_meta['total_pages'].values[0]}")
      print(f"Total number of files: {data_meta['total_files'].values[0]}")
      print(f"Total files size (MB): {round(data_meta['total_size'].values[0], 2)}")
      print(f"Average single file size (MB): {round(page_meta['avg_file_size_for_page'].mean(), 2)}")
      print(f"Date partition column: {page_meta['partition_column'].values[0]}")
      print(f"Expires at: {data_meta['expires_at'].values[0]}")
      print("-----------------------------------------------------------------\n")
      sys.stdout.flush()

  return files_df

def read_sample_data(url, nrows = 100):
  # if(nrows > 1000) {
  #   print("Warning: set nrows no greater than 1000.");
  #   nrows = 1000;
  # }
  
  # Create a response object from the URL
  response = requests.get(url)
  
  try:
    df = pd.read_csv(BytesIO(response.content), compression="gzip", nrows = nrows)
  except gzip.BadGzipFile: # not gzip file. try normal csv
    df = pd.read_csv(BytesIO(response.content), nrows = nrows)
  except:
    print("Could not read the data. Can only open gzip csv file or csv file.")
  
  return(df)


# Read first file data into memory
def read_sample_data0(apikey, product_path, nrows = 100):
  files_df = get_file_list(apikey, product_path, start_page=1, end_page=1, print_info = True)
  print("    ")
  
  if not(files_df is None) & (files_df.shape[0] > 0):
    return read_sample_data(files_df["link"][0], nrows)
  else:
    return None

def read_local_data(path, nrows=None):
    df = pd.read_csv(path, nrows=nrows)
    return df

# Download files from file list to a destination folder
def download_files(files_df, dest_folder, filename_prefix = "", skip_exists = True):
  dest_folder = dest_folder.replace("\\", "/")
  if(not (dest_folder.endswith("/"))):
    dest_folder = dest_folder + "/"

  files_df.reset_index(drop=True, inplace=True)
  
  # number of files
  num_files = files_df.shape[0]
  
  for i in range(0, num_files):
    print(f"Downloading {i +1}/{num_files} (file index = {files_df['index'][i]})")
  
    file_name = filename_prefix + files_df['file_name'][i]
    dest_path = dest_folder + file_name

    if os.path.exists(dest_path) and skip_exists:
        print(f"File {dest_path} already exists. Skipping...")
        continue

    print(f"Writing {dest_path}")
    print("Please be patient. It may take a while...")
    sys.stdout.flush()

    response = requests.get(files_df['link'][i])
    open(dest_path, 'wb').write(response.content)
    print(f"   ")
    sys.stdout.flush()
    
def download_files0(apikey, product_path, dest_folder, filename_prefix=None):
    files_df = get_file_list(apikey, product_path, print_info=True)
    print("   ")

    if files_df is not None and len(files_df) > 0:
        download_files(files_df, dest_folder, filename_prefix)

def slice_files_df(files_df, start_date, end_date=None):
    if end_date is None:
        sliced_df = files_df[(start_date <= files_df['partition_key'])]
    else:
        sliced_df = files_df[(start_date <= files_df['partition_key']) &
                  (files_df['partition_key'] <= end_date)]

    return sliced_df

It has the following functions:

get_file_list: gets the list of files in a DataFrame
read_sample_data: read a sample of data for a file download URL
read_sample_data0: read a sample of data for the first file with apikey and product path
read_local_data: read data from locally saved csv.gz file
download_files: download files from the file list to a destination folder
download_files0: download files with apikey and product path to a destination folder
slice_files_df : slice files_df (retrieved by get_file_list) by date range

4. Examples
I am going to use Advan weekly pattern as an example.

# API Key
apikey_ = "Paste your API key from step 1 here."

# Advan product path
product_path_= "Paste product path from step 2 here."

You will only have one API Key while having different product paths for each product.

You can now see the list of files to download by

files_df = get_file_list(apikey_, product_path_, print_info = True);
files_df

print_info = True set to print the meta information of the files like below:

Advan weekly pattern data has a total of 8848 files over 9 pages, a total of 1.8TB, and 206.81MB average file sizes.

API v3 introduced the “page” concept that files are delivered on multiple pages. Each page includes about 1,000 files. So, if the data has 8848 files, then there will be 8 pages with 1,000 files each and the 9th page with 848 files. Thus, if you want to download files on pages 2 and 3, you can

files_df = get_file_list(apikey_, product_path_,
                         start_page = 2, end_page = 3, print_info = True);

Also, you can do this to download files from page 8 to all the rest

files_df = get_file_list(apikey_, product_path_,
                         start_page = 8, print_info = True);

files_df includes a file list (data.frame) like below:

The DataFrame has index (file index ranges from 0 to the number of files minus one), page, link (file download), partition_key (to subselect files based on dates), file_name, and download_link which is the same as the link (download_link is left there to be consistent with the v2 tutorial).

You can quickly load/see a sample data by

sample_data = read_sample_data(files_df['link'][0], nrows = 100)

This will load sample data for the first file in files_df (files_df['link'][0]) for the first 100 rows. You can see any files in the list.

If you want to see the first n rows of the first file skipping get_file_list, you can use

sample_data = read_sample_data0(apikey_, product_path_, nrows = 100);

This will load the first 100 rows for the first file of Advan data.

Now it’s time to download data to your local drive. First, you can download all the files by

download_files0(apikey_, product_path_, "E:/temp", "advan_wp_")

The third parameter is for your destination folder (“E:/temp”), and the last parameter (“advan_mp_”) is the filename prefix. So, all the files will be saved as “advan_wp_xxxxxxx.csv.gz”, etc. You can leave this empty or NULL not to have a prefix.

The second approach to download files is to pass files_df:

download_files(files_df, "E:/temp", "advan_wp_")

This will show the progress of your file download like below:

If some of the files are already downloaded and if you want to skip downloading them, set download_files(files_df, "E:/temp", "advan_wp_", skip_exists = True).

Sometimes, the download may stop/fail for any reason in the middle. If you want to resume from the last failure, then you can pass a slice of files_df. The progress shows file index = 2608 for example. If the process was failed on that file, you can resume from that file by

download_files(files_df[files_df['index']>=2608], "E:/temp", "advan_wp_")

Also, you may want to download incremental files, not all the files from the beginning. Then you can slice the data by date range. For example, to get the file list that falls between 2023-09-01 to 2023-09-10

sliced_files_df = slice_files_df(files_df, "2023-09-01", "2023-09-10")

and to get files from 2023-09-01 to all onward files

sliced_files_df = slice_files_df(files_df, "2023-09-01")

and then run

download_files(sliced_files_df, "E:/temp2")

You can quickly open a downloaded local file by

sample_local = read_local_data("E:/temp2/advan_wp_Weekly_Patterns_Foot_Traffic-0-DATE_RANGE_START-2019-01-07.csv.gz",
                nrows = 100)

Thanks

Alan_Kwan · October 14, 2023, 4:18am

Great code. I’d recommend adding an option to skip files if they exist.

Under download_files

# modified method signature
def download_files(files_df, dest_folder, filename_prefix = "",skip_if_exists = True):

...

    if os.path.exists(dest_path) and skip_if_exists:
        print(f"File {dest_path} already exists. Skipping...")
        continue

InvTech · October 17, 2023, 7:11pm

Wonderful suggestion!

Updated. Thanks.