Bulk Downloading Data using v3 API using Python

ameya · September 25, 2023, 9:31pm

This page is no longer being updated

For the latest updates and information, please refer to our new Dewey documentation page at docs.deweydata.io

The information for this page can be found on this page:

Bulk API

Here are some examples on how to bulk download a data product using Python.

1. Create API Key

In the system, click Connections → Add Connection to create your API key.

As the message says, please make a copy of your API key and store it somewhere. Also, please hit the Save button before use.

2. Get a product path

Choose your product and Get / Subscribe → Connect to API then you can get API endpoint (product path). Make a copy of it.

3. Call the API

Example 1 - Download first page of files (up to 1000 files)

# import requests library to call API endpoint
import requests

# set key and API endpoint variables
API_KEY = <key generated by step 1>
PRODUCT_API_PATH = <URL generated by step 2>

# get results from API endpoint, using API key for authentication
results = requests.get(url=PRODUCT_API_PATH,
                       params={'page': 1}, # only getting 1st page of results
                       headers={'X-API-KEY': API_KEY,
                                'accept': 'application/json'
                               })

# loop through download links and save to your computer
for i, link_data in enumerate(results.json()["download_links"]):
    print(f"Downloading file {i}...")
    data = requests.get(link_data["link"])
    open(link_data["file_name"], 'wb').write(data.content)

Example 2 - Download all files

Filtering Parameters:
page: Page number (API will return maximum of 1000 files per page), defaulted to 1.

# import requests library to call API endpoint
import requests

# set key and API endpoint variables
API_KEY = <key generated by step 1>
PRODUCT_API_PATH = <URL generated by step 2>

# loop through all API result pages, keeping track of number of downloaded files
page = 1
download_count = 0
while True:
    # get results from API endpoint, using API key for authentication, for a specific page
    results = requests.get(url=PRODUCT_API_PATH,
                           params={'page': page},
                           headers={'X-API-KEY': API_KEY,
                                    'accept': 'application/json'
                                   })
    response_json = results.json()

    # for each result page, loop through download links and save to your computer
    for link_data in response_json['download_links']:
        print(f"Downloading file {link_data['file_name']}...")
        data = requests.get(link_data['link'])
        with open(link_data['file_name'], 'wb') as file:
            file.write(data.content)
        download_count += 1

    # only continue if there are more result pages to process
    total_pages = response_json['total_pages']
    if page >= total_pages:
        break
    page += 1

print(f"Successfully downloaded {download_count} files.")

Example 3 - Download specific files using date filters

Filtering Parameters:
partition_key_after: Retrieve file links for data at or after the specified date in YYYY-MM-DD format
partition_key_before: Retrieve file links for data at or before the specified date in YYYY-MM-DD format

# import requests library to call API endpoint
import requests

# set key and API endpoint variables
API_KEY = <key generated by step 1>
PRODUCT_API_PATH = <URL generated by step 2>

# loop through all API result pages, keeping track of number of downloaded files
page = 1
download_count = 0
while True:
    # get results from API endpoint, using API key for authentication
    results = requests.get(url=PRODUCT_API_PATH,
                           params={'page': page,
                                   'partition_key_after': <'YYYY-MM-DD'>,   # optionally set date value here
                                   'partition_key_before': <'YYYY-MM-DD'>}, # optionally set date value here
                           headers={'X-API-KEY': API_KEY,
                                    'accept': 'application/json'
                                   })
    response_json = results.json()

    # for each result page, loop through download links and save to your computer
    for link_data in response_json['download_links']:
        print(f"Downloading file {link_data['file_name']}...")
        data = requests.get(link_data['link'])
        with open(link_data['file_name'], 'wb') as file:
            file.write(data.content)
        download_count += 1

    # only continue if there are more result pages to process
    total_pages = response_json['total_pages']
    if page >= total_pages:
        break
    page += 1

print(f"Successfully downloaded {download_count} files.")

lejanek · October 11, 2023, 7:00pm

Thank you for sharing!
I have a question about setting the params for a specific date range. If I want to retrieve the whole month of August, should I set the start date and end date like this: ‘partition_key_after’: ‘2023-07-31’ , ‘partition_key_before’: ‘2023-09-01’} or on the exact date ‘partition_key_after’: ‘2023-08-01’ , ‘partition_key_before’: ‘2023-08-30’}?

ameya · October 11, 2023, 7:11pm

The dates are inclusive, so would do ‘2023-08-01’ to ‘2023-08-31’.

Huan_Ning_University_of_South_Carolina · October 12, 2023, 7:56pm

The filename, e.g.:
‘file_name’: ‘Weekly_Patterns_Foot_Traffic-0-DATE_RANGE_START-2022-09-19.csv.gz’,
will change in every downloading session. Is there a way to key the file name the same? So that the users do not need to download the files again and again because they are not sure which file is the same as the previously downloaded file.

Thanks！

xiaoluwang · October 17, 2023, 6:47pm

Dear all,

Newbie here! I was able to download the first page (1000 files) for Advan monthly patterns following this page, but when I moved on and changed the access to pages 2 and 3, the access appeared to be very challenging (no files are downloaded after hours). I was running the code on a high-performance computing server and didn’t receive errors, but the download process is not going anywhere after the first page. Are there any tricks and suggestions?

Thank you so much!

xiaoluwang · October 17, 2023, 7:26pm

Related Python access question: given the size of data dumped, I plan to convert gz to csv on a server using the following code.

import gzip
import csv
import os

def convert_csv_gz_to_csv(input_csv_gz_filename, output_csv_filename):
try:
with gzip.open(input_csv_gz_filename, 'rt') as gz_file:
csv_reader = csv.reader(gz_file)
with open(output_csv_filename, 'w', newline='') as csv_file:
csv_writer = csv.writer(csv_file)
for row in csv_reader:
csv_writer.writerow(row)
print(f'Conversion completed: {input_csv_gz_filename} to {output_csv_filename}')
except Exception as e:
print(f'Error: {str(e)}')

def convert_all_csv_gz_to_csv(input_dir, output_dir):
for filename in os.listdir(input_dir):
if filename.endswith('.csv.gz'):
input_csv_gz_filename = os.path.join(input_dir, filename)
output_csv_filename = os.path.join(output_dir, filename.replace('.csv.gz', '.csv'))
convert_csv_gz_to_csv(input_csv_gz_filename, output_csv_filename)

# Input and output directories 
input_dir = 'E:\PROJECTS\SGtest\GZ'  # the directory containing your .csv.gz files
output_dir = 'E:\PROJECTS\SGtest\CSV'  # the desired output directory for .csv files

# Convert all .csv.gz files to .csv
convert_all_csv_gz_to_csv(input_dir, output_dir)

I tested the code off the server on three files, which seems to work (but v slowly), but when running on my server I received warnings (also see snapshot below).

There can still be files converted, but the size is much smaller than what a 7-Zip file manager will give you, then I’d appreciate some help on suggestions to adjust the code to better accommodate for Dewey file sizes. thanks!

evan-barry-dewey · October 18, 2023, 12:47am

Thanks @xiaoluwang - we’re looking into this.

One suggestion for now is to look at the R notebook for examples of starting on dedicated pages and resuming after an error. You may find helpful ideas for adjusting your code.

xiaoluwang · October 18, 2023, 3:41am

Thanks for @evan-barry-dewey checking on this! I tried again this evening, and after 1+ hour of waiting without downloading, the API access download worked again, and now I’m on track to bulk downloading the 2nd page. So, hopefully, this is some temporary API access issue. Meanwhile, it’ll be great to get your (team’s) help on massive python gz → csv conversion tips. I guess there may need to be some encoding format or restrictions specified, but without knowing all the details in the compression process, I am not able to back it up properly

ameya · October 18, 2023, 4:11pm

@xiaoluwang - glad the API is working again for now - not sure if there was any network restriction on that. One note is that the links are only valid for 24 hours, after which you’ll need to repull the links from the API endpoint, so could also cause issues if you were running things for a long time.

RE: the gz → csv, reading all the files and writing them again as you are will likely be too slow. You could try using the gzip command line tool to do it faster if needed (example in this StackOverflow post)

However, what program are you using to process the data after unzipping it? If you are using pandas or another python library, usually they can usually use the .gz files directly without needing to unzip.

xiaoluwang · October 19, 2023, 2:40am

Hi Ameya (@ameya),

I followed your suggestion and used gzip package. I wrote a .py script to unscript all gz files in one folder to csv in another folder, and it worked for the POI data, but I encountered issues when I tried the same script on POI * Geometry data. The warnings are as below and I have no clue, as it appears to work for some files but not others (I haven’t tested the script on the monthly pattern data given the size, but wanna use these smaller ones to test things out). FWIW for at least a large sample I need to convert to CSV as my coauthor needs a copy and she doesn’t/prefers not to use python/R. For analyses, I also prefer having input flat files as safe copies.

ameya · October 19, 2023, 10:36pm

Unfortunately with the size of the files here, would recommend using an analysis method that can read the compressed files directly.

xiaoluwang · October 20, 2023, 8:17pm

Hi @ameya - thanks for the note; but can you hinted on what’s the issue w the problem above? I sense it may be sth w the data structure rather than data size. Also in another parallel task, I think the encoding is UTF-8 (a Q I asked before but didn’t get answered directly), which is helpful for certain cases.

Even when directly reading into R, there are certain columns that get warnings (for certain files), but as I currently process them in large chuck it’s difficult if some file-specific input issues need to be inspected individually. For example, in the above case, I can’t tell which is the file that triggered the problem (which is not encountered for other Dewey data I processed, which I sense is a problem about input data that won’t be resolved whether I read gz directly or not). Thanks for helping out!