Reading In a Portion of Advan Data

A1 · July 24, 2023, 5:36pm

Hi,

When I try to download a sample dataset from Advan (weekly patterns to be exact), the dataset is so big that I can’t open it. Is there a way to only read the first 100 rows or so, that way I can see how the data looks and practice with it?

Any step by step instructions on how to properly download and sample code on how to view a portion of the sample data will help!

evan-barry-dewey · July 24, 2023, 6:06pm

We have this tutorial for taking a quick look at the data without downloading

We’re also launching a large upgrade to our product in a few weeks that will allow you to see sample row and all of the column attributes for every dataset on the platform without downloading the data.

For now, you can find the Advan attributes here: Documentation (Public) - Google Drive

InvTech · July 27, 2023, 11:52pm

This will help you to use Python code direct in R without needing to convert them to R codes:

Donn.

A1 · August 11, 2023, 5:51pm

Okay thanks. So all I will need to do is copy this code exactly into python with only changing the url link?

InvTech · August 15, 2023, 9:39am

Yes, that should work.

But still, the current function will read the entire one csv file (.gz), which can be ~280MB on your computer memory. I think it should be okay, but if you really want to read the first 100 or so rows, please add nrows = 100 parameter to pd.read_csv function as below.

# Directly read csv data from gz file
def read_data(token, file_url, timeout = 300):
    # token = tkn
    # file_url = "/api/data/v2/data/2022/12/26/ADVAN/WP/20221226-advan_wp_pat_part99_0"
    # timeout = 300
    src_url = DEWEY_MP_ROOT + file_url
    response = requests.get(src_url, headers={"Authorization": "Bearer " + token}, timeout=timeout)
    try:
        csv_df = pd.read_csv(BytesIO(response.content), nrows = 100, compression="gzip")
    except gzip.BadGzipFile: # not gzip file. try normal csv
        csv_df = pd.read_csv(BytesIO(response.content), nrows = 100)
    except:
        print("Could not read the data. Can only open gzip csv file or csv file.")
    return csv_df