How to extract specific POIs from historical Patterns on Dewey?

I am a long-time watcher of SafeGraph data, first time user. It seems every time I got close to using the data the process for accessing it would change; this happened again this fall with the transition to Dewey.

I’m trying to figure out if there are any tricks to subsetting the historical Patterns data to a small list of known POIs (either by address or lat/long). From reading other posts, it seems there is a Placekey API that may be able to return that identifier, but can anyone confirm that I would need to download the files for every month I want to include in my analysis and subset each one to the same list of POIs.

The old SafeGraph Shop had ways to subset records in various ways AND specify the range of months that you wanted to download. Unfortunately, not knowing about all the impending changes to their data and access, I didn’t download any data while this was easier.

Now that the data is only available on Dewey, it seems that Dewey needs to provide new documentation about best practices for getting to data you need. So far, I’ve only seen exhortations to “only downloading data you plan to use” which isn’t terribly helpful if it isn’t clear how to limit/subset data BEFORE downloading (if that is even possible).

Thank you for humoring what may be a basic question from a novice data scientist.

Hi @evan_nielsen ! Glad you’re getting the chance to work with the data. Don’t worry, I’ll help you get up and going–I’ve built a few tools to help users get around some of the download constraints.

Currently, the SafeGraph data available via Dewey must be downloaded as a whole, but I’ve built a workaround in the form of a Google Colab notebook. The tool takes advantage of the Dewey API and lets the user filter and otherwise process the data in a notebook cloud environment prior to downloading locally. It’s a very user-friendly setup in which you’ll only need to change a few lines of code to get what you want.

Additionally, there is an analogous tool in R that was shared by another user. The code from both the Python (Google Colab) and R tools can be used in other environments as desired.

Please let me know your experience with either of these tools. I’ll be happy to help troubleshoot any issues that come up! Further, your feedback is always welcome as we build out more tools trying to improve the user experience!

Hi @ryank ,

Thanks so much for the pointer (I had actually found that Notebook through some searching too, so it is out there). I was finally able to spend some time getting it to work, and it did! I only have a little experience with Python, so I had to do some googling to find how to do what I needed to do in the processing function. (I created a list of about 30 placekeys and filtered each file to only records matching those ids.) So my only suggestion would be to add some example code for some different types of processing one might want to do: filtering to a known list, like I was; filtering to types of POIs; limiting to a certain geographic extent; etc.

I also appreciated the experience of using Google Colab for the first time. It is really amazing to be able to filter files and records BEFORE downloading anything to my local machine.

That’s great to hear, @evan_nielsen!

Thanks for the feedback. I’ll be sure to include a few examples in the next release of the tool to help users more easily generalize to their needs.

If anything else comes up, don’t hesitate to reach out!

Two more things:

  1. After appending all of my data and reviewing, I’ve realized that among the records I was filtering for, the minority that had placekey formats of xxx@aaa-bbb-ccc did not return any records (while all of the placekeys with xxx-yyy@aaa-bbb-ccc format did). I used the placekey generator at placekey.io, and the three-digit “what” is what it gave me for those few POIs. If you are familiar with SafeGraph data, could you clarify if this is expected? Or if not, can you point me to someone at SafeGraph to ask?

  2. One more suggestion for your tool: since many of the files are split into parts, it would be nice to append all of those pieces together after filtering to the recordset of interest. I used the following:

import pandas as pd
import os
import glob
os.chdir(“drive/MyDrive/SafeGraphData/”)
extension = ‘csv’
all_filenames = [i for i in glob.glob(‘*.{}’.format(extension))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv(“combined_monthly_patterns.csv”, index=False, encoding=‘utf-8-sig’)

@ryank, update to the most recent question #1: I extracted every POI from a single ZIP code within my study area (which was more than 300) and every single one had xxx-yyy before the at-sign, which leads me to believe that if the placekey generator only gives you xxx@aaa-bbb-ccc, that POI is not going to show up in the SafeGraph data. Their documentation about placekeys indicates that the second half of the “what” indicates a specific POI, as opposed to just an address (which I assume would be helpful in cases where a new POI is in the same place where a different POI (e.g., business, restaurant) used to be).

Hi @evan_nielsen , SafeGraph Placekeys should always have the full xxx-yyy before the where component. When you append Placekeys to your dataset, the Placekey API tries to be as precise as possible, but sometimes its unable to match the given information with a specific POI, which results in the xxx@aaa-bbb-ccc. Those partial Placekeys correspond to an address, which in my experience can still be quite useful for joining datasets.

I would suggest trying matching on the partial Placekeys, and perhaps manually reviewing to ensure those rows meet expectations.

Also, thanks for the code snippet! I will add something similar to the next notebook as an optional way to reduce file clutter.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.