Wondering whether there are any explanations for the downloading directory of the neighborhood patterns data?

Ruoran_Lin_NYC_DCP · September 1, 2021, 1:58pm

Hi group, wondering whether there are any explanations for the downloading directory of the neighborhood patterns data? I attempted to bulk download the neighborhood patterns files using s3fs (Python) but it is hard to figure out the file paths or naming rules for these files. Thank you!

This topic was automatically generated from Slack. You can find the original thread here.

Niki_Kaz · September 1, 2021, 1:58pm

Hi Thanks for reaching out! We are looking into it and will get back to you once we have an answer.

Ruoran_Lin_NYC_DCP · September 1, 2021, 1:58pm

Thanks Niki!

Niki_Kaz · September 1, 2021, 1:58pm

Hey - this might be a better question for our Product team. I’ve looped them into this conversation, and they will get back to you soon on this! Thanks, Ruoran!

Ruoran_Lin_NYC_DCP · September 1, 2021, 1:58pm

Thank you! Looking foward!

Ruoran_Lin_NYC_DCP · September 1, 2021, 1:58pm

Hi I have more questions regarding the normalization of neighborhood patterns data. 1. Which versions of neighborhood pattern home panel summary files should be used for the normalization of neighborhood patterns in April 2019, April 2020, and April 2021? 2. How to normalize them? In the neighborhood patterns files, the “raw_device_counts” stands for the number of devices visiting/stoping at the CBGs. In the home panel summary files, the “number_devices_residing” means the number of devices residing in the CBGs. Wondering about the normalization algorithms. Thanks!

Niki_Kaz · September 1, 2021, 1:58pm

Hi again ! Thanks for your patience on this.

Regarding your initial question, folder structure for Neighborhood Patterns is similar to Monthly Patterns. Therefore, I would recommend reviewing documentation on Monthly Patterns first. Here is a link to our documentation on the S3 Bucket Configuration. While it does not specifically outline Neighborhood Patterns, it does indicate the year/month/day folder structure.

Another resource I recommend keeping handy is the recent announcement from Jeff on the most recent Neighborhood Patterns S3 path.

For your question on normalization, check out our Data Science Resources under Normalization. There is Colab Notebook titled Simple Methods for Normalizing SafeGraph Patterns Data Over Time. While it is focused on normalizing Monthly and Weekly Patterns, the same principles apply. The simplest approach would be to take raw_stop_counts or raw_device_counts, and divide by number_devices_residing. Regarding what files you should use, there is one home panel summary file per month, so if you are normalizing April 2019, you should use the number_devices residing for that month.

https://docs.safegraph.com/docs/data-science-resources#section-panel-normalization-for-longitudinal-[…]s-sampling-bias-corrections-and-extrapolation

Ruoran_Lin_NYC_DCP · September 1, 2021, 1:58pm

Thanks for your response! I might need to provide more background on my research question. Let us simplify it into - how many people visit the neighborhoods in NYC in April 2019? For the simplest way to normalize the neighborhood patterns data, in each CBG, take raw_device_counts in this CBG and divide by number_devices_primary_daytime in this CBG and then multiply by “population” in this CBG or ground-truth “visitors” in this CBG?

Jeff_Ho_SafeGraph · September 1, 2021, 1:58pm

It sounds like you are looking for a method to upsample our raw_device_counts number into an estimated “true” number of visitors.

You are right - The simplest is normalizing by the geographic area sampling rate (you’ve said CBG here, but you could do it by the state sampling rate as well). An example is located in the notebook Niki pointed you to above (here).

state_scaled_visits = raw_device counts  * state_population / number_devices_residing

Because you’re interested in daytime visitors, you can use substitute number_devices_primary_daytime as you’ve specified, and substitute the CBG population and CBG number devices_residing as well.

All that said, I want to caveat this approach can amplify noise in the cbg sampling rates, and can sometimes lead to unintuitive results. I’d recommend you try the approach and compare against other groundtruth data you have to test your intuition.

Because of this, we also recommend moving away from trying to upsample to “true” visitors and instead generating indices that are proxies of such things. E.g.,

true_visitor_proxy = raw_device_counts / number_devices_primary_daytime

Which is the proportion of our daytime panel that visited the CBG, and can be a more robust proxy.

Let me know if that makes sense!

Niki_Kaz · September 1, 2021, 1:58pm

Thanks !

Hi - looks like Jeff was able to get you squared away. As this thread is over a week old, I’m going to close it out to avoid any follow-up questions from being overlooked. If you have any more questions or follow-up questions, we’re always here to help! Just be sure to make a new post to safegraphdata, as we aren’t monitoring old threads at this time. Thanks!