I have a few questions about popularity_by_each_hour in the Neighborhood Patterns data

Vincent_Rollet_MIT · October 28, 2021, 6:13pm

Hello,
I am writing to learn more about the popularity_by_each_hour in the Neighborhood Patterns data. Specifically, I am thinking about the following potential users and how their movements would be recorded in the data:
1/ a person enters a CBG at 3:20, goes to a shop within the CBG for a bit (over 1 minute), then goes to another shop for a bit (over 1 minute) and leaves the CBG at 3:50. Would this be recorded as one stop in the CBG or several?
2/ If the same person leaves the CBG at 4:10 instead, how will this be recorded in popularity_by_each_hour?
3/ If someone goes back home at 8pm and leaves home at 7am, how is this recorded in popularity_by_each_hour?
4/ If someone walks in a census block group and spends a total of 10 minutes in the census block group, how will this be recorded if the person never stops walking?
5/ If that person stops at one point to talk with someone in the street or sits on a bench for a few minutes, how will this be recorded?
Thank you!

This topic was automatically generated from Slack. You can find the original thread here.

Niki_Kaz · October 28, 2021, 6:13pm

Hey @Vincent_Rollet_MIT ! Appreciate the detailed questions! Let me dig in a little further and verify with our team on a few of these. We’ll get back circle back with you on this!

Jeff_Ho_SafeGraph · October 28, 2021, 6:13pm

Hi Vincent! Love these questions. They key difference for Neighborhood Patterns is that each cluster within a CBG with duration 1 minute will become a stop>, whereas for regular Patterns a cluster must have duration > 4 minutes and also be assigned to a POI as per our Visits Attribution Model.

With that in mind,

If a person goes to two shops within a CBG, each for > 1 minute, it would most likely be recorded as several stops since this would generate multiple clusters of pings, one at each shop. This would only become one giant stop if the stores were enmeshed together such that the clustering algorithm considered them one giant cluster (unlikely, but possible).
If the stops bled over into the next hour, this would increment the next hour in popularity_by_hour, but not stops_by_hour . popularity_by_hour increments for each hour a stop spans (e.g., a 3-hour stop would count in all three hours) whereas stops_by_hour only increments once per each stop (e.g., a 3-hour stop would only count in the first hour).
For a device at home overnight, assuming pings were received continuously, this should generate two long stops since stops can’t span multiple days (8pm-midnight on day 1, then midnight-7am on day 2). For every hour between 8pm -7am, popularity_by_hour would increase by 1 due to these stops.
If the person walks in an absolute straight line, their 10-minute stop would most likely get filtered out and never form a cluster. The filter is for driving and pass-through, but it isn’t perfect.
If that person stops moving, most likely then the pings would remain and pass through the clustering algorithm, which would then become a stop.
Here are some further discussions which you may find useful:
• How SafeGraph handles long visits, with a section on popularity_by_hour: SafeGraph Documentation on how Patterns handles long visits - Google Docs (this doc is linked in Patterns > Visit Nuances in our docs, but applies to Neighborhood Patterns as well)
• What happens for stops where someone is walking: Workspace Deleted | Slack
• A thread about nuances in the clustering algorithm: Workspace Deleted | Slack
Feel free to follow up if you have any questions!

Vincent_Rollet_MIT · October 28, 2021, 6:13pm

Thank you Jeff for this detailed answer! I now better understand the process you go through to create this variable. For my research purposes, I need a measure of the number of people in a CBG at a given time, which can be proxied by the number of devices in a CBG. Given the information you gave me, I am worried that the popularity_by_each_hour variable might be biased in a systematic way to measure density: in CBGs with many places to visit, the popularity_by_each_hour variable will give me an overestimate of the number of people in the area (as people may make several stops within an hour) while in CBGs with fewer places to visit and where people walk but do not stop, I will have an underestimate of the number of people in the CBG.
Are there other variables that you create that could get me closer to a measure of the number of people currently located in a CBG?
Thank you very much!

Jeff_Ho_SafeGraph · October 28, 2021, 6:13pm

Unfortunately I don’t think so. The data are best for measuring when people stop and do things (when they “stop”), not for when they are just driving through (or walking through) a CBG. That said, the data will still give you a very good proxy of the number of people in a CBG in a given time. Although I don’t know your specific application, I personally wouldn’t be so concerned about the walking people being filtered out. We only provide the linear path filter to remove noisy data from driving, and it’s relatively uncommon someone walks in a perfectly straight line.

Vincent_Rollet_MIT · October 28, 2021, 6:13pm

Thanks for these additional details! It is helpful to know that if people walk in a more erratic manner (e.g., walking around in a park), that would be counted as a stop. As a last question, could you please tell how such a walk around a park would be plausibly recorded in your data?
Thanks again!

Jeff_Ho_SafeGraph · October 28, 2021, 6:13pm

The pings would most likely be grouped together by our clustering algorithm, and then they would be counted as a stop in the CBG

Vincent_Rollet_MIT · October 28, 2021, 6:13pm

Perfect! Thanks