Any insights/suggestions in dealing with differences in mobility in the SDM and Neighborhood Patterns datasets?

Hi everyone. We are trying to estimate intercounty mobility from Safegraph data. Till recently, we had been using the Social Distancing Metrics dataset to aggregate cbg-to-cbg mobility up to the county-to-county scale. Given the sunsetting of the SDM dataset, we have now been looking to do this with the Neighborhood Patterns dataset as it also provides cbg-to-cbg visit counts. However, we are finding that the mobility from the two datasets doesn’t line up. There is high correlation between the two, but on average the SDM estimated mobility is much larger than the NP estimated mobility. Has anyone looked into this comparison before? Any insights/suggestions from the community/Safegraph colleagues would be great! Thanks.


This topic was automatically generated from Slack. You can find the original thread here.

Hey @Shweta_Bansal_Georgetown - I will let others in the Community chime in, but will also consult our Patterns Product Manager about this. Will circle back later!

Thank you so much, @Niki_Kaz ! I appreciate it. Here are a couple of potential explanations that I’d love thoughts on:
• the SDM dataset is daily while NP is monthly, so I wonder if the discrepancy comes from the scale at which deduplication is done to define “unique devices”
• My understanding is that the SDM dataset includes devices that did not leave home in the destination_cbgs. Does the NP dataset also include these devices in the device_home_areas? If not, this could be part of the issue though it’s not a complete explanation

Just to add more info, here’s the plot of the comparison (each dot is the visits between a pair of counties, normalized by the number of total visits from the origin county):

Hi @Niki_Kaz , I have a follow up on this issue and a question for you. I’ve been able to estimate that the difference between the SDM and NP datasets may well be the inclusion of devices completely at home in the SDM dataset but not in the NP dataset. I’d love confirmation of that assumption, and I’d also appreciate any pointers on updated data on devices completely at home. I know the SDM dataset provided this, but do any of the current datasets also have this information? Thanks!

Hi Shweta, just to confirm - how are you aggregating the two together to compare on the same timescale? Because SDM data is daily and NP is monthly, if you’re just summing device_counts from SDM up to the monthly level, there will indeed be duplicates (i.e., the same device could make multiple stops and appear in multiple days’ data from SDM)

Hi @Jeff_Ho_SafeGraph , thanks for your response!
Right. To compare the two datasets, I am aggregating SDM to the monthly scale by summing device counts for all days across a month. The reason I expect the two datasets to be somewhat comparable is because I normalize each. So, the mobility between county i and j is defined as the number of visits from i to j divided by the total number of visits from i. In both datasets, “visits” mean different things but because the numerator and denominator is defined in the same way, the two values should be comparable.

Do you think the duplication is really to blame for the discrepancy or do you think it’s the difference in inclusion of individuals completely at home (as described in my message above)?

Thanks!

Thanks for explaining more! Yes, I would expect that the biggest explanation is that summing device_counts from each day basically double-counts devices which show multiple times in a month. These would otherwise be deduplicated when we do the monthly aggregation for neighborhood patterns.

Devices that do not go anyway I believe are included in Neighborhood Patterns, as we record “stops” by devices to their home area in raw_stop_counts in NP:
> Number of stops by devices in our panel to this area during the date range. A stop must have a minimum duration of 1 minute to be included. The count includes stops by devices whose home area is the same as this area.

Thanks @Jeff_Ho_SafeGraph .
The main variable I’m using from the NP dataset is device_home_areas since that provides mobility between cbgs. Does device_home_areas also count devices who stay completely within their home area? Or, is there a way to get to that information using device_home_areas and raw_device_counts ?

Yup, device_home_areas will count devices whose stops are in the same area. For example, for the row in NP corresponding to CBG 360470043003 (conveniently the location of my favorite pizza place in NY), “360470043003” will appear as a key in device_home_areas, representing the number of devices whose home area is 360470043003 that also stopped in 360470043003 (whether they also went elsewhere or not).

Thanks @Jeff_Ho_SafeGraph . One more clarification please: Would NP’s device_home_areas or raw_device_counts also include individuals who were completely at home (as defined by SD as not leaving the geohash-7 of their home location) or do they have to stop at a POI within their home CBG to appear in the NP dataset?

They don’t have to stop at a POI, they be anywhere in the CBG. So as long as the device is on and registering pings, then it would appear in the NP dataset. See also this question: Workspace Deleted | Slack

Thanks @Jeff_Ho_SafeGraph . That makes sense about raw_device_counts. However, I’m finding that the sum of all values in device_home_areas is always smaller (not by a lot) than raw_device_counts. Previously, I was assuming that the difference was the number of devices that didn’t leave home. But based on your answer, that’s not it. Could you help me understand what this difference might be?

I don’t want to interrupt this great conversation. Will pipe in very quickly, but feel free to kick me out :slightly_smiling_face:

Any columns in our datasets that rely on estimates of device home location, like device_home_areas, are only surfaced if we feel confident about a device’s home location. Devices that do not have a high confidence home location are treated as if the home location is unknown. Linking our docs about how home location is determined.

Home Algo v2 "Incremental Updates" is a really helpful section.

Seems like that’s why you’re seeing totals for device_home_areas and raw_device_counts being fairly close but device_home_areas being a little lower. Difference would be the devices with an unknown home location.

Thanks Niki! Niki is correct. The way we estimate home locations (so that they would show up in device_home_areas ) depends on our confidence in the common nighttime location of the device over the recent time period. So it’s common that device_home_areas doesn’t sum up to raw_device_counts for that reason. So it isn’t due to the devices that didn’t leave home, as you originally suspected.

Got it. Thank you both!

No problem! To prevent any further questions from being overlooked, I’ll go ahead and close this thread out. If you have any more questions or follow-up questions, we’re always here to help! Just be sure to make a new post to help, as we aren’t monitoring old threads at this time. Thanks!