I observe strange behavior in micro-normalizing Neighborhood Patterns data between January and March 2018

I observe strange behavior in micro-normalizing Neighborhood Patterns data between January and March 2018.

The method I employ is to break out the visitor_home_cbg column into CBG-visit pairs, create a scaling factor that is census population/SG sample size, and multiply the number of visits by this scaling factor.

I noted in (3) in this post that there is a substantial drop in devices_residing across all CBGs in Los Angeles in February 2018, and a coincident spike in normalized visits, which is a little odd.

I checked whether this was due to differential privacy by creating lower and upper bound estimates of the normalized visitor counts (where the lower bound estimate assumes 4 visits really means 2 visits, and the upper bound estimate assumes 4 visits really means 4 visits). The leap in visitors persists under both assumptions.

I then chose several major US cities to see if this was a nationwide phenomenon. In all 7 cities, there is a drop in devices_residing and a spike in normalized visit counts in February 2018 (see attached plots). The normalized visit counts alone are implausible to me, and it is also striking that these counts coincide with a marked decrease in devices residing. I note that this behavior occurs in the July 2020 and Dec 2020 release of Neighborhood Patterns, which use different Home Device location algorithms.

Any ideas on what might be driving this behavior?

This topic was automatically generated from Slack. You can find the original thread here.

Hey ! Thanks for the question. Going to loop our Product team on this. We’ll circle back with you soon! Thanks!

Hi - good investigation on upper and lower bounds of normalized visitor counts. I would, however, expect the leap in micro-normalized visitors to be there regardless of treating 4-visits as 2 or 4, as that would imply the same level of over/undersampling by cbg across all cbgs. The tricky part with dealing with differential privacy-applied data is we don’t know which are truly 2s and which are 4s, so the sampling rate by cbg (and therefore the scaling factor) could be more variable than treating all visits across all cbgs at the lower bound as the same.

While there was that change in the Home Device Location algorithms between the two releases of Neighborhood Patterns, I don’t think it would affect 2018 data; and regardless it wouldn’t affect Feb 2018 separately from Jan or Mar 2018. What seems most plausible to me is that our panel just happened to see fewer devices that month - as was sometimes the case before our panel really stabilized from late 2018 onward.

I think what you’re seeing here is ultimately a limitation of the micro-normalization technique. It attempts to correct for over/undersampling by cbg, but in the case of feb 2018 it appears to overcorrect. You could try normalizing by total visits, or by total visitors, as we recommend here.

Hi ! Just confirming that we answered your question. I’m going to go ahead and close this thread out. If you have any more questions or follow-up questions, we’re always here to help! Just be sure to make a new post to safegraphdata, as we aren’t monitoring old threads at this time. Thanks!