I’m working on normalizing some safegraph data by adapting the script to normalize visitor home cbgs. Two questions came to mind

Hi! I’m working on normalizing some safegraph data by adapting the script (Google Colab) to normalize visitor home cbgs. Two things came to mind:

  1. summing the visitor_home_cbgs column doesn’t add up to raw_visitor_counts. Should this add up? If so, why not? And how could visitor_home_cbgs be adjusted to reflect the total raw_visitor_counts?
  2. If I’m trying to normalize the visitor_home_cbgs, should I use the POI state multiplier? I’m realizing that for MSAs that straddle state boundaries, traffic could frequently come from one state into another, and was wondering how to address this.

This topic was automatically generated from Slack. You can find the original thread here.

Hey @system !

For your first question:

The definition for each of these is as follows:

raw_visitor_counts: Number of unique visitors from our panel to this POI during the date range.

visitor_home_cbg: A mapping of census block groups to the number of visitors to the POI whose home is in that census block group.

With this information, we see that raw_visitor_counts accounts for all unique visits to a POI whereas the visitor_home_cbgs is limited to the number of visitors to the POI whose home is in that CBG (essentially members who live there).

At times, apparent “discrepancies”, like the ones you noticed, may appear. These are explained simply by our privacy noise algorithm. We add extra noise to the individual visitor_home_cbg counts, to help protect privacy. This jittering can, in some cases, case the sum of visitor_home_cbg > raw_visitor_counts.

raw_visitor_counts is not jittered and is not “adjusted” based on the noise added to visitor_home_cbg.

Additionally, we only map a CBG to a visitor if we were able to confidently assign a home location to the device. See here for more details on how we determine device home locations: Patterns | SafeGraph Docs

For your second question:

Wondering if this might be a better notebook for help normalizing visitor_home_cbgs:

This leverages the column visitor_home_cbgs and adjusts the overall visitor count for each POI by month, by each origin CBG visiting that POI, based on the known sample sizes for each CBG reported in home_panel_summary.csv and true Census populations (available from Open Census Data).