Questions regarding visitor_home_cbgs

henryadams9102 · May 22, 2022, 8:15pm

Hello there! I am trying to look at the origin locations of visitors to all pois in my region of interest. I have the following questions below:

When trying to do the ‘Micro’ normalization of visitor_home_cbgs to obtain a true population count for visitors to each poi from each origin cbg, I see in a colab tutorial that a method of eliminating all visitor_home_cbgs with less than 5 visitor counts is used, the idea is to eliminate any of the cbgs with 4 visitors because this value could technically be anything from 2-4 actual sampled visitors due to the added noise for privacy concerns. So my question is, if we filter out all visitor_home_cbgs <5, and then apply the eq–>
(SG CBG visitor raw count / SG CBG sample size) x CBG population to get a true estimate of the population of visitors to that poi from that specific CBG, how can we reliably use this if we are filtering out a lot of the visitors to each poi, from that CBG? What sort of significance would an analysis be with this method of filtering out every cbg with less than 5 visitors?
Similar to Q1, I am looking at the amount of visitor_home_cbgs with counts = 4 in a monthly patterns dataset vs in a weekly patterns dataset. Since the weekly patterns is 1/4 of the sampling period, wouldn’t we expect to see a much larger amount of visitor_home_cbgs counts to be =4 and thus have to be filtered out? Im wondering if doing a study like this would produce vastly different answers if looking at monthly vs weekly patterns due to the relatively large amount of counts equalling 4.
Also has anyone tried to do this for thousands of poi’s in a city to generate a map of sorts of origin locations, the process of exploding the visitor_home_cbgs, then counting up each CBG visitor count and adding it to a master list of all CBGs in the country seems like it would be a nested loop nightmare. But I havent been able to find any resources of someone trying to do this.

Sorry if this was confusing, let me know if I should restate my questions more succinctly!

link to colab notebook explaining the process for obtaining estimates of population counts of visitors from each CBG: Google Colab

Jeff_Ho_SafeGraph · June 1, 2022, 7:02pm

You are right that filtering out home_cbgs with <5 visitors removes a lot of data. But that filter is recommended to remove the noise that comes from the visitor floor. Depending on the goals of your analysis, the added step of noise removal is beneficial to including all of the data.
Feel free to do it both ways and compare your results! CBG counts can also be noisier in general due to variable precision in assigning home CBGs (See this notebook). Because of that, we recommend normalizing by something higher, like the state population, to produce more robus results. Or just use the pre-computed normalized_visits_by_state_scaling column .
You are correct that weekly patterns will have more 4s and thus, if you filter, you’ll filter out more cbgs. It probably will impact the results, yes. However, if you find the same results using both methods, you can probably be pretty confident that your answer is pretty robust!
If you are worried, you can also use visitor_home_aggregation, which is for census tracts rather than cbgs.
People do this all the time with our data. I would not recommend using a for loop to do this though - here are some code snippets (in pandas, R, scala, spark) that should help.
1000s is actually fine (we’ve found up to 20k is usually bearable). Once you get to millions though, you’ll probably want to look for a bigger data solution.

Good luck! And please report back what you find!

henryadams9102 · June 15, 2022, 6:18pm

Awesome Thank you Jeff!