Hi all!
I wondering if someone can help me think through censoring in the Patterns data.
As a bit of context, I want to model the number of visitors to a site based on the characteristics of the block-groups from which said visitors originated. In the Patterns documentation, I saw that the note that, “only [visitor block groups] with at least 2 devices are shown and cbgs with less than 5 devices are reported as 4.” If my read of this is correct, the data are truncated and censored.
Handling the censoring (re: 4 being a stand in for anything in the interval from 2-4) is straightforward enough. But I’m stuck on the truncation part (re: block groups not appearing in the data if they said less than 2 devices to a site). In particular, there appears to be some fairly large discrepancies between raw visitor counts and summed counts from block groups that appear in the data. E.g., take a look at the following data set with coverage of all visitors to one particular site during March 2019:
# A tibble: 11 x 9
# Groups: fips [11]
date_start date_end month fips count raw_visitor_counts region naics_code test
<date> <date> <ord> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 2019-03-01 2019-04-01 Mar 360550096012 8 82 NY 712190 48
2 2019-03-01 2019-04-01 Mar 360550049002 4 82 NY 712190 48
3 2019-03-01 2019-04-01 Mar 360550132062 4 82 NY 712190 48
4 2019-03-01 2019-04-01 Mar 360550109024 4 82 NY 712190 48
5 2019-03-01 2019-04-01 Mar 360550142024 4 82 NY 712190 48
6 2019-03-01 2019-04-01 Mar 360550066002 4 82 NY 712190 48
7 2019-03-01 2019-04-01 Mar 120910201003 4 82 NY 712190 48
8 2019-03-01 2019-04-01 Mar 540610120001 4 82 NY 712190 48
9 2019-03-01 2019-04-01 Mar 360550047021 4 82 NY 712190 48
10 2019-03-01 2019-04-01 Mar 484279504013 4 82 NY 712190 48
11 2019-03-01 2019-04-01 Mar 360550135032 4 82 NY 712190 48
The column “count” is the number of visitors to that site from a particular block group (Row 1, for instance, shows that “this site received 8 visitors from block-group 360550096012 in March 2019”), and “raw_visitor_count” is the SafeGraph produced estimate of the total number of people, across all block groups, visited this particular site in March 2019. Now, even if we assume that all of those censored 4s are true 4s, summing up the total visitors using the “count” column is far off of the visitor count provided by SafeGraph (48 vs 82). I’m attributing this difference to a bunch of “invisible block-groups” that are sending visitors to the site in question, but aren’t observed in the data because they sent 2 or less individuals to the site in March.
With all that said, I could use some help thinking through this and possible “solutions.” As a starting point, I’m wondering if someone could provide some bounds around the scope/extent of the truncation? (Re: on average, in a given month, for a given site, how many block-groups send 1-2 people to a site?) And if anyone else has a creative solution for dealing w/ this, I’d appreciate hearing it! THANKS!