I wondering if someone can help me think through censoring in the Patterns data?

Michael_Esposito_UMich · June 19, 2021, 12:00am

Hi all!

I wondering if someone can help me think through censoring in the Patterns data.

As a bit of context, I want to model the number of visitors to a site based on the characteristics of the block-groups from which said visitors originated. In the Patterns documentation, I saw that the note that, “only [visitor block groups] with at least 2 devices are shown and cbgs with less than 5 devices are reported as 4.” If my read of this is correct, the data are truncated and censored.

Handling the censoring (re: 4 being a stand in for anything in the interval from 2-4) is straightforward enough. But I’m stuck on the truncation part (re: block groups not appearing in the data if they said less than 2 devices to a site). In particular, there appears to be some fairly large discrepancies between raw visitor counts and summed counts from block groups that appear in the data. E.g., take a look at the following data set with coverage of all visitors to one particular site during March 2019:

# A tibble: 11 x 9
# Groups: fips [11]
date_start date_end month fips count raw_visitor_counts region naics_code test
<date> <date> <ord> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 2019-03-01 2019-04-01 Mar 360550096012 8 82 NY 712190 48
2 2019-03-01 2019-04-01 Mar 360550049002 4 82 NY 712190 48
3 2019-03-01 2019-04-01 Mar 360550132062 4 82 NY 712190 48
4 2019-03-01 2019-04-01 Mar 360550109024 4 82 NY 712190 48
5 2019-03-01 2019-04-01 Mar 360550142024 4 82 NY 712190 48
6 2019-03-01 2019-04-01 Mar 360550066002 4 82 NY 712190 48
7 2019-03-01 2019-04-01 Mar 120910201003 4 82 NY 712190 48
8 2019-03-01 2019-04-01 Mar 540610120001 4 82 NY 712190 48
9 2019-03-01 2019-04-01 Mar 360550047021 4 82 NY 712190 48
10 2019-03-01 2019-04-01 Mar 484279504013 4 82 NY 712190 48
11 2019-03-01 2019-04-01 Mar 360550135032 4 82 NY 712190 48

The column “count” is the number of visitors to that site from a particular block group (Row 1, for instance, shows that “this site received 8 visitors from block-group 360550096012 in March 2019”), and “raw_visitor_count” is the SafeGraph produced estimate of the total number of people, across all block groups, visited this particular site in March 2019. Now, even if we assume that all of those censored 4s are true 4s, summing up the total visitors using the “count” column is far off of the visitor count provided by SafeGraph (48 vs 82). I’m attributing this difference to a bunch of “invisible block-groups” that are sending visitors to the site in question, but aren’t observed in the data because they sent 2 or less individuals to the site in March.

With all that said, I could use some help thinking through this and possible “solutions.” As a starting point, I’m wondering if someone could provide some bounds around the scope/extent of the truncation? (Re: on average, in a given month, for a given site, how many block-groups send 1-2 people to a site?) And if anyone else has a creative solution for dealing w/ this, I’d appreciate hearing it! THANKS!

Pranav_Thaenraj_UCSD · June 21, 2021, 3:25pm

Hello @Michael_Esposito_UMich I think that It may be also a valid point to pay note that the visitor_daytime_cbgs column serves to make the counts between “daytime hours” (9AM - 5PM). Could it be possible that the 34 visits that you’re missing could be becuase these visits happened after or before the given time. This is important because you are using raw_visitor_counts , which calculates number of unique visitors in the POI and thus even if an individual came to the POI at 8:50 AM (take for instance if the POI is an office building and the idividual is attempting to come to work 10 minutes early) it would still not be considered as in between 9AM - 5PM.

I’m not sure if there are any solutions to this issue that you’re facing - it’s simply the way the visitor_daytime_cbgs column is structured. One thing that you could potentially do is explode the visitor_home_cbgs column to see exactly which records from which block group are not present in the visitor_daytime_cbgs

Let me know if you have any otherf questions.

Michael_Esposito_UMich · June 21, 2021, 4:14pm

Thanks folks! This is all super helpful. Just to check my understanding/provide some extra detail, the variable I’m using to calculate count is visitor_home_cbgs – which summarizes the number of visitors w/ ‘homes’ in [block = b] to [poi = p] for [time = t]. visitor_daytime_cbgs instead summarizes the number of visitors w/ ‘daytime locations’ in [b], to [p], at [t] (so more of a measure of things like, “where folks work”), yeah? I hadn’t used the latter variable in constructing these counts, because I’m generally interest in asking how many folks that lived in this neighborhood visited this type of POI. I could be misunderstanding this/selecting the wrong variable. But if not, the 9-5 timing issue shouldn’t be a source of discrepancy b/t visitor_home_cbgs and raw_visitor_counts, right?

However that shakes out, I think that a way to work around this is to expand my set to include all potential senders (e.g., all block groups in NYC, for a study of park use in NYC) and just sweep blocks that didn’t appear in the data for a given [p,t] combo into the censoring, such that everything from 0-4 is considered left censored. That’s a lil expensive, but seems worth it overall.

Jeff_Ho_SafeGraph · June 21, 2021, 4:28pm

Your understanding of visitor_home_cbgs (where folks live) and visitor_daytime_cbgs (where folks work) sounds right to me. Based on your research question, visitor_home_cbgs is the right column to use, and you’re right 9-5 timing shouldn’t be an issue.

I think your solution to use all potential senders works as well, especially if you want to really include as many block groups as possible in your research. We have had some people only include uncensored visitor_home_cbgs (i.e., >4) as well. If you’re worried about compute, you could always try with uncensored cbgs first, and then expand after.

Michael_Esposito_UMich · June 21, 2021, 4:38pm

Awesome!! That should more or less do it. As one last question, regarding Jeff’s point #1: do you have a rough idea of how many “missing home address visitors” a typical POI receives per month? (Or I suppose better would be something like, “on average, what percentage of visits to a POI are from folks w/ no home address?”) It would be nice/fun to play around w/ ways to incorporate measurement error into the model, maybe around the idea of how far off from the true count of visitors from [block = b] might we expect the observed count of visitors from [block = b] to be, based off of error due to missing home address individuals? (This isn’t essential, so no worries if that info isn’t readily available!)

Jeff_Ho_SafeGraph · June 21, 2021, 5:27pm

I would think it varies by POI, so we don’t have any stats off the top of my head unfortunately. However, I think you can calculate this for the whole dataset and at the state level by comparing number_devices_residing from the home panel summary to num_unique_visitors from visit_panel_summary. That should give you a sense of how many missing home address visitors there are!