Has anybody tried to combine Neighborhood Patterns (NP) and Weekly Patterns (WP) to estimate “non-POI visits”?

Has anybody tried to combine Neighborhood Patterns (NP) and Weekly Patterns (WP) to estimate “non-POI visits”? Demo included in this thread. In short, my thinking on the most straightforward version of this is:

• Sum up visits_by_each_hour from weekly patterns for POIs in the same CBG.*
• Sum up stops_by_each_hour from neighborhood patterns by CBG
• Line these up so you can subtract one from the other
Both types of visits are defined as a device stopping within some boundary for more than a minute. So, assuming the boundary definition of a CBG for NP would definitely be inclusive of boundary definitions for POIs within that CBG in WP, I would not expect it to be possible for WP_counts > NP_counts for the same hour (especially when NP includes home devices staying home). But this case happens often regularly (one CBG has 323 more device stops in WP than in NP for a specific hour). Besides deeper data issues, the only explanation I can think of at the moment is that in WP, a device can visit 3 different POIs within an hour and be counted 3 times, while in NP, a device can only ever be counted one time for a given CBG in a given hour.

An obvious value for estimating non-POI visits is to try to get at residential transmission as opposed to POI transmission.

Any thoughts greatly appreciated!

Here’s code to replicate:

wp is weekly patterns for week of 2020-11-02, just Santa Clara County POIs.

  transmute(
    cbg = poi_cbg, 
    visits_by_each_hour = substr(visits_by_each_hour,2,nchar(visits_by_each_hour)-1)
  ) %>% 
  separate(
    visits_by_each_hour,
    c(as.character(1:168)),
    sep = ","
  ) %>% 
  mutate(across(
    -cbg,
    ~(as.numeric(.)*-1)
  )) %>% 
  group_by(cbg) %>% 
  summarize_all(sum) %>% 
  filter(cbg %in% scc_blockgroups$origin_census_block_group)```
`np` is neighborhood patterns for november, just SCC CBGS. in this case, to line up with my test week of `wp`, i need to skip the first 24 values in `stops_by_each_hour` and then take the next 168.

```np_cbg_summary <- np %>% 
  transmute(
    cbg = area, 
    stops_by_each_hour = substr(stops_by_each_hour,2,nchar(stops_by_each_hour)-1)
  ) %>% 
  separate(
    stops_by_each_hour,
    c(rep(NA,24),as.character(1:168)),
    sep = ","
  ) %>% 
  mutate(across(
    -cbg,
    ~(as.numeric(.))
  ))```
i then combine the two together and sum (hence why I made the `wp` visit counts negative)

```cbg_non_poi <- 
  rbind(wp_cbg_summary, np_cbg_summary) %>% 
  group_by(cbg) %>% 
  summarize_all(sum)```
To inspect some outlier cases,

```mins <- cbg_non_poi %>% 
  select(-cbg) %>% 
  summarize_all(min) %>% 
  t()

summary(mins)```

I have not dug into this but the most likely reason that occurs to me is that for NP, need a stop of at least 1 minute – meaning need at least 2 pings from a device. For WP, we would count a single ping visit especially if we had wifi info.

Interesting, I didn’t think there would be such a fundamental difference to how pings are registered, so good to know as another caveat. Can you clarify that for WP, a single ping would count, but you still check that that ping is within the boundaries of the POI geometry, right?

We’re using a model that takes into account a number of features – distance from centroid, distance from boundary of polygon, wifi, category and time of day, etc.

Does that mean that the model could potentially have accounted for outdoor dining for POIs throughout 2020, if that outdoor dining physically was in a fixed location a few feet outside of a building geometry?

yes. we don’t just do point in polygon

A follow-up observation that I’m confused by: Under device_home_areas, as described in the schema, I am always seeing the area itself listed, with, as expected, the highest device count. But why do I see the device counts for the area itself generally drop significantly, and sometimes not show up at all, in the other device_home_area fields like weekday_device_home_areas? If, for example, the area itself had 454 devices identified in the device_home_areas field, I would expect that for weekday_device_home_areas there is a number close to 454 for the same CBG, since it would be common sense that a home device generally shows up in its own CBG on a weekday. But in this example, the number in that field is just 6. (the example is Nov 2020, area = 060374034022). And in the weekend_device_home_areas field, there’s no value at all for the area CBG, meaning it was 0 or 1, and that leaves 447 devices unaccounted for, which were “spotted” per device_home_areas but not on a weekday or a weekend. Something like this happens with almost every record. Unless I’m misunderstanding the schema, this would seem to be a big issue with those supplementary device_home_area fields. Let me know if I’m missing something! @Nate_Ramos_Stanford @Julia_Wagenfehr_Stanford @Angie_Peng @Caci_Jiang

@Lauren_Spiegel_SafeGraph any thoughts/guidance on this?

Hi Derek, I am checking with the engineers. I think it might be that we are only checking the start time of a cluster of pings but want to confirm with them.

my theory was wrong. how often are you seeing this? we are applying DP to these counts so a few examples should look odd but wouldn’t expect this generally.

I can have my team do a more formal assessment to share tomorrow, but my own experience was seeing this issue in the vast majority of cases

@Lauren_Spiegel_SafeGraph I took a closer look at the prevalence of the discrepancy Derek mentioned above. Here are a few notes:
• I found that 60% of the observations of unpacked and summed weekday_device_home_areas / weekend_device_home_areas values did not sum up to those of device_home_areas
• I specifically examined census block groups (area ) in San Jose, CA for the period of 12/1/2020 - 1/1/2021.

thanks. i wouldn’t expect them to sum up exactly because of the differential privacy. do you have any stats on how big the deviations are and for how many CBGs? if not, not a big deal. we will check.

@Lauren_Spiegel_SafeGraph agree, these shouldn’t sum up in principle, but if the sum of weekend and weekday device counts is far below the total in device_home_areas, that’s our concern. the sum being a bit over would make sense, if a device is seen both on weekdays and weekends. We’ll follow up with some more summary stats, but I would say at this point that this is a widespread issue that (hopefully) is a fixable bug, and not evidence of a deeper issue with NP.

noted. we will dig in further

Thanks. Seems like team is looking into the issue. While the hourly spike may or may not be related to the monthly aggregation, the level of differences in the home devices vs the home_wk + home_wkend is in both direction and is not making sense in magnitude as well. Looking forward to learning what the data team discovers. Thanks for the response again!