Advan patterns data compatibility with SafeGraphR package

lme56 · November 1, 2023, 10:42pm

I have R code that uses the expand_cat_json function from the SafeGraphR package which worked with the Safegraph patterns data. I’ve attempted to run the same code on the VISITOR_HOME_CBGS variable from the Advan patterns data but get the following error: “lexical error: invalid char in json.” From what I can tell, it occurs when the VISITOR_HOME_CBGS is blank. Is the the expand_cat_json function usable on the Advan data? If not, is anyone aware of a work around or any functions that do the same thing? Thank you!

InvTech · November 1, 2023, 11:53pm

Hi @lme56 it worked well when I tested with my sample. Could you provide patterns file name (.csv.gz) and the part of your code when you used expand_cat_json?

lme56 · November 2, 2023, 1:49am

Thank you for your reply @InvTech ! The file name is Monthly_Patterns_Foot_Traffic-82-DATE_RANGE_START-2019-01-01.csv.gz. It works for all patterns files until this one.

The relevant code is below, where I’m trying to loop through patterns files. The expand_cat_json line is near the bottom of the loop. The merge_with_POI is an additional dataframe containing POI info for what we’re interested in. Please let me know if you need more info! Thank you!

visits <- lapply(file_names[83], function(file) {
  
  counter <<- counter + 1
  print(paste0("reading in file ", counter))

  patterns_data <- fread(file)
  
  setkey(patterns_data, PLACEKEY)
  merged <- patterns_data[merge_with_POI, nomatch = 0]
  
  if (dim(merged)[1] > 1) {
    
    merged <- merged[, .(PLACEKEY, PARENT_PLACEKEY, LOCATION_NAME, Label,   
                         STREET_ADDRESS, CITY, LONGITUDE, LATITUDE, DATE_RANGE_START, 
                         DATE_RANGE_END, RAW_VISIT_COUNTS, 
                         RAW_VISITOR_COUNTS, POI_CBG, VISITOR_HOME_CBGS)]
    
    # pivot longer
    pivoted <- SafeGraphR::expand_cat_json(merged, expand = 'VISITOR_HOME_CBGS', by = names(merged)[-14])
    setnames(pivoted, c("index", "VISITOR_HOME_CBGS"), c("VISITOR_CBGS", "VISITS"))
    
    return(pivoted)
    
    # else return NA
  } else {
    return(NA)
  }
})

InvTech · November 3, 2023, 7:51pm

Unfortunately, that file does not exist on download platform, which means the file might be updated. I think it’s worth to download data again. Error message indicates some of your current datapoint may contain invalid json text.

lme56 · November 4, 2023, 8:31pm

Thank you for pointing that out. I’ll download the updated data and post an update here if there are still issues. Thank you for your help!

lme56 · November 17, 2023, 8:44pm

Hello, I was able to try my code on the updated patterns data and ran into the same issue using the expand_cat_json function with the following file: Monthly_Patterns_Foot_Traffic-34-DATE_RANGE_START-2019-01-01.csv.gz. Any suggestions you may have would be greatly appreciated. Thank you!

InvTech · November 18, 2023, 2:45am

SafeGraphR uses jsonlite::fromJSON in the expand_cat_json function. I also encountered a case that fromJSON throws an error when Advan data has "" string in the data. You may remove rows with "" data first and run it again.

Thanks,

lme56 · November 19, 2023, 1:26am

This works, thank you!

lme56 · November 27, 2023, 8:01pm

Hi @InvTech Just following up on this, would you happen to know why the VISITOR_HOME_CBGS contains ""? A possible reason I’ve come across is from @evan-barry-dewey’s post here which I’ve quoted below. I’m trying to understand if dropping these observations will introduce some sort of bias in our analysis. Thanks!

Number of CBGs and Census Tracts in Trade Areas.
Advan cuts the number of CBGs in a trade area to the top 1,000 and number of tracts in a trade area to the top 400. SafeGraph did not. This is a temporary measure to limit the size of the data and make it easier to ingest. Advan generates home/work trade areas as 4 fields - geohash 6 (i.e., g6), g5, g4, and g3, so the more distant areas have lower granularity. Advan reserves the right to modify the schema in the future to similarly reduce the overall data size without losing granularity at the local level.

Effect: the visitor columns (visitor_home_cbgs, visitor_home_aggregation, etc.) will contain a much smaller number of CBGs / Census Tracts. Additionally, CBGs / Tracts that are distant from the POI (and less likely to have significant visitation to the POI) will be missing.

evan-barry-dewey · November 30, 2023, 5:47pm

@lme56 I updated the methodology doc with a few things after hearing back from the Advan team.

Number of CBGs and Census Tracts in Trade Areas.
Advan cuts the number of CBGs in a trade area to the top 1,000 and number of tracts in a trade area to the top 400 for each POI. SafeGraph did not. This is a temporary measure to limit the size of the data and make it easier to ingest. Advan generates home/work trade areas as 4 fields - geohash 6 (i.e., g6), g5, g4, and g3, so the more distant areas have lower granularity. Advan reserves the right to modify the schema in the future to similarly reduce the overall data size without losing granularity at the local level.

Effect: the visitor columns (visitor_home_cbgs, visitor_home_aggregation, etc.) will contain a much smaller number of CBGs / Census Tracts. Additionally, CBGs / Tracts that are distant from the POI (and less likely to have significant visitation to the POI) will be missing. Advan also filters CBGs that did not have enough visitors, so if there are very few visitors and there are 1 or 2 visitors per CBG then these will not show up; in extreme cases that can lead to a blank field.

Panel Details
Note that currently the trade area data use a different panel than the one used for visits, so visits and trade areas do not go hand in hand. Advan is planning on changing that in a future release (no ETA yet). This allows Advan to use a larger panel for the home/work data as a way to analyze a larger and more detailed demographic area. However, because this panel is not sourced as consistently as the panel they use to count visits, it may vary more from month to month.

They’ve recently added a new panel provider for the home/work data that is much more stable. The month over month volatility of the trade area panels will be significantly reduced going forward (starting with the November 2023 data).

https://community.deweydata.io/t/safegraph-advan-methodology-differences/26163

lme56 · December 1, 2023, 7:16pm

Thanks @evan-barry-dewey! This is good to know.