Error with Arrow package to read in multiple SafeGraph csv.gz files

gabriella.palomo · March 12, 2023, 11:31pm

I would like to use the arrow package to work with csv.gz files. However, I am getting the following error. I downloaded SafeGraph monthly patterns for 2020. My directory has a folder for 2020 and 12 subfolders for each month, for example: data/2020/1

Arrow opens the database with all the files but when I try to collect it it gives me the following error. Anyone knows what I’m doing wrong? Any help/guidance will be greatly appreciated!!

Error in `compute.arrow_dplyr_query()`:
! Invalid: straddling object straddles two block boundaries (try to increase block size?)

This is my code

data_path <- "safegraph_data/2020"
> files <- list.files(data_path, recursive = TRUE, full.names = TRUE)
> length(files)
[1] 348
> dat <- open_dataset(data_path, format='csv', partitioning = c('month'))
> dat
FileSystemDataset with 348 csv files
placekey: string
parent_placekey: string
safegraph_brand_ids: string
location_name: string
brands: string
store_id: string
top_category: string
sub_category: string
naics_code: int64
latitude: double
longitude: double
street_address: string
city: string
region: string
postal_code: string
open_hours: string
category_tags: string
opened_on: string
closed_on: string
tracking_closed_since: string
websites: string
geometry_type: string
polygon_wkt: string
polygon_class: string
enclosed: bool
phone_number: string
is_synthetic: bool
includes_parking_lot: bool
iso_country_code: string
wkt_area_sq_meters: int64
date_range_start: timestamp[s, tz=UTC]
date_range_end: timestamp[s, tz=UTC]
raw_visit_counts: int64
raw_visitor_counts: int64
visits_by_day: string
poi_cbg: int64
visitor_home_cbgs: string
visitor_home_aggregation: string
visitor_daytime_cbgs: string
visitor_country_of_origin: string
distance_from_home: int64
median_dwell: double
bucketed_dwell_times: string
related_same_day_brand: string
related_same_month_brand: string
popularity_by_hour: string
popularity_by_day: string
device_type: string
normalized_visits_by_state_scaling: double
normalized_visits_by_region_naics_visits: double
normalized_visits_by_region_naics_visitors: double
normalized_visits_by_total_visits: double
normalized_visits_by_total_visitors: double
month: int32
> dat %>% 
+   group_by(month, region) %>%
+   summarise(mean_visitors = mean(raw_visitor_counts)) %>% 
+   select(month, region, mean_visitors) %>%  
+   collect() ->tst
Error in `compute.arrow_dplyr_query()`:
! Invalid: straddling object straddles two block boundaries (try to increase block size?)
Run `rlang::last_error()` to see where the error occurred.

evan-barry-dewey · March 22, 2023, 3:31pm

@Christian_Gunning_University_of_Georgia put together this great tutorial on processing SafeGraph in R with Arrow. Hope this helps!

scott.stetkiewicz · April 4, 2023, 8:48pm

I hit the same issue, seems like long fields (i.e. geometries, etc.) need more block space than the default settings. You can adjust the block_size parameter inline:

dat <- open_dataset(data_path, format='csv', partitioning = c('month'), block_size=1e9)

1e9 did the trick for me, though for good practice you’d probably want to play around a bit with that.

system · April 21, 2023, 9:53pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.