Tutorial: processing SafeGraph in R with Apache Arrow

Christian_Gunning_University_of_Georgia · January 15, 2023, 5:57am

Hi all,

I just worked through an example of processing SafeGraph csv files using R’s arrow package, and I thought this might be of general interest. Arrow allows many files to be processed as a set without loading all of the files into memory. The result is fast, and can operate on datasets much larger than could fit in RAM. Arrow can also auto-magically read compressed CSV files (see installation guide for more details), and write out files by a grouping value (state, month, etc.).

In this example, I read a set of SafeGraph monthly home panels (freely available) and summarize device counts by county. You could also filter() rows, etc. Arrow does some optimization behind the scenes to help speed things up.

Note: I’m using the new-style “|>” pipe that was introduced in R 4.1. For earlier versions, replace with the familiar “%>%” version.

library(arrow)
library(dplyr)
library(readr)
library(fst)

## directory contains multiple SafeGraph home_panel csv files
## (one per month)
.dir <- './data/home_panel/'

## define column names/types
.schema <- schema(
    year=uint16(), month=uint8(), 
    region=utf8(), iso_country_code=null(), census_block_group=utf8(), 
    number_devices_residing=uint64(),number_devices_primary_daytime=uint64()
)

dat.panel <- open_dataset(.dir, format='csv', schema=.schema, skip=1)

## or without schema: infer column type, use header
# dat.panel <- open_dataset(.dir, format='csv')

## sum devices by county
dat.panel.county <- (
    dat.panel
    ## convenience: rename later
    |> rename(
        .reside=number_devices_residing,
        .day=number_devices_primary_daytime
    )
    ## truncate block_group to county
    |> mutate(
        county=substr(as.character(census_block_group), 1,5),
    )
    ## id columns, sort
    |> group_by(county, year, month, region)
    |> arrange(.by_group=T)
    ## aggregate by county, keep id cols
    |> summarize(devices_reside=sum(.reside), devices_day=sum(.day))
    ## final output: select columns
    |> select(year, month, region, county, devices_reside, devices_day)
    |> collect()
)

## widely used, good compression
write_parquet(
    dat.panel.county,
    paste0(.dir,'../home_panel.parq')
)

## good for R
write.fst(
    dat.panel.county,
    paste0(.dir,'../home_panel.fst')
)
    
## base R / most portable?
write_csv(
    dat.panel.county,
    file=gzfile(paste0(.dir,'../home_panel.csv.gz'))
)

Please let me know if I missed anything.
I hope you found this helpful!

Refs

Intro: Integration to Apache Arrow • Arrow R Package
Data type mapping: Using the Arrow C++ Library in R
Cookbook: 7 Manipulating Data - Tables | Apache Arrow R Cookbook
R Installation: Installing the Arrow Package on Linux

Christian_Gunning_University_of_Georgia · December 6, 2023, 7:01pm

For errors with wide input files, see also the block_size parameter :