Tutorial: processing SafeGraph in R with Apache Arrow

Hi all,

I just worked through an example of processing SafeGraph csv files using R’s arrow package, and I thought this might be of general interest. Arrow allows many files to be processed as a set without loading all of the files into memory. The result is fast, and can operate on datasets much larger than could fit in RAM. Arrow can also auto-magically read compressed CSV files (see installation guide for more details), and write out files by a grouping value (state, month, etc.).

In this example, I read a set of SafeGraph monthly home panels (freely available) and summarize device counts by county. You could also filter() rows, etc. Arrow does some optimization behind the scenes to help speed things up.

Note: I’m using the new-style “|>” pipe that was introduced in R 4.1. For earlier versions, replace with the familiar “%>%” version.

library(arrow)
library(dplyr)
library(readr)
library(fst)

## directory contains multiple SafeGraph home_panel csv files
## (one per month)
.dir <- './data/home_panel/'

## define column names/types
.schema <- schema(
    year=uint16(), month=uint8(), 
    region=utf8(), iso_country_code=null(), census_block_group=utf8(), 
    number_devices_residing=uint64(),number_devices_primary_daytime=uint64()
)

dat.panel <- open_dataset(.dir, format='csv', schema=.schema, skip=1)

## or without schema: infer column type, use header
# dat.panel <- open_dataset(.dir, format='csv')

## sum devices by county
dat.panel.county <- (
    dat.panel
    ## convenience: rename later
    |> rename(
        .reside=number_devices_residing,
        .day=number_devices_primary_daytime
    )
    ## truncate block_group to county
    |> mutate(
        county=substr(as.character(census_block_group), 1,5),
    )
    ## id columns, sort
    |> group_by(county, year, month, region)
    |> arrange(.by_group=T)
    ## aggregate by county, keep id cols
    |> summarize(devices_reside=sum(.reside), devices_day=sum(.day))
    ## final output: select columns
    |> select(year, month, region, county, devices_reside, devices_day)
    |> collect()
)

## widely used, good compression
write_parquet(
    dat.panel.county,
    paste0(.dir,'../home_panel.parq')
)

## good for R
write.fst(
    dat.panel.county,
    paste0(.dir,'../home_panel.fst')
)
    
## base R / most portable?
write_csv(
    dat.panel.county,
    file=gzfile(paste0(.dir,'../home_panel.csv.gz'))
)

Please let me know if I missed anything.
I hope you found this helpful!

Refs

3 Likes

For errors with wide input files, see also the block_size parameter :