Hello everyone, I have a question about using R for safegraph data.
Am I the only one experiencing error that is caused by the huge data size? My workspace becomes about 100GB when I download the SDM / Patterns datasets for 2020 and then Rstudio goes wrong(Error 502/ and sometimes forever ‘resuming’) so often while loading the large workspace. Often it occurs when I want to come back to that worksapce after I close the session and reopen the next morning. I was curious how other people were dealing with these large datasets. I am using my workplace VPN and use Bear Cluster for running more efficiently. But still, it is giving me trouble so often.
It would be great if anyone could share their experience or tips on how you deal with the large safegraph data.
Thank you and stay warm!
So I don’t know much about the safegraph data or the way it is accessed, but this is a very clear sign that you need to be working with a database, or something else that can manage the compute for you. If you can download the data as chunked csvs then upload them to a sqlite or postgress database that’ll likely be your best bet.
It looks like the read_core()
function reads the already compressed csv files as a data.table brings that into memory. SafeGraphR/read_core.R at 64746acc9e7e665397980aa6d6f739b650621e70 · SafeGraphInc/SafeGraphR · GitHub
It may make sense to provide a way to download these files rather than bring them all into memory.
If I were approaching this I’d likely
• make connection to a database
• read 1 month of data then append to table
◦ make sure that you don’t save this object to memory otherwise you’ll fill up your memory quickly
◦ rinse and repeat for your date range
• Connect to database with DBI then run sql queries against it with dbplyr
The core files are small enough that you can open them all without running out of memory on a typical machine. That’s not the case for patterns, though, if you’re covering a wide time range. If you’re planning to do any aggregation of the data, read_many_patterns()
can process and aggregate each time period separately before bringing everything together, which keeps memory sizes reasonable. You can generally get a few months at a time in memory at once at a reasonable aggregation level (say, by-brand-by-county-by-day). If you really want all the raw data at once, then yes you’re looking at setting up a database. But you should think carefully about whether you actually need that.