A few general questions for the folks in the community who are accustomed to using {data.tab} with other parallel computing packages (e.g., {doParallel}, {doMC}, {future}) with large volumes of data

Joe_Wasserman_RTI_International · June 6, 2020, 12:00am

This is a more general question for the folks in the consortium who are accustomed to using {data.table} with other parallel computing packages (e.g., {doParallel}, {doMC}, {future}) with large volumes of data.

My understanding is that data.table disables its default multi-thread processing when called within another parallelized operation, to avoid dangerously nested parallelism. Is that correct?
Heuristically, when is it more efficient to proceed with data.table’s built-in multi-threading versus parallelizing a higher-level function that includes data.table functions?
If you call data.table functions within a parallelized function, do those single-thread, sequential data.table functions still out-perform, e.g., {readr} or {dplyr}?

Jessica_Williams-Holt · June 6, 2020, 7:56pm

*Following. My only exposure to big data in R is with sparklyr which utilizes dplyr syntax. Also, didn’t know that about data.table, but it explains it’s speedy compute times.

Sean_Davis_NIH · June 10, 2020, 3:02pm

@Joe_Wasserman_RTI_International My suspicion is that you’ll want to profile. I second @Jessica_Williams-Holt suggestion to try out Spark if you get some free time. You’ll get the most bang for your buck by using a small cluster on GCP or AWS, but you can test-drive locally. Note that Spark is not guaranteed to be faster, but the dataset sizes you can reasonably treat interactively are nearly unbounded.

Sean_Davis_NIH · June 10, 2020, 3:04pm

Another option is Bigquery that can actually do a fair amount of analysis in parallel, at scale, and for cheap, again using dplyr syntax. Loading data is extremely fast and nearly automatic once CSV files are in cloud storage.