A few general questions for the folks in the community who are accustomed to using {data.tab} with other parallel computing packages (e.g., {doParallel}, {doMC}, {future}) with large volumes of data

This is a more general question for the folks in the consortium who are accustomed to using {data.table} with other parallel computing packages (e.g., {doParallel}, {doMC}, {future}) with large volumes of data.

  1. My understanding is that data.table disables its default multi-thread processing when called within another parallelized operation, to avoid dangerously nested parallelism. Is that correct?
  2. Heuristically, when is it more efficient to proceed with data.table’s built-in multi-threading versus parallelizing a higher-level function that includes data.table functions?
  3. If you call data.table functions within a parallelized function, do those single-thread, sequential data.table functions still out-perform, e.g., {readr} or {dplyr}?

*Following. My only exposure to big data in R is with sparklyr which utilizes dplyr syntax. Also, didn’t know that about data.table, but it explains it’s speedy compute times.

@Joe_Wasserman_RTI_International My suspicion is that you’ll want to profile. I second @Jessica_Williams-Holt suggestion to try out Spark if you get some free time. You’ll get the most bang for your buck by using a small cluster on GCP or AWS, but you can test-drive locally. Note that Spark is not guaranteed to be faster, but the dataset sizes you can reasonably treat interactively are nearly unbounded.

Another option is Bigquery that can actually do a fair amount of analysis in parallel, at scale, and for cheap, again using dplyr syntax. Loading data is extremely fast and nearly automatic once CSV files are in cloud storage.