I'm curious what tools people are using to process large data sets. Any Dask users? Spark? Ray?

I’m curious what tools people are using to process large data sets. Any Dask users? Spark? Ray?

Hi @todd_hendricks, I really liked the idea of dask. I was able to link up 2 old laptops and my main rig and get quite a bit of power running. I am not sure how useful dask is on a single machine, but would be interested to see.

I want to say Ray only runs on linux based distributions (ie - Mac OSX) but I am not 100% on that.

I think spark is amazing but cost money which makes it less amazing haha.

Something I would be interested in figuring out:

So, the legend has it that Mac OSX will turn extra harddrive space into super slow ram when you exceed your memory limits. essentially a mac will never throw oom errors because the likelyhood of using 250gb + of ram is slim to none. it may take forever because instead of running DDR4 at 3000mhz, you are throttled to like 800mhz or whatever your harddrive speed is (prob faster with modern day NvMe m.2 storage).

So - do you think it is possible to get a PC to perform like a mac? maybe a library out there that works like a wrapper for pandas or something? I havent looked into it, but when I learned that about macs my mind was blown. They actually cant turn that feature off, it is hard coded in haha

i use pyspark at SafeGraph, but most of the time I usually pre-filter locally to a subset based on geography or brand and then do my analysis on the subset, so luckily I don’t have truly do big data stuff most of the time

@Ryan_Fox_Squire_SafeGraph are you on a mac or a windows machine?

@Jack_Lindsay_Kraken1 macbook pro

have you ever hit a memory limit?

@Jack_Lindsay_Kraken1 no, but when i’m working on my macbook pro I usually have pre-filtered already either by pulling a subset from shop.safegraph.com or using a cloud spark cluster and pyspark.

The only time things will really explode is if you try to unpack some of the json columns like visitor_home_cbg nation wide or across many months. Those are the cases where I usually need to pre-filter to a subset by geography or brand first.

@Ryan_Fox_Squire_SafeGraph I see. I need to borrow someones Mac one day to see if the rumors are true about converting harddrive storage into “slow RAM” haha

From what I gather AWS EMR is a low cost option ( < $5 ish ) if you get your cluster up and down reasonably fast.

using pyspark to explode the JSON columns is truly magically fast

If you’re trying to make me and Jack jealous, you’re winning :stuck_out_tongue_closed_eyes:

Haha, is it the Mac or Pyspark, that is a causal inference question

I also didn’t actually truly process big datasets and usually would filter down to a specific region. I tried buying the colab pro for running some JSON explodes before but it didn’t do well on a single weekly dataset if not broken down in chunks, so wouldn’t really recommend it for JSON parsing that would create tens of millions of rows.

CoLab pro is the single best investment I have made in 2020

@Jack_Lindsay_Kraken1 What are the best/smoothest features you get out of colab pro? Does it really help with processing large datasets? I wanted just to solve that memory crash problem with JSON explodes but unfortunately didn’t work for me.

well, you have to remember colab defaults to standard memory allocation even with pro, you have to manually set it to high memory everytime. And it doubles the RAM which is crazy helpful. Also, for future applications, it also allows access to GPUs which means cloud based CUDA processes are possible (wayyyyyy faster than any CPU dreamed of being)

Good to know! I should have probably dug in a bit more with it. :slightly_smiling_face:

I’ve been using Ray pretty successfully on Windows using 1 machine and in terms of performance, it’s pretty good. It’s better than standard multiprocessing, since Ray has some decent optimization on how to split your data, which can significantly increase performance. It can have some inconvenient syntax at time though, but making a function run in parallel can be done in 1 line. I haven’t tested it on multiple machines, so I’m not sure about task distribution.

I’ve gotten Dask to work using 6 beefy PC’s, which was really fast, but I never got a perfect set up. I often encountered strange errors on some of my machines, despite using identical environments. My experience with Dask is mostly frustration, but if you have a lot of computers on hand, then it’s probably worth the trouble.

I still like SAS. It’s fast with large datasets, but costs $12k (at a minimum) per year.