People Data Labs Company Data: Reading in PySpark

dvdijcke · February 8, 2024, 4:31pm

Just in case it’s useful for anyone: the People Data Labs Company data has a slightly tricky encoding with multiple lines and non-escaped quotes inside quotes that really trip PySpark up (Pandas handles it fine but it’s a pretty huge dataset so that wasn’t an option for me).

This way of reading the data seems to parse everything correctly:

df = spark.read.option("header", "true") \
               .option("multiLine", "true") \
                .option("escape", "\"") \
                .option("quote", '"') \
               .csv(path-do-company-data)