Just in case it’s useful for anyone: the People Data Labs Company data has a slightly tricky encoding with multiple lines and non-escaped quotes inside quotes that really trip PySpark up (Pandas handles it fine but it’s a pretty huge dataset so that wasn’t an option for me).
This way of reading the data seems to parse everything correctly:
df = spark.read.option("header", "true") \
.option("multiLine", "true") \
.option("escape", "\"") \
.option("quote", '"') \
.csv(path-do-company-data)