Do we want the explode_json_array function to work with the destination_cbgs column in the social distancing data?

Do we want the explode_json_array function to work with the destination_cbgs column in the social distancing data? I just found out the pandas explode function doesnt work on that column for some reason, but I have a simple work around if we want it to work. The only reason I can think of to not include it is because how large the data might grow.

@Brian_Blakely_BGSU good catch. Why doesn’t it work? On the SafeGraph side, our intention was for the data to be structured in the same manner so that the same function should work to unpack it. Did SafeGraph do something wrong?

@Ryan_Fox_Squire_SafeGraph I don’t think SafeGraph did anything wrong. Pandas built in explode function seems to fail when using that column, not too sure why to be honest. It’s possible that it tries to unpack the CBG’s as integers and might throw errors at integers that start with 0.

@Ryan_Fox_Squire_SafeGraph The work around is super simple and uses our existing functions to first unpack the json and then merge on the same key

@Brian_Blakely_BGSU hm, weird. can you elaborate on the workaround?

It would be nice if we had one function that handled both cases in some simple way. what does the work around look like?

@Ryan_Fox_Squire_SafeGraph It’s kind of just a hard coded catch if you are trying to explode ‘destination_cbgs’

        unpack_df = unpack_json(df[[array_column, place_key,file_key]],array_column, index_name = place_key)
        return df.merge(unpack_df,on=[place_key,file_key])```

@Ryan_Fox_Squire_SafeGraph The only issue with this approach is needing to slightly modify unpack_json in order to keep ‘date_start_range’ information, so we can merge on that and origin_cbg afterwards

@Brian_Blakely_BGSU maybe we could add this as an issue.

I feel like there is a deeper issue that we should be able to uncover for a more robust general solution, but until then you could at least post your work around in the issue for people to see. And if we aren’t able to come up with a more robust general solution, then we can always fall back to this suggestion.

@Ryan_Fox_Squire_SafeGraph Hard coding a work around is kind of lazy since we don’t solve the root of the issue, which is that exploding fails when you want to explode a json that has “string” : integer. I’m pretty sure that’s the cause of this error too, but it’s easier to see when you call explode on ‘bucketed_distance_traveled’

Right. So the question is how do we engineer a good solution for the root issue. I don’t know off top of my head, but I’m sure we can figure it out. having an “issue” would at least let us capture thoughts until one of us has time to work on it

I opened an issue for it

thank you! here it is: explode_json_array error on destination_cbgs · Issue #29 · SafeGraphInc/safegraph_py · GitHub

@Brian_Blakely_BGSU Sorry I do not understand what you mean by “The error is because the function is being called on a dictionary, not a list.”

My question is why are you (or the user) trying to use explode_json_array on the destination_cbg column? That’s just not the intended use. The intended use is to use unpack_json() or unpack_json_and_merge()

What am I not getting?

@Ryan_Fox_Squire_SafeGraph There is no real reason to be calling explode_json_array instead of unpack_json_and_merge(). Someone was trying to get my help using social distancing data and was coming to me with this problem and I at the time had forgotten that it applies only to lists. In any case, I think implementing a tiny patch to fix this from potentially confusing people by automatically calling the correct function might be a good idea.

@Brian_Blakely_BGSU

sorry for the confusion, and sorry that in your original post I didn’t realize what you were asking, because the super easy fast answer would have been "No, you should use unpack_json instead. "

@Ryan_Fox_Squire_SafeGraph All good, I should have realized it too as I am one of the people who helped make it. However, do you think avoiding the error is better for people who might have misunderstood the docs?

Or maybe put in the docs “if you receive this error it’s because x and you should use y instead”

I do not want to add code to the repo for the purpose of handling the case of people misunderstanding which function to use, and catching that and redirecting them to the correct function on the backend.

We should expect people to use the functions for their intended purposes. If people are not sure which function to use, then then it is our failure in documentation/readme. That problem is better solved with improved documentation, not more code.

I think we should consider additional documentation, possibly in the readme, to help people use the right function.

thanks for getting to the bottom of this with me @Brian_Blakely_BGSU

No problem, sorry for the confusion at first