Is there any dataset for mapping the census block groups to zip code?

@Ryan_Fox_Squire_SafeGraph Is there any dataset for mapping the census block groups to zip code?

@Deeksha_MIT I just sent you my code in a message - I hope I got it right! I used HUDUser data. I would love to hear if there is an official way to do it.
This seems to be an important piece of code and I was surprised not to find it - so spent half of my Saturday coding it… :face_palm:

@Dana_Turjeman_UMich_Ross would you be willing to share it with everyone? You can post it in this thread.

So is there really not such a code available elsewhere (I am shocked and also relieved I didn’t just spent half Saturday :slightly_smiling_face: ).
I am not sure about it’s quality but here it is. Would appreciate anyone’s comments if you find bugs. ZIP-TRACT data can be found here: HUD USPS ZIP Code Crosswalk Files | HUD USER

tks @Dana_Turjeman_UMich_Ross

@Dana_Turjeman_UMich_Ross thx, extremely helpful!

In case this is useful for others: to speed up processing the files, add


replace read_csv(…) with

fread(cmd = paste0(“gzip -dc ./sg-social-distancing/”, directory_name, file_name, “.gz”), keepLeadingZeros = TRUE)

and write.csv(…) with

fwrite(x = tract_sd,
file = paste0("./sg-social-distancing-tract/", file_name))

Thank you @Andrey_Simonov_Columbia_GSB! Gosh I really need to get my github going for this kind of improvements! (but maybe after the manuscript is in? :confused: )

Hi, @Andrey_Simonov_Columbia_GSB - I see that the keepLeadingZeros=TRUE doesn’t really keep the leading zeros of the FIPS with 4 digits for some reason. I am using the read_csv function and that’s OK, albeit slower.
Also, I found a minor issue for the “to tract” part of the code - the line with the aggregated distance_travel_from_home should be: distance_traveled_from_home = sum(device_count*distance_traveled_from_home) / sum(device_count), #weighted mean

@Andrey_Simonov_Columbia_GSB @Dana_Turjeman_UMich_Ross have eitehr of you found time to put this up on github? Maybe someone else in the community could volunteer to do that for you?

cc: @Jessica_Williams-Holt

Yes, I can help!

Reviving this just because we still don’t have the final code (we ended up not using it in our paper), but also because I can make some clarifications that (I hope) can help others: to run the code from this thread, I used HUDUser data and there’s a mapping between TRACT to ZIP that is what you’re looking for.

The reason it’s hard to match TRACT to ZIP is because they’re not a 1:1 or even m:1 matching; there are ZIP codes that reside in multiple TRACTs and TRACTs that reside in multiple ZIPs :face_palm: However, if you REALLY want, the code I posted can be OK (note the correction in a message after), and another way is simply take, for each TRACT, the ZIP that had the maximum residential part of it. This is the HUDUser link: HUD USPS ZIP Code Crosswalk Files | HUD USER. In the documentation of HUDUser, you may find an explanation of the “simpler” practice of matching maximum residential portion of each ZIP to the associated TRACT. I hope this helps. :pray:

@Dana_Turjeman_UMich_Ross is the blocker the mapping only? or are there other blockers as well?

The only “blocker” is the mapping. Since it’s a m:m mapping, there isn’t a good way I know of to perfectly map between TRACT and ZIP codes. However, if we find a better mapping between, say, a CBG to ZIP, that is m:1 (because CBG are much smaller, so hopefully each CBG will not be split across ZIPs), then the mapping will have no controversies. The reason I don’t think ANYONE should be blocked by this is because using a weighted mapping by the residential ratio, or by using the mapping of each TRACT to the ZIP with the Maximum residential coverage is good enough, as far as I understand from the documentation of HUDUser. The documentation explains in what conditions these kind of mapping might be problematic. I attach the documentation here for reference, but the source is the same HUDUser website.

cc @Roshan_George_SafeGraph. @Dana_Turjeman_UMich_Ross this is great, we had gotten started on a crosswalk doc here as well - How to Join X to Y. Free resources to connect together IDs, boundaries, and indexes from disparate systems. - Google Sheets

This may or may not help, but: I found a very quick way to merge CBG data with ZIP data to be via GeoPandas .sjoin function with CBG point data (lat,long) and ZCTA polygons using the ‘within’ option. For robustness, I plan to do a more proper intersection of CBG/ZCTA polygons (distributing CBG counts by proportion of overlapping area with ZCTA) but the mismatch in Census and ZIP boundaries makes all solutions bad solutions in one way or another.

@Christopher_Michael_Graziul_University_of_Chicago Many people want to join CBG to ZIP, do you have any code or additional documentation you could share?

Where did you obtain the ZCTA polygons? and the CBG centroids?

Note: This may be less useful, but I am fairly certain there exist ZCTA-based Census TIGER/Line files that contain at least a subset of Census variables. They are careful to caveat their methodology but these data may be helpful for checking robustness of certain kinds of analysis.

@Christopher_Michael_Graziul_University_of_Chicago consider adding your code to the new SafeGraph Awesome Data Science list

Cleaned it up a little and submitted as an issue. Hope it helps!