Are home CBGs adjusted in the downloaded data? How are sampling bias demographics collected?

I have two questions.

  1. SafeGraph says 30-40% of total devices have troublesome home CBGs and it proposes some methods to deal with CBG sampling bias issues, such as Post-hoc Stratification Re-Weighting. I wonder if our downloaded data has already gone through those adjustments. For example, in patterns dataset, regarding visitor_home_cbgs and distance_from_home, are the home CBGs already adjusted? Meanwhile, due to that home is only identified at the CBG level, how are the distance measured from homes to POIs? Is the home some geometric center of the CBG?

  2. SafeGraph’s sampling bias report in 2019 (Google Colab) shows sampling biases by race, income, and education. I wonder how are those demographics collected due to the data are ananymous? Is this 2019 report the most updated file about the sampling bias information?

Thank you in advance.

Hi @KANGLIN_CHEN_University_of_Florida , thanks for your questions!

  1. Can you link me to where SafeGraph says 30-40% of total devices have troublesome home CBGs and the proposed methods to fix? I believe the data is available “raw” in the sense that you would need to apply the methods yourself, but I can confirm when you link me to the source. As for the second part of the question, according this section of the SafeGraph docs, haversine distance is measured from the visitor’s home geohash-7 and the POI. From what I’ve found, geohash-7 precision gives an area of about 152m x 152m (roughly the size of a square-ish city block). Another part of the documentation that might be helpful is the Determining Home Location section.

  2. I believe the report you referenced is their most recent internal report. There is a slightly more recent academic paper that worked with SafeGraph to assess bias. Here is a webinar where the authors shared their research. I’m not sure if the authors did their analysis with or without the Post-hoc Stratification Re-Weighting (or any other normalization method), but that should be clear from the paper.

Hi Ryan, thank you for the kind reply!

  1. In the same file Google Colab, there are two statements: “These troublesome CBGs contain a relatively large % of the total panel devices (30-40% of total devices), " and " ~ 10% of census block groups (CBGs) show disproportionately and sometimes impossibly high number_devices_residing in the sample panel compared to the true census_population .” I wonder in our downloaded data, if the home CBGs still keep such a high rate of inaccuracy in visitor_home_cbgs, distance_from_home, and the provided panel data (i.e., number_devices_residing) at the CBG level.
    Meanwhile, under others’ questions, I saw someone replied that SafeGraph’s panel data is frequently updated along time (how frequent?). I am dealing with county-level monthly pattern data of 2019-2020 and 2021-2022. Do you think it is “quite important” for me to normalize the number of visits by panel data (e.g., using residing devices to compute sampling rates in each county for different time)?

  2. I also wonder how the demographic biases (i.e., race, income, and education) are assessed.

  3. I have a follow-up question under another thread about sampling rates that has not been answered. This is related to my first question here. Can you please also look at it? Questions about normalized_visits_ by_state_scaling - #7 by Connie_Chen_University_of_Florida

Hi @Connie_Chen_University_of_Florida , you are welcome! I’ll try to provide some clarity.

  1. For the first question, it is possible the troublesome CBGs will be present in the data you downloaded. The “CBG sink” issue appears several times on SafeGraph’s Known Issues or Data Artifacts page. CBG-level normalization (or micro-normalization) should help with the issue, as it scales the raw number_devices_residing by the Census population estimates, when uses visitor_home_cbgs to more accurately attribute visits.
    a. For the second question, SafeGraph’s panel is updated roughly every month. I believe it is partly changing due to changes in reality (e.g., people moving from place to place) and partly because of improved/updated source data/processes on their side. For best results, I would suggest using the same month’s panel data to normalize the visits (June 2022 panel should be used to normalize June 2022 Patterns).

  2. I believe a great resource for this is the paper I referenced earlier. I don’t think they did any normalization, which could have a major impact.

  3. I will take a look at this other post later today when I get a chance!