SafeGraph <> Advan Research Methodology Differences

evan-barry-dewey · January 23, 2023, 7:12pm

Now that Advan is available on the platform, I wanted so share some methodology differences between the Advan Patterns datasets, and the one previously made available by SafeGraph.

No Modeling of Visits/Visitors in Shared Geometries
Advan computes the visits/visitors and other metrics inside a POI using the POI’s geometry.

If a POI has shared geometry, SafeGraph assigned a subset of the above visits to each POI in the Shared Polygon. Additional historical visitation techniques are listed here.

Example: a Pizza Hut and a Taco Bell may have a shared geometry (they both operate at the same location, e.g., inside a mall). If this polygon had 1,000 visits, SafeGraph will “assign” each of those 1,000 visits to one of either Pizza Hut or Taco Bell. Advan will report 1,000 visits to each of the two.

Effect: the majority of POIs with Shared Polygons will have a lot more traffic, and therefore the total visits and visitors across the sum of all POIs will be a lot higher, on the order of 10x (however the median visits will only be 25% higher). The important thing to understand, however, is that the historical trends (year/year changes, etc.) will be more consistent than before, as there is no scaling that adds a layer of uncertainty / fluctuation. If you need a list of the Placekeys with a shared polygon, you can find it here.

Recommended actions: if you are measuring year/year changes please filter out Shared Polygons from your computations. Advan provides a list of Placekeys pertaining to Shared Polygons that need to be filtered out.

No estimation of Visits/Stops.
Advan computes visits by measuring the pings inside a POI’s polygon. It does not apply any dwell time or any concept of “stops”; it relies on the polygon for accuracy. Advan has tested its own data on 1,500 publicly traded tickers versus (a) top line revenue as reported from the companies and (b) credit card transaction counts on physical locations, and has determined consistently that in the vast majority of cases filtering for dwell time reduces the signal and makes the correlation/forecasting worse.

SafeGraph first computed “stops”; then it compared the stops to the POIs within a 90 meter radius; then assigned a device to one of those POIs using an algorithm that takes into account hour of day, day of week, distance from the POI centroid, etc.

Effect: Advan’s visitation counts are a median of 25% higher (I.e. the typical location has 25% more devices observed in it). Additionally, as long as a POI’s polygon remains consistent, visit counts over time will be significantly more stable and there is less risk of visit cannibalization from neighboring POI.

Panel details
Advan did not experience the panel changes or mobility data provider disruption that SafeGraph did in May, therefore Advan’s visitation counts did not have large swings in 2022 and will be more consistent on a year over year basis. Advan is also less likely to encounter bugs like the normalized_visits_by_state_scaling posted here: SafeGraph data issue

Note that currently the trade area data use a different panel than the one used for visits, so visits and trade areas do not go hand in hand. Advan is planning on changing that in a future release (no ETA yet). This allows Advan to use a larger panel for the home/work data as a way to analyze a larger and more detailed demographic area. However, because this panel is not sourced as consistently as the panel they use to count visits, it may vary more from month to month.

They’ve recently added a new panel provider for the home/work data that is much more stable. The month over month volatility of the trade area panels will be significantly reduced going forward (starting with the November 2023 data).

Number of CBGs and Census Tracts in Trade Areas.
Advan cuts the number of CBGs in a trade area to the top 1,000 and number of tracts in a trade area to the top 400 for each POI. SafeGraph did not. Advan generates home/work trade areas as 4 fields - geohash 6 (i.e., g6), g5, g4, and g3, so the more distant areas have lower granularity. Advan reserves the right to modify the schema in the future to similarly reduce the overall data size without losing granularity at the local level.

Effect: the visitor columns (visitor_home_cbgs, visitor_home_aggregation, etc.) will contain a much smaller number of CBGs / Census Tracts. Additionally, CBGs / Tracts that are distant from the POI (and less likely to have significant visitation to the POI) will be missing. Advan also filters CBGs that did not have enough visitors, so if there are very few visitors and there are 1 or 2 visitors per CBG then these will not show up; in extreme cases that can lead to a blank field.

Normalization.
Advan calculates the column normalized_visits_by_state_scaling by dividing the US adult population by the sum of unique visitors seen daily, multiplying that scale by the daily raw_visitor_count per POI and then summing this value over the respective time period. This data has been tested against “ground truth” and has proven to be robust in capturing true visit trends. Advan will be computing the remaining normalization columns using the methodology listed in SafeGraph’s documentation.

SafeGraph calculated the column normalized_visits_by_state_scaling by dividing the regional population by the unique visitors seen in the region (US State or CA Province) in the respective time period and multiplying that scale by the raw_visit_counts per POI.

Effect: Both Advan and SafeGraph’s method correct for differences in panel size and can be used to understand visits longitudinally.

Dwell computations.
When measuring the median dwell time in a location, Advan filters out any devices that have no dwell time. This is very similar to what SafeGraph had been doing (filtering for devices that “stopped”, i.e., with at least 1 minute dwell), as the majority of devices either have dwell time more than 1 minutes or no dwell time at all – very few fall in the (0,1) bucket of dwell times.

Effect: Advan and SafeGraph’s method correct for dwell metrics are substantially the same and the data will not change. However, for purposes of computing devices in each dwell bucket, Advan uses all the observations (whether the device dwelled in a location or not); this results in Sum(visitors on each dwell bucket) = all visitors. However, because the median dwell time is computed using only dwelled devices, Sum(visitors on each dwell bucket * median dwell time for the bucket) < all visitors * median dwell. This is by design, as the dwell buckets are measuring different things than the median dwell time.