What is the recommended approach to dealing with something like this (it looks to be very common)

what is the recommended approach to dealing with something like this (it looks to be very common). Where there are multiple safegraph ids for what are, at least from the data in core_pois, what appear to be exactly the same places. It seems wrong to sum the visits together, but it doesnt seem like there is a clear way to choose which id to treat as the “true” poi. For example the three pois
00579e722c8e48178ef0c66a7c91f92c
1e1c36608ab849a082e60e9583758715
d30e185c9ae040ecba3cac6a8c1b4e62
all are labeled as temple university, all have the same or almost exactly the same lat long. The one ended in 92c appears to stop in April, e62 goes through all time but is essentially zero until May and 715 doesnt start until June.

Hi @Dan_Moulton_Federal_Reserve_Bank_of_Philadelphia, I am looking into this now

thank you

Hi Dan, if you’re looking for general guidance, I’d offer three resources: first, Safegraph offers a matching service, where we help deduplicate POI (https://docs.safegraph.com/docs/matching-service-overview); second, if you’re using a SQL interface, we have a collection of queries to help you identify “diffs” between releases Data Science Resources | SafeGraph Docs; lastly, keep an eye on the changelog. SGPID Churn is an important metric we use to monitor data quality. To be sure, this is high level metric but we strive to provide as much detail and transparency as possible to help folks anticipate issues. (Changelog)

thanks for the links, yes I’m interacting with the data in Hive SQL and pyspark. Its not so much identifying the diffs as IDs to choose when they represent that same POI. Is the idea I should only use Ids in the latest core poi file? This seems like it throws away a lot of history. For example, in the Temple University POIs above, on of hte ids goes back to 2018, but the last Core POI file it shows up in is the May one

thanks for the link to the match service, i’ll give that a go

Hi @Dan_Moulton_Federal_Reserve_Bank_of_Philadelphia, safegraph only recommends using the most recent core and the most up to date data. The problem I see is multiple SGPIDs. That should not ever happen for a singular location. @todd_hendricks is taking a deeper look into this, but that should not be the case.

Hi @Dan_Moulton_Federal_Reserve_Bank_of_Philadelphia have you arrived at a resolution here?

Sorry for the glacial response, I was on vacation. No, I have not reached a resoultion here. Still the same issue

@Dan_Moulton_Federal_Reserve_Bank_of_Philadelphia Apologies for the stupid question here, but which product/file are you using? I’m looking at the Core Places (US Only) files in the safegraph_places_id column and I’m not finding the id’s you listed above.

sg:1e1c36608ab849a082e60e9583758715sg:d30e185c9ae040ecba3cac6a8c1b4e62 are both from the september file

the first one i will give you is from the may file

but if i dont use that id, then i dont have a complete history of temple’s foot traffic

here is another example that was given directly by safegraphs match service.
sg:2ceae1f4f92144ceadc3cbf9613ab346
sg:120f81665c6d4db5b4db37e8aeb46572

both are “Coastal Alabama Community College Bay Minette Campus”

they have different naics in that particular case

but trivially different

6113 vs 6112

@Dan_Moulton_Federal_Reserve_Bank_of_Philadelphia Yes, I am seeing the Coastal Alabama CC examples. What is the issue you see / what was your expectation? Those look right to me.

Also, Temple and Coastal Alabama highlight the unique geospatial challenge of universities in particular. I found this thread from earlier in the year to be informative, and I hope you do too: Workspace Deleted | Slack