There seems to be a lot less devices counted in there than usual (~2.7 million). Is this a result of a methodology change?

Jason_Kao_Columbia_University · December 10, 2020, 12:00am

Hi, I just pulled down the latest home summary file. There seems to be a lot less devices counted in there than usual (~2.7 million). Is this a result of a methodology change? Or is this to be expected

I summed the number_devices_residing field with awk and got these results:
• The output of awk -F',' 'FNR > 1 { s += $5} END { print s }' home-panel/2020/12/09/18/home_panel_summary.csv is 15652746.
• The output of awk -F',' 'FNR > 1 { s += $5} END { print s }' home-panel/2020/12/02/19/home_panel_summary.csv is 18317362.

Ryan_Kruse_MN_State · December 11, 2020, 5:29am

Hi @Jason_Kao_Columbia_University, just to verify that we’re on the same page, which weeks are you seeing ~2.7 million for? And do the suspiciously low weeks only include the 2020/12/09 and 2020/12/02 releases? Also, have you looked at any of the other summary stats, like total_visits ?

Bruce_Mizrach_Rutgers_University · December 17, 2020, 3:08am

I sum total devices in the US from the home summary file, and there has been a big decline in the last two datasets.

Ryan_Kruse_MN_State · December 18, 2020, 4:30pm

@Bruce_Mizrach_Rutgers_University @Jason_Kao_Columbia_University This is helpful, thanks. I will investigate further, then get back to you

Bruce_Mizrach_Rutgers_University · December 18, 2020, 4:32pm

Jason/Ryan: Have you been using other (potentially more stable) normalization stats? I had been using the devices seen (successfully) for almost a year.

Ryan_Kruse_MN_State · December 22, 2020, 1:10am

@Bruce_Mizrach_Rutgers_University The most consistently successful normalization technique in my opinion is home_panel_summary’s devices_residing (monthly). I’m sorry–I haven’t gotten the chance to investigate this myself yet

Bruce_Mizrach_Rutgers_University · December 22, 2020, 1:42am

Ryan: that is what I am reporting in the table. Was very surprised to see this change so dramatically in a very short time

Ryan_Fox_Squire_SafeGraph · December 22, 2020, 3:04am

@Ryan_Kruse_MN_State so you are pulling total_devices_seen daily from normalization_stats (summed across all states), and the number_devices_residing from home_panel_summary.csv (summed across all CBGs), is that right?

Ryan_Kruse_MN_State · December 22, 2020, 3:05am

@Ryan_Fox_Squire_SafeGraph Correct

Ryan_Kruse_MN_State · December 22, 2020, 3:08am

@Ryan_Fox_Squire_SafeGraph I believe I misspoke when I said total devices seen should be greater than total of number devices residing, but there still seems to be a significant decrease in total of number devices residing. Perhaps related to Thanksgiving travel? Maybe that would cause some confidence to be lost in some people’s home CBG?

Bruce_Mizrach_Rutgers_University · December 22, 2020, 3:08am

old_string=‘2020-11-30’
end_string=‘2020-12-07’
for j in range(len(weekly_dates)-1,len(weekly_dates)):
sdate=str(weekly_dates[‘week’].iloc[j])
syear=sdate[0:4]
infile=‘T:/AltData/Safegraph/weekly_patterns/home-summary-file/’ + syear + ‘/’ + sdate[0:10] + ‘-home-panel-summary.zip’
panel_df=pd.read_csv(infile)
print(infile)
panel_df[‘state’]=panel_df[‘state’].apply(lambda x: x.upper())
panel_df_state=panel_df.groupby(‘state’,as_index=False).agg({‘number_devices_residing’:‘sum’})
panel_state=panel_df_state.set_index(‘state’).transpose()
panel_state[‘sdate’]=sdate[0:10]
oldfile=‘T:/AltData/Safegraph/weekly_patterns/home-summary-file/2018-12-31_’ + old_string +‘adjustment_factors.xlsx’
old_df=pd.read_excel(oldfile)
new_df=old_df.append(panel_state)
outfile='T:/AltData/Safegraph/weekly_patterns/home-summary-file/2018-12-31’ + end_string +'adjustment_factors.xlsx’
print(outfile)
full_adj2=new_df.set_index(‘sdate’)
fulladj2.to_excel(outfile)

infile=‘T:/AltData/Safegraph/weekly_patterns/home-summary-file/2018-12-31_’ + end_string +‘adjustment_factors.xlsx’
df=pd.read_excel(infile)
df2=df.set_index(‘sdate’)
df2.head()
df2[‘US_devices’]= df2.sum(axis=1)
df3=df2.reset_index()
df4=df3[[‘sdate’,‘US_devices’]]
df4[‘US_deflator’]=df4[‘US_devices’]/df4[‘US_devices’].iloc[0]
outfile='T:/AltData/Safegraph/weekly_patterns/US/Agg/2018-12-31’ + end_string + ‘_deflators_us.xlsx’
df4.to_excel(outfile,index=False)
df4.tail()

Bruce_Mizrach_Rutgers_University · December 22, 2020, 3:09am

Here is the Python. Just summing the home summary file for the week. (Can you send it to you in some machine readable form).

Ryan_Fox_Squire_SafeGraph · December 22, 2020, 3:10am

I’ve reported this internally at SafeGraph, and someone from the product team will follow up with you. I am not sure whether this is expected for any reason, but it looks suspicious.

Bruce_Mizrach_Rutgers_University · December 22, 2020, 3:11am

The output is above. In the last two weeks, the US devices fell by 27,000. At first, I thought it was a Thanksgiving seasonal, but it persisted into the following week.

Bruce_Mizrach_Rutgers_University · December 22, 2020, 3:11am

I can send the Python to you by e-mail.

Ryan_Kruse_MN_State · December 22, 2020, 3:17am

Thanks @Bruce_Mizrach_Rutgers_University. I think we will be okay without your code because I was able to reproduce it. For the time being, I’d suggest omitting these weeks from analysis or trying to use another approach for normalization.

Bruce_Mizrach_Rutgers_University · December 22, 2020, 6:59pm

Hard to complain because the data is free, but if I was paying for the data, this would be unacceptable. WE discovered the error, not Safegraph, and if I had relied on this data, I would have predicted (incorrectly) a huge surge in economic activity. That might have led someone like me to, for example, testify that the stimulus bill just passed could be scaled down.

Ryan_Fox_Squire_SafeGraph · December 22, 2020, 7:25pm

@Bruce_Mizrach_Rutgers_University absolutely understand. We strive to have wholly reliable data, and proactively discover issues, and this is unsatisfactory. We are actively investigating the root issue and will try to resolve asap.

Ryan_Fox_Squire_SafeGraph · December 22, 2020, 10:37pm

I’m also able to easily reproduce this by reading the data located in this path:

'<s3://sg-c19-response/weekly-patterns-delivery/weekly/home_panel_summary/2020/*/*/*/*.csv>'

Ryan_Fox_Squire_SafeGraph · December 22, 2020, 10:47pm

@Bruce_Mizrach_Rutgers_University @Jason_Kao_Columbia_University @Ryan_Kruse_MN_State

I think I’ve gotten to the root of the issue, and it has to do with confusion/poor communicating / poor documenting on SafeGraph’s part with the latest backfill delivery.

The immediate fix to your problem is to update the paths and logic you are using to read the data.

As of the beginning of Dec we updated our algorithms and data and created a backfill:

The best, most correct version of historical data we have available is located at these paths:
• <s3://sg-c19-response/weekly-patterns-delivery/weekly/patterns_backfill/>
• <s3://sg-c19-response/weekly-patterns-delivery/weekly/home_panel_summary_backfill/>
• <s3://sg-c19-response/weekly-patterns-delivery/weekly/normalization_stats_backfill/>
This historical data goes back to 2018 and stops at end of November 2020.

For all data after Nov 2020, you should use the following paths:

• <s3://sg-c19-response/weekly-patterns-delivery/weekly/patterns_backfill/>
• <s3://sg-c19-response/weekly-patterns-delivery/weekly/home_panel_summary/>
• <s3://sg-c19-response/weekly-patterns-delivery/weekly/normalization_stats/>
• AND YOU SHOULD IGNORE ALL DATA BEFORE DECEMBER 2020 IN THIS PATH (because it is from the previosu version.
We are working right now on updating the documentation in our catalog to make this clearer, and we apologize for the confusion.

If you read the historical data before Dec 2020 from those paths, and only read the Dec 2020 onward from the original paths, then things should look continuous and non-anomalous.