Spend Patterns -- large outliers, look wrong

In the safegraph spend patterns there seem to be erroneous entries… I have provided an example of what I’m seeing: in rows 1 and 21 there are massive spikes in spending. There are other examples of this in the data. Does anyone know what’s going on or have advice on how to deal with these obesrvations?

   placekey            date       month_spend month_cust med_spend_cust location
   <chr>               <date>           <dbl>      <int>          <dbl> <chr>   
 1 225-223@5x3-fhg-p35 2019-03-01     19021.        1246           10   Omak, WA
 2 225-223@5x3-fhg-p35 2019-04-01        50            2           25   Omak, WA
 3 225-223@5x3-fhg-p35 2019-08-01       125            6           22.5 Omak, WA
 4 225-223@5x3-fhg-p35 2019-09-01       314.          10           25   Omak, WA
 5 225-223@5x3-fhg-p35 2019-10-01       170            6           20   Omak, WA
 6 225-223@5x3-fhg-p35 2019-11-01       205            7           15   Omak, WA
 7 225-223@5x3-fhg-p35 2019-12-01       140            3           55   Omak, WA
 8 225-223@5x3-fhg-p35 2020-01-01       110            3           40   Omak, WA
 9 225-223@5x3-fhg-p35 2020-02-01       135            6           20   Omak, WA
10 225-223@5x3-fhg-p35 2020-03-01        90            4           20   Omak, WA
11 225-223@5x3-fhg-p35 2020-05-01        30            3           10   Omak, WA
12 225-223@5x3-fhg-p35 2020-06-01       100            7           10   Omak, WA
13 225-223@5x3-fhg-p35 2020-07-01        55            2           27.5 Omak, WA
14 225-223@5x3-fhg-p35 2020-08-01        60            4           12.5 Omak, WA
15 225-223@5x3-fhg-p35 2020-09-01       167.           7           10   Omak, WA
16 225-223@5x3-fhg-p35 2020-10-01        80            3           15   Omak, WA
17 225-223@5x3-fhg-p35 2020-11-01        81.1          4           17.5 Omak, WA
18 225-223@5x3-fhg-p35 2020-12-01        77.3          4           12.7 Omak, WA
19 225-223@5x3-fhg-p35 2021-01-01        73.0          5           11.2 Omak, WA
20 225-223@5x3-fhg-p35 2021-02-01        35.2          2           17.6 Omak, WA
21 225-223@5x3-fhg-p35 2021-03-01     39040.        2230           15   Omak, WA
22 225-223@5x3-fhg-p35 2021-04-01        94.8          4           25   Omak, WA
23 225-223@5x3-fhg-p35 2021-05-01        72.0          5           10   Omak, WA

Is this data reliable? I suppose it’s possible for slight fluctuations here and there (e.g. a store closes for renovation and so has no customers) but these huge changes are seen all over the data… Here is an example of very low outlier (row 16):

  placekey            date       month_spend month_cust med_spend_cust location   
   <chr>               <date>           <dbl>      <int>          <dbl> <chr>      
 1 222-222@5vh-rc9-yy9 2019-01-01      2531.         175          10.7  Modesto, CA
 2 222-222@5vh-rc9-yy9 2019-02-01      2623.         191           9.5  Modesto, CA
 3 222-222@5vh-rc9-yy9 2019-03-01      2657.         188           9.35 Modesto, CA
 4 222-222@5vh-rc9-yy9 2019-04-01      2450.         219           8    Modesto, CA
 5 222-222@5vh-rc9-yy9 2019-05-01      3040.         241           8.9  Modesto, CA
 6 222-222@5vh-rc9-yy9 2019-06-01      2480.         193           9.5  Modesto, CA
 7 222-222@5vh-rc9-yy9 2019-07-01      2474.         204           8.83 Modesto, CA
 8 222-222@5vh-rc9-yy9 2019-08-01      2389.         202           9.7  Modesto, CA
 9 222-222@5vh-rc9-yy9 2019-09-01      2652.         191           9.8  Modesto, CA
10 222-222@5vh-rc9-yy9 2019-10-01      2986.         222           8.58 Modesto, CA
11 222-222@5vh-rc9-yy9 2019-11-01      2669.         187          10.4  Modesto, CA
12 222-222@5vh-rc9-yy9 2019-12-01      3121.         208           9.95 Modesto, CA
13 222-222@5vh-rc9-yy9 2020-01-01      2051.         162           8.85 Modesto, CA
14 222-222@5vh-rc9-yy9 2020-02-01      2210.         184           8.98 Modesto, CA
15 222-222@5vh-rc9-yy9 2020-03-01      2198.         170           9.15 Modesto, CA
16 222-222@5vh-rc9-yy9 2020-04-01        20.1          2          10.0  Modesto, CA
17 222-222@5vh-rc9-yy9 2020-05-01      1243.          90          10.3  Modesto, CA
18 222-222@5vh-rc9-yy9 2020-06-01      1576.         128           9.4  Modesto, CA

@fullagar I’ll float this by the SafeGraph team for feedback.

Generally, the data is subject to more fluctuation at the POI level for a variety of reasons that are hard to define/adjust for (Holidays, renovation, large purchases, etc.). In the past, SafeGraph has suggested aggregated this data at various levels (brand, NAICS, location, etc.) to smooth out those fluctuations.

I don’t know what type of POI this is, but in your second example it could be very likely that a POI didn’t have many transactions during the first month of the pandemic.

Hi @evan-barry-dewey. Thanks for the reply. So these flucations are not mismeasured but likely capture actual fluctuations that occur to random events (Holidays, renovation, large purchases, etc.)? I am looking at Starbucks stores, so I agree with your pandemic conjecture. However, these fluctuations occur at other times too. Is it that people are spending money at these places but it isn’t being captured by the data due to sampling?

It’s always a good idea to gut check the data and report extreme outliers. I’ve sent these back to SafeGraph for review. In the past, they’ve also suggested filtering out outliers beyond a certain min/max threshold, depending on the use case. It will help remove locations where coverage is too low to consider reliable.

One thing to consider is the fluctuation in panel size, so it’s also worth normalizing the data against the number of transactions and customers.

There is a notebook that was previously published on sampling bias which might be helpful to review.