Hi, our team has detected duplicated data in patterns data. Specially in patterns data from all Subways of Houston from 2020 until now. We get duplicated rows by subsetting “placekey”, “date_range_start”, “date_range_end”

Hi @Angel_Langdon thank you for posting this.

  1. Are you using Weekly or Monthly Patterns?
  2. Are the entire rows the same or just the placekey/dates?
  3. How many times is a given row duplicated? (Are there just two copies or are there more?)
  4. Can you provide a screenshot of some of the duplicated rows?
  1. We are using Monthly Patterns
  2. No, the entire rows are not the same, only placekey/dates are duplicated
  3. Given a row it is duplicated two times only (two copies)
  4. Yes, here it is:

Here is the Python code to get the duplicated rows
df = df[df.duplicated(subset=["placekey",
df = df.sort_values(by=["placekey",
[df.to](http://df.to)_excel("duplicated_rows.xlsx", index=False)

Hi @Angel_Langdon, I was unable to recreate this issue, but I think I know what’s going on. The rows with missing information seem to just have the columns from the Patterns files, but not the Places files. I believe at some point in the processing, the Patterns files are getting duplicated. When I download using the tool we talked about in the other thread, I am not seeing any duplicated Subway rows in Houston.

Okay @Ryan_Kruse_MN_State thank you for your response, I suppose we have a bug somewhere