Our team has detected duplicated data in patterns data

Hi, our team has detected duplicated data in patterns data. Specially in patterns data from all Subways of Houston from 2020 until now. We get duplicated rows by subsetting “placekey”, “date_range_start”, “date_range_end”

Hi @Angel_Langdon thank you for posting this.

  1. Are you using Weekly or Monthly Patterns?
  2. Are the entire rows the same or just the placekey/dates?
  3. How many times is a given row duplicated? (Are there just two copies or are there more?)
  4. Can you provide a screenshot of some of the duplicated rows?
    Thanks
  1. We are using Monthly Patterns
  2. No, the entire rows are not the same, only placekey/dates are duplicated
  3. Given a row it is duplicated two times only (two copies)
  4. Yes, here it is:

Here is the Python code to get the duplicated rows
df = df[df.duplicated(subset=["placekey",
"date_range_start",
"date_range_end"],
keep=False)]
df = df.sort_values(by=["placekey",
"date_range_start",
"latitude"])
[df.to](http://df.to)_excel("duplicated_rows.xlsx", index=False)

Hi @Angel_Langdon, I was unable to recreate this issue, but I think I know what’s going on. The rows with missing information seem to just have the columns from the Patterns files, but not the Places files. I believe at some point in the processing, the Patterns files are getting duplicated. When I download using the tool we talked about in the other thread, I am not seeing any duplicated Subway rows in Houston.

Okay @Ryan_Kruse_MN_State thank you for your response, I suppose we have a bug somewhere