Hi, our team has detected duplicated data in patterns data. Specially in patterns data from all Subways of Houston from 2020 until now. We get duplicated rows by subsetting “placekey”, “date_range_start”, “date_range_end”
Hi @Angel_Langdon thank you for posting this.
- Are you using Weekly or Monthly Patterns?
- Are the entire rows the same or just the placekey/dates?
- How many times is a given row duplicated? (Are there just two copies or are there more?)
- Can you provide a screenshot of some of the duplicated rows?
Thanks
- We are using Monthly Patterns
- No, the entire rows are not the same, only placekey/dates are duplicated
- Given a row it is duplicated two times only (two copies)
- Yes, here it is:
Here is the Python code to get the duplicated rows
df = df[df.duplicated(subset=["placekey",
"date_range_start",
"date_range_end"],
keep=False)]
df = df.sort_values(by=["placekey",
"date_range_start",
"latitude"])
[df.to](http://df.to)_excel("duplicated_rows.xlsx", index=False)
Hi @Angel_Langdon, I was unable to recreate this issue, but I think I know what’s going on. The rows with missing information seem to just have the columns from the Patterns files, but not the Places files. I believe at some point in the processing, the Patterns files are getting duplicated. When I download using the tool we talked about in the other thread, I am not seeing any duplicated Subway rows in Houston.