Hi! I’m trying to analyze the Monthly Places Patterns (aka “Patterns”) Jan 2018 - Apr 2020 data. When I try to download it shows up as 28 separate files. Am I doing something wrong? How can I access the data as just one file?
Hi Erin. Each file contains one monthly pattern. Therefore, there are 28 (12+12+4) files.
Do you have any recommendations on how to combine the 28 files into one?
I think it is hard to combine 28 datasets into one since each dataset is very big. For my research, I extracted my targeted POIs first, then I combined 28 datasets (only contain my targeted POIs) into one.
Hi @Erin_Brown_Purdue_University, I would agree with @Yun_Liang_Penn_State_University here in that filtering before merging is typically the best approach. There are some functions in SafeGraph_py that will merge all of those together for you, but you will likely get a memory error after 5-10 files if they aren’t filtered
Alternatively, @Ryan_Kruse_MN_State has come up with a nifty notebook that will walk you through an all in 1 process to pull, filter, and save just the data you want! you can check it out HERE
So I tried following that notebook, but in Step 4 I got the error ‘TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]’.
I am trying to filter the data to get just restaurants in the US, but I think I set the filters wrong, this is what I have: df = chunk[(chunk.top_category.isin(‘Restaurants and Other Eating Place’)) & (chunk.country.isin(‘US’))] but I have a feeling I am doing the filter wrong. How can I filter this correctly? @Ryan_Kruse_MN_State @Jack_Lindsay_Kraken1
Hi @Erin_Brown_Purdue_University, I believe you can switch it to df = chunk[(chunk.top_category.isin(['Restaurants and Other Eating Place'])) & (chunk.country.isin(['US']))]
and it should work for you. I think when you use .isin()
, the input has to be a list, so I just made the following changes:
-
'Restaurants and Other Eating Place'
to['Restaurants and Other Eating Place']
-
'US'
to['US']
Let me know if that makes sense/works!
So that fixed my initial problem and so I was able to complete the entire process, but then when I opened the folder in my google drive there was no data, just the column names in the combined_core_poi-patterns.csv when I opened it in google drive. Do you know why this issue occurred and how to fix it? @Ryan_Kruse_MN_State
Hi @Erin_Brown_Purdue_University, I think I’ve identified the problem. Looking at the column names in one of the combined_core_poi-patterns.csv
files, I don’t think you will find a country
column, which is why the data is filtering to nothing. There may be an iso_country_code
column or something similar, but for Monthly Patterns, all the POIs are in the US anyway, so you really don’t need to filter by country at all.
I believe if you just filter by top_category
, you will get data back. Something like
> df = chunk[chunk.top_category.isin(['Restaurants and Other Eating Place'])]
I would suggest trying with one month to make sure the filter is working as expected. Please let me know how it goes and if any issues arise!
Note 1: You may have to delete all the folders created in Drive from when you ran it last time. The tool is somewhat barebones, so it’s not super robust.
Note 2: There is a Canada Weekly Patterns product. However, the Canada and US Patterns data are in separate datasets, so there’s no need to filter by country.
I made the change to the filter and deleted the folders created in Drive. Now the brand data is fully downloaded but it is not filtered and there is a core_poi.csv file that only contains headers. I am also getting this error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-16-c62671c712f8> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '\nprint("getCoreFile")\ngetCoreFile()\nprint("getPatternsFiles")\ngetPatternsFiles()\nprint("mergeCorePOIandPatterns")\nmergeCorePOIandPatterns()\nprint("disseminateBrandInfo")\ndisseminateBrandInfo()')
3 frames
<decorator-gen-53> in time(self, line, cell, local_ns)
<timed exec> in <module>()
<ipython-input-13-4ab9656eb13a> in getPatternsFiles(destination, months)
32 if x_dir not in os.listdir(destination):
33 os.mkdir(destination + "/" + x_dir)
---> 34 for f in date_dict[x]:
35 if f.split('/')[-1] not in os.listdir(destination + '/' + x_dir): #do not download the file if it is already there
36 print(bucket, f, '/'.join([destination, x_dir, f.split('/')[-1]]))
KeyError: '2'```
@Ryan_Kruse_MN_State Do you have any theories to why this might be happening?
@Erin_Brown_Purdue_University Sorry for the delay getting back to you. I have some answers for you!
- You’ll need to change
df = chunk[chunk.top_category.isin(['Restaurants and Other Eating Place'])]
todf = chunk[chunk.top_category.isin(['Restaurants and Other Eating Places'])]
. Previously, the string did not match any of the categories, which is why nothing was returned. - The months variable has to be a list. So change
months = sorted(date_dict.keys())[0]
tomonths = sorted(date_dict.keys())[0:1]
. This will make months =['2018-01']
instead of months ='2018-01'
.
I ran this, and the code ended with this error:
FileNotFoundError Traceback (most recent call last)
<ipython-input-12-c62671c712f8> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '\nprint("getCoreFile")\ngetCoreFile()\nprint("getPatternsFiles")\ngetPatternsFiles()\nprint("mergeCorePOIandPatterns")\nmergeCorePOIandPatterns()\nprint("disseminateBrandInfo")\ndisseminateBrandInfo()')```
8 frames
```<decorator-gen-53> in time(self, line, cell, local_ns)
<timed exec> in <module>()```
/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py
``` in __init__(self, src, **kwds)
2008 kwds["usecols"] = self.usecols
2009
-> 2010 self._reader = parsers.TextReader(src, **kwds)
2011 self.unnamed_cols = self._reader.unnamed_cols
2012
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()```
FileNotFoundError: [Errno 2] No such file or directory: '/content/gdrive/My Drive/SafeGraph/Monthly Patterns Test/2-patterns/patterns.csv'
Files were saved to my google drive, but one is a 1.48 GB CSV file for the patterns data. I’m trying to open it to see if that data is correct and what I’m looking for, but given the size and fact it is taking over 20 minutes to open a file that should only contain one month of data I think there must have been an other thing I am doing wrong. I’m not sure if this has anything to do with it since the rest of the notebook runs fine, but I noticed this error after running the first !pip install boto3 command: Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.21.0,>=1.20.19->boto3) (1.15.0)
ERROR: requests 2.23.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you’ll have urllib3 1.26.3 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you’ll have folium 0.8.3 which is incompatible.
Installing collected packages: urllib3, jmespath, botocore, s3transfer, boto3
Found existing installation: urllib3 1.24.3
Uninstalling urllib3-1.24.3:
Successfully uninstalled urllib3-1.24.3
Successfully installed boto3-1.17.19 botocore-1.20.19 jmespath-0.10.0 s3transfer-0.3.4 urllib3-1.26.3
@Erin_Brown_Purdue_University I think the file size is within expectations—the Patterns files are quite big. However, let me know if the data is not filtered properly for some reason. How are you opening the data? Some tools aren’t well-equipped for working with big files like that, so they may take longer.
I believe the FileNotFoundError may be a result of a previous errored run of the program. When you ran it with months as a String instead of a List, it created the folder “2-patterns”. Then when you ran the tool again, it saw the folder “2-patterns” and expected data to be in it. one of the shortcomings of this lightweight tool is that the destination folder needs to be empty, otherwise you’ll run into this type of error.
Something you can do to decrease the file size is filter to just the columns you want to work with. This can be done by adjusting the code in the Colab notebook in the step where it saves the final dataframe to a CSV in Drive.
My Mac automatically tried to open the file with Numbers. I just wanted to preview the file so I didn’t think what I opened it with would have had a big impact, but considering it still isn’t open I should probably use a different software. Do you have any suggestions? Once I have all the data I want to use it on Tableau, but right now I’m just checking to see if the filtering worked
I opened the file with Excel and it worked! Thank you so much for all your help. I would have been so lost without your help
Oh, great! You’re welcome, I’m glad you got it working. Please let me know if there’s anything else I can help with. And if you start a new thread, feel free to ping me!