Unable to read/download data using new system

Joel_Han_Loyola_University_Chicago · August 25, 2023, 5:59am

Hi,

I am trying to download the Advan neighborhood patterns data, follwing the instructions in the recent post: Dewey Data Bulk Download in R (new system)

The command get_file_list (defined in the linked post) appears to be working as intended, producing a table with download links. When I try to run read_sample_data or download_files, I get the following error message:

Error in make.names(col.names, unique = TRUE) : 
  invalid multibyte string 1
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 1 appears to contain embedded nulls

Here is an example of code that returns the above error.

# Get the list of files on server
get_file_list = function(apikey, product_path, print_info = T) {
  response = tryCatch(
    {
    GET(url=product_path,
                 add_headers(.headers = c("X-API-KEY" = apikey,
                                          "accept" = "application/json")))
    }, warning = function(cond) {
      message("Warning in GET.")
      message(cond)
      message("")
    }, error = function(cond) {
      message("Error in GET.")
      message(cond)
      message("")
    }
  )

  if(is.null(response)) {
    return(NULL)
  } else if(response$status_code == 401) {
    print(response);
    return(NULL);
  }

  res_json = content(response)
  num_files = res_json$metadata$num_files
  total_size_mb = res_json$metadata$total_size_mb
  avg_file_size_mb = res_json$metadata$avg_file_size_mb
  expires_at = res_json$metadata$expires_at

  if(print_info) {
    message("Files information---------------------------------------")
    message(paste0("Number of files: ", num_files))
    message(paste0("Total size (MB): ", total_size_mb))
    message(paste0("Average file size (MB): ", avg_file_size_mb))
    message(paste0("Link expires: ", expires_at))
    message("--------------------------------------------------------")
  }

  files_df = data.frame(download_link = unlist(res_json$download_links))
  split_links  = do.call(rbind.data.frame,
                         strsplit(files_df$download_link, "?", fixed = T))
  files_df$file_link = split_links[, 1];
  files_df = files_df[order(files_df$file_link), ];

  # Extract the file name
  file_names = apply(data.frame(files_df$file), 1,
                     function(x) tail(unlist(strsplit(x, "/")), n= 1) );
  files_df$file_name  = file_names;

  return(files_df);
}

# Read URL data into memory
read_sample_data = function(url, nrows = 100) {
  if(nrows > 1000) {
    message("Warning: set nrows no greater than 1000.");
    nrows = 1000;
  }

  df = read.csv(file = url, nrows = nrows);
  return(df);
}

api_key = (removed)
endpoint = (removed)
product_id = (removed)

files_df = get_file_list(api_key,endpoint, print_info = T)
jan2023 = filter(files_df, str_detect(file_link, '/2023-01-01/'))
sample_data = read_sample_data(jan2023$download_link[1], nrows = 100);

Appreciate any help you can provide. Thanks!

Joel

InvTech · August 25, 2023, 4:43pm

Could you let me know what dataset you are working?

Joel_Han_Loyola_University_Chicago · August 25, 2023, 6:53pm

Hi Dongshin,

I am working with the Advan neighborhood patterns data. I mention this in the first sentence, is that the information you are looking for or did I misunderstand?

Thanks!

Joel

InvTech · August 26, 2023, 12:56am

Oh, I missed that part.

I ran through your code, and it works fine on my computer for all 44 files. Could you please try to add sikpNul = TRUE and encoding = “UTF-8” to read.csv function in the read_sample_data function as follows?

  df = read.csv(url_con, nrows = nrows, skipNul = TRUE, encoding = "UTF-8");

Thanks,

donn

Joel_Han_Loyola_University_Chicago · August 26, 2023, 3:32am

Donn,

that change removes the warning message about embedded nulls but the error in make.names still remains.

I tried using the download.file function: download.file(files_df$download_link[1], destfile) and I this worked but I then encountered an error when unpacking the .gz file. Does that help to narrow down the problem?

Also, when you ran the code you said it worked on all 44 files. my table jan2023 contains 321 rows, is there an issue here?

Thanks!

Joel

InvTech · August 28, 2023, 10:57pm

Hi, Joel.

Oh, I found the issue!
Please replace the read_sample_data with the below or get a copy of the entire new code from here.

# Read URL data into memory
read_sample_data = function(url, nrows = 100) {
  # if(nrows > 1000) {
  #   message("Warning: set nrows no greater than 1000.");
  #   nrows = 1000;
  # }
  
  url_con = gzcon(url(url), text = T);
  df = read.csv(url_con, nrows = nrows);

  return(df);
}

There was another system update after I posted the tutorial (from csv to csv.gz), and I updated R code accordingly. You have downloaded the code before my R update!

Also, I was looking at a different dataset. Now I have 321 files and all open (read_sample_data) well.

Donn