In July 2023, the USDA released a public data set of Crop Sequence Boundaries in the United States. This data was in a heavy file format (.gsb) that ate up over 25GB of memory when trying to read the entire .gsb file in either R or Python.
In order to work with this data in either of those two open source programming languages, it needed to be ported to a different file format. The py directory in this repository contains an example script for ingesting the .gsb file and writing it out to .parquet format every n
number of rows (in our example, we used n = 1,000,000
).
Next, the R directory in this repository contains an example script for re-partitioning the Parquet data by the unique values in a column in the data (in our example, we re-partitioned the data by STATEFIPS
code).
Lastly, the data directory in this repository contains .csv files that can be used to convert integer codes in the CSB data to their plain-English equivalents.
Ketchbrook Analytics has hosted the 2022 CSB data in a public AWS S3 bucket (in .parquet format, partitioned by year and STATEFIPS code). This blog post can walk you through how to connect to this data yourself, using R. If you can't wait to read the blog post, this R code should help you get started:
library(arrow)
library(dplyr)
# Confirm connecting to s3 buckets is enabled on your machine
arrow::arrow_with_s3()
# Specify the s3 bucket containing the CSB .parquet data
bucket <- arrow::s3_bucket("ketchbrook-public-usda-nass-csb")
# List the directories and files in the bucket
bucket$ls("year=2022", recursive = TRUE)
# Calculate the average number of crop sequence boundary acres in Fairfield County, Connecticut
arrow::open_dataset(bucket) |>
dplyr::filter(
STATEFIPS == 09,
CNTY == "Fairfield"
) |>
dplyr::summarise(
mean_CSBACRES = mean(CSBACRES, na.rm = TRUE)
) |>
dplyr::collect()