Skip to content

ketchbrookanalytics/usda-csb-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

USDA Crop Sequence Boundaries Data

In July 2023, the USDA released a public data set of Crop Sequence Boundaries in the United States. This data was in a heavy file format (.gsb) that ate up over 25GB of memory when trying to read the entire .gsb file in either R or Python.

In order to work with this data in either of those two open source programming languages, it needed to be ported to a different file format. The py directory in this repository contains an example script for ingesting the .gsb file and writing it out to .parquet format every n number of rows (in our example, we used n = 1,000,000).

Next, the R directory in this repository contains an example script for re-partitioning the Parquet data by the unique values in a column in the data (in our example, we re-partitioned the data by STATEFIPS code).

Lastly, the data directory in this repository contains .csv files that can be used to convert integer codes in the CSB data to their plain-English equivalents.

How to Access this Data Yourself

Ketchbrook Analytics has hosted the 2022 CSB data in a public AWS S3 bucket (in .parquet format, partitioned by year and STATEFIPS code). This blog post can walk you through how to connect to this data yourself, using R. If you can't wait to read the blog post, this R code should help you get started:

library(arrow)
library(dplyr)

# Confirm connecting to s3 buckets is enabled on your machine
arrow::arrow_with_s3()

# Specify the s3 bucket containing the CSB .parquet data
bucket <- arrow::s3_bucket("ketchbrook-public-usda-nass-csb")

# List the directories and files in the bucket
bucket$ls("year=2022", recursive = TRUE)

# Calculate the average number of crop sequence boundary acres in Fairfield County, Connecticut
arrow::open_dataset(bucket) |>
  dplyr::filter(
    STATEFIPS == 09,
    CNTY == "Fairfield"
  ) |>
  dplyr::summarise(
    mean_CSBACRES = mean(CSBACRES, na.rm = TRUE)
  ) |>
  dplyr::collect()

Useful Links

About

Crop Sequence Boundaries data from USDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published