USDA Crop Sequence Boundaries Data

In July 2023, the USDA released a public data set of Crop Sequence Boundaries in the United States. This data was in a heavy file format (.gsb) that ate up over 25GB of memory when trying to read the entire .gsb file in either R or Python.

In order to work with this data in either of those two open source programming languages, it needed to be ported to a different file format. The py directory in this repository contains an example script for ingesting the .gsb file and writing it out to .parquet format every n number of rows (in our example, we used n = 1,000,000).

Next, the R directory in this repository contains an example script for re-partitioning the Parquet data by the unique values in a column in the data (in our example, we re-partitioned the data by STATEFIPS code).

Lastly, the data directory in this repository contains .csv files that can be used to convert integer codes in the CSB data to their plain-English equivalents.

How to Access this Data Yourself

Ketchbrook Analytics has hosted the 2022 CSB data in a public AWS S3 bucket (in .parquet format, partitioned by year and STATEFIPS code). This blog post can walk you through how to connect to this data yourself, using R. If you can't wait to read the blog post, this R code should help you get started:

library(arrow)
library(dplyr)

# Confirm connecting to s3 buckets is enabled on your machine
arrow::arrow_with_s3()

# Specify the s3 bucket containing the CSB .parquet data
bucket <- arrow::s3_bucket("ketchbrook-public-usda-nass-csb")

# List the directories and files in the bucket
bucket$ls("year=2022", recursive = TRUE)

# Calculate the average number of crop sequence boundary acres in Fairfield County, Connecticut
arrow::open_dataset(bucket) |>
  dplyr::filter(
    STATEFIPS == 09,
    CNTY == "Fairfield"
  ) |>
  dplyr::summarise(
    mean_CSBACRES = mean(CSBACRES, na.rm = TRUE)
  ) |>
  dplyr::collect()

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
R		R
data		data
py		py
www		www
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USDA Crop Sequence Boundaries Data

How to Access this Data Yourself

Useful Links

About

Releases

Packages

Contributors 2

Languages

ketchbrookanalytics/usda-csb-data

Folders and files

Latest commit

History

Repository files navigation

USDA Crop Sequence Boundaries Data

How to Access this Data Yourself

Useful Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages