Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena tables should be automatically generated from a list of capabilities #243

Open
travis opened this issue Oct 18, 2023 · 0 comments
Open

Comments

@travis
Copy link
Member

travis commented Oct 18, 2023

@web3-storage/capabilities exposes Ucanto Capability objects that capture schema information for nb fields, so we should have everything we need to generate Glue table schemas automatically, which would make it easier to make data available to Athena for querying.

The idea is that something like:

generateTable(Store.add)

would result in something like the following SST code:

  const storeAddTableName = getCdkNames('store-add-table', app.stage)
  const storeAddTable = new glue.CfnTable(stack, storeAddTableName, {
    catalogId: Aws.ACCOUNT_ID,
    databaseName,
    tableInput: {
      name: storeAddTableName,
      partitionKeys: [
        { name: 'day', type: 'date' }
      ],
      parameters: {
        classification: "json",
        typeOfData: "file",
        // @see https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html for more information on projection
        // configuration - this should match the "day" parameter and S3 prefix configured in the delivery stream
        "projection.enabled": "true",
        "projection.day.type": "date",
        "projection.day.format": "yyyy-MM-dd",
        "projection.day.range": "2023-01-01,NOW",
        "projection.day.interval": "1",
        "projection.day.interval.unit": "DAYS",
        "storage.location.template": `s3://${streamLogBucket.bucketName}/logs/receipt/store_add/\${day}/`
      },
      storageDescriptor: {
        location: `s3://${streamLogBucket.bucketName}/logs/receipt/store_add/`,
        columns: [
          { name: 'carcid', type: 'string' },
          // STRUCT here refers to the Apache Hive STRUCT datatype - see https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/
          { name: 'value', type: 'STRUCT<att:ARRAY<struct<can:STRING,with:STRING,nb:STRUCT<size:BIGINT,link:STRUCT<_cid_slash:STRING>>>>,iss:STRING,aud:STRING>' },
          { name: "out", type: "STRUCT<error:STRUCT<name:STRING>,ok:STRUCT<status:STRING,link:STRUCT<_cid_slash:STRING>>>" },
          { name: "ts", type: "timestamp" }
        ],
        inputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
        outputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
        serdeInfo: {
          serializationLibrary: 'org.openx.data.jsonserde.JsonSerDe',
          parameters: {
            // see https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/
            'mapping._cid_slash': '/'
          }
        }
      }
    }
  })
  storeAddTable.addDependsOn(glueDatabase)

where the Ucanto Schema of Store.add is used to generate the type field of the out column of the storageDescriptor.

@web3-storage/capabilities does also expose a comprehensive list of the names of capabilities it defines (as abilitiesAsStrings) so we could even automatically generate tables for all known capabilities. I don't think this would be too costly on the AWS side but it would be a good idea to verify that.

travis added a commit that referenced this issue Oct 18, 2023
Add two more capability tables to Glue so Athena can query them.

I'd like to start doing this in an automated way at some point, but that doesn't feel high priority enough for the moment so I've filed an issue here:

#243
travis added a commit that referenced this issue Oct 19, 2023
Add two more capability tables to Glue so Athena can query them.

I'd like to start doing this in an automated way at some point, but that
doesn't feel high priority enough for the moment so I've filed an issue
here:

#243

---------

Co-authored-by: Vasco Santos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant