Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to TimescaleDB #9

Open
3 of 5 tasks
pablojimpas opened this issue May 3, 2022 · 5 comments
Open
3 of 5 tasks

Write to TimescaleDB #9

pablojimpas opened this issue May 3, 2022 · 5 comments
Labels
area:processor enhancement New feature or request help wanted Asking the community to help with this

Comments

@pablojimpas
Copy link
Contributor

Summary

It will be great to support an option to write uplinks to a time-series database such as TimescaleDB.

TimescaleDB it's just an extension of PostgreSQL so all the pieces needed are in place since Benthos' sql_insert already supports a postgres driver option.

Use Case

This will be useful for almost any IoT use case concerned with monitoring since a time-series database some nice properties for such scenarios.

Implementation

No response

Testing

No response

Contributing

  • I can do more research
  • I can implement the feature if I get some help
  • I can implement the feature without help (but I'll wait until the feature request is accepted)
  • I can test the feature before it's released

Code of Conduct

  • I agree to follow TTN's Community Code of Conduct
@pablojimpas pablojimpas added enhancement New feature or request triage labels May 3, 2022
@htdvisser
Copy link
Contributor

Thanks for submitting this issue, @pablojimpas!

Maybe you've already seen the ClickHouse processor, which also inserts traffic in a SQL database. A Timescale processor wouldn't look much different.

I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?

We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:

  • Storing RX metadata directly in the tts_uplink_messages. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?
  • Storing decoded payloads directly in the tts_uplink_messages. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?
  • Inserts into the "main" table automatically getting transformed and inserted into the tts_uplink_message_decoded_payloads and tts_uplink_message_rx_metadata tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.

Let me know if you can pick this up.

@htdvisser htdvisser added help wanted Asking the community to help with this area:processor and removed triage labels May 6, 2022
@pablojimpas
Copy link
Contributor Author

I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?

Me neither, but I'm currently evaluating the use of Timescale for an integration and it's quite interesting. In summary, as far as I know now, all the standard SQL features should work since in the end it's just PostgreSQL, the interesting specific feature is that Timescale introduces the concept of "hypertables", optimized for time-series data. I'll continue learning about it this weekend, they have really good docs so it shouldn't be too hard.

We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:

* Storing RX metadata directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?

* Storing decoded payloads directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?

* Inserts into the "main" table automatically getting transformed and inserted into the `tts_uplink_message_decoded_payloads` and `tts_uplink_message_rx_metadata` tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.

Let me know if you can pick this up.

I don't know the specifics of ClickHouse either, it will be easier for me to port this if it was plain SQL. I will investigate further and see if I can make a solution work, I'll let you know what my findings are and contribute the implementation if I'm successful...I can definitely use some help though :)

@pablojimpas
Copy link
Contributor Author

pablojimpas commented May 9, 2022

FYI: So far I've tried using the data type JSONB for the columns correlation_ids, rx_metadata and decoded_payload and it seems to do the trick. I've a pretty very basic example working (not comparable to the ClickHouse one) but if you wan't I can make a PR to add it.

@htdvisser
Copy link
Contributor

Thanks for looking into this. I have a couple of thoughts on it:

  1. What do you think about using character varying[] type for the correlation IDs instead of JSONB?

  2. How efficient is it to run queries over JSONB rx_metadata compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?

  3. Same question for decoded_payload.

  4. One concern I'd have with using JSONB for decoded_payload is related to encoding:

    RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the input function for the json type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character.

    (source: https://www.postgresql.org/docs/current/datatype-json.html)

    It's not uncommon for (decoded) payloads to contain \u0000, so if we'd be using JSONB we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingesting JSON I think it's better to use that (at least for decoded_payload).

    Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.

@pablojimpas
Copy link
Contributor Author

  1. What do you think about using character varying[] type for the correlation IDs instead of JSONB?

You're right, using an array would be a better fit for that field.

  • How efficient is it to run queries over JSONB rx_metadata compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?

  • Same question for decoded_payload.

I haven't run any benchmarks, I know the JSONB type adds some overhead on inserts but then it's supposed to be quite efficient, you can even create indexes with them. However, I think this wouldn't be an issue for my use case at least, I plan to store the decoded payload as is, then I'll query the table to get hole JSONB field and the business logic will figure out which json fields to look for and how to interpret them (I don't want to put that logic inside a benthos processor).

4. It's not uncommon for (decoded) payloads to contain \u0000, so if we'd be using JSONB we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingesting JSON I think it's better to use that (at least for decoded_payload).
Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.

I'm not sure I understand this issue, so far I haven't run into problems with encoding, the decoded payloads that I've tested so far look like this:

{
  "soilK": 74,
  "soilN": 22,
  "soilP": 29,
  "soilEC": 306,
  "soilPH": 5.83,
  "soilMoisture": 16.55,
  "soilTemperature": 23.19
}

I'm not doing any fancy processing in the benthos processor like in the case of ClickHouse, I'm just storing the raw decoded payload doing:
uplink_message.decoded_payload.string()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:processor enhancement New feature or request help wanted Asking the community to help with this
Projects
None yet
Development

No branches or pull requests

2 participants