Thoughts about inserting #13

waralex · 2020-08-30T11:47:12Z

So far just thoughts for discussion

insert requires that all columns in the table are passed. This requirement is clearly redundant - the clickhouse handles insertion with missing columns without any problems
This needs to be checked separately, but it seems to me that the clickhouse converts data when inserted. If this is the case, then perhaps we should define the type in the native protocol based on the Julia type rather than the type in the table schema. This would save us a lot of work on converting types
We clearly need a way to insert from a row format

The text was updated successfully, but these errors were encountered:

athre0z · 2020-08-30T14:33:31Z

Some thoughts of mine in response to your points:

insert requires that all columns in the table are passed. This requirement is clearly redundant - the clickhouse handles insertion with missing columns without any problems

I implemented this because the ClickHouse server version that I developed this client against had a tendency to SIGSEGV for both missing and duplicated columns which was hard to troubleshoot. I remember that this has since been fixed, so we could indeed remove this now.

It's worth noting that the current check in the client will probably give users trouble making use of the DEFAULT X option that they might have specified when creating the table, so we should definitely change this in some way.

I'm not that much of a fan of implicit defaulting for columns that don't actually have a default or are nullable, which ClickHouse pretty much just initializes with values similar to C++' default init scheme for heap-allocated objects (0, empty string, ...) which, honestly, hardly ever does what I want. I personally find that design pretty error prone and would prefer to have the server reject it. But on the other hand, this is how CH designed it and a client for the database should probably mirror what the database and the official client do.

It's also worth pointing out that, last time I checked, the server will silently ignore columns in a block that don't correspond to any database columns, which might be pretty confusing for users.

Summarizing, I guess I'd be okay with removing the check if you still think this the best thing to do after considering my arguments. The check is pretty much free from a performance perspective, but if we were to keep it, we'd have to solve the problem of allowing users to explicitly default, perhaps by having them set the column to nothing explicitly. Let me know what you think!

This needs to be checked separately, but it seems to me that the clickhouse converts data when inserted. If this is the case, then perhaps we should define the type in the native protocol based on the Julia type rather than the type in the table schema. This would save us a lot of work on converting types

Yes, this is an interesting question. A somewhat related question is that the command line client also allows insert queries with pretty much arbitrary expressions like:

CREATE TABLE blah
(
    `aaa` UInt32
)
ENGINE = MEMORY;

INSERT INTO blah (aaa) VALUES 
    (toUInt32(toDecimal64(234, 0))), 
    (1234), 
    (234 + 2934728394);

and I wonder how this actually works. I think it might actually calculate this in the client, which wouldn't be very helpful for us. Or, it sends such inserts as a string. We could just use tcpdump to spy on the C++ / command line client and check what it sends.

We clearly need a way to insert from a row format

Yes. I think we should however design it in a way that doesn't tempt the user to hit the database for every row inserted, but instead a more convenient way for users to build and insert blocks or to automatically construct a stream of blocks of a good size from a sequence / iterator / channel of rows. Something in this direction.

It would probably also be good if we could design it in a way that makes the user define a struct / named tuple type for their row format instead of using Dicts because that would reduce runtime lookups significantly. Might be able to make the whole row struct -> block conversion function a super-efficient @generated function that would make row based inserts pretty much just as efficient as block based inserts.

Just some thoughts, nothing really fleshed out yet.

waralex · 2020-08-30T19:14:43Z

I personally find that design pretty error prone and would prefer to have the server reject it.

From my experience, we use this scheme when adding columns to analytics. First, the table is changed, and the analytics continues to receive events without a new column. After the alter has passed through some time , a new code is laid out to production to collect Analytics with a new column. If the server will reject the request without all the columns, we would have a huge problem with syncing the alter and and the production code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts about inserting #13

Thoughts about inserting #13

waralex commented Aug 30, 2020

athre0z commented Aug 30, 2020 •

edited

Loading

waralex commented Aug 30, 2020 •

edited

Loading

Thoughts about inserting #13

Thoughts about inserting #13

Comments

waralex commented Aug 30, 2020

athre0z commented Aug 30, 2020 • edited Loading

waralex commented Aug 30, 2020 • edited Loading

athre0z commented Aug 30, 2020 •

edited

Loading

waralex commented Aug 30, 2020 •

edited

Loading