Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts about inserting #13

Open
waralex opened this issue Aug 30, 2020 · 2 comments
Open

Thoughts about inserting #13

waralex opened this issue Aug 30, 2020 · 2 comments

Comments

@waralex
Copy link
Collaborator

waralex commented Aug 30, 2020

So far just thoughts for discussion

  • insert requires that all columns in the table are passed. This requirement is clearly redundant - the clickhouse handles insertion with missing columns without any problems

  • This needs to be checked separately, but it seems to me that the clickhouse converts data when inserted. If this is the case, then perhaps we should define the type in the native protocol based on the Julia type rather than the type in the table schema. This would save us a lot of work on converting types

  • We clearly need a way to insert from a row format

@athre0z
Copy link
Member

athre0z commented Aug 30, 2020

Some thoughts of mine in response to your points:

insert requires that all columns in the table are passed. This requirement is clearly redundant - the clickhouse handles insertion with missing columns without any problems

I implemented this because the ClickHouse server version that I developed this client against had a tendency to SIGSEGV for both missing and duplicated columns which was hard to troubleshoot. I remember that this has since been fixed, so we could indeed remove this now.

It's worth noting that the current check in the client will probably give users trouble making use of the DEFAULT X option that they might have specified when creating the table, so we should definitely change this in some way.

I'm not that much of a fan of implicit defaulting for columns that don't actually have a default or are nullable, which ClickHouse pretty much just initializes with values similar to C++' default init scheme for heap-allocated objects (0, empty string, ...) which, honestly, hardly ever does what I want. I personally find that design pretty error prone and would prefer to have the server reject it. But on the other hand, this is how CH designed it and a client for the database should probably mirror what the database and the official client do.

It's also worth pointing out that, last time I checked, the server will silently ignore columns in a block that don't correspond to any database columns, which might be pretty confusing for users.

Summarizing, I guess I'd be okay with removing the check if you still think this the best thing to do after considering my arguments. The check is pretty much free from a performance perspective, but if we were to keep it, we'd have to solve the problem of allowing users to explicitly default, perhaps by having them set the column to nothing explicitly. Let me know what you think!

This needs to be checked separately, but it seems to me that the clickhouse converts data when inserted. If this is the case, then perhaps we should define the type in the native protocol based on the Julia type rather than the type in the table schema. This would save us a lot of work on converting types

Yes, this is an interesting question. A somewhat related question is that the command line client also allows insert queries with pretty much arbitrary expressions like:

CREATE TABLE blah
(
    `aaa` UInt32
)
ENGINE = MEMORY;

INSERT INTO blah (aaa) VALUES 
    (toUInt32(toDecimal64(234, 0))), 
    (1234), 
    (234 + 2934728394);

and I wonder how this actually works. I think it might actually calculate this in the client, which wouldn't be very helpful for us. Or, it sends such inserts as a string. We could just use tcpdump to spy on the C++ / command line client and check what it sends.

We clearly need a way to insert from a row format

Yes. I think we should however design it in a way that doesn't tempt the user to hit the database for every row inserted, but instead a more convenient way for users to build and insert blocks or to automatically construct a stream of blocks of a good size from a sequence / iterator / channel of rows. Something in this direction.

It would probably also be good if we could design it in a way that makes the user define a struct / named tuple type for their row format instead of using Dicts because that would reduce runtime lookups significantly. Might be able to make the whole row struct -> block conversion function a super-efficient @generated function that would make row based inserts pretty much just as efficient as block based inserts.

Just some thoughts, nothing really fleshed out yet.

@waralex
Copy link
Collaborator Author

waralex commented Aug 30, 2020

I personally find that design pretty error prone and would prefer to have the server reject it.

From my experience, we use this scheme when adding columns to analytics. First, the table is changed, and the analytics continues to receive events without a new column. After the alter has passed through some time , a new code is laid out to production to collect Analytics with a new column. If the server will reject the request without all the columns, we would have a huge problem with syncing the alter and and the production code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants