From 8be48879df747ebeb7e88c2c418a61c1c3f8f8b1 Mon Sep 17 00:00:00 2001 From: liujiwen-up Date: Thu, 23 Jan 2025 13:25:45 +0800 Subject: [PATCH 1/2] Fix routine load statements documentation docs --- .../load-and-export/ALTER-ROUTINE-LOAD.md | 178 ++-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 770 +++++++++--------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 50 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 49 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 38 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 71 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 150 ++-- .../load-and-export/STOP-ROUTINE-LOAD.md | 45 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 100 +-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 758 +++++++++-------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 45 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 46 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 34 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 75 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 133 +-- .../load-and-export/STOP-ROUTINE-LOAD.md | 44 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 100 +-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 762 +++++++++-------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 44 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 45 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 35 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 74 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 132 +-- .../load-and-export/STOP-ROUTINE-LOAD.md | 43 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 100 +-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 758 +++++++++-------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 45 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 46 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 34 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 75 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 133 +-- .../load-and-export/STOP-ROUTINE-LOAD.md | 44 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 147 ++-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 767 +++++++++-------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 50 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 50 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 39 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 71 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 188 +++-- .../load-and-export/STOP-ROUTINE-LOAD.md | 46 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 178 ++-- .../load-and-export/CREATE-ROUTINE-LOAD.md | 770 +++++++++--------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 50 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 49 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 38 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 71 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 150 ++-- .../load-and-export/STOP-ROUTINE-LOAD.md | 45 +- 48 files changed, 4108 insertions(+), 3657 deletions(-) diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 4c6631ae77ab1..0167c8afcfabc 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -27,98 +27,96 @@ under the License. ## Description -This syntax is used to modify an already created routine import job. +This syntax is used to modify an existing routine load job. Only jobs in PAUSED state can be modified. -Only jobs in the PAUSED state can be modified. - -grammar: +## Syntax ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - Specifies the job name to modify. - -2. `tbl_name` - - Specifies the name of the table to be imported. - -3. `job_properties` - - Specifies the job parameters that need to be modified. Currently, only the modification of the following parameters is supported: - - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` - - -4. `data_source` - - The type of data source. Currently supports: - - KAFKA - -5. `data_source_properties` - - Relevant properties of the data source. Currently only supports: - - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. Custom properties, such as `property.group.id` - - Note: - - 1. `kafka_partitions` and `kafka_offsets` are used to modify the offset of the kafka partition to be consumed, only the currently consumed partition can be modified. Cannot add partition. - -## Example - -1. Change `desired_concurrent_number` to 1 - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "1" - ); - ``` - -2. Modify `desired_concurrent_number` to 10, modify the offset of the partition, and modify the group id. - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "10" - ) - FROM kafka - ( - "kafka_partitions" = "0, 1, 2", - "kafka_offsets" = "100, 200, 100", - "property.group.id" = "new_group" - ); - ``` - -## Keywords - - ALTER, ROUTINE, LOAD - -## Best Practice - +## Required Parameters + +**1. `[db.]job_name`** + +> Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. +> +> The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. + +**2. `job_properties`** + +> Specifies the job parameters to be modified. Currently supported parameters include: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio + +**3. `data_source`** + +> The type of data source. Currently supports: +> +> - KAFKA + +**4. `data_source_properties`** + +> Properties related to the data source. Currently supports: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - Custom properties, such as property.group.id + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | + +## Notes + +- `kafka_partitions` and `kafka_offsets` are used to modify the offset of kafka partitions to be consumed, and can only modify currently consumed partitions. New partitions cannot be added. + +## Examples + +- Modify `desired_concurrent_number` to 1 + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "1" + ); + ``` + +- Modify `desired_concurrent_number` to 10, modify partition offsets, and modify group id + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "10" + ) + FROM kafka + ( + "kafka_partitions" = "0, 1, 2", + "kafka_offsets" = "100, 200, 100", + "property.group.id" = "new_group" + ); + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 1d9d0e7a2481a..e0eec79594f64 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -27,357 +27,381 @@ under the License. ## Description -The Routine Load function allows users to submit a resident import task, and import data into Doris by continuously reading data from a specified data source. +The Routine Load feature allows users to submit a resident import task that continuously reads data from a specified data source and imports it into Doris. -Currently, only data in CSV or Json format can be imported from Kakfa through unauthenticated or SSL authentication. [Example of importing data in Json format](../../../../data-operate/import/import-way/routine-load-manual.md#Example_of_importing_data_in_Json_format) +Currently, it only supports importing CSV or Json format data from Kafka through unauthenticated or SSL authentication methods. [Example of importing Json format data](../../../../data-operate/import/import-way/routine-load-manual.md#Example-of-importing-Json-format-data) -grammar: +## Syntax ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` -- `[db.]job_name` - - The name of the import job. Within the same database, only one job with the same name can be running. - -- `tbl_name` - - Specifies the name of the table to be imported.Optional parameter, If not specified, the dynamic table method will - be used, which requires the data in Kafka to contain table name information. Currently, only the table name can be - obtained from the Kafka value, and it needs to conform to the format of "table_name|{"col1": "val1", "col2": "val2"}" - for JSON data. The "tbl_name" represents the table name, and "|" is used as the delimiter between the table name and - the table data. The same format applies to CSV data, such as "table_name|val1,val2,val3". It is important to note that - the "table_name" must be consistent with the table name in Doris, otherwise it may cause import failures. - - Tips: The `columns_mapping` parameter is not supported for dynamic tables. If your table structure is consistent with - the table structure in Doris and there is a large amount of table information to be imported, this method will be the - best choice. - -- `merge_type` - - Data merge type. The default is APPEND, which means that the imported data are ordinary append write operations. The MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - -- load_properties - - Used to describe imported data. The composition is as follows: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - Specifies the column separator, defaults to `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - It is used to specify the mapping relationship between file columns and columns in the table, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - Tips: Dynamic multiple tables are not supported. - - - `preceding_filter` - - Filter raw data. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - Tips: Dynamic multiple tables are not supported. - - - `where_predicates` - - Filter imported data based on conditions. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `WHERE k1 > 100 and k2 = 1000` - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - - - `partitions` - - Specify in which partitions of the import destination table. If not specified, it will be automatically imported into the corresponding partition. - - `PARTITION(p1, p2, p3)` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `DELETE ON` - - It needs to be used with the MEREGE import mode, only for the table of the Unique Key model. Used to specify the columns and calculated relationships in the imported data that represent the Delete Flag. - - `DELETE ON v3 >100` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `ORDER BY` - - Tables only for the Unique Key model. Used to specify the column in the imported data that represents the Sequence Col. Mainly used to ensure data order when importing. - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - -- `job_properties` - - Common parameters for specifying routine import jobs. - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - Currently we support the following parameters: - - 1. `desired_concurrent_number` - - Desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies the maximum number of tasks a job can execute concurrently. Must be greater than 0. Default is 5. - - This degree of concurrency is not the actual degree of concurrency. The actual degree of concurrency will be comprehensively considered by the number of nodes in the cluster, the load situation, and the situation of the data source. - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - These three parameters represent: - - 1. The maximum execution time of each subtask, in seconds. Must be greater than or equal to 1. The default is 10. - 2. The maximum number of lines read by each subtask. Must be greater than or equal to 200000. The default is 200000. - 3. The maximum number of bytes read by each subtask. The unit is bytes and the range is 100MB to 10GB. The default is 100MB. - - These three parameters are used to control the execution time and processing volume of a subtask. When either one reaches the threshold, the task ends. - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - The maximum number of error lines allowed within the sampling window. Must be greater than or equal to 0. The default is 0, which means no error lines are allowed. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines is greater than `max_error_number` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 4. `strict_mode` - - Whether to enable strict mode, the default is off. If enabled, the column type conversion of non-null raw data will be filtered if the result is NULL. Specify as: - - `"strict_mode" = "true"` - - The strict mode mode means strict filtering of column type conversions during the load process. The strict filtering strategy is as follows: - - 1. For column type conversion, if strict mode is true, the wrong data will be filtered. The error data here refers to the fact that the original data is not null, and the result is a null value after participating in the column type conversion. - 2. When a loaded column is generated by a function transformation, strict mode has no effect on it. - 3. For a column type loaded with a range limit, if the original data can pass the type conversion normally, but cannot pass the range limit, strict mode will not affect it. For example, if the type is decimal(1,0) and the original data is 10, it is eligible for type conversion but not for column declarations. This data strict has no effect on it. - - **strict mode and load relationship of source data** - - Here is an example of a column type of TinyInt. - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - Here the column type is Decimal(1,0) - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > Note: 10 Although it is a value that is out of range, because its type meets the requirements of decimal, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows. But it will not be filtered by strict mode. - - 5. `timezone` - - Specifies the time zone used by the import job. The default is to use the Session's timezone parameter. This parameter affects the results of all time zone-related functions involved in the import. - - 6. `format` - - Specify the import data format, the default is csv, and the json format is supported. - - 7. `jsonpaths` - - When the imported data format is json, the fields in the Json data can be extracted by specifying jsonpaths. - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - - 8. `strip_outer_array` - - When the imported data format is json, strip_outer_array is true, indicating that the Json data is displayed in the form of an array, and each element in the data will be regarded as a row of data. The default value is false. - - `-H "strip_outer_array: true"` - - 9. `json_root` - - When the import data format is json, you can specify the root node of the Json data through json_root. Doris will extract the elements of the root node through json_root for parsing. Default is empty. - - `-H "json_root: $.RECORDS"` - 10. `send_batch_parallelism` - - Integer, Used to set the default parallelism for sending batch, if the value for parallelism exceed `max_send_batch_parallelism_per_job` in BE config, then the coordinator BE will use the value of `max_send_batch_parallelism_per_job`. - - 11. `load_to_single_tablet` - Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. This parameter can only be set when loading data into the OLAP table with random bucketing. - - 12. `partial_columns` - Boolean type, True means that use partial column update, the default value is false, this parameter is only allowed to be set when the table model is Unique and Merge on Write is used. Multi-table does not support this parameter. - - 13. `max_filter_ratio` - The maximum allowed filtering rate within the sampling window. Must be between 0 and 1. The default value is 0. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines / total lines is greater than `max_filter_ratio` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 14. `enclose` - When the csv data field contains row delimiters or column delimiters, to prevent accidental truncation, single-byte characters can be specified as brackets for protection. For example, the column separator is ",", the bracket is "'", and the data is "a,'b,c'", then "b,c" will be parsed as a field. - Note: when the bracket is `"`, `trim\_double\_quotes` must be set to true. - - 15. `escape` - Used to escape characters that appear in a csv field identical to the enclosing characters. For example, if the data is "a,'b,'c'", enclose is "'", and you want "b,'c to be parsed as a field, you need to specify a single-byte escape character, such as `\`, and then modify the data to `a,' b,\'c'`. - -- `FROM data_source [data_source_properties]` - - The type of data source. Currently supports: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` supports the following data source properties: - - 1. `kafka_broker_list` - - Kafka's broker connection information. The format is ip:host. Separate multiple brokers with commas. - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - Specifies the Kafka topic to subscribe to. - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - Specify the kafka partition to be subscribed to, and the corresponding starting offset of each partition. If a time is specified, consumption will start at the nearest offset greater than or equal to the time. - - offset can specify a specific offset from 0 or greater, or: - - - `OFFSET_BEGINNING`: Start subscription from where there is data. - - `OFFSET_END`: subscribe from the end. - - Time format, such as: "2021-05-22 11:00:00" - - If not specified, all partitions under topic will be subscribed from `OFFSET_END` by default. - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - Note that the time format cannot be mixed with the OFFSET format. - - 4. `property` - - Specify custom kafka parameters. The function is equivalent to the "--property" parameter in the kafka shell. - - When the value of the parameter is a file, you need to add the keyword: "FILE:" before the value. - - For how to create a file, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. - - For more supported custom parameters, please refer to the configuration items on the client side in the official CONFIGURATION document of librdkafka. Such as: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. When connecting to Kafka using SSL, you need to specify the following parameters: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - in: - - `property.security.protocol` and `property.ssl.ca.location` are required to indicate the connection method is SSL and the location of the CA certificate. - - If client authentication is enabled on the Kafka server side, thenAlso set: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - They are used to specify the client's public key, private key, and password for the private key, respectively. - - 2. Specify the default starting offset of the kafka partition - - If `kafka_partitions/kafka_offsets` is not specified, all partitions are consumed by default. - - At this point, you can specify `kafka_default_offsets` to specify the starting offset. Defaults to `OFFSET_END`, i.e. subscribes from the end. - - Example: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - Comment for the routine load job. - -:::tip Tips -This feature is supported since the Apache Doris 1.2.3 version -::: - -<<<<<<<< HEAD:versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md -## Example -======== +## Required Parameters + +**1. `[db.]job_name`** + +> The name of the import job. Within the same database, only one job with the same name can be running. + +**2. `FROM data_source`** + +> The type of data source. Currently supports: KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` +> +> 2. `kafka_topic` +> +> Specifies the Kafka topic to subscribe to. +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## Optional Parameters + +**1. `tbl_name`** + +> Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. +> +> Currently, only supports getting table names from Kafka's Value, and it needs to follow this format: for json example: `table_name|{"col1": "val1", "col2": "val2"}`, +> where `tbl_name` is the table name, with `|` as the separator between table name and table data. +> +> For csv format data, it's similar: `table_name|val1,val2,val3`. Note that `table_name` here must match the table name in Doris, otherwise the import will fail. +> +> Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. + +**2. `merge_type`** + +> Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. +> +> Tips: When using dynamic multiple tables, please note that this parameter should be consistent with each dynamic table's type, otherwise it will result in import failure. + +**3. `load_properties`** + +> Used to describe imported data. The composition is as follows: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> Specifies the column separator, defaults to `\t` +> +> `COLUMNS TERMINATED BY ","` +> +> 2. `columns_mapping` +> +> Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> Tips: Dynamic tables do not support this parameter. +> +> 3. `preceding_filter` +> +> Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: Dynamic tables do not support this parameter. +> +> 4. `where_predicates` +> +> Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. +> +> 5. `partitions` +> +> Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. +> +> `PARTITION(p1, p2, p3)` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 6. `DELETE ON` +> +> Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. +> +> `DELETE ON v3 >100` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 7. `ORDER BY` +> +> Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. + +**4. `job_properties`** + +> Used to specify general parameters for routine import jobs. +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> Currently, we support the following parameters: +> +> 1. `desired_concurrent_number` +> +> The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. +> +> This concurrency is not the actual concurrency. The actual concurrency will be determined by considering the number of cluster nodes, load conditions, and data source conditions. +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> These three parameters represent: +> +> 1. Maximum execution time for each subtask, in seconds. Must be greater than or equal to 1. Default is 10. +> 2. Maximum number of rows to read for each subtask. Must be greater than or equal to 200000. Default is 20000000. +> 3. Maximum number of bytes to read for each subtask. Unit is bytes, range is 100MB to 10GB. Default is 1G. +> +> These three parameters are used to control the execution time and processing volume of a subtask. When any one reaches the threshold, the task ends. +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. +> +> The sampling window is `max_batch_rows * 10`. If the number of error rows within the sampling window exceeds `max_error_number`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 4. `strict_mode` +> +> Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: +> +> `"strict_mode" = "true"` +> +> Strict mode means: strictly filter column type conversions during the import process. The strict filtering strategy is as follows: +> +> 1. For column type conversion, if strict mode is true, erroneous data will be filtered. Here, erroneous data refers to: original data that is not null but results in null value after column type conversion. +> 2. For columns generated by function transformation during import, strict mode has no effect. +> 3. For columns with range restrictions, if the original data can pass type conversion but cannot pass range restrictions, strict mode has no effect. For example: if the type is decimal(1,0) and the original data is 10, it can pass type conversion but is outside the column's declared range. Strict mode has no effect on such data. +> +> **Relationship between strict mode and source data import** +> +> Here's an example using TinyInt column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> Here's an example using Decimal(1,0) column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. +> +> 5. `timezone` +> +> Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> Specifies the import data format, default is csv, json format is supported. +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> When importing json format data, jsonpaths can be used to specify fields to extract from Json data. +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. +> +> The sampling window is `max_batch_rows * 10`. If within the sampling window, error rows/total rows exceeds `max_filter_ratio`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 14. `enclose` +> +> Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. +> +> Note: When enclose is set to `"`, trim_double_quotes must be set to true. +> +> 15. `escape` +> +> Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. +> +**5. Optional properties in `data_source_properties`** + +> 1. `kafka_partitions/kafka_offsets` +> +> Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. +> +> offset can be specified as a specific offset greater than or equal to 0, or: +> +> - `OFFSET_BEGINNING`: Start subscribing from where data exists. +> - `OFFSET_END`: Start subscribing from the end. +> - Time format, such as: "2021-05-22 11:00:00" +> +> If not specified, defaults to subscribing to all partitions under the topic from `OFFSET_END`. +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> Note: Time format cannot be mixed with OFFSET format. +> +> 2. `property` +> +> Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. +> +> When the value of a parameter is a file, the keyword "FILE:" needs to be added before the value. +> +> For information about how to create files, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. +> +> For more supported custom parameters, please refer to the client configuration items in the official CONFIGURATION documentation of librdkafka. For example: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> Among them: +> +> `property.security.protocol` and `property.ssl.ca.location` are required, used to specify the connection method as SSL and the location of the CA certificate. +> +> If client authentication is enabled on the Kafka server side, the following also need to be set: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> Used to specify the client's public key, private key, and private key password respectively. +> +> 2. Specify default starting offset for kafka partitions +> +> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> +> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> +> Example: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> Comment information for the routine load task. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | CREATE ROUTINE LOAD belongs to table LOAD operation | + +## Notes + +- Dynamic tables do not support the `columns_mapping` parameter +- When using dynamic multiple tables, parameters like merge_type, where_predicates, etc., need to conform to each dynamic table's requirements +- Time format cannot be mixed with OFFSET format +- `kafka_partitions` and `kafka_offsets` must correspond one-to-one +- When `enclose` is set to `"`, `trim_double_quotes` must be set to true. ## Examples ->>>>>>>> ac43c88d43b68f907eafd82a2629cea01b097093:versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md - -1. Create a Kafka routine import task named test1 for example_tbl of example_db. Specify the column separator and group.id and client.id, and automatically consume all partitions by default, and start subscribing from the location where there is data (OFFSET_BEGINNING) - +- Create a Kafka routine load task named test1 for example_tbl in example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -401,9 +425,9 @@ This feature is supported since the Apache Doris 1.2.3 version ); ``` -2. Create a Kafka routine dynamic multiple tables import task named "test1" for the "example_db". Specify the column delimiter, group.id, and client.id, and automatically consume all partitions, subscribing from the position with data (OFFSET_BEGINNING). +- Create a Kafka routine dynamic multi-table load task named test1 for example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) -Assuming that we need to import data from Kafka into tables "test1" and "test2" in the "example_db", we create a routine import task named "test1". At the same time, we write the data in "test1" and "test2" to a Kafka topic named "my_topic" so that data from Kafka can be imported into both tables through a routine import task. + Assuming we need to import data from Kafka into test1 and test2 tables in example_db, we create a routine load task named test1, and write data from test1 and test2 to a Kafka topic named `my_topic`. This way, we can import data from Kafka into two tables through one routine load task. ```sql CREATE ROUTINE LOAD example_db.test1 @@ -425,9 +449,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -3. Create a Kafka routine import task named test1 for example_tbl of example_db. Import tasks are in strict mode. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db. The import task is in strict mode. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -451,9 +473,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -4. Import data from the Kafka cluster through SSL authentication. Also set the client.id parameter. The import task is in non-strict mode and the time zone is Africa/Abidjan - - +- Import data from Kafka cluster using SSL authentication. Also set client.id parameter. Import task is in non-strict mode, timezone is Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -481,9 +501,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -5. Import data in Json format. By default, the field name in Json is used as the column name mapping. Specify to import three partitions 0, 1, and 2, and the starting offsets are all 0 - - +- Import Json format data. Use field names in Json as column name mapping by default. Specify importing partitions 0,1,2, all starting offsets are 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -506,9 +524,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -6. Import Json data, extract fields through Jsonpaths, and specify the root node of the Json document - - +- Import Json data, extract fields through Jsonpaths, and specify Json document root node ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -534,9 +550,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -7. Create a Kafka routine import task named test1 for example_tbl of example_db. And use conditional filtering. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db with condition filtering. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -561,9 +575,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -8. Import data to Unique with sequence column Key model table - - +- Import data into a Unique Key model table containing sequence columns ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -585,9 +597,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -9. Consume from a specified point in time - - +- Start consuming from a specified time point ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -603,30 +613,4 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## Keywords - - CREATE, ROUTINE, LOAD, CREATE LOAD - -## Best Practice - -Partition and Offset for specified consumption - -Doris supports the specified Partition and Offset to start consumption, and also supports the function of consumption at a specified time point. The configuration relationship of the corresponding parameters is described here. - -There are three relevant parameters: - -- `kafka_partitions`: Specify a list of partitions to be consumed, such as "0, 1, 2, 3". -- `kafka_offsets`: Specify the starting offset of each partition, which must correspond to the number of `kafka_partitions` list. For example: "1000, 1000, 2000, 2000" -- `property.kafka_default_offsets`: Specifies the default starting offset of the partition. - -When creating an import job, these three parameters can have the following combinations: - -| Composition | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | Behavior | -| ----------- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | The system will automatically find all partitions corresponding to the topic and start consumption from OFFSET_END | -| 2 | No | No | Yes | The system will automatically find all partitions corresponding to the topic and start consumption from the location specified by default offset | -| 3 | Yes | No | No | The system will start consumption from OFFSET_END of the specified partition | -| 4 | Yes | Yes | No | The system will start consumption from the specified offset of the specified partition | -| 5 | Yes | No | Yes | The system will start consumption from the specified partition, the location specified by default offset | + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 5b93ad79b926c..ebc06d9229ec1 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,31 +25,51 @@ under the License. --> -## Description +## Description## Description -Used to pause a Routine Load job. A suspended job can be rerun with the RESUME command. +This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. + +## Syntax ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to pause. If ALL is specified, job_name is not required. + +## Optional Parameters + +**1. `[ALL]`** + +> Optional parameter. If ALL is specified, it indicates pausing all routine load jobs. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: -1. Pause the routine import job named test1. +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | - ```sql - PAUSE ROUTINE LOAD FOR test1; - ``` +## Notes -2. Pause all routine import jobs. +- After a job is paused, it can be restarted using the RESUME command +- The pause operation will not affect tasks that have already been dispatched to BE, these tasks will continue to complete - ```sql - PAUSE ALL ROUTINE LOAD; - ``` +## Examples -## Keywords +- Pause a routine load job named test1. - PAUSE, ROUTINE, LOAD + ```sql + PAUSE ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Pause all routine load jobs. + ```sql + PAUSE ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index da72700bdcb82..7974c1c3542d6 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -28,29 +28,50 @@ under the License. ## Description -Used to restart a suspended Routine Load job. The restarted job will continue to consume from the previously consumed offset. +This syntax is used to restart one or all paused Routine Load jobs. The restarted job will continue consuming from the previously consumed offset. + +## Syntax ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to restart. If ALL is specified, job_name is not required. + +## Optional Parameters + +**1. `[ALL]`** + +> Optional parameter. If ALL is specified, it indicates restarting all paused routine load jobs. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: -1. Restart the routine import job named test1. +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | - ```sql - RESUME ROUTINE LOAD FOR test1; - ``` +## Notes -2. Restart all routine import jobs. +- Only jobs in PAUSED state can be restarted +- Restarted jobs will continue consuming data from the last consumed position +- If a job has been paused for too long, the restart may fail due to expired Kafka data - ```sql - RESUME ALL ROUTINE LOAD; - ``` +## Examples -## Keywords +- Restart a routine load job named test1. - RESUME, ROUTINE, LOAD + ```sql + RESUME ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Restart all routine load jobs. + ```sql + RESUME ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 8880986f281b5..fc079f8901cbc 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -29,32 +29,40 @@ under the License. ## Description -This statement is used to demonstrate the creation statement of a routine import job. +This statement is used to display the creation statement of a routine load job. -The kafka partition and offset in the result show the currently consumed partition and the corresponding offset to be consumed. +The result shows the current consuming Kafka partitions and their corresponding offsets to be consumed. -grammar: +## Syntax ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -illustrate: +## Required Parameters -1. `ALL`: optional parameter, which means to get all jobs, including historical jobs -2. `load_name`: routine import job name +**1. `load_name`** -## Example +> The name of the routine load job -1. Show the creation statement of the specified routine import job under the default db +## Optional Parameters - ```sql - SHOW CREATE ROUTINE LOAD for test_load - ``` +**1. `[ALL]`** -## Keywords +> Optional parameter that represents retrieving all jobs, including historical jobs - SHOW, CREATE, ROUTINE, LOAD +## Permission Control -## Best Practice +Users executing this SQL command must have at least the following permission: +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Examples + +- Show the creation statement of a specified routine load job in the default database + + ```sql + SHOW CREATE ROUTINE LOAD for test_load + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index c275b9cd2ae1f..6f7d84542242a 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -28,52 +28,55 @@ under the License. ## Description -View the currently running subtasks of a specified Routine Load job. - +This syntax is used to view the currently running subtasks of a specified Routine Load job. +## Syntax ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -The returned results are as follows: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## Required Parameters + +**1. `job_name`** + +> The name of the routine load job to view. + +## Return Results + +The return results include the following fields: -- `TaskId`: The unique ID of the subtask. -- `TxnId`: The import transaction ID corresponding to the subtask. -- `TxnStatus`: The import transaction status corresponding to the subtask. When TxnStatus is null, it means that the subtask has not yet started scheduling. -- `JobId`: The job ID corresponding to the subtask. -- `CreateTime`: The creation time of the subtask. -- `ExecuteStartTime`: The time when the subtask is scheduled to be executed, usually later than the creation time. -- `Timeout`: Subtask timeout, usually twice the `max_batch_interval` set by the job. -- `BeId`: The ID of the BE node executing this subtask. -- `DataSourceProperties`: The starting offset of the Kafka Partition that the subtask is ready to consume. is a Json format string. Key is Partition Id. Value is the starting offset of consumption. +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| TaskId | Unique ID of the subtask | +| TxnId | Import transaction ID corresponding to the subtask | +| TxnStatus | Import transaction status of the subtask. Null indicates the subtask has not yet been scheduled | +| JobId | Job ID corresponding to the subtask | +| CreateTime | Creation time of the subtask | +| ExecuteStartTime | Time when the subtask was scheduled for execution, typically later than creation time | +| Timeout | Subtask timeout, typically twice the `max_batch_interval` set in the job | +| BeId | BE node ID executing this subtask | +| DataSourceProperties | Starting offset of Kafka Partition that the subtask is preparing to consume. It's a Json format string. Key is Partition Id, Value is the starting offset for consumption | -## Example +## Privilege Control -1. Display the subtask information of the routine import task named test1. +Users executing this SQL command must have at least the following privileges: - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD TASK requires LOAD privilege on the table | -## Keywords +## Notes - SHOW, ROUTINE, LOAD, TASK +- A null TxnStatus doesn't indicate task error, it may mean the task hasn't been scheduled yet +- The offset information in DataSourceProperties can be used to track data consumption progress +- When Timeout is reached, the task will automatically end regardless of whether data consumption is complete -## Best Practice +## Examples -With this command, you can view how many subtasks are currently running in a Routine Load job, and which BE node is running on. +- Show subtask information for a routine load task named test1. + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 201a06b0bfe9c..2d502603eedb6 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -27,99 +27,115 @@ under the License. ## Description -This statement is used to display the running status of the Routine Load job +This statement is used to display the running status of Routine Load jobs. You can view the status information of either a specific job or all jobs. -grammar: +## Syntax ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; +SHOW [] ROUTINE LOAD [FOR ]; ``` -Result description: +## Optional Parameters -``` - Id: job ID - Name: job name - CreateTime: job creation time - PauseTime: The last job pause time - EndTime: Job end time - DbName: corresponding database name - TableName: The name of the corresponding table (In the case of multiple tables, since it is a dynamic table, the specific table name is not displayed, and we uniformly display it as "multi-table"). - IsMultiTbl: Indicates whether it is a multi-table - State: job running state - DataSourceType: Data source type: KAFKA - CurrentTaskNum: The current number of subtasks - JobProperties: Job configuration details -DataSourceProperties: Data source configuration details - CustomProperties: custom configuration - Statistic: Job running status statistics - Progress: job running progress - Lag: job delay status -ReasonOfStateChanged: The reason for the job state change - ErrorLogUrls: The viewing address of the filtered unqualified data - OtherMsg: other error messages -``` +**1. `[ALL]`** + +> Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. + +**2. `[FOR jobName]`** -* State +> Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. +> +> Supports the following formats: +> +> - `job_name`: Shows the job with the specified name in the current database +> - `db_name.job_name`: Shows the job with the specified name in the specified database - There are the following 5 states: - * NEED_SCHEDULE: The job is waiting to be scheduled - * RUNNING: The job is running - * PAUSED: The job is paused - * STOPPED: The job has ended - * CANCELLED: The job was canceled +## Return Results -* Progress +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| Id | Job ID | +| Name | Job name | +| CreateTime | Job creation time | +| PauseTime | Most recent job pause time | +| EndTime | Job end time | +| DbName | Corresponding database name | +| TableName | Corresponding table name (shows 'multi-table' for multiple tables) | +| IsMultiTbl | Whether it's a multi-table job | +| State | Job running status | +| DataSourceType | Data source type: KAFKA | +| CurrentTaskNum | Current number of subtasks | +| JobProperties | Job configuration details | +| DataSourceProperties | Data source configuration details | +| CustomProperties | Custom configurations | +| Statistic | Job running statistics | +| Progress | Job running progress | +| Lag | Job delay status | +| ReasonOfStateChanged | Reason for job state change | +| ErrorLogUrls | URLs to view filtered data that failed quality checks | +| OtherMsg | Other error messages | - For Kafka data sources, displays the currently consumed offset for each partition. For example, {"0":"2"} indicates that the consumption progress of Kafka partition 0 is 2. +## Permission Control -*Lag +Users executing this SQL command must have at least the following permission: - For Kafka data sources, shows the consumption latency of each partition. For example, {"0":10} means that the consumption delay of Kafka partition 0 is 10. +| Privilege | Object | Notes | +| :----------- | :----- | :----------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | -## Example +## Notes -1. Show all routine import jobs named test1 (including stopped or canceled jobs). The result is one or more lines. +- State descriptions: + - NEED_SCHEDULE: Job is waiting to be scheduled + - RUNNING: Job is running + - PAUSED: Job is paused + - STOPPED: Job has ended + - CANCELLED: Job has been cancelled - ```sql - SHOW ALL ROUTINE LOAD FOR test1; - ``` +- Progress description: + - For Kafka data source, shows the consumed offset for each partition + - For example, {"0":"2"} means the consumption progress of Kafka partition 0 is 2 -2. Show the currently running routine import job named test1 +- Lag description: + - For Kafka data source, shows the consumption delay for each partition + - For example, {"0":10} means the consumption lag of Kafka partition 0 is 10 - ```sql - SHOW ROUTINE LOAD FOR test1; - ``` +## Examples -3. Display all routine import jobs (including stopped or canceled jobs) under example_db. The result is one or more lines. +- Show all routine load jobs (including stopped or cancelled ones) named test1 - ```sql - use example_db; - SHOW ALL ROUTINE LOAD; - ``` + ```sql + SHOW ALL ROUTINE LOAD FOR test1; + ``` -4. Display all running routine import jobs under example_db +- Show currently running routine load jobs named test1 - ```sql - use example_db; - SHOW ROUTINE LOAD; - ``` + ```sql + SHOW ROUTINE LOAD FOR test1; + ``` -5. Display the currently running routine import job named test1 under example_db +- Show all routine load jobs (including stopped or cancelled ones) in example_db. Results can be one or multiple rows. - ```sql - SHOW ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ALL ROUTINE LOAD; + ``` -6. Displays all routine import jobs named test1 under example_db (including stopped or canceled jobs). The result is one or more lines. +- Show all currently running routine load jobs in example_db - ```sql - SHOW ALL ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ROUTINE LOAD; + ``` -## Keywords +- Show currently running routine load job named test1 in example_db - SHOW, ROUTINE, LOAD + ```sql + SHOW ROUTINE LOAD FOR example_db.test1; + ``` -## Best Practice +- Show all routine load jobs (including stopped or cancelled ones) named test1 in example_db. Results can be one or multiple rows. + ```sql + SHOW ALL ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 36fb589789c71..ffea7818a144e 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -28,23 +28,48 @@ under the License. ## Description -User stops a Routine Load job. A stopped job cannot be rerun. +This syntax is used to stop a Routine Load job. Unlike the PAUSE command, stopped jobs cannot be restarted. If you need to import data again, you'll need to create a new import job. + +## Syntax ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to stop. It can be in the following formats: +> +> - `job_name`: Stop a job with the specified name in the current database +> - `db_name.job_name`: Stop a job with the specified name in the specified database + +## Permission Control + +Users executing this SQL command must have at least the following permission: + +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Notes -1. Stop the routine import job named test1. +- The stop operation is irreversible; stopped jobs cannot be restarted using the RESUME command +- The stop operation takes effect immediately, and running tasks will be interrupted +- It's recommended to check the job status using the SHOW ROUTINE LOAD command before stopping a job +- If you only want to temporarily pause a job, use the PAUSE command instead - ```sql - STOP ROUTINE LOAD FOR test1; - ``` +## Examples -## Keywords +- Stop a routine load job named test1 - STOP, ROUTINE, LOAD + ```sql + STOP ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Stop a routine load job in a specified database + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 3cb042cbb68c7..68c1c2760d937 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -28,70 +28,75 @@ under the License. ## 描述 -该语法用于修改已经创建的例行导入作业。 +该语法用于修改已经创建的例行导入作业。只能修改处于 PAUSED 状态的作业。 -只能修改处于 PAUSED 状态的作业。 - -语法: +## 语法 ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - 指定要修改的作业名称。 +## 必选参数 -2. `tbl_name` +**1. `[db.]job_name`** - 指定需要导入的表的名称。 +> 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 +> +> 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -3. `job_properties` +**2. `job_properties`** - 指定需要修改的作业参数。目前仅支持如下参数的修改: +> 指定需要修改的作业参数。目前支持修改的参数包括: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` +**3. `data_source`** +> 数据源的类型。当前支持: +> +> - KAFKA -4. `data_source` +**4. `data_source_properties`** - 数据源的类型。当前支持: +> 数据源的相关属性。目前支持: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - 自定义 property,如 property.group.id - KAFKA +## 权限控制 -5. `data_source_properties` +执行此 SQL 命令的用户必须至少具有以下权限: - 数据源的相关属性。目前仅支持: +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. 自定义 property,如 `property.group.id` +## 注意事项 - 注: - - 1. `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 +- `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 ## 示例 -1. 将 `desired_concurrent_number` 修改为 1 +- 将 `desired_concurrent_number` 修改为 1 ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -101,7 +106,7 @@ FROM data_source ); ``` -2. 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id。 +- 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -115,10 +120,5 @@ FROM data_source "kafka_offsets" = "100, 200, 100", "property.group.id" = "new_group" ); - -## 关键词 - - ALTER, ROUTINE, LOAD - -### 最佳实践 + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 4e343b1d485b8..4ac39f6df86a4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -31,346 +31,383 @@ under the License. ## 描述 -例行导入(Routine Load)功能,支持用户提交一个常驻的导入任务,通过不断的从指定的数据源读取数据,将数据导入到 Doris 中。 +例行导入(Routine Load)功能支持用户提交一个常驻的导入任务,通过不断地从指定的数据源读取数据,将数据导入到 Doris 中。 -目前仅支持通过无认证或者 SSL 认证方式,从 Kakfa 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) +目前仅支持通过无认证或者 SSL 认证方式,从 Kafka 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) -语法: +## 语法 ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` -``` - -- `[db.]job_name` - - 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 - -- `tbl_name` - - 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 - 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, - 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 - `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败. - - tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 - -- `merge_type` - - 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 - -- load_properties - - 用于描述导入数据。组成如下: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - 指定列分隔符,默认为 `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - tips: 动态表不支持此参数。 - - - `preceding_filter` - - 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - tips: 动态表不支持此参数。 - - - `where_predicates` - - 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `WHERE k1 > 100 and k2 = 1000` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 - - - `partitions` - - 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 - - `PARTITION(p1, p2, p3)` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `DELETE ON` - - 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 - - `DELETE ON v3 >100` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `ORDER BY` - - 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - -- `job_properties` - - 用于指定例行导入作业的通用参数。 - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - 目前我们支持以下参数: - - 1. `desired_concurrent_number` - - 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 - - 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - 这三个参数分别表示: - - 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 - 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 20000000。 - 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 1G。 - - 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 4. `strict_mode` - - 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: - - `"strict_mode" = "true"` - - strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: - - 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 - 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 - 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 - - **strict mode 与 source data 的导入关系** - - 这里以列类型为 TinyInt 来举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - 这里以列类型为 Decimal(1,0) 举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 - - 5. `timezone` - - 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 - - 6. `format` - - 指定导入数据格式,默认是 csv,支持 json 格式。 - - 7. `jsonpaths` - - 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - 8. `strip_outer_array` +## 必选参数 + +**1. `[db.]job_name`** + +> 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 + +**2. `FROM data_source`** + +> 数据源的类型。当前支持:KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` + +> 2. `kafka_topic` +> +> 指定要订阅的 Kafka 的 topic。 +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## 可选参数 + +**1. `tbl_name`** + +> 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 +> +> 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, +> 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。 + +> csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败。 +> +> tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 + + +**2. `merge_type`** + +> 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 + +**3. `load_properties`** + +> 用于描述导入数据。组成如下: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> 指定列分隔符,默认为 `\t` +> +> `COLUMNS TERMINATED BY ","` + +> 2. `columns_mapping` +> +> 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> tips: 动态表不支持此参数。 + +> 3. `preceding_filter` +> +> 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 动态表不支持此参数。 +> +> 4. `where_predicates` +> +> 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 + +> 5. `partitions` +> +> 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 +> +> `PARTITION(p1, p2, p3)` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 6. `DELETE ON` +> +> 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 +> +> `DELETE ON v3 >100` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 7. `ORDER BY` +> +> 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +**4. `job_properties`** + +> 用于指定例行导入作业的通用参数。 +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> 目前我们支持以下参数: +> +> 1. `desired_concurrent_number` +> +> 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 +> +> 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> 这三个参数分别表示: +> +> 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 +> 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 20000000。 +> 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 1G。 +> +> 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 4. `strict_mode` +> +> 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: +> +> `"strict_mode" = "true"` +> +> strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: +> +> 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 +> 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 +> 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 +> +> **strict mode 与 source data 的导入关系** +> +> 这里以列类型为 TinyInt 来举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> 这里以列类型为 Decimal(1,0) 举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 +> +> 5. `timezone` +> +> 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> 指定导入数据格式,默认是 csv,支持 json 格式。 +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 14. `enclose` +> +> 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 +> +> 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 +> +> 15. `escape` +> +> 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 +> +**5. `data_source_properties` 中的可选属性** + +> 1. `kafka_partitions/kafka_offsets` +> +> 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 +> +> offset 可以指定从大于等于 0 的具体 offset,或者: +> +> - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 +> - `OFFSET_END`: 从末尾开始订阅。 +> - 时间格式,如:"2021-05-22 11:00:00" +> +> 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> 注意,时间格式不能和 OFFSET 格式混用。 +> +> 2. `property` +> +> 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 +> +> 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 +> +> 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 +> +> 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> 其中: +> +> `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 +> +> 如果 Kafka server 端开启了 client 认证,则还需设置: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> 分别用于指定 client 的 public key,private key 以及 private key 的密码。 +> +> 2. 指定 kafka partition 的默认起始 offset +> +> 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 +> +> 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 +> +> 示例: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> 例行导入任务的注释信息。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | CREATE ROUTINE LOAD 属于表 LOAD 操作 | + +## 注意事项 + +- 动态表不支持 `columns_mapping` 参数 +- 使用动态多表时,merge_type、where_predicates 等参数需要符合每张动态表的要求 +- 时间格式不能和 OFFSET 格式混用 +- `kafka_partitions` 和 `kafka_offsets` 必须一一对应 +- 当 `enclose` 设置为`"`时,`trim_double_quotes` 一定要设置为 true。 - 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 - - `-H "strip_outer_array: true"` - - 9. `json_root` - - 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 - - `-H "json_root: $.RECORDS"` - - 10. `send_batch_parallelism` - - 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 - - 11. `load_to_single_tablet` - - 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 - - 12. `partial_columns` - 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 - - 13. `max_filter_ratio` - - 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 14. `enclose` - 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 - 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 - - 15. `escape` - 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 - -- `FROM data_source [data_source_properties]` - - 数据源的类型。当前支持: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` 支持如下数据源属性: - - 1. `kafka_broker_list` - - Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - 指定要订阅的 Kafka 的 topic。 - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 - - offset 可以指定从大于等于 0 的具体 offset,或者: - - - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 - - `OFFSET_END`: 从末尾开始订阅。 - - 时间格式,如:"2021-05-22 11:00:00" - - 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - 注意,时间格式不能和 OFFSET 格式混用。 - - 4. `property` - - 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 - - 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 - - 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 - - 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - 其中: - - `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 - - 如果 Kafka server 端开启了 client 认证,则还需设置: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - 分别用于指定 client 的 public key,private key 以及 private key 的密码。 - - 2. 指定 kafka partition 的默认起始 offset - - 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 - - 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 - - 示例: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - 例行导入任务的注释信息。 ## 示例 -1. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -394,7 +431,7 @@ FROM data_source [data_source_properties] ); ``` -2. 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, +- 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, 且从有数据的位置(OFFSET_BEGINNING)开始订阅 我们假设需要将 Kafka 中的数据导入到 example_db 中的 test1 以及 test2 表中,我们创建了一个名为 test1 的例行导入任务,同时将 test1 和 @@ -420,9 +457,7 @@ FROM data_source [data_source_properties] ); ``` -3. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -446,9 +481,7 @@ FROM data_source [data_source_properties] ); ``` -4. 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan - - +- 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -476,9 +509,7 @@ FROM data_source [data_source_properties] ); ``` -5. 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 - - +- 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -501,9 +532,7 @@ FROM data_source [data_source_properties] ); ``` -6. 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 - - +- 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -529,9 +558,7 @@ FROM data_source [data_source_properties] ); ``` -7. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -556,9 +583,7 @@ FROM data_source [data_source_properties] ); ``` -8. 导入数据到含有 sequence 列的 Unique Key 模型表中 - - +- 导入数据到含有 sequence 列的 Unique Key 模型表中 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -580,9 +605,7 @@ FROM data_source [data_source_properties] ); ``` -9. 从指定的时间点开始消费 - - +- 从指定的时间点开始消费 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -598,31 +621,4 @@ FROM data_source [data_source_properties] "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## 关键词 - - CREATE, ROUTINE, LOAD, CREATE LOAD - -### 最佳实践 - -关于指定消费的 Partition 和 Offset - -Doris 支持指定 Partition 和 Offset 开始消费,还支持了指定时间点进行消费的功能。这里说明下对应参数的配置关系。 - -有三个相关参数: - -- `kafka_partitions`:指定待消费的 partition 列表,如:"0, 1, 2, 3"。 -- `kafka_offsets`:指定每个分区的起始 offset,必须和 `kafka_partitions` 列表个数对应。如:"1000, 1000, 2000, 2000" -- `property.kafka_default_offsets:指定分区默认的起始 offset。 - -在创建导入作业时,这三个参数可以有以下组合: - -| 组合 | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | 行为 | -| ---- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | 系统会自动查找 topic 对应的所有分区并从 OFFSET_END 开始消费 | -| 2 | No | No | Yes | 系统会自动查找 topic 对应的所有分区并从 default offset 指定的位置开始消费 | -| 3 | Yes | No | No | 系统会从指定分区的 OFFSET_END 开始消费 | -| 4 | Yes | Yes | No | 系统会从指定分区的指定 offset 处开始消费 | -| 5 | Yes | No | Yes | 系统会从指定分区,default offset 指定的位置开始消费 | - + ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 20a9e5f8c187e..4653c3b3c7ed8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -24,34 +24,51 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用于暂停一个 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 +该语法用于暂停一个或所有 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 + +## 语法 ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示暂停所有例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 作业被暂停后,可以通过 RESUME 命令重新启动 +- 暂停操作不会影响已经下发到 BE 的任务,这些任务会继续执行完成 + ## 示例 -1. 暂停名称为 test1 的例行导入作业。 +- 暂停名称为 test1 的例行导入作业。 ```sql PAUSE ROUTINE LOAD FOR test1; ``` -2. 暂停所有例行导入作业。 +- 暂停所有例行导入作业。 ```sql PAUSE ALL ROUTINE LOAD; ``` - -## 关键词 - - PAUSE, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index d5d1d861f79c6..3371f39a0d5e8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -24,34 +24,52 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用于重启一个被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 +该语法用于重启一个或所有被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 + +## 语法 ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示重启所有被暂停的例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 只能重启处于 PAUSED 状态的作业 +- 重启后的作业会从上次消费的位置继续消费数据 +- 如果作业被暂停时间过长,可能会因为 Kafka 数据过期导致重启失败 + ## 示例 -1. 重启名称为 test1 的例行导入作业。 +- 重启名称为 test1 的例行导入作业。 ```sql RESUME ROUTINE LOAD FOR test1; ``` -2. 重启所有例行导入作业。 +- 重启所有例行导入作业。 ```sql RESUME ALL ROUTINE LOAD; ``` - -## 关键词 - - RESUME, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index d7730520a3984..9931c745df012 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -33,27 +33,37 @@ under the License. 结果中的 kafka partition 和 offset 展示的当前消费的 partition,以及对应的待消费的 offset。 -语法: +## 语法 ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -说明: - 1. `ALL`: 可选参数,代表获取所有作业,包括历史作业 - 2. `load_name`: 例行导入作业名称 +## 必选参数 + +**1. `load_name`** + +> 例行导入作业名称 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数,代表获取所有作业,包括历史作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | ## 示例 -1. 展示默认 db 下指定例行导入作业的创建语句 +- 展示默认 db 下指定例行导入作业的创建语句 ```sql SHOW CREATE ROUTINE LOAD for test_load ``` -## 关键词 - - SHOW, CREATE, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 5fe3e02b87a32..252791c572402 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -24,54 +24,57 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 +该语法用于查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 + +## 语法 ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -返回结果如下: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## 必选参数 -- `TaskId`:子任务的唯一 ID。 -- `TxnId`:子任务对应的导入事务 ID。 -- `TxnStatus`:子任务对应的导入事务状态。为 null 时表示子任务还未开始调度。 -- `JobId`:子任务对应的作业 ID。 -- `CreateTime`:子任务的创建时间。 -- `ExecuteStartTime`:子任务被调度执行的时间,通常晚于创建时间。 -- `Timeout`:子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍。 -- `BeId`:执行这个子任务的 BE 节点 ID。 -- `DataSourceProperties`:子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id。Value 为消费的起始 offset。 +**1. `job_name`** -## 示例 +> 要查看的例行导入作业名称。 -1. 展示名为 test1 的例行导入任务的子任务信息。 +## 返回结果 - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +返回结果包含以下字段: -## 关键词 +| 字段名 | 说明 | +| :------------------- | :---------------------------------------------------------- | +| TaskId | 子任务的唯一 ID | +| TxnId | 子任务对应的导入事务 ID | +| TxnStatus | 子任务对应的导入事务状态。为 null 时表示子任务还未开始调度 | +| JobId | 子任务对应的作业 ID | +| CreateTime | 子任务的创建时间 | +| ExecuteStartTime | 子任务被调度执行的时间,通常晚于创建时间 | +| Timeout | 子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍 | +| BeId | 执行这个子任务的 BE 节点 ID | +| DataSourceProperties | 子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id,Value 为消费的起始 offset | - SHOW, ROUTINE, LOAD, TASK +## 权限控制 -### 最佳实践 +执行此 SQL 命令的用户必须至少具有以下权限: -通过这个命令,可以查看一个 Routine Load 作业当前有多少子任务在运行,具体运行在哪个 BE 节点上。 +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD TASK 需要对表有LOAD权限 | + +## 注意事项 + +- TxnStatus 为 null 不代表任务出错,可能是任务还未开始调度 +- DataSourceProperties 中的 offset 信息可用于追踪数据消费进度 +- Timeout 时间到达后,任务会自动结束,无论是否完成数据消费 + +## 示例 + +- 展示名为 test1 的例行导入任务的子任务信息。 + + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 44d5a18b7d58b..24da597680db2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -24,105 +24,118 @@ specific language governing permissions and limitations under the License. --> - - - - ## 描述 -该语句用于展示 Routine Load 作业运行状态 +该语句用于展示 Routine Load 作业运行状态。可以查看指定作业或所有作业的状态信息。 -语法: +## 语法 ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; +SHOW [] ROUTINE LOAD [FOR ]; ``` -结果说明: - -``` - Id: 作业ID - Name: 作业名称 - CreateTime: 作业创建时间 - PauseTime: 最近一次作业暂停时间 - EndTime: 作业结束时间 - DbName: 对应数据库名称 - TableName: 对应表名称 (多表的情况下由于是动态表,因此不显示具体表名,我们统一显示 multi-table ) - IsMultiTbl: 是否为多表 - State: 作业运行状态 - DataSourceType: 数据源类型:KAFKA - CurrentTaskNum: 当前子任务数量 - JobProperties: 作业配置详情 -DataSourceProperties: 数据源配置详情 - CustomProperties: 自定义配置 - Statistic: 作业运行状态统计信息 - Progress: 作业运行进度 - Lag: 作业延迟状态 -ReasonOfStateChanged: 作业状态变更的原因 - ErrorLogUrls: 被过滤的质量不合格的数据的查看地址 - OtherMsg: 其他错误信息 -``` - -* State - - 有以下 5 种 State: - * NEED_SCHEDULE:作业等待被调度 - * RUNNING:作业运行中 - * PAUSED:作业被暂停 - * STOPPED:作业已结束 - * CANCELLED:作业已取消 - -* Progress - - 对于 Kafka 数据源,显示每个分区当前已消费的 offset。如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2。 - -* Lag - - 对于 Kafka 数据源,显示每个分区的消费延迟。如{"0":10} 表示 Kafka 分区 0 的消费延迟为 10。 +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 + +**2. `[FOR jobName]`** + +> 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 +> +> 支持以下形式: +> +> - `job_name`: 显示当前数据库下指定名称的作业 +> - `db_name.job_name`: 显示指定数据库下指定名称的作业 + +## 返回结果 + +| 字段名 | 说明 | +| :-------------------- | :---------------------------------------------------------- | +| Id | 作业ID | +| Name | 作业名称 | +| CreateTime | 作业创建时间 | +| PauseTime | 最近一次作业暂停时间 | +| EndTime | 作业结束时间 | +| DbName | 对应数据库名称 | +| TableName | 对应表名称(多表情况下显示 multi-table) | +| IsMultiTbl | 是否为多表 | +| State | 作业运行状态 | +| DataSourceType | 数据源类型:KAFKA | +| CurrentTaskNum | 当前子任务数量 | +| JobProperties | 作业配置详情 | +| DataSourceProperties | 数据源配置详情 | +| CustomProperties | 自定义配置 | +| Statistic | 作业运行状态统计信息 | +| Progress | 作业运行进度 | +| Lag | 作业延迟状态 | +| ReasonOfStateChanged | 作业状态变更的原因 | +| ErrorLogUrls | 被过滤的质量不合格的数据的查看地址 | +| OtherMsg | 其他错误信息 | + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- State 状态说明: + - NEED_SCHEDULE:作业等待被调度 + - RUNNING:作业运行中 + - PAUSED:作业被暂停 + - STOPPED:作业已结束 + - CANCELLED:作业已取消 + +- Progress 说明: + - 对于 Kafka 数据源,显示每个分区当前已消费的 offset + - 例如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2 + +- Lag 说明: + - 对于 Kafka 数据源,显示每个分区的消费延迟 + - 例如 {"0":10} 表示 Kafka 分区 0 的消费延迟为 10 ## 示例 -1. 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业) ```sql SHOW ALL ROUTINE LOAD FOR test1; ``` -2. 展示名称为 test1 的当前正在运行的例行导入作业 +- 展示名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR test1; ``` -3. 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql use example_db; SHOW ALL ROUTINE LOAD; ``` -4. 显示 example_db 下,所有正在运行的例行导入作业 +- 显示 example_db 下,所有正在运行的例行导入作业 ```sql use example_db; SHOW ROUTINE LOAD; ``` -5. 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 +- 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR example_db.test1; ``` -6. 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql SHOW ALL ROUTINE LOAD FOR example_db.test1; ``` -## 关键词 - - SHOW, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index f5fe6269abd6a..57c5f8abfecee 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -24,28 +24,50 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用户停止一个 Routine Load 作业。被停止的作业无法再重新运行。 +该语法用于停止一个 Routine Load 作业。被停止的作业无法再重新运行,这与 PAUSE 命令不同。如果需要重新导入数据,需要创建新的导入作业。 + +## 语法 ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` +## 必选参数 + +**1. `job_name`** + +> 指定要停止的作业名称。可以是以下形式: +> +> - `job_name`: 停止当前数据库下指定名称的作业 +> - `db_name.job_name`: 停止指定数据库下指定名称的作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 停止操作是不可逆的,被停止的作业无法通过 RESUME 命令重新启动 +- 停止操作会立即生效,正在执行的任务会被中断 +- 建议在停止作业前先通过 SHOW ROUTINE LOAD 命令检查作业状态 +- 如果只是临时暂停作业,建议使用 PAUSE 命令 + ## 示例 -1. 停止名称为 test1 的例行导入作业。 +- 停止名称为 test1 的例行导入作业。 ```sql STOP ROUTINE LOAD FOR test1; ``` -## 关键词 - - STOP, ROUTINE, LOAD - -### 最佳实践 +- 停止指定数据库下的例行导入作业。 + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 9518c07fd2544..68c1c2760d937 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -28,70 +28,75 @@ under the License. ## 描述 -该语法用于修改已经创建的例行导入作业。 +该语法用于修改已经创建的例行导入作业。只能修改处于 PAUSED 状态的作业。 -只能修改处于 PAUSED 状态的作业。 - -语法: +## 语法 ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - 指定要修改的作业名称。 +## 必选参数 -2. `tbl_name` +**1. `[db.]job_name`** - 指定需要导入的表的名称。 +> 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 +> +> 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -3. `job_properties` +**2. `job_properties`** - 指定需要修改的作业参数。目前仅支持如下参数的修改: +> 指定需要修改的作业参数。目前支持修改的参数包括: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` +**3. `data_source`** +> 数据源的类型。当前支持: +> +> - KAFKA -4. `data_source` +**4. `data_source_properties`** - 数据源的类型。当前支持: +> 数据源的相关属性。目前支持: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - 自定义 property,如 property.group.id - KAFKA +## 权限控制 -5. `data_source_properties` +执行此 SQL 命令的用户必须至少具有以下权限: - 数据源的相关属性。目前仅支持: +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. 自定义 property,如 `property.group.id` +## 注意事项 - 注: - - 1. `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 +- `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 ## 示例 -1. 将 `desired_concurrent_number` 修改为 1 +- 将 `desired_concurrent_number` 修改为 1 ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -101,7 +106,7 @@ FROM data_source ); ``` -2. 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id。 +- 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -115,10 +120,5 @@ FROM data_source "kafka_offsets" = "100, 200, 100", "property.group.id" = "new_group" ); - -## 关键词 - - ALTER, ROUTINE, LOAD - -## 最佳实践 + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 00b89ebcd1462..4ac39f6df86a4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -27,350 +27,387 @@ under the License. + + ## 描述 -例行导入(Routine Load)功能,支持用户提交一个常驻的导入任务,通过不断的从指定的数据源读取数据,将数据导入到 Doris 中。 +例行导入(Routine Load)功能支持用户提交一个常驻的导入任务,通过不断地从指定的数据源读取数据,将数据导入到 Doris 中。 -目前仅支持通过无认证或者 SSL 认证方式,从 Kakfa 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) +目前仅支持通过无认证或者 SSL 认证方式,从 Kafka 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) -语法: +## 语法 ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` +## 必选参数 + +**1. `[db.]job_name`** + +> 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 + +**2. `FROM data_source`** + +> 数据源的类型。当前支持:KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` + +> 2. `kafka_topic` +> +> 指定要订阅的 Kafka 的 topic。 +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## 可选参数 + +**1. `tbl_name`** + +> 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 +> +> 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, +> 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。 + +> csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败。 +> +> tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 + + +**2. `merge_type`** + +> 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 + +**3. `load_properties`** + +> 用于描述导入数据。组成如下: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> 指定列分隔符,默认为 `\t` +> +> `COLUMNS TERMINATED BY ","` + +> 2. `columns_mapping` +> +> 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> tips: 动态表不支持此参数。 + +> 3. `preceding_filter` +> +> 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 动态表不支持此参数。 +> +> 4. `where_predicates` +> +> 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 + +> 5. `partitions` +> +> 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 +> +> `PARTITION(p1, p2, p3)` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 6. `DELETE ON` +> +> 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 +> +> `DELETE ON v3 >100` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 7. `ORDER BY` +> +> 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +**4. `job_properties`** + +> 用于指定例行导入作业的通用参数。 +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> 目前我们支持以下参数: +> +> 1. `desired_concurrent_number` +> +> 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 +> +> 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> 这三个参数分别表示: +> +> 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 +> 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 20000000。 +> 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 1G。 +> +> 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 4. `strict_mode` +> +> 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: +> +> `"strict_mode" = "true"` +> +> strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: +> +> 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 +> 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 +> 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 +> +> **strict mode 与 source data 的导入关系** +> +> 这里以列类型为 TinyInt 来举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> 这里以列类型为 Decimal(1,0) 举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 +> +> 5. `timezone` +> +> 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> 指定导入数据格式,默认是 csv,支持 json 格式。 +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 14. `enclose` +> +> 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 +> +> 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 +> +> 15. `escape` +> +> 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 +> +**5. `data_source_properties` 中的可选属性** + +> 1. `kafka_partitions/kafka_offsets` +> +> 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 +> +> offset 可以指定从大于等于 0 的具体 offset,或者: +> +> - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 +> - `OFFSET_END`: 从末尾开始订阅。 +> - 时间格式,如:"2021-05-22 11:00:00" +> +> 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> 注意,时间格式不能和 OFFSET 格式混用。 +> +> 2. `property` +> +> 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 +> +> 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 +> +> 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 +> +> 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> 其中: +> +> `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 +> +> 如果 Kafka server 端开启了 client 认证,则还需设置: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> 分别用于指定 client 的 public key,private key 以及 private key 的密码。 +> +> 2. 指定 kafka partition 的默认起始 offset +> +> 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 +> +> 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 +> +> 示例: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> 例行导入任务的注释信息。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | CREATE ROUTINE LOAD 属于表 LOAD 操作 | + +## 注意事项 + +- 动态表不支持 `columns_mapping` 参数 +- 使用动态多表时,merge_type、where_predicates 等参数需要符合每张动态表的要求 +- 时间格式不能和 OFFSET 格式混用 +- `kafka_partitions` 和 `kafka_offsets` 必须一一对应 +- 当 `enclose` 设置为`"`时,`trim_double_quotes` 一定要设置为 true。 -- `[db.]job_name` - - 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 - -- `tbl_name` - - 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 - 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, - 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 - `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败。 - - tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 - -- `merge_type` - - 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 - -- load_properties - - 用于描述导入数据。组成如下: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - 指定列分隔符,默认为 `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - tips: 动态表不支持此参数。 - - - `preceding_filter` - - 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - tips: 动态表不支持此参数。 - - - `where_predicates` - - 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `WHERE k1 > 100 and k2 = 1000` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 - - - `partitions` - - 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 - - `PARTITION(p1, p2, p3)` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `DELETE ON` - - 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 - - `DELETE ON v3 >100` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `ORDER BY` - - 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - -- `job_properties` - - 用于指定例行导入作业的通用参数。 - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - 目前我们支持以下参数: - - 1. `desired_concurrent_number` - - 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 - - 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - 这三个参数分别表示: - - 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 - 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 200000(2.1.5 及更高版本为 20000000)。 - 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 100MB(2.1.5 及更高版本为 1G)。 - - 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 4. `strict_mode` - - 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: - - `"strict_mode" = "true"` - - strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: - - 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 - 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 - 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 - - **strict mode 与 source data 的导入关系** - - 这里以列类型为 TinyInt 来举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - 这里以列类型为 Decimal(1,0) 举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 - - 5. `timezone` - - 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 - - 6. `format` - - 指定导入数据格式,默认是 csv,支持 json 格式。 - - 7. `jsonpaths` - - 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - - 8. `strip_outer_array` - - 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 - - `-H "strip_outer_array: true"` - - 9. `json_root` - - 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 - - `-H "json_root: $.RECORDS"` - - 10. `send_batch_parallelism` - - 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 - - 11. `load_to_single_tablet` - - 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 - - 12. `partial_columns` - 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 - - 13. `max_filter_ratio` - - 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 14. `enclose` - 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 - 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 - - 15. `escape` - 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 - -- `FROM data_source [data_source_properties]` - - 数据源的类型。当前支持: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` 支持如下数据源属性: - - 1. `kafka_broker_list` - - Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - 指定要订阅的 Kafka 的 topic。 - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 - - offset 可以指定从大于等于 0 的具体 offset,或者: - - - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 - - `OFFSET_END`: 从末尾开始订阅。 - - 时间格式,如:"2021-05-22 11:00:00" - - 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - 注意,时间格式不能和 OFFSET 格式混用。 - - 4. `property` - - 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 - - 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 - - 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 - - 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - 其中: - - `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 - - 如果 Kafka server 端开启了 client 认证,则还需设置: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - 分别用于指定 client 的 public key,private key 以及 private key 的密码。 - - 2. 指定 kafka partition 的默认起始 offset - - 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 - - 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 - - 示例: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - 例行导入任务的注释信息。 - - 该功能自 Apache Doris 1.2.3 版本起支持 -::: ## 示例 -1. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -394,7 +431,7 @@ FROM data_source [data_source_properties] ); ``` -2. 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, +- 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, 且从有数据的位置(OFFSET_BEGINNING)开始订阅 我们假设需要将 Kafka 中的数据导入到 example_db 中的 test1 以及 test2 表中,我们创建了一个名为 test1 的例行导入任务,同时将 test1 和 @@ -420,9 +457,7 @@ FROM data_source [data_source_properties] ); ``` -3. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -446,9 +481,7 @@ FROM data_source [data_source_properties] ); ``` -4. 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan - - +- 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -476,9 +509,7 @@ FROM data_source [data_source_properties] ); ``` -5. 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 - - +- 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -501,9 +532,7 @@ FROM data_source [data_source_properties] ); ``` -6. 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 - - +- 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -529,9 +558,7 @@ FROM data_source [data_source_properties] ); ``` -7. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -556,9 +583,7 @@ FROM data_source [data_source_properties] ); ``` -8. 导入数据到含有 sequence 列的 Unique Key 模型表中 - - +- 导入数据到含有 sequence 列的 Unique Key 模型表中 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -580,9 +605,7 @@ FROM data_source [data_source_properties] ); ``` -9. 从指定的时间点开始消费 - - +- 从指定的时间点开始消费 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -598,31 +621,4 @@ FROM data_source [data_source_properties] "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## 关键词 - - CREATE, ROUTINE, LOAD, CREATE LOAD - -## 最佳实践 - -关于指定消费的 Partition 和 Offset - -Doris 支持指定 Partition 和 Offset 开始消费,还支持了指定时间点进行消费的功能。这里说明下对应参数的配置关系。 - -有三个相关参数: - -- `kafka_partitions`:指定待消费的 partition 列表,如:"0, 1, 2, 3"。 -- `kafka_offsets`:指定每个分区的起始 offset,必须和 `kafka_partitions` 列表个数对应。如:"1000, 1000, 2000, 2000" -- `property.kafka_default_offsets:指定分区默认的起始 offset。 - -在创建导入作业时,这三个参数可以有以下组合: - -| 组合 | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | 行为 | -| ---- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | 系统会自动查找 topic 对应的所有分区并从 OFFSET_END 开始消费 | -| 2 | No | No | Yes | 系统会自动查找 topic 对应的所有分区并从 default offset 指定的位置开始消费 | -| 3 | Yes | No | No | 系统会从指定分区的 OFFSET_END 开始消费 | -| 4 | Yes | Yes | No | 系统会从指定分区的指定 offset 处开始消费 | -| 5 | Yes | No | Yes | 系统会从指定分区,default offset 指定的位置开始消费 | - + ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 319c470a3e680..4653c3b3c7ed8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -24,33 +24,51 @@ specific language governing permissions and limitations under the License. --> - - ## 描述 -用于暂停一个 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 +该语法用于暂停一个或所有 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 + +## 语法 ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示暂停所有例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 作业被暂停后,可以通过 RESUME 命令重新启动 +- 暂停操作不会影响已经下发到 BE 的任务,这些任务会继续执行完成 + ## 示例 -1. 暂停名称为 test1 的例行导入作业。 +- 暂停名称为 test1 的例行导入作业。 ```sql PAUSE ROUTINE LOAD FOR test1; ``` -2. 暂停所有例行导入作业。 +- 暂停所有例行导入作业。 ```sql PAUSE ALL ROUTINE LOAD; ``` - -## 关键词 - - PAUSE, ROUTINE, LOAD - -## 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 13ce13b37e1e9..3371f39a0d5e8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -24,33 +24,52 @@ specific language governing permissions and limitations under the License. --> - - ## 描述 -用于重启一个被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 +该语法用于重启一个或所有被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 + +## 语法 ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示重启所有被暂停的例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 只能重启处于 PAUSED 状态的作业 +- 重启后的作业会从上次消费的位置继续消费数据 +- 如果作业被暂停时间过长,可能会因为 Kafka 数据过期导致重启失败 + ## 示例 -1. 重启名称为 test1 的例行导入作业。 +- 重启名称为 test1 的例行导入作业。 ```sql RESUME ROUTINE LOAD FOR test1; ``` -2. 重启所有例行导入作业。 +- 重启所有例行导入作业。 ```sql RESUME ALL ROUTINE LOAD; ``` - -## 关键词 - - RESUME, ROUTINE, LOAD - -## 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 61ccd0c43ae8b..9931c745df012 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -26,33 +26,44 @@ under the License. + ## 描述 该语句用于展示例行导入作业的创建语句。 结果中的 kafka partition 和 offset 展示的当前消费的 partition,以及对应的待消费的 offset。 -语法: +## 语法 ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -说明: - 1. `ALL`: 可选参数,代表获取所有作业,包括历史作业 - 2. `load_name`: 例行导入作业名称 +## 必选参数 + +**1. `load_name`** + +> 例行导入作业名称 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数,代表获取所有作业,包括历史作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | ## 示例 -1. 展示默认 db 下指定例行导入作业的创建语句 +- 展示默认 db 下指定例行导入作业的创建语句 ```sql SHOW CREATE ROUTINE LOAD for test_load ``` -## 关键词 - - SHOW, CREATE, ROUTINE, LOAD - -## 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 2caad50e02bbd..252791c572402 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -24,53 +24,57 @@ specific language governing permissions and limitations under the License. --> - - ## 描述 -查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 +该语法用于查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 + +## 语法 ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -返回结果如下: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## 必选参数 -- `TaskId`:子任务的唯一 ID。 -- `TxnId`:子任务对应的导入事务 ID。 -- `TxnStatus`:子任务对应的导入事务状态。为 null 时表示子任务还未开始调度。 -- `JobId`:子任务对应的作业 ID。 -- `CreateTime`:子任务的创建时间。 -- `ExecuteStartTime`:子任务被调度执行的时间,通常晚于创建时间。 -- `Timeout`:子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍。 -- `BeId`:执行这个子任务的 BE 节点 ID。 -- `DataSourceProperties`:子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id。Value 为消费的起始 offset。 +**1. `job_name`** -## 示例 +> 要查看的例行导入作业名称。 -1. 展示名为 test1 的例行导入任务的子任务信息。 +## 返回结果 - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +返回结果包含以下字段: + +| 字段名 | 说明 | +| :------------------- | :---------------------------------------------------------- | +| TaskId | 子任务的唯一 ID | +| TxnId | 子任务对应的导入事务 ID | +| TxnStatus | 子任务对应的导入事务状态。为 null 时表示子任务还未开始调度 | +| JobId | 子任务对应的作业 ID | +| CreateTime | 子任务的创建时间 | +| ExecuteStartTime | 子任务被调度执行的时间,通常晚于创建时间 | +| Timeout | 子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍 | +| BeId | 执行这个子任务的 BE 节点 ID | +| DataSourceProperties | 子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id,Value 为消费的起始 offset | -## 关键词 +## 权限控制 - SHOW, ROUTINE, LOAD, TASK +执行此 SQL 命令的用户必须至少具有以下权限: -## 最佳实践 +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD TASK 需要对表有LOAD权限 | -通过这个命令,可以查看一个 Routine Load 作业当前有多少子任务在运行,具体运行在哪个 BE 节点上。 +## 注意事项 + +- TxnStatus 为 null 不代表任务出错,可能是任务还未开始调度 +- DataSourceProperties 中的 offset 信息可用于追踪数据消费进度 +- Timeout 时间到达后,任务会自动结束,无论是否完成数据消费 + +## 示例 + +- 展示名为 test1 的例行导入任务的子任务信息。 + + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 3ee7c0fd2ccfb..24da597680db2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -24,104 +24,118 @@ specific language governing permissions and limitations under the License. --> - - ## 描述 -该语句用于展示 Routine Load 作业运行状态 +该语句用于展示 Routine Load 作业运行状态。可以查看指定作业或所有作业的状态信息。 -语法: +## 语法 ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; -``` - -结果说明: - -``` - Id: 作业ID - Name: 作业名称 - CreateTime: 作业创建时间 - PauseTime: 最近一次作业暂停时间 - EndTime: 作业结束时间 - DbName: 对应数据库名称 - TableName: 对应表名称 (多表的情况下由于是动态表,因此不显示具体表名,我们统一显示 multi-table ) - IsMultiTbl: 是否为多表 - State: 作业运行状态 - DataSourceType: 数据源类型:KAFKA - CurrentTaskNum: 当前子任务数量 - JobProperties: 作业配置详情 -DataSourceProperties: 数据源配置详情 - CustomProperties: 自定义配置 - Statistic: 作业运行状态统计信息 - Progress: 作业运行进度 - Lag: 作业延迟状态 -ReasonOfStateChanged: 作业状态变更的原因 - ErrorLogUrls: 被过滤的质量不合格的数据的查看地址 - OtherMsg: 其他错误信息 +SHOW [] ROUTINE LOAD [FOR ]; ``` -* State - - 有以下 5 种 State: - - * NEED_SCHEDULE:作业等待被调度 - * RUNNING:作业运行中 - * PAUSED:作业被暂停 - * STOPPED:作业已结束 - * CANCELLED:作业已取消 - -* Progress - - 对于 Kafka 数据源,显示每个分区当前已消费的 offset。如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2。 - -* Lag - - 对于 Kafka 数据源,显示每个分区的消费延迟。如{"0":10} 表示 Kafka 分区 0 的消费延迟为 10。 +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 + +**2. `[FOR jobName]`** + +> 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 +> +> 支持以下形式: +> +> - `job_name`: 显示当前数据库下指定名称的作业 +> - `db_name.job_name`: 显示指定数据库下指定名称的作业 + +## 返回结果 + +| 字段名 | 说明 | +| :-------------------- | :---------------------------------------------------------- | +| Id | 作业ID | +| Name | 作业名称 | +| CreateTime | 作业创建时间 | +| PauseTime | 最近一次作业暂停时间 | +| EndTime | 作业结束时间 | +| DbName | 对应数据库名称 | +| TableName | 对应表名称(多表情况下显示 multi-table) | +| IsMultiTbl | 是否为多表 | +| State | 作业运行状态 | +| DataSourceType | 数据源类型:KAFKA | +| CurrentTaskNum | 当前子任务数量 | +| JobProperties | 作业配置详情 | +| DataSourceProperties | 数据源配置详情 | +| CustomProperties | 自定义配置 | +| Statistic | 作业运行状态统计信息 | +| Progress | 作业运行进度 | +| Lag | 作业延迟状态 | +| ReasonOfStateChanged | 作业状态变更的原因 | +| ErrorLogUrls | 被过滤的质量不合格的数据的查看地址 | +| OtherMsg | 其他错误信息 | + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- State 状态说明: + - NEED_SCHEDULE:作业等待被调度 + - RUNNING:作业运行中 + - PAUSED:作业被暂停 + - STOPPED:作业已结束 + - CANCELLED:作业已取消 + +- Progress 说明: + - 对于 Kafka 数据源,显示每个分区当前已消费的 offset + - 例如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2 + +- Lag 说明: + - 对于 Kafka 数据源,显示每个分区的消费延迟 + - 例如 {"0":10} 表示 Kafka 分区 0 的消费延迟为 10 ## 示例 -1. 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业) ```sql SHOW ALL ROUTINE LOAD FOR test1; ``` -2. 展示名称为 test1 的当前正在运行的例行导入作业 +- 展示名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR test1; ``` -3. 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql use example_db; SHOW ALL ROUTINE LOAD; ``` -4. 显示 example_db 下,所有正在运行的例行导入作业 +- 显示 example_db 下,所有正在运行的例行导入作业 ```sql use example_db; SHOW ROUTINE LOAD; ``` -5. 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 +- 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR example_db.test1; ``` -6. 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql SHOW ALL ROUTINE LOAD FOR example_db.test1; ``` -## 关键词 - - SHOW, ROUTINE, LOAD - -## 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index ad41882a42e4e..57c5f8abfecee 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -24,27 +24,50 @@ specific language governing permissions and limitations under the License. --> - - ## 描述 -用户停止一个 Routine Load 作业。被停止的作业无法再重新运行。 +该语法用于停止一个 Routine Load 作业。被停止的作业无法再重新运行,这与 PAUSE 命令不同。如果需要重新导入数据,需要创建新的导入作业。 + +## 语法 ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` +## 必选参数 + +**1. `job_name`** + +> 指定要停止的作业名称。可以是以下形式: +> +> - `job_name`: 停止当前数据库下指定名称的作业 +> - `db_name.job_name`: 停止指定数据库下指定名称的作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 停止操作是不可逆的,被停止的作业无法通过 RESUME 命令重新启动 +- 停止操作会立即生效,正在执行的任务会被中断 +- 建议在停止作业前先通过 SHOW ROUTINE LOAD 命令检查作业状态 +- 如果只是临时暂停作业,建议使用 PAUSE 命令 + ## 示例 -1. 停止名称为 test1 的例行导入作业。 +- 停止名称为 test1 的例行导入作业。 ```sql STOP ROUTINE LOAD FOR test1; ``` -## 关键词 - - STOP, ROUTINE, LOAD - -## 最佳实践 +- 停止指定数据库下的例行导入作业。 + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 3cb042cbb68c7..68c1c2760d937 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -28,70 +28,75 @@ under the License. ## 描述 -该语法用于修改已经创建的例行导入作业。 +该语法用于修改已经创建的例行导入作业。只能修改处于 PAUSED 状态的作业。 -只能修改处于 PAUSED 状态的作业。 - -语法: +## 语法 ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - 指定要修改的作业名称。 +## 必选参数 -2. `tbl_name` +**1. `[db.]job_name`** - 指定需要导入的表的名称。 +> 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 +> +> 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -3. `job_properties` +**2. `job_properties`** - 指定需要修改的作业参数。目前仅支持如下参数的修改: +> 指定需要修改的作业参数。目前支持修改的参数包括: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` +**3. `data_source`** +> 数据源的类型。当前支持: +> +> - KAFKA -4. `data_source` +**4. `data_source_properties`** - 数据源的类型。当前支持: +> 数据源的相关属性。目前支持: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - 自定义 property,如 property.group.id - KAFKA +## 权限控制 -5. `data_source_properties` +执行此 SQL 命令的用户必须至少具有以下权限: - 数据源的相关属性。目前仅支持: +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. 自定义 property,如 `property.group.id` +## 注意事项 - 注: - - 1. `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 +- `kafka_partitions` 和 `kafka_offsets` 用于修改待消费的 kafka partition 的 offset,仅能修改当前已经消费的 partition。不能新增 partition。 ## 示例 -1. 将 `desired_concurrent_number` 修改为 1 +- 将 `desired_concurrent_number` 修改为 1 ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -101,7 +106,7 @@ FROM data_source ); ``` -2. 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id。 +- 将 `desired_concurrent_number` 修改为 10,修改 partition 的 offset,修改 group id ```sql ALTER ROUTINE LOAD FOR db1.label1 @@ -115,10 +120,5 @@ FROM data_source "kafka_offsets" = "100, 200, 100", "property.group.id" = "new_group" ); - -## 关键词 - - ALTER, ROUTINE, LOAD - -### 最佳实践 + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 4e343b1d485b8..4ac39f6df86a4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -31,346 +31,383 @@ under the License. ## 描述 -例行导入(Routine Load)功能,支持用户提交一个常驻的导入任务,通过不断的从指定的数据源读取数据,将数据导入到 Doris 中。 +例行导入(Routine Load)功能支持用户提交一个常驻的导入任务,通过不断地从指定的数据源读取数据,将数据导入到 Doris 中。 -目前仅支持通过无认证或者 SSL 认证方式,从 Kakfa 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) +目前仅支持通过无认证或者 SSL 认证方式,从 Kafka 导入 CSV 或 Json 格式的数据。 [导入 Json 格式数据使用示例](../../../../data-operate/import/import-way/routine-load-manual.md#导入Json格式数据使用示例) -语法: +## 语法 ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` -``` - -- `[db.]job_name` - - 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 - -- `tbl_name` - - 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 - 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, - 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 - `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败. - - tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 - -- `merge_type` - - 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 - -- load_properties - - 用于描述导入数据。组成如下: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - 指定列分隔符,默认为 `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - tips: 动态表不支持此参数。 - - - `preceding_filter` - - 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - tips: 动态表不支持此参数。 - - - `where_predicates` - - 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 - - `WHERE k1 > 100 and k2 = 1000` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 - - - `partitions` - - 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 - - `PARTITION(p1, p2, p3)` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `DELETE ON` - - 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 - - `DELETE ON v3 >100` - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - - - `ORDER BY` - - 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 - - tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 - -- `job_properties` - - 用于指定例行导入作业的通用参数。 - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - 目前我们支持以下参数: - - 1. `desired_concurrent_number` - - 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 - - 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - 这三个参数分别表示: - - 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 - 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 20000000。 - 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 1G。 - - 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 4. `strict_mode` - - 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: - - `"strict_mode" = "true"` - - strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: - - 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 - 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 - 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 - - **strict mode 与 source data 的导入关系** - - 这里以列类型为 TinyInt 来举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - 这里以列类型为 Decimal(1,0) 举例 - - > 注:当表中的列允许导入空值时 - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | 空值 | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 - - 5. `timezone` - - 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 - - 6. `format` - - 指定导入数据格式,默认是 csv,支持 json 格式。 - - 7. `jsonpaths` - - 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - 8. `strip_outer_array` +## 必选参数 + +**1. `[db.]job_name`** + +> 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 + +**2. `FROM data_source`** + +> 数据源的类型。当前支持:KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` + +> 2. `kafka_topic` +> +> 指定要订阅的 Kafka 的 topic。 +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## 可选参数 + +**1. `tbl_name`** + +> 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 +> +> 目前仅支持从 Kafka 的 Value 中获取表名,且需要符合这种格式:以 json 为例:`table_name|{"col1": "val1", "col2": "val2"}`, +> 其中 `tbl_name` 为表名,以 `|` 作为表名和表数据的分隔符。 + +> csv 格式的数据也是类似的,如:`table_name|val1,val2,val3`。注意,这里的 `table_name` 必须和 Doris 中的表名一致,否则会导致导入失败。 +> +> tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 + + +**2. `merge_type`** + +> 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 + +**3. `load_properties`** + +> 用于描述导入数据。组成如下: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> 指定列分隔符,默认为 `\t` +> +> `COLUMNS TERMINATED BY ","` + +> 2. `columns_mapping` +> +> 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> tips: 动态表不支持此参数。 + +> 3. `preceding_filter` +> +> 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 动态表不支持此参数。 +> +> 4. `where_predicates` +> +> 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 +> +> `WHERE k1 > 100 and k2 = 1000` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 + +> 5. `partitions` +> +> 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 +> +> `PARTITION(p1, p2, p3)` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 6. `DELETE ON` +> +> 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 +> +> `DELETE ON v3 >100` +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +> 7. `ORDER BY` +> +> 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 +> +> tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 + +**4. `job_properties`** + +> 用于指定例行导入作业的通用参数。 +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> 目前我们支持以下参数: +> +> 1. `desired_concurrent_number` +> +> 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 +> +> 这个并发度并不是实际的并发度,实际的并发度,会通过集群的节点数、负载情况,以及数据源的情况综合考虑。 +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> 这三个参数分别表示: +> +> 1. 每个子任务最大执行时间,单位是秒。必须大于等于 1。默认为 10。 +> 2. 每个子任务最多读取的行数。必须大于等于 200000。默认是 20000000。 +> 3. 每个子任务最多读取的字节数。单位是字节,范围是 100MB 到 10GB。默认是 1G。 +> +> 这三个参数,用于控制一个子任务的执行时间和处理量。当任意一个达到阈值,则任务结束。 +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数大于 `max_error_number`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 4. `strict_mode` +> +> 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: +> +> `"strict_mode" = "true"` +> +> strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: +> +> 1. 对于列类型转换来说,如果 strict mode 为 true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 +> 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 +> 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 +> +> **strict mode 与 source data 的导入关系** +> +> 这里以列类型为 TinyInt 来举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> 这里以列类型为 Decimal(1,0) 举例 +> +> 注:当表中的列允许导入空值时 +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | 空值 | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 +> +> 5. `timezone` +> +> 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> 指定导入数据格式,默认是 csv,支持 json 格式。 +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 +> +> 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 +> +> 被 where 条件过滤掉的行不算错误行。 +> +> 14. `enclose` +> +> 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 +> +> 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 +> +> 15. `escape` +> +> 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 +> +**5. `data_source_properties` 中的可选属性** + +> 1. `kafka_partitions/kafka_offsets` +> +> 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 +> +> offset 可以指定从大于等于 0 的具体 offset,或者: +> +> - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 +> - `OFFSET_END`: 从末尾开始订阅。 +> - 时间格式,如:"2021-05-22 11:00:00" +> +> 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> 注意,时间格式不能和 OFFSET 格式混用。 +> +> 2. `property` +> +> 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 +> +> 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 +> +> 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 +> +> 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> 其中: +> +> `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 +> +> 如果 Kafka server 端开启了 client 认证,则还需设置: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> 分别用于指定 client 的 public key,private key 以及 private key 的密码。 +> +> 2. 指定 kafka partition 的默认起始 offset +> +> 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 +> +> 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 +> +> 示例: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> 例行导入任务的注释信息。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | CREATE ROUTINE LOAD 属于表 LOAD 操作 | + +## 注意事项 + +- 动态表不支持 `columns_mapping` 参数 +- 使用动态多表时,merge_type、where_predicates 等参数需要符合每张动态表的要求 +- 时间格式不能和 OFFSET 格式混用 +- `kafka_partitions` 和 `kafka_offsets` 必须一一对应 +- 当 `enclose` 设置为`"`时,`trim_double_quotes` 一定要设置为 true。 - 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 - - `-H "strip_outer_array: true"` - - 9. `json_root` - - 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 - - `-H "json_root: $.RECORDS"` - - 10. `send_batch_parallelism` - - 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 - - 11. `load_to_single_tablet` - - 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 - - 12. `partial_columns` - 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 - - 13. `max_filter_ratio` - - 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 - - 采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。 - - 被 where 条件过滤掉的行不算错误行。 - - 14. `enclose` - 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 - 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 - - 15. `escape` - 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 - -- `FROM data_source [data_source_properties]` - - 数据源的类型。当前支持: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` 支持如下数据源属性: - - 1. `kafka_broker_list` - - Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - 指定要订阅的 Kafka 的 topic。 - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 - - offset 可以指定从大于等于 0 的具体 offset,或者: - - - `OFFSET_BEGINNING`: 从有数据的位置开始订阅。 - - `OFFSET_END`: 从末尾开始订阅。 - - 时间格式,如:"2021-05-22 11:00:00" - - 如果没有指定,则默认从 `OFFSET_END` 开始订阅 topic 下的所有 partition。 - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - 注意,时间格式不能和 OFFSET 格式混用。 - - 4. `property` - - 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 - - 当参数的 value 为一个文件时,需要在 value 前加上关键词:"FILE:"。 - - 关于如何创建文件,请参阅 [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) 命令文档。 - - 更多支持的自定义参数,请参阅 librdkafka 的官方 CONFIGURATION 文档中,client 端的配置项。如: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - 其中: - - `property.security.protocol` 和 `property.ssl.ca.location` 为必须,用于指明连接方式为 SSL,以及 CA 证书的位置。 - - 如果 Kafka server 端开启了 client 认证,则还需设置: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - 分别用于指定 client 的 public key,private key 以及 private key 的密码。 - - 2. 指定 kafka partition 的默认起始 offset - - 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 - - 此时可以指定 `kafka_default_offsets` 指定起始 offset。默认为 `OFFSET_END`,即从末尾开始订阅。 - - 示例: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - 例行导入任务的注释信息。 ## 示例 -1. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区,且从有数据的位置(OFFSET_BEGINNING)开始订阅 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -394,7 +431,7 @@ FROM data_source [data_source_properties] ); ``` -2. 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, +- 为 example_db 创建一个名为 test1 的 Kafka 例行动态多表导入任务。指定列分隔符和 group.id 和 client.id,并且自动默认消费所有分区, 且从有数据的位置(OFFSET_BEGINNING)开始订阅 我们假设需要将 Kafka 中的数据导入到 example_db 中的 test1 以及 test2 表中,我们创建了一个名为 test1 的例行导入任务,同时将 test1 和 @@ -420,9 +457,7 @@ FROM data_source [data_source_properties] ); ``` -3. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。导入任务为严格模式。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -446,9 +481,7 @@ FROM data_source [data_source_properties] ); ``` -4. 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan - - +- 通过 SSL 认证方式,从 Kafka 集群导入数据。同时设置 client.id 参数。导入任务为非严格模式,时区为 Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -476,9 +509,7 @@ FROM data_source [data_source_properties] ); ``` -5. 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 - - +- 导入 Json 格式数据。默认使用 Json 中的字段名作为列名映射。指定导入 0,1,2 三个分区,起始 offset 都为 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -501,9 +532,7 @@ FROM data_source [data_source_properties] ); ``` -6. 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 - - +- 导入 Json 数据,并通过 Jsonpaths 抽取字段,并指定 Json 文档根节点 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -529,9 +558,7 @@ FROM data_source [data_source_properties] ); ``` -7. 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 - - +- 为 example_db 的 example_tbl 创建一个名为 test1 的 Kafka 例行导入任务。并且使用条件过滤。 ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -556,9 +583,7 @@ FROM data_source [data_source_properties] ); ``` -8. 导入数据到含有 sequence 列的 Unique Key 模型表中 - - +- 导入数据到含有 sequence 列的 Unique Key 模型表中 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -580,9 +605,7 @@ FROM data_source [data_source_properties] ); ``` -9. 从指定的时间点开始消费 - - +- 从指定的时间点开始消费 ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -598,31 +621,4 @@ FROM data_source [data_source_properties] "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## 关键词 - - CREATE, ROUTINE, LOAD, CREATE LOAD - -### 最佳实践 - -关于指定消费的 Partition 和 Offset - -Doris 支持指定 Partition 和 Offset 开始消费,还支持了指定时间点进行消费的功能。这里说明下对应参数的配置关系。 - -有三个相关参数: - -- `kafka_partitions`:指定待消费的 partition 列表,如:"0, 1, 2, 3"。 -- `kafka_offsets`:指定每个分区的起始 offset,必须和 `kafka_partitions` 列表个数对应。如:"1000, 1000, 2000, 2000" -- `property.kafka_default_offsets:指定分区默认的起始 offset。 - -在创建导入作业时,这三个参数可以有以下组合: - -| 组合 | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | 行为 | -| ---- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | 系统会自动查找 topic 对应的所有分区并从 OFFSET_END 开始消费 | -| 2 | No | No | Yes | 系统会自动查找 topic 对应的所有分区并从 default offset 指定的位置开始消费 | -| 3 | Yes | No | No | 系统会从指定分区的 OFFSET_END 开始消费 | -| 4 | Yes | Yes | No | 系统会从指定分区的指定 offset 处开始消费 | -| 5 | Yes | No | Yes | 系统会从指定分区,default offset 指定的位置开始消费 | - + ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 20a9e5f8c187e..4653c3b3c7ed8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -24,34 +24,51 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用于暂停一个 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 +该语法用于暂停一个或所有 Routine Load 作业。被暂停的作业可以通过 RESUME 命令重新运行。 + +## 语法 ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示暂停所有例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 作业被暂停后,可以通过 RESUME 命令重新启动 +- 暂停操作不会影响已经下发到 BE 的任务,这些任务会继续执行完成 + ## 示例 -1. 暂停名称为 test1 的例行导入作业。 +- 暂停名称为 test1 的例行导入作业。 ```sql PAUSE ROUTINE LOAD FOR test1; ``` -2. 暂停所有例行导入作业。 +- 暂停所有例行导入作业。 ```sql PAUSE ALL ROUTINE LOAD; ``` - -## 关键词 - - PAUSE, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index d5d1d861f79c6..3371f39a0d5e8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -24,34 +24,52 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用于重启一个被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 +该语法用于重启一个或所有被暂停的 Routine Load 作业。重启的作业,将继续从之前已消费的 offset 继续消费。 + +## 语法 ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` +## 必选参数 + +**1. `job_name`** + +> 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定 ALL,则表示重启所有被暂停的例行导入作业。 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 只能重启处于 PAUSED 状态的作业 +- 重启后的作业会从上次消费的位置继续消费数据 +- 如果作业被暂停时间过长,可能会因为 Kafka 数据过期导致重启失败 + ## 示例 -1. 重启名称为 test1 的例行导入作业。 +- 重启名称为 test1 的例行导入作业。 ```sql RESUME ROUTINE LOAD FOR test1; ``` -2. 重启所有例行导入作业。 +- 重启所有例行导入作业。 ```sql RESUME ALL ROUTINE LOAD; ``` - -## 关键词 - - RESUME, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index d7730520a3984..9931c745df012 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -33,27 +33,37 @@ under the License. 结果中的 kafka partition 和 offset 展示的当前消费的 partition,以及对应的待消费的 offset。 -语法: +## 语法 ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -说明: - 1. `ALL`: 可选参数,代表获取所有作业,包括历史作业 - 2. `load_name`: 例行导入作业名称 +## 必选参数 + +**1. `load_name`** + +> 例行导入作业名称 + +## 可选参数 + +**1. `[ALL]`** + +> 可选参数,代表获取所有作业,包括历史作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | ## 示例 -1. 展示默认 db 下指定例行导入作业的创建语句 +- 展示默认 db 下指定例行导入作业的创建语句 ```sql SHOW CREATE ROUTINE LOAD for test_load ``` -## 关键词 - - SHOW, CREATE, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 5fe3e02b87a32..252791c572402 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -24,54 +24,57 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 +该语法用于查看一个指定的 Routine Load 作业的当前正在运行的子任务情况。 + +## 语法 ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -返回结果如下: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## 必选参数 -- `TaskId`:子任务的唯一 ID。 -- `TxnId`:子任务对应的导入事务 ID。 -- `TxnStatus`:子任务对应的导入事务状态。为 null 时表示子任务还未开始调度。 -- `JobId`:子任务对应的作业 ID。 -- `CreateTime`:子任务的创建时间。 -- `ExecuteStartTime`:子任务被调度执行的时间,通常晚于创建时间。 -- `Timeout`:子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍。 -- `BeId`:执行这个子任务的 BE 节点 ID。 -- `DataSourceProperties`:子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id。Value 为消费的起始 offset。 +**1. `job_name`** -## 示例 +> 要查看的例行导入作业名称。 -1. 展示名为 test1 的例行导入任务的子任务信息。 +## 返回结果 - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +返回结果包含以下字段: -## 关键词 +| 字段名 | 说明 | +| :------------------- | :---------------------------------------------------------- | +| TaskId | 子任务的唯一 ID | +| TxnId | 子任务对应的导入事务 ID | +| TxnStatus | 子任务对应的导入事务状态。为 null 时表示子任务还未开始调度 | +| JobId | 子任务对应的作业 ID | +| CreateTime | 子任务的创建时间 | +| ExecuteStartTime | 子任务被调度执行的时间,通常晚于创建时间 | +| Timeout | 子任务超时时间,通常是作业设置的 `max_batch_interval` 的两倍 | +| BeId | 执行这个子任务的 BE 节点 ID | +| DataSourceProperties | 子任务准备消费的 Kafka Partition 的起始 offset。是一个 Json 格式字符串。Key 为 Partition Id,Value 为消费的起始 offset | - SHOW, ROUTINE, LOAD, TASK +## 权限控制 -### 最佳实践 +执行此 SQL 命令的用户必须至少具有以下权限: -通过这个命令,可以查看一个 Routine Load 作业当前有多少子任务在运行,具体运行在哪个 BE 节点上。 +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD TASK 需要对表有LOAD权限 | + +## 注意事项 + +- TxnStatus 为 null 不代表任务出错,可能是任务还未开始调度 +- DataSourceProperties 中的 offset 信息可用于追踪数据消费进度 +- Timeout 时间到达后,任务会自动结束,无论是否完成数据消费 + +## 示例 + +- 展示名为 test1 的例行导入任务的子任务信息。 + + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 44d5a18b7d58b..24da597680db2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -24,105 +24,118 @@ specific language governing permissions and limitations under the License. --> - - - - ## 描述 -该语句用于展示 Routine Load 作业运行状态 +该语句用于展示 Routine Load 作业运行状态。可以查看指定作业或所有作业的状态信息。 -语法: +## 语法 ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; +SHOW [] ROUTINE LOAD [FOR ]; ``` -结果说明: - -``` - Id: 作业ID - Name: 作业名称 - CreateTime: 作业创建时间 - PauseTime: 最近一次作业暂停时间 - EndTime: 作业结束时间 - DbName: 对应数据库名称 - TableName: 对应表名称 (多表的情况下由于是动态表,因此不显示具体表名,我们统一显示 multi-table ) - IsMultiTbl: 是否为多表 - State: 作业运行状态 - DataSourceType: 数据源类型:KAFKA - CurrentTaskNum: 当前子任务数量 - JobProperties: 作业配置详情 -DataSourceProperties: 数据源配置详情 - CustomProperties: 自定义配置 - Statistic: 作业运行状态统计信息 - Progress: 作业运行进度 - Lag: 作业延迟状态 -ReasonOfStateChanged: 作业状态变更的原因 - ErrorLogUrls: 被过滤的质量不合格的数据的查看地址 - OtherMsg: 其他错误信息 -``` - -* State - - 有以下 5 种 State: - * NEED_SCHEDULE:作业等待被调度 - * RUNNING:作业运行中 - * PAUSED:作业被暂停 - * STOPPED:作业已结束 - * CANCELLED:作业已取消 - -* Progress - - 对于 Kafka 数据源,显示每个分区当前已消费的 offset。如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2。 - -* Lag - - 对于 Kafka 数据源,显示每个分区的消费延迟。如{"0":10} 表示 Kafka 分区 0 的消费延迟为 10。 +## 可选参数 + +**1. `[ALL]`** + +> 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 + +**2. `[FOR jobName]`** + +> 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 +> +> 支持以下形式: +> +> - `job_name`: 显示当前数据库下指定名称的作业 +> - `db_name.job_name`: 显示指定数据库下指定名称的作业 + +## 返回结果 + +| 字段名 | 说明 | +| :-------------------- | :---------------------------------------------------------- | +| Id | 作业ID | +| Name | 作业名称 | +| CreateTime | 作业创建时间 | +| PauseTime | 最近一次作业暂停时间 | +| EndTime | 作业结束时间 | +| DbName | 对应数据库名称 | +| TableName | 对应表名称(多表情况下显示 multi-table) | +| IsMultiTbl | 是否为多表 | +| State | 作业运行状态 | +| DataSourceType | 数据源类型:KAFKA | +| CurrentTaskNum | 当前子任务数量 | +| JobProperties | 作业配置详情 | +| DataSourceProperties | 数据源配置详情 | +| CustomProperties | 自定义配置 | +| Statistic | 作业运行状态统计信息 | +| Progress | 作业运行进度 | +| Lag | 作业延迟状态 | +| ReasonOfStateChanged | 作业状态变更的原因 | +| ErrorLogUrls | 被过滤的质量不合格的数据的查看地址 | +| OtherMsg | 其他错误信息 | + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- State 状态说明: + - NEED_SCHEDULE:作业等待被调度 + - RUNNING:作业运行中 + - PAUSED:作业被暂停 + - STOPPED:作业已结束 + - CANCELLED:作业已取消 + +- Progress 说明: + - 对于 Kafka 数据源,显示每个分区当前已消费的 offset + - 例如 {"0":"2"} 表示 Kafka 分区 0 的消费进度为 2 + +- Lag 说明: + - 对于 Kafka 数据源,显示每个分区的消费延迟 + - 例如 {"0":10} 表示 Kafka 分区 0 的消费延迟为 10 ## 示例 -1. 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 展示名称为 test1 的所有例行导入作业(包括已停止或取消的作业) ```sql SHOW ALL ROUTINE LOAD FOR test1; ``` -2. 展示名称为 test1 的当前正在运行的例行导入作业 +- 展示名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR test1; ``` -3. 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,所有的例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql use example_db; SHOW ALL ROUTINE LOAD; ``` -4. 显示 example_db 下,所有正在运行的例行导入作业 +- 显示 example_db 下,所有正在运行的例行导入作业 ```sql use example_db; SHOW ROUTINE LOAD; ``` -5. 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 +- 显示 example_db 下,名称为 test1 的当前正在运行的例行导入作业 ```sql SHOW ROUTINE LOAD FOR example_db.test1; ``` -6. 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 +- 显示 example_db 下,名称为 test1 的所有例行导入作业(包括已停止或取消的作业)。结果为一行或多行。 ```sql SHOW ALL ROUTINE LOAD FOR example_db.test1; ``` -## 关键词 - - SHOW, ROUTINE, LOAD - -### 最佳实践 - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index f5fe6269abd6a..57c5f8abfecee 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -24,28 +24,50 @@ specific language governing permissions and limitations under the License. --> - - - ## 描述 -用户停止一个 Routine Load 作业。被停止的作业无法再重新运行。 +该语法用于停止一个 Routine Load 作业。被停止的作业无法再重新运行,这与 PAUSE 命令不同。如果需要重新导入数据,需要创建新的导入作业。 + +## 语法 ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` +## 必选参数 + +**1. `job_name`** + +> 指定要停止的作业名称。可以是以下形式: +> +> - `job_name`: 停止当前数据库下指定名称的作业 +> - `db_name.job_name`: 停止指定数据库下指定名称的作业 + +## 权限控制 + +执行此 SQL 命令的用户必须至少具有以下权限: + +| 权限(Privilege) | 对象(Object) | 说明(Notes) | +| :---------------- | :------------- | :---------------------------- | +| LOAD_PRIV | 表(Table) | SHOW ROUTINE LOAD 需要对表有LOAD权限 | + +## 注意事项 + +- 停止操作是不可逆的,被停止的作业无法通过 RESUME 命令重新启动 +- 停止操作会立即生效,正在执行的任务会被中断 +- 建议在停止作业前先通过 SHOW ROUTINE LOAD 命令检查作业状态 +- 如果只是临时暂停作业,建议使用 PAUSE 命令 + ## 示例 -1. 停止名称为 test1 的例行导入作业。 +- 停止名称为 test1 的例行导入作业。 ```sql STOP ROUTINE LOAD FOR test1; ``` -## 关键词 - - STOP, ROUTINE, LOAD - -### 最佳实践 +- 停止指定数据库下的例行导入作业。 + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 69ddcd8cefba0..0167c8afcfabc 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -25,101 +25,98 @@ under the License. --> - ## Description -This syntax is used to modify an already created routine import job. - -Only jobs in the PAUSED state can be modified. +This syntax is used to modify an existing routine load job. Only jobs in PAUSED state can be modified. -grammar: +## Syntax ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - Specifies the job name to modify. +## Required Parameters -2. `tbl_name` +**1. `[db.]job_name`** - Specifies the name of the table to be imported. +> Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. +> +> The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. -3. `job_properties` +**2. `job_properties`** - Specifies the job parameters that need to be modified. Currently, only the modification of the following parameters is supported: +> Specifies the job parameters to be modified. Currently supported parameters include: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` +**3. `data_source`** +> The type of data source. Currently supports: +> +> - KAFKA -4. `data_source` +**4. `data_source_properties`** - The type of data source. Currently supports: +> Properties related to the data source. Currently supports: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - Custom properties, such as property.group.id - KAFKA +## Privilege Control -5. `data_source_properties` +Users executing this SQL command must have at least the following privileges: - Relevant properties of the data source. Currently only supports: +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. Custom properties, such as `property.group.id` +## Notes - Note: - - 1. `kafka_partitions` and `kafka_offsets` are used to modify the offset of the kafka partition to be consumed, only the currently consumed partition can be modified. Cannot add partition. +- `kafka_partitions` and `kafka_offsets` are used to modify the offset of kafka partitions to be consumed, and can only modify currently consumed partitions. New partitions cannot be added. ## Examples -1. Change `desired_concurrent_number` to 1 - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "1" - ); - ``` - -2. Modify `desired_concurrent_number` to 10, modify the offset of the partition, and modify the group id. - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "10" - ) - FROM kafka - ( - "kafka_partitions" = "0, 1, 2", - "kafka_offsets" = "100, 200, 100", - "property.group.id" = "new_group" - ); - ``` - -## Keywords - - ALTER, ROUTINE, LOAD - -## Best Practice - +- Modify `desired_concurrent_number` to 1 + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "1" + ); + ``` + +- Modify `desired_concurrent_number` to 10, modify partition offsets, and modify group id + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "10" + ) + FROM kafka + ( + "kafka_partitions" = "0, 1, 2", + "kafka_offsets" = "100, 200, 100", + "property.group.id" = "new_group" + ); + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 570a63fa35150..e0eec79594f64 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -25,356 +25,383 @@ under the License. --> - ## Description -The Routine Load function allows users to submit a resident import task, and import data into Doris by continuously reading data from a specified data source. +The Routine Load feature allows users to submit a resident import task that continuously reads data from a specified data source and imports it into Doris. -Currently, only data in CSV or Json format can be imported from Kakfa through unauthenticated or SSL authentication. [Example of importing data in Json format](../../../../data-operate/import/import-way/routine-load-manual.md#Example_of_importing_data_in_Json_format) +Currently, it only supports importing CSV or Json format data from Kafka through unauthenticated or SSL authentication methods. [Example of importing Json format data](../../../../data-operate/import/import-way/routine-load-manual.md#Example-of-importing-Json-format-data) -grammar: +## Syntax ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` -- `[db.]job_name` - - The name of the import job. Within the same database, only one job with the same name can be running. - -- `tbl_name` - - Specifies the name of the table to be imported.Optional parameter, If not specified, the dynamic table method will - be used, which requires the data in Kafka to contain table name information. Currently, only the table name can be - obtained from the Kafka value, and it needs to conform to the format of "table_name|{"col1": "val1", "col2": "val2"{" - for JSON data. The "tbl_name" represents the table name, and "|" is used as the delimiter between the table name and - the table data. The same format applies to CSV data, such as "table_name|val1,val2,val3". It is important to note that - the "table_name" must be consistent with the table name in Doris, otherwise it may cause import failures. - - Tips: The `columns_mapping` parameter is not supported for dynamic tables. If your table structure is consistent with - the table structure in Doris and there is a large amount of table information to be imported, this method will be the - best choice. - -- `merge_type` - - Data merge type. The default is APPEND, which means that the imported data are ordinary append write operations. The MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - -- load_properties - - Used to describe imported data. The composition is as follows: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - Specifies the column separator, defaults to `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - It is used to specify the mapping relationship between file columns and columns in the table, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - Tips: Dynamic multiple tables are not supported. - - - `preceding_filter` - - Filter raw data. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - Tips: Dynamic multiple tables are not supported. - - - `where_predicates` - - Filter imported data based on conditions. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `WHERE k1 > 100 and k2 = 1000` - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - - - `partitions` - - Specify in which partitions of the import destination table. If not specified, it will be automatically imported into the corresponding partition. - - `PARTITION(p1, p2, p3)` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `DELETE ON` - - It needs to be used with the MEREGE import mode, only for the table of the Unique Key model. Used to specify the columns and calculated relationships in the imported data that represent the Delete Flag. - - `DELETE ON v3 >100` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `ORDER BY` - - Tables only for the Unique Key model. Used to specify the column in the imported data that represents the Sequence Col. Mainly used to ensure data order when importing. - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - -- `job_properties` - - Common parameters for specifying routine import jobs. - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - Currently we support the following parameters: - - 1. `desired_concurrent_number` - - Desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies the maximum number of tasks a job can execute concurrently. Must be greater than 0. Default is 5. - - This degree of concurrency is not the actual degree of concurrency. The actual degree of concurrency will be comprehensively considered by the number of nodes in the cluster, the load situation, and the situation of the data source. - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - These three parameters represent: - - 1. The maximum execution time of each subtask, in seconds. Must be greater than or equal to 1. The default is 10. - 2. The maximum number of lines read by each subtask. Must be greater than or equal to 200000. The default is 200000. - 3. The maximum number of bytes read by each subtask. The unit is bytes and the range is 100MB to 10GB. The default is 100MB. - - These three parameters are used to control the execution time and processing volume of a subtask. When either one reaches the threshold, the task ends. - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - The maximum number of error lines allowed within the sampling window. Must be greater than or equal to 0. The default is 0, which means no error lines are allowed. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines is greater than `max_error_number` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 4. `strict_mode` - - Whether to enable strict mode, the default is off. If enabled, the column type conversion of non-null raw data will be filtered if the result is NULL. Specify as: - - `"strict_mode" = "true"` - - The strict mode mode means strict filtering of column type conversions during the load process. The strict filtering strategy is as follows: - - 1. For column type conversion, if strict mode is true, the wrong data will be filtered. The error data here refers to the fact that the original data is not null, and the result is a null value after participating in the column type conversion. - 2. When a loaded column is generated by a function transformation, strict mode has no effect on it. - 3. For a column type loaded with a range limit, if the original data can pass the type conversion normally, but cannot pass the range limit, strict mode will not affect it. For example, if the type is decimal(1,0) and the original data is 10, it is eligible for type conversion but not for column declarations. This data strict has no effect on it. - - **strict mode and load relationship of source data** - - Here is an example of a column type of TinyInt. - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - Here the column type is Decimal(1,0) - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > Note: 10 Although it is a value that is out of range, because its type meets the requirements of decimal, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows. But it will not be filtered by strict mode. - - 5. `timezone` - - Specifies the time zone used by the import job. The default is to use the Session's timezone parameter. This parameter affects the results of all time zone-related functions involved in the import. - - 6. `format` - - Specify the import data format, the default is csv, and the json format is supported. - - 7. `jsonpaths` - - When the imported data format is json, the fields in the Json data can be extracted by specifying jsonpaths. - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - - 8. `strip_outer_array` - - When the imported data format is json, strip_outer_array is true, indicating that the Json data is displayed in the form of an array, and each element in the data will be regarded as a row of data. The default value is false. - - `-H "strip_outer_array: true"` - - 9. `json_root` - - When the import data format is json, you can specify the root node of the Json data through json_root. Doris will extract the elements of the root node through json_root for parsing. Default is empty. - - `-H "json_root: $.RECORDS"` - 10. `send_batch_parallelism` - - Integer, Used to set the default parallelism for sending batch, if the value for parallelism exceed `max_send_batch_parallelism_per_job` in BE config, then the coordinator BE will use the value of `max_send_batch_parallelism_per_job`. - - 11. `load_to_single_tablet` - Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. This parameter can only be set when loading data into the OLAP table with random bucketing. - - 12. `partial_columns` - Boolean type, True means that use partial column update, the default value is false, this parameter is only allowed to be set when the table model is Unique and Merge on Write is used. Multi-table does not support this parameter. - - 13. `max_filter_ratio` - The maximum allowed filtering rate within the sampling window. Must be between 0 and 1. The default value is 0. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines / total lines is greater than `max_filter_ratio` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 14. `enclose` - When the csv data field contains row delimiters or column delimiters, to prevent accidental truncation, single-byte characters can be specified as brackets for protection. For example, the column separator is ",", the bracket is "'", and the data is "a,'b,c'", then "b,c" will be parsed as a field. - Note: when the bracket is `"`, `trim\_double\_quotes` must be set to true. - - 15. `escape` - Used to escape characters that appear in a csv field identical to the enclosing characters. For example, if the data is "a,'b,'c'", enclose is "'", and you want "b,'c to be parsed as a field, you need to specify a single-byte escape character, such as `\`, and then modify the data to `a,' b,\'c'`. - -- `FROM data_source [data_source_properties]` - - The type of data source. Currently supports: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` supports the following data source properties: - - 1. `kafka_broker_list` - - Kafka's broker connection information. The format is ip:host. Separate multiple brokers with commas. - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - Specifies the Kafka topic to subscribe to. - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - Specify the kafka partition to be subscribed to, and the corresponding starting offset of each partition. If a time is specified, consumption will start at the nearest offset greater than or equal to the time. - - offset can specify a specific offset from 0 or greater, or: - - - `OFFSET_BEGINNING`: Start subscription from where there is data. - - `OFFSET_END`: subscribe from the end. - - Time format, such as: "2021-05-22 11:00:00" - - If not specified, all partitions under topic will be subscribed from `OFFSET_END` by default. - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - Note that the time format cannot be mixed with the OFFSET format. - - 4. `property` - - Specify custom kafka parameters. The function is equivalent to the "--property" parameter in the kafka shell. - - When the value of the parameter is a file, you need to add the keyword: "FILE:" before the value. - - For how to create a file, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. - - For more supported custom parameters, please refer to the configuration items on the client side in the official CONFIGURATION document of librdkafka. Such as: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. When connecting to Kafka using SSL, you need to specify the following parameters: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - in: - - `property.security.protocol` and `property.ssl.ca.location` are required to indicate the connection method is SSL and the location of the CA certificate. - - If client authentication is enabled on the Kafka server side, thenAlso set: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - They are used to specify the client's public key, private key, and password for the private key, respectively. - - 2. Specify the default starting offset of the kafka partition - - If `kafka_partitions/kafka_offsets` is not specified, all partitions are consumed by default. - - At this point, you can specify `kafka_default_offsets` to specify the starting offset. Defaults to `OFFSET_END`, i.e. subscribes from the end. - - Example: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - Comment for the routine load job. - -:::tip Tips -This feature is supported since the Apache Doris 1.2.3 version -::: - +## Required Parameters + +**1. `[db.]job_name`** + +> The name of the import job. Within the same database, only one job with the same name can be running. + +**2. `FROM data_source`** + +> The type of data source. Currently supports: KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` +> +> 2. `kafka_topic` +> +> Specifies the Kafka topic to subscribe to. +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## Optional Parameters + +**1. `tbl_name`** + +> Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. +> +> Currently, only supports getting table names from Kafka's Value, and it needs to follow this format: for json example: `table_name|{"col1": "val1", "col2": "val2"}`, +> where `tbl_name` is the table name, with `|` as the separator between table name and table data. +> +> For csv format data, it's similar: `table_name|val1,val2,val3`. Note that `table_name` here must match the table name in Doris, otherwise the import will fail. +> +> Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. + +**2. `merge_type`** + +> Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. +> +> Tips: When using dynamic multiple tables, please note that this parameter should be consistent with each dynamic table's type, otherwise it will result in import failure. + +**3. `load_properties`** + +> Used to describe imported data. The composition is as follows: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> Specifies the column separator, defaults to `\t` +> +> `COLUMNS TERMINATED BY ","` +> +> 2. `columns_mapping` +> +> Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> Tips: Dynamic tables do not support this parameter. +> +> 3. `preceding_filter` +> +> Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: Dynamic tables do not support this parameter. +> +> 4. `where_predicates` +> +> Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. +> +> 5. `partitions` +> +> Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. +> +> `PARTITION(p1, p2, p3)` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 6. `DELETE ON` +> +> Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. +> +> `DELETE ON v3 >100` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 7. `ORDER BY` +> +> Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. + +**4. `job_properties`** + +> Used to specify general parameters for routine import jobs. +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> Currently, we support the following parameters: +> +> 1. `desired_concurrent_number` +> +> The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. +> +> This concurrency is not the actual concurrency. The actual concurrency will be determined by considering the number of cluster nodes, load conditions, and data source conditions. +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> These three parameters represent: +> +> 1. Maximum execution time for each subtask, in seconds. Must be greater than or equal to 1. Default is 10. +> 2. Maximum number of rows to read for each subtask. Must be greater than or equal to 200000. Default is 20000000. +> 3. Maximum number of bytes to read for each subtask. Unit is bytes, range is 100MB to 10GB. Default is 1G. +> +> These three parameters are used to control the execution time and processing volume of a subtask. When any one reaches the threshold, the task ends. +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. +> +> The sampling window is `max_batch_rows * 10`. If the number of error rows within the sampling window exceeds `max_error_number`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 4. `strict_mode` +> +> Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: +> +> `"strict_mode" = "true"` +> +> Strict mode means: strictly filter column type conversions during the import process. The strict filtering strategy is as follows: +> +> 1. For column type conversion, if strict mode is true, erroneous data will be filtered. Here, erroneous data refers to: original data that is not null but results in null value after column type conversion. +> 2. For columns generated by function transformation during import, strict mode has no effect. +> 3. For columns with range restrictions, if the original data can pass type conversion but cannot pass range restrictions, strict mode has no effect. For example: if the type is decimal(1,0) and the original data is 10, it can pass type conversion but is outside the column's declared range. Strict mode has no effect on such data. +> +> **Relationship between strict mode and source data import** +> +> Here's an example using TinyInt column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> Here's an example using Decimal(1,0) column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. +> +> 5. `timezone` +> +> Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> Specifies the import data format, default is csv, json format is supported. +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> When importing json format data, jsonpaths can be used to specify fields to extract from Json data. +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. +> +> The sampling window is `max_batch_rows * 10`. If within the sampling window, error rows/total rows exceeds `max_filter_ratio`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 14. `enclose` +> +> Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. +> +> Note: When enclose is set to `"`, trim_double_quotes must be set to true. +> +> 15. `escape` +> +> Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. +> +**5. Optional properties in `data_source_properties`** + +> 1. `kafka_partitions/kafka_offsets` +> +> Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. +> +> offset can be specified as a specific offset greater than or equal to 0, or: +> +> - `OFFSET_BEGINNING`: Start subscribing from where data exists. +> - `OFFSET_END`: Start subscribing from the end. +> - Time format, such as: "2021-05-22 11:00:00" +> +> If not specified, defaults to subscribing to all partitions under the topic from `OFFSET_END`. +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> Note: Time format cannot be mixed with OFFSET format. +> +> 2. `property` +> +> Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. +> +> When the value of a parameter is a file, the keyword "FILE:" needs to be added before the value. +> +> For information about how to create files, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. +> +> For more supported custom parameters, please refer to the client configuration items in the official CONFIGURATION documentation of librdkafka. For example: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> Among them: +> +> `property.security.protocol` and `property.ssl.ca.location` are required, used to specify the connection method as SSL and the location of the CA certificate. +> +> If client authentication is enabled on the Kafka server side, the following also need to be set: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> Used to specify the client's public key, private key, and private key password respectively. +> +> 2. Specify default starting offset for kafka partitions +> +> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> +> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> +> Example: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> Comment information for the routine load task. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | CREATE ROUTINE LOAD belongs to table LOAD operation | + +## Notes + +- Dynamic tables do not support the `columns_mapping` parameter +- When using dynamic multiple tables, parameters like merge_type, where_predicates, etc., need to conform to each dynamic table's requirements +- Time format cannot be mixed with OFFSET format +- `kafka_partitions` and `kafka_offsets` must correspond one-to-one +- When `enclose` is set to `"`, `trim_double_quotes` must be set to true. ## Examples -1. Create a Kafka routine import task named test1 for example_tbl of example_db. Specify the column separator and group.id and client.id, and automatically consume all partitions by default, and start subscribing from the location where there is data (OFFSET_BEGINNING) - - +- Create a Kafka routine load task named test1 for example_tbl in example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -398,9 +425,9 @@ This feature is supported since the Apache Doris 1.2.3 version ); ``` -2. Create a Kafka routine dynamic multiple tables import task named "test1" for the "example_db". Specify the column delimiter, group.id, and client.id, and automatically consume all partitions, subscribing from the position with data (OFFSET_BEGINNING). +- Create a Kafka routine dynamic multi-table load task named test1 for example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) -Assuming that we need to import data from Kafka into tables "test1" and "test2" in the "example_db", we create a routine import task named "test1". At the same time, we write the data in "test1" and "test2" to a Kafka topic named "my_topic" so that data from Kafka can be imported into both tables through a routine import task. + Assuming we need to import data from Kafka into test1 and test2 tables in example_db, we create a routine load task named test1, and write data from test1 and test2 to a Kafka topic named `my_topic`. This way, we can import data from Kafka into two tables through one routine load task. ```sql CREATE ROUTINE LOAD example_db.test1 @@ -422,9 +449,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -3. Create a Kafka routine import task named test1 for example_tbl of example_db. Import tasks are in strict mode. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db. The import task is in strict mode. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -448,9 +473,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -4. Import data from the Kafka cluster through SSL authentication. Also set the client.id parameter. The import task is in non-strict mode and the time zone is Africa/Abidjan - - +- Import data from Kafka cluster using SSL authentication. Also set client.id parameter. Import task is in non-strict mode, timezone is Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -478,9 +501,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -5. Import data in Json format. By default, the field name in Json is used as the column name mapping. Specify to import three partitions 0, 1, and 2, and the starting offsets are all 0 - - +- Import Json format data. Use field names in Json as column name mapping by default. Specify importing partitions 0,1,2, all starting offsets are 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -503,9 +524,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -6. Import Json data, extract fields through Jsonpaths, and specify the root node of the Json document - - +- Import Json data, extract fields through Jsonpaths, and specify Json document root node ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -531,9 +550,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -7. Create a Kafka routine import task named test1 for example_tbl of example_db. And use conditional filtering. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db with condition filtering. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -558,9 +575,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -8. Import data to Unique with sequence column Key model table - - +- Import data into a Unique Key model table containing sequence columns ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -582,9 +597,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -9. Consume from a specified point in time - - +- Start consuming from a specified time point ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -600,30 +613,4 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## Keywords - - CREATE, ROUTINE, LOAD, CREATE LOAD - -## Best Practice - -Partition and Offset for specified consumption - -Doris supports the specified Partition and Offset to start consumption, and also supports the function of consumption at a specified time point. The configuration relationship of the corresponding parameters is described here. - -There are three relevant parameters: - -- `kafka_partitions`: Specify a list of partitions to be consumed, such as "0, 1, 2, 3". -- `kafka_offsets`: Specify the starting offset of each partition, which must correspond to the number of `kafka_partitions` list. For example: "1000, 1000, 2000, 2000" -- `property.kafka_default_offsets`: Specifies the default starting offset of the partition. - -When creating an import job, these three parameters can have the following combinations: - -| Composition | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | Behavior | -| ----------- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | The system will automatically find all partitions corresponding to the topic and start consumption from OFFSET_END | -| 2 | No | No | Yes | The system will automatically find all partitions corresponding to the topic and start consumption from the location specified by default offset | -| 3 | Yes | No | No | The system will start consumption from OFFSET_END of the specified partition | -| 4 | Yes | Yes | No | The system will start consumption from the specified offset of the specified partition | -| 5 | Yes | No | Yes | The system will start consumption from the specified partition, the location specified by default offset | + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index f135093ce8cb7..ebc06d9229ec1 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,33 +25,51 @@ under the License. --> +## Description## Description +This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. -## Description - -Used to pause a Routine Load job. A suspended job can be rerun with the RESUME command. +## Syntax ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` -## Examples +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to pause. If ALL is specified, job_name is not required. + +## Optional Parameters -1. Pause the routine import job named test1. +**1. `[ALL]`** - ```sql - PAUSE ROUTINE LOAD FOR test1; - ``` +> Optional parameter. If ALL is specified, it indicates pausing all routine load jobs. -2. Pause all routine import jobs. +## Privilege Control - ```sql - PAUSE ALL ROUTINE LOAD; - ``` +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | + +## Notes + +- After a job is paused, it can be restarted using the RESUME command +- The pause operation will not affect tasks that have already been dispatched to BE, these tasks will continue to complete + +## Examples -## Keywords +- Pause a routine load job named test1. - PAUSE, ROUTINE, LOAD + ```sql + PAUSE ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Pause all routine load jobs. + ```sql + PAUSE ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index de682d409516d..7974c1c3542d6 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -26,32 +26,52 @@ under the License. - ## Description -Used to restart a suspended Routine Load job. The restarted job will continue to consume from the previously consumed offset. +This syntax is used to restart one or all paused Routine Load jobs. The restarted job will continue consuming from the previously consumed offset. + +## Syntax ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` -## Examples +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to restart. If ALL is specified, job_name is not required. + +## Optional Parameters -1. Restart the routine import job named test1. +**1. `[ALL]`** - ```sql - RESUME ROUTINE LOAD FOR test1; - ``` +> Optional parameter. If ALL is specified, it indicates restarting all paused routine load jobs. -2. Restart all routine import jobs. +## Privilege Control - ```sql - RESUME ALL ROUTINE LOAD; - ``` +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | + +## Notes + +- Only jobs in PAUSED state can be restarted +- Restarted jobs will continue consuming data from the last consumed position +- If a job has been paused for too long, the restart may fail due to expired Kafka data + +## Examples -## Keywords +- Restart a routine load job named test1. - RESUME, ROUTINE, LOAD + ```sql + RESUME ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Restart all routine load jobs. + ```sql + RESUME ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 5b3f6f73eb79f..fc079f8901cbc 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -26,34 +26,43 @@ under the License. + ## Description -This statement is used to demonstrate the creation statement of a routine import job. +This statement is used to display the creation statement of a routine load job. -The kafka partition and offset in the result show the currently consumed partition and the corresponding offset to be consumed. +The result shows the current consuming Kafka partitions and their corresponding offsets to be consumed. -grammar: +## Syntax ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -illustrate: +## Required Parameters -1. `ALL`: optional parameter, which means to get all jobs, including historical jobs -2. `load_name`: routine import job name +**1. `load_name`** -## Examples +> The name of the routine load job + +## Optional Parameters -1. Show the creation statement of the specified routine import job under the default db +**1. `[ALL]`** - ```sql - SHOW CREATE ROUTINE LOAD for test_load - ``` +> Optional parameter that represents retrieving all jobs, including historical jobs -## Keywords +## Permission Control - SHOW, CREATE, ROUTINE, LOAD +Users executing this SQL command must have at least the following permission: + +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Examples -## Best Practice +- Show the creation statement of a specified routine load job in the default database + ```sql + SHOW CREATE ROUTINE LOAD for test_load + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 837924d5397f2..6f7d84542242a 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -28,52 +28,55 @@ under the License. ## Description -View the currently running subtasks of a specified Routine Load job. - +This syntax is used to view the currently running subtasks of a specified Routine Load job. +## Syntax ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -The returned results are as follows: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## Required Parameters -- `TaskId`: The unique ID of the subtask. -- `TxnId`: The import transaction ID corresponding to the subtask. -- `TxnStatus`: The import transaction status corresponding to the subtask. When TxnStatus is null, it means that the subtask has not yet started scheduling. -- `JobId`: The job ID corresponding to the subtask. -- `CreateTime`: The creation time of the subtask. -- `ExecuteStartTime`: The time when the subtask is scheduled to be executed, usually later than the creation time. -- `Timeout`: Subtask timeout, usually twice the `max_batch_interval` set by the job. -- `BeId`: The ID of the BE node executing this subtask. -- `DataSourceProperties`: The starting offset of the Kafka Partition that the subtask is ready to consume. is a Json format string. Key is Partition Id. Value is the starting offset of consumption. +**1. `job_name`** -## Examples +> The name of the routine load job to view. + +## Return Results -1. Display the subtask information of the routine import task named test1. +The return results include the following fields: - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| TaskId | Unique ID of the subtask | +| TxnId | Import transaction ID corresponding to the subtask | +| TxnStatus | Import transaction status of the subtask. Null indicates the subtask has not yet been scheduled | +| JobId | Job ID corresponding to the subtask | +| CreateTime | Creation time of the subtask | +| ExecuteStartTime | Time when the subtask was scheduled for execution, typically later than creation time | +| Timeout | Subtask timeout, typically twice the `max_batch_interval` set in the job | +| BeId | BE node ID executing this subtask | +| DataSourceProperties | Starting offset of Kafka Partition that the subtask is preparing to consume. It's a Json format string. Key is Partition Id, Value is the starting offset for consumption | -## Keywords +## Privilege Control - SHOW, ROUTINE, LOAD, TASK +Users executing this SQL command must have at least the following privileges: -## Best Practice +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD TASK requires LOAD privilege on the table | + +## Notes + +- A null TxnStatus doesn't indicate task error, it may mean the task hasn't been scheduled yet +- The offset information in DataSourceProperties can be used to track data consumption progress +- When Timeout is reached, the task will automatically end regardless of whether data consumption is complete + +## Examples -With this command, you can view how many subtasks are currently running in a Routine Load job, and which BE node is running on. +- Show subtask information for a routine load task named test1. + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 32db7ac2cc873..2d502603eedb6 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -27,117 +27,115 @@ under the License. ## Description -This statement is used to display the running status of the Routine Load job +This statement is used to display the running status of Routine Load jobs. You can view the status information of either a specific job or all jobs. -grammar: +## Syntax ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; +SHOW [] ROUTINE LOAD [FOR ]; ``` -Result description: +## Optional Parameters + +**1. `[ALL]`** + +> Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. + +**2. `[FOR jobName]`** + +> Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. +> +> Supports the following formats: +> +> - `job_name`: Shows the job with the specified name in the current database +> - `db_name.job_name`: Shows the job with the specified name in the specified database + +## Return Results + +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| Id | Job ID | +| Name | Job name | +| CreateTime | Job creation time | +| PauseTime | Most recent job pause time | +| EndTime | Job end time | +| DbName | Corresponding database name | +| TableName | Corresponding table name (shows 'multi-table' for multiple tables) | +| IsMultiTbl | Whether it's a multi-table job | +| State | Job running status | +| DataSourceType | Data source type: KAFKA | +| CurrentTaskNum | Current number of subtasks | +| JobProperties | Job configuration details | +| DataSourceProperties | Data source configuration details | +| CustomProperties | Custom configurations | +| Statistic | Job running statistics | +| Progress | Job running progress | +| Lag | Job delay status | +| ReasonOfStateChanged | Reason for job state change | +| ErrorLogUrls | URLs to view filtered data that failed quality checks | +| OtherMsg | Other error messages | + +## Permission Control + +Users executing this SQL command must have at least the following permission: + +| Privilege | Object | Notes | +| :----------- | :----- | :----------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Notes + +- State descriptions: + - NEED_SCHEDULE: Job is waiting to be scheduled + - RUNNING: Job is running + - PAUSED: Job is paused + - STOPPED: Job has ended + - CANCELLED: Job has been cancelled + +- Progress description: + - For Kafka data source, shows the consumed offset for each partition + - For example, {"0":"2"} means the consumption progress of Kafka partition 0 is 2 + +- Lag description: + - For Kafka data source, shows the consumption delay for each partition + - For example, {"0":10} means the consumption lag of Kafka partition 0 is 10 -``` - Id: job ID - Name: job name - CreateTime: job creation time - PauseTime: The last job pause time - EndTime: Job end time - DbName: corresponding database name - TableName: The name of the corresponding table (In the case of multiple tables, since it is a dynamic table, the specific table name is not displayed, and we uniformly display it as "multi-table"). - IsMultiTbl: Indicates whether it is a multi-table - State: job running state - DataSourceType: Data source type: KAFKA - CurrentTaskNum: The current number of subtasks - JobProperties: Job configuration details -DataSourceProperties: Data source configuration details - CustomProperties: custom configuration - Statistic: Job running status statistics - Progress: job running progress - Lag: job delay status -ReasonOfStateChanged: The reason for the job state change - ErrorLogUrls: The viewing address of the filtered unqualified data - OtherMsg: other error messages -``` - -* State - - There are the following 5 states: - * NEED_SCHEDULE: The job is waiting to be scheduled - * RUNNING: The job is running - * PAUSED: The job is paused - * STOPPED: The job has ended - * CANCELLED: The job was canceled - -* Progress - -<<<<<<< HEAD - For Kafka data sources, displays the currently consumed offset for each partition. For example, {"0":"2"{ indicates that the consumption progress of Kafka partition 0 is 2. - -* Lag - - For Kafka data sources, shows the consumption latency of each partition. For example, {"0":10{ means that the consumption delay of Kafka partition 0 is 10. - -<<<<<<<< HEAD:versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md ## Examples -======== -## Example ->>>>>>>> ac43c88d43b68f907eafd82a2629cea01b097093:versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md -======= - For Kafka data sources, displays the currently consumed offset for each partition. For example, {"0":"2"} indicates that the consumption progress of Kafka partition 0 is 2. - -*Lag - - For Kafka data sources, shows the consumption latency of each partition. For example, {"0":10} means that the consumption delay of Kafka partition 0 is 10. - -<<<<<<<< HEAD:versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md -## Example -======== -## Examples ->>>>>>>> ac43c88d43b68f907eafd82a2629cea01b097093:versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md ->>>>>>> ac43c88d43b68f907eafd82a2629cea01b097093 - -1. Show all routine import jobs named test1 (including stopped or canceled jobs). The result is one or more lines. - - ```sql - SHOW ALL ROUTINE LOAD FOR test1; - ``` - -2. Show the currently running routine import job named test1 - - ```sql - SHOW ROUTINE LOAD FOR test1; - ``` -3. Display all routine import jobs (including stopped or canceled jobs) under example_db. The result is one or more lines. +- Show all routine load jobs (including stopped or cancelled ones) named test1 - ```sql - use example_db; - SHOW ALL ROUTINE LOAD; - ``` + ```sql + SHOW ALL ROUTINE LOAD FOR test1; + ``` -4. Display all running routine import jobs under example_db +- Show currently running routine load jobs named test1 - ```sql - use example_db; - SHOW ROUTINE LOAD; - ``` + ```sql + SHOW ROUTINE LOAD FOR test1; + ``` -5. Display the currently running routine import job named test1 under example_db +- Show all routine load jobs (including stopped or cancelled ones) in example_db. Results can be one or multiple rows. - ```sql - SHOW ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ALL ROUTINE LOAD; + ``` -6. Displays all routine import jobs named test1 under example_db (including stopped or canceled jobs). The result is one or more lines. +- Show all currently running routine load jobs in example_db - ```sql - SHOW ALL ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ROUTINE LOAD; + ``` -## Keywords +- Show currently running routine load job named test1 in example_db - SHOW, ROUTINE, LOAD + ```sql + SHOW ROUTINE LOAD FOR example_db.test1; + ``` -## Best Practice +- Show all routine load jobs (including stopped or cancelled ones) named test1 in example_db. Results can be one or multiple rows. + ```sql + SHOW ALL ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 8a8cf7cc57257..ffea7818a144e 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -25,25 +25,51 @@ under the License. --> + ## Description -User stops a Routine Load job. A stopped job cannot be rerun. +This syntax is used to stop a Routine Load job. Unlike the PAUSE command, stopped jobs cannot be restarted. If you need to import data again, you'll need to create a new import job. + +## Syntax ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` -## Examples +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to stop. It can be in the following formats: +> +> - `job_name`: Stop a job with the specified name in the current database +> - `db_name.job_name`: Stop a job with the specified name in the specified database -1. Stop the routine import job named test1. +## Permission Control - ```sql - STOP ROUTINE LOAD FOR test1; - ``` +Users executing this SQL command must have at least the following permission: + +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Notes + +- The stop operation is irreversible; stopped jobs cannot be restarted using the RESUME command +- The stop operation takes effect immediately, and running tasks will be interrupted +- It's recommended to check the job status using the SHOW ROUTINE LOAD command before stopping a job +- If you only want to temporarily pause a job, use the PAUSE command instead + +## Examples -## Keywords +- Stop a routine load job named test1 - STOP, ROUTINE, LOAD + ```sql + STOP ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Stop a routine load job in a specified database + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 4c6631ae77ab1..0167c8afcfabc 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -27,98 +27,96 @@ under the License. ## Description -This syntax is used to modify an already created routine import job. +This syntax is used to modify an existing routine load job. Only jobs in PAUSED state can be modified. -Only jobs in the PAUSED state can be modified. - -grammar: +## Syntax ```sql -ALTER ROUTINE LOAD FOR [db.]job_name -[job_properties] -FROM data_source -[data_source_properties] +ALTER ROUTINE LOAD FOR [] +[] +FROM +[] ``` -1. `[db.]job_name` - - Specifies the job name to modify. - -2. `tbl_name` - - Specifies the name of the table to be imported. - -3. `job_properties` - - Specifies the job parameters that need to be modified. Currently, only the modification of the following parameters is supported: - - 1. `desired_concurrent_number` - 2. `max_error_number` - 3. `max_batch_interval` - 4. `max_batch_rows` - 5. `max_batch_size` - 6. `jsonpaths` - 7. `json_root` - 8. `strip_outer_array` - 9. `strict_mode` - 10. `timezone` - 11. `num_as_string` - 12. `fuzzy_parse` - 13. `partial_columns` - 14. `max_filter_ratio` - - -4. `data_source` - - The type of data source. Currently supports: - - KAFKA - -5. `data_source_properties` - - Relevant properties of the data source. Currently only supports: - - 1. `kafka_partitions` - 2. `kafka_offsets` - 3. `kafka_broker_list` - 4. `kafka_topic` - 5. Custom properties, such as `property.group.id` - - Note: - - 1. `kafka_partitions` and `kafka_offsets` are used to modify the offset of the kafka partition to be consumed, only the currently consumed partition can be modified. Cannot add partition. - -## Example - -1. Change `desired_concurrent_number` to 1 - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "1" - ); - ``` - -2. Modify `desired_concurrent_number` to 10, modify the offset of the partition, and modify the group id. - - ```sql - ALTER ROUTINE LOAD FOR db1.label1 - PROPERTIES - ( - "desired_concurrent_number" = "10" - ) - FROM kafka - ( - "kafka_partitions" = "0, 1, 2", - "kafka_offsets" = "100, 200, 100", - "property.group.id" = "new_group" - ); - ``` - -## Keywords - - ALTER, ROUTINE, LOAD - -## Best Practice - +## Required Parameters + +**1. `[db.]job_name`** + +> Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. +> +> The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. + +**2. `job_properties`** + +> Specifies the job parameters to be modified. Currently supported parameters include: +> +> - desired_concurrent_number +> - max_error_number +> - max_batch_interval +> - max_batch_rows +> - max_batch_size +> - jsonpaths +> - json_root +> - strip_outer_array +> - strict_mode +> - timezone +> - num_as_string +> - fuzzy_parse +> - partial_columns +> - max_filter_ratio + +**3. `data_source`** + +> The type of data source. Currently supports: +> +> - KAFKA + +**4. `data_source_properties`** + +> Properties related to the data source. Currently supports: +> +> - kafka_partitions +> - kafka_offsets +> - kafka_broker_list +> - kafka_topic +> - Custom properties, such as property.group.id + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | + +## Notes + +- `kafka_partitions` and `kafka_offsets` are used to modify the offset of kafka partitions to be consumed, and can only modify currently consumed partitions. New partitions cannot be added. + +## Examples + +- Modify `desired_concurrent_number` to 1 + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "1" + ); + ``` + +- Modify `desired_concurrent_number` to 10, modify partition offsets, and modify group id + + ```sql + ALTER ROUTINE LOAD FOR db1.label1 + PROPERTIES + ( + "desired_concurrent_number" = "10" + ) + FROM kafka + ( + "kafka_partitions" = "0, 1, 2", + "kafka_offsets" = "100, 200, 100", + "property.group.id" = "new_group" + ); + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 1d9d0e7a2481a..e0eec79594f64 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -27,357 +27,381 @@ under the License. ## Description -The Routine Load function allows users to submit a resident import task, and import data into Doris by continuously reading data from a specified data source. +The Routine Load feature allows users to submit a resident import task that continuously reads data from a specified data source and imports it into Doris. -Currently, only data in CSV or Json format can be imported from Kakfa through unauthenticated or SSL authentication. [Example of importing data in Json format](../../../../data-operate/import/import-way/routine-load-manual.md#Example_of_importing_data_in_Json_format) +Currently, it only supports importing CSV or Json format data from Kafka through unauthenticated or SSL authentication methods. [Example of importing Json format data](../../../../data-operate/import/import-way/routine-load-manual.md#Example-of-importing-Json-format-data) -grammar: +## Syntax ```sql -CREATE ROUTINE LOAD [db.]job_name [ON tbl_name] -[merge_type] -[load_properties] -[job_properties] -FROM data_source [data_source_properties] -[COMMENT "comment"] +CREATE ROUTINE LOAD [] [ON ] +[] +[] +[] +FROM [] +[] ``` -- `[db.]job_name` - - The name of the import job. Within the same database, only one job with the same name can be running. - -- `tbl_name` - - Specifies the name of the table to be imported.Optional parameter, If not specified, the dynamic table method will - be used, which requires the data in Kafka to contain table name information. Currently, only the table name can be - obtained from the Kafka value, and it needs to conform to the format of "table_name|{"col1": "val1", "col2": "val2"}" - for JSON data. The "tbl_name" represents the table name, and "|" is used as the delimiter between the table name and - the table data. The same format applies to CSV data, such as "table_name|val1,val2,val3". It is important to note that - the "table_name" must be consistent with the table name in Doris, otherwise it may cause import failures. - - Tips: The `columns_mapping` parameter is not supported for dynamic tables. If your table structure is consistent with - the table structure in Doris and there is a large amount of table information to be imported, this method will be the - best choice. - -- `merge_type` - - Data merge type. The default is APPEND, which means that the imported data are ordinary append write operations. The MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - -- load_properties - - Used to describe imported data. The composition is as follows: - - ```SQL - [column_separator], - [columns_mapping], - [preceding_filter], - [where_predicates], - [partitions], - [DELETE ON], - [ORDER BY] - ``` - - - `column_separator` - - Specifies the column separator, defaults to `\t` - - `COLUMNS TERMINATED BY ","` - - - `columns_mapping` - - It is used to specify the mapping relationship between file columns and columns in the table, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `(k1, k2, tmpk1, k3 = tmpk1 + 1)` - - Tips: Dynamic multiple tables are not supported. - - - `preceding_filter` - - Filter raw data. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - Tips: Dynamic multiple tables are not supported. - - - `where_predicates` - - Filter imported data based on conditions. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. - - `WHERE k1 > 100 and k2 = 1000` - - Tips: When using dynamic multiple tables, please note that this parameter should be consistent with the type of each dynamic table, otherwise it will result in import failure. - - - `partitions` - - Specify in which partitions of the import destination table. If not specified, it will be automatically imported into the corresponding partition. - - `PARTITION(p1, p2, p3)` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `DELETE ON` - - It needs to be used with the MEREGE import mode, only for the table of the Unique Key model. Used to specify the columns and calculated relationships in the imported data that represent the Delete Flag. - - `DELETE ON v3 >100` - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - - - `ORDER BY` - - Tables only for the Unique Key model. Used to specify the column in the imported data that represents the Sequence Col. Mainly used to ensure data order when importing. - - Tips: When using dynamic multiple tables, please note that this parameter should conform to each dynamic table, otherwise it may cause import failure. - -- `job_properties` - - Common parameters for specifying routine import jobs. - - ```text - PROPERTIES ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - Currently we support the following parameters: - - 1. `desired_concurrent_number` - - Desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies the maximum number of tasks a job can execute concurrently. Must be greater than 0. Default is 5. - - This degree of concurrency is not the actual degree of concurrency. The actual degree of concurrency will be comprehensively considered by the number of nodes in the cluster, the load situation, and the situation of the data source. - - `"desired_concurrent_number" = "3"` - - 2. `max_batch_interval/max_batch_rows/max_batch_size` - - These three parameters represent: - - 1. The maximum execution time of each subtask, in seconds. Must be greater than or equal to 1. The default is 10. - 2. The maximum number of lines read by each subtask. Must be greater than or equal to 200000. The default is 200000. - 3. The maximum number of bytes read by each subtask. The unit is bytes and the range is 100MB to 10GB. The default is 100MB. - - These three parameters are used to control the execution time and processing volume of a subtask. When either one reaches the threshold, the task ends. - - ```text - "max_batch_interval" = "20", - "max_batch_rows" = "300000", - "max_batch_size" = "209715200" - ``` - - 3. `max_error_number` - - The maximum number of error lines allowed within the sampling window. Must be greater than or equal to 0. The default is 0, which means no error lines are allowed. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines is greater than `max_error_number` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 4. `strict_mode` - - Whether to enable strict mode, the default is off. If enabled, the column type conversion of non-null raw data will be filtered if the result is NULL. Specify as: - - `"strict_mode" = "true"` - - The strict mode mode means strict filtering of column type conversions during the load process. The strict filtering strategy is as follows: - - 1. For column type conversion, if strict mode is true, the wrong data will be filtered. The error data here refers to the fact that the original data is not null, and the result is a null value after participating in the column type conversion. - 2. When a loaded column is generated by a function transformation, strict mode has no effect on it. - 3. For a column type loaded with a range limit, if the original data can pass the type conversion normally, but cannot pass the range limit, strict mode will not affect it. For example, if the type is decimal(1,0) and the original data is 10, it is eligible for type conversion but not for column declarations. This data strict has no effect on it. - - **strict mode and load relationship of source data** - - Here is an example of a column type of TinyInt. - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa or 2000 | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 | 1 | true or false | correct data | - - Here the column type is Decimal(1,0) - - > Note: When a column in a table allows a null value to be loaded - - | source data | source data example | string to int | strict_mode | result | - | ----------- | ------------------- | ------------- | ------------- | ---------------------- | - | null | `\N` | N/A | true or false | NULL | - | not null | aaa | NULL | true | invalid data(filtered) | - | not null | aaa | NULL | false | NULL | - | not null | 1 or 10 | 1 | true or false | correct data | - - > Note: 10 Although it is a value that is out of range, because its type meets the requirements of decimal, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows. But it will not be filtered by strict mode. - - 5. `timezone` - - Specifies the time zone used by the import job. The default is to use the Session's timezone parameter. This parameter affects the results of all time zone-related functions involved in the import. - - 6. `format` - - Specify the import data format, the default is csv, and the json format is supported. - - 7. `jsonpaths` - - When the imported data format is json, the fields in the Json data can be extracted by specifying jsonpaths. - - `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` - - 8. `strip_outer_array` - - When the imported data format is json, strip_outer_array is true, indicating that the Json data is displayed in the form of an array, and each element in the data will be regarded as a row of data. The default value is false. - - `-H "strip_outer_array: true"` - - 9. `json_root` - - When the import data format is json, you can specify the root node of the Json data through json_root. Doris will extract the elements of the root node through json_root for parsing. Default is empty. - - `-H "json_root: $.RECORDS"` - 10. `send_batch_parallelism` - - Integer, Used to set the default parallelism for sending batch, if the value for parallelism exceed `max_send_batch_parallelism_per_job` in BE config, then the coordinator BE will use the value of `max_send_batch_parallelism_per_job`. - - 11. `load_to_single_tablet` - Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. This parameter can only be set when loading data into the OLAP table with random bucketing. - - 12. `partial_columns` - Boolean type, True means that use partial column update, the default value is false, this parameter is only allowed to be set when the table model is Unique and Merge on Write is used. Multi-table does not support this parameter. - - 13. `max_filter_ratio` - The maximum allowed filtering rate within the sampling window. Must be between 0 and 1. The default value is 0. - - The sampling window is `max_batch_rows * 10`. That is, if the number of error lines / total lines is greater than `max_filter_ratio` within the sampling window, the routine operation will be suspended, requiring manual intervention to check data quality problems. - - Rows that are filtered out by where conditions are not considered error rows. - - 14. `enclose` - When the csv data field contains row delimiters or column delimiters, to prevent accidental truncation, single-byte characters can be specified as brackets for protection. For example, the column separator is ",", the bracket is "'", and the data is "a,'b,c'", then "b,c" will be parsed as a field. - Note: when the bracket is `"`, `trim\_double\_quotes` must be set to true. - - 15. `escape` - Used to escape characters that appear in a csv field identical to the enclosing characters. For example, if the data is "a,'b,'c'", enclose is "'", and you want "b,'c to be parsed as a field, you need to specify a single-byte escape character, such as `\`, and then modify the data to `a,' b,\'c'`. - -- `FROM data_source [data_source_properties]` - - The type of data source. Currently supports: - - ```text - FROM KAFKA - ( - "key1" = "val1", - "key2" = "val2" - ) - ``` - - `data_source_properties` supports the following data source properties: - - 1. `kafka_broker_list` - - Kafka's broker connection information. The format is ip:host. Separate multiple brokers with commas. - - `"kafka_broker_list" = "broker1:9092,broker2:9092"` - - 2. `kafka_topic` - - Specifies the Kafka topic to subscribe to. - - `"kafka_topic" = "my_topic"` - - 3. `kafka_partitions/kafka_offsets` - - Specify the kafka partition to be subscribed to, and the corresponding starting offset of each partition. If a time is specified, consumption will start at the nearest offset greater than or equal to the time. - - offset can specify a specific offset from 0 or greater, or: - - - `OFFSET_BEGINNING`: Start subscription from where there is data. - - `OFFSET_END`: subscribe from the end. - - Time format, such as: "2021-05-22 11:00:00" - - If not specified, all partitions under topic will be subscribed from `OFFSET_END` by default. - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" - ``` - - ```text - "kafka_partitions" = "0,1,2,3", - "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" - ``` - - Note that the time format cannot be mixed with the OFFSET format. - - 4. `property` - - Specify custom kafka parameters. The function is equivalent to the "--property" parameter in the kafka shell. - - When the value of the parameter is a file, you need to add the keyword: "FILE:" before the value. - - For how to create a file, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. - - For more supported custom parameters, please refer to the configuration items on the client side in the official CONFIGURATION document of librdkafka. Such as: - - ```text - "property.client.id" = "12345", - "property.ssl.ca.location" = "FILE:ca.pem" - ``` - - 1. When connecting to Kafka using SSL, you need to specify the following parameters: - - ```text - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ``` - - in: - - `property.security.protocol` and `property.ssl.ca.location` are required to indicate the connection method is SSL and the location of the CA certificate. - - If client authentication is enabled on the Kafka server side, thenAlso set: - - ```text - "property.ssl.certificate.location" - "property.ssl.key.location" - "property.ssl.key.password" - ``` - - They are used to specify the client's public key, private key, and password for the private key, respectively. - - 2. Specify the default starting offset of the kafka partition - - If `kafka_partitions/kafka_offsets` is not specified, all partitions are consumed by default. - - At this point, you can specify `kafka_default_offsets` to specify the starting offset. Defaults to `OFFSET_END`, i.e. subscribes from the end. - - Example: - - ```text - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` -- comment - - Comment for the routine load job. - -:::tip Tips -This feature is supported since the Apache Doris 1.2.3 version -::: - -<<<<<<<< HEAD:versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md -## Example -======== +## Required Parameters + +**1. `[db.]job_name`** + +> The name of the import job. Within the same database, only one job with the same name can be running. + +**2. `FROM data_source`** + +> The type of data source. Currently supports: KAFKA + +**3. `data_source_properties`** + +> 1. `kafka_broker_list` +> +> Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. +> +> ```text +> "kafka_broker_list" = "broker1:9092,broker2:9092" +> ``` +> +> 2. `kafka_topic` +> +> Specifies the Kafka topic to subscribe to. +> ```text +> "kafka_topic" = "my_topic" +> ``` + +## Optional Parameters + +**1. `tbl_name`** + +> Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. +> +> Currently, only supports getting table names from Kafka's Value, and it needs to follow this format: for json example: `table_name|{"col1": "val1", "col2": "val2"}`, +> where `tbl_name` is the table name, with `|` as the separator between table name and table data. +> +> For csv format data, it's similar: `table_name|val1,val2,val3`. Note that `table_name` here must match the table name in Doris, otherwise the import will fail. +> +> Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. + +**2. `merge_type`** + +> Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. +> +> Tips: When using dynamic multiple tables, please note that this parameter should be consistent with each dynamic table's type, otherwise it will result in import failure. + +**3. `load_properties`** + +> Used to describe imported data. The composition is as follows: +> +> ```SQL +> [column_separator], +> [columns_mapping], +> [preceding_filter], +> [where_predicates], +> [partitions], +> [DELETE ON], +> [ORDER BY] +> ``` +> +> 1. `column_separator` +> +> Specifies the column separator, defaults to `\t` +> +> `COLUMNS TERMINATED BY ","` +> +> 2. `columns_mapping` +> +> Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. +> +> `(k1, k2, tmpk1, k3 = tmpk1 + 1)` +> +> Tips: Dynamic tables do not support this parameter. +> +> 3. `preceding_filter` +> +> Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: Dynamic tables do not support this parameter. +> +> 4. `where_predicates` +> +> Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. +> +> `WHERE k1 > 100 and k2 = 1000` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. +> +> 5. `partitions` +> +> Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. +> +> `PARTITION(p1, p2, p3)` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 6. `DELETE ON` +> +> Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. +> +> `DELETE ON v3 >100` +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. +> +> 7. `ORDER BY` +> +> Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. +> +> Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. + +**4. `job_properties`** + +> Used to specify general parameters for routine import jobs. +> +> ```text +> PROPERTIES ( +> "key1" = "val1", +> "key2" = "val2" +> ) +> ``` +> +> Currently, we support the following parameters: +> +> 1. `desired_concurrent_number` +> +> The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. +> +> This concurrency is not the actual concurrency. The actual concurrency will be determined by considering the number of cluster nodes, load conditions, and data source conditions. +> +> `"desired_concurrent_number" = "3"` +> +> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> +> These three parameters represent: +> +> 1. Maximum execution time for each subtask, in seconds. Must be greater than or equal to 1. Default is 10. +> 2. Maximum number of rows to read for each subtask. Must be greater than or equal to 200000. Default is 20000000. +> 3. Maximum number of bytes to read for each subtask. Unit is bytes, range is 100MB to 10GB. Default is 1G. +> +> These three parameters are used to control the execution time and processing volume of a subtask. When any one reaches the threshold, the task ends. +> +> ```text +> "max_batch_interval" = "20", +> "max_batch_rows" = "300000", +> "max_batch_size" = "209715200" +> ``` +> +> 3. `max_error_number` +> +> Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. +> +> The sampling window is `max_batch_rows * 10`. If the number of error rows within the sampling window exceeds `max_error_number`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 4. `strict_mode` +> +> Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: +> +> `"strict_mode" = "true"` +> +> Strict mode means: strictly filter column type conversions during the import process. The strict filtering strategy is as follows: +> +> 1. For column type conversion, if strict mode is true, erroneous data will be filtered. Here, erroneous data refers to: original data that is not null but results in null value after column type conversion. +> 2. For columns generated by function transformation during import, strict mode has no effect. +> 3. For columns with range restrictions, if the original data can pass type conversion but cannot pass range restrictions, strict mode has no effect. For example: if the type is decimal(1,0) and the original data is 10, it can pass type conversion but is outside the column's declared range. Strict mode has no effect on such data. +> +> **Relationship between strict mode and source data import** +> +> Here's an example using TinyInt column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa or 2000 | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 | 1 | true or false | correct data | +> +> Here's an example using Decimal(1,0) column type +> +> Note: When columns in the table allow null values +> +> | source data | source data example | string to int | strict_mode | result | +> | ----------- | ------------------- | ------------- | ------------- | ---------------------- | +> | null | `\N` | N/A | true or false | NULL | +> | not null | aaa | NULL | true | invalid data(filtered) | +> | not null | aaa | NULL | false | NULL | +> | not null | 1 or 10 | 1 | true or false | correct data | +> +> Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. +> +> 5. `timezone` +> +> Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. +> +> `"timezone" = "Asia/Shanghai"` +> +> 6. `format` +> +> Specifies the import data format, default is csv, json format is supported. +> +> `"format" = "json"` +> +> 7. `jsonpaths` +> +> When importing json format data, jsonpaths can be used to specify fields to extract from Json data. +> +> `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` +> +> 8. `strip_outer_array` +> +> When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. +> +> `-H "strip_outer_array: true"` +> +> 9. `json_root` +> +> When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. +> +> `-H "json_root: $.RECORDS"` +> +> 10. `send_batch_parallelism` +> +> Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. +> +> `"send_batch_parallelism" = "10"` +> +> 11. `load_to_single_tablet` +> +> Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. +> +> `"load_to_single_tablet" = "true"` +> +> 12. `partial_columns` +> +> Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. +> +> `"partial_columns" = "true"` +> +> 13. `max_filter_ratio` +> +> Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. +> +> The sampling window is `max_batch_rows * 10`. If within the sampling window, error rows/total rows exceeds `max_filter_ratio`, the routine job will be suspended and require manual intervention to check data quality issues. +> +> Rows filtered by where conditions are not counted as error rows. +> +> 14. `enclose` +> +> Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. +> +> Note: When enclose is set to `"`, trim_double_quotes must be set to true. +> +> 15. `escape` +> +> Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. +> +**5. Optional properties in `data_source_properties`** + +> 1. `kafka_partitions/kafka_offsets` +> +> Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. +> +> offset can be specified as a specific offset greater than or equal to 0, or: +> +> - `OFFSET_BEGINNING`: Start subscribing from where data exists. +> - `OFFSET_END`: Start subscribing from the end. +> - Time format, such as: "2021-05-22 11:00:00" +> +> If not specified, defaults to subscribing to all partitions under the topic from `OFFSET_END`. +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END" +> ``` +> +> ```text +> "kafka_partitions" = "0,1,2,3", +> "kafka_offsets" = "2021-05-22 11:00:00,2021-05-22 11:00:00,2021-05-22 11:00:00" +> ``` +> +> Note: Time format cannot be mixed with OFFSET format. +> +> 2. `property` +> +> Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. +> +> When the value of a parameter is a file, the keyword "FILE:" needs to be added before the value. +> +> For information about how to create files, please refer to the [CREATE FILE](../../../Data-Definition-Statements/Create/CREATE-FILE) command documentation. +> +> For more supported custom parameters, please refer to the client configuration items in the official CONFIGURATION documentation of librdkafka. For example: +> +> ```text +> "property.client.id" = "12345", +> "property.ssl.ca.location" = "FILE:ca.pem" +> ``` +> +> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> +> ```text +> "property.security.protocol" = "ssl", +> "property.ssl.ca.location" = "FILE:ca.pem", +> "property.ssl.certificate.location" = "FILE:client.pem", +> "property.ssl.key.location" = "FILE:client.key", +> "property.ssl.key.password" = "abcdefg" +> ``` +> +> Among them: +> +> `property.security.protocol` and `property.ssl.ca.location` are required, used to specify the connection method as SSL and the location of the CA certificate. +> +> If client authentication is enabled on the Kafka server side, the following also need to be set: +> +> ```text +> "property.ssl.certificate.location" +> "property.ssl.key.location" +> "property.ssl.key.password" +> ``` +> +> Used to specify the client's public key, private key, and private key password respectively. +> +> 2. Specify default starting offset for kafka partitions +> +> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> +> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> +> Example: +> +> ```text +> "property.kafka_default_offsets" = "OFFSET_BEGINNING" +> ``` + +**6. `COMMENT`** + +> Comment information for the routine load task. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: + +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | CREATE ROUTINE LOAD belongs to table LOAD operation | + +## Notes + +- Dynamic tables do not support the `columns_mapping` parameter +- When using dynamic multiple tables, parameters like merge_type, where_predicates, etc., need to conform to each dynamic table's requirements +- Time format cannot be mixed with OFFSET format +- `kafka_partitions` and `kafka_offsets` must correspond one-to-one +- When `enclose` is set to `"`, `trim_double_quotes` must be set to true. ## Examples ->>>>>>>> ac43c88d43b68f907eafd82a2629cea01b097093:versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md - -1. Create a Kafka routine import task named test1 for example_tbl of example_db. Specify the column separator and group.id and client.id, and automatically consume all partitions by default, and start subscribing from the location where there is data (OFFSET_BEGINNING) - +- Create a Kafka routine load task named test1 for example_tbl in example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -401,9 +425,9 @@ This feature is supported since the Apache Doris 1.2.3 version ); ``` -2. Create a Kafka routine dynamic multiple tables import task named "test1" for the "example_db". Specify the column delimiter, group.id, and client.id, and automatically consume all partitions, subscribing from the position with data (OFFSET_BEGINNING). +- Create a Kafka routine dynamic multi-table load task named test1 for example_db. Specify column separator, group.id and client.id, and automatically consume all partitions by default, starting subscription from where data exists (OFFSET_BEGINNING) -Assuming that we need to import data from Kafka into tables "test1" and "test2" in the "example_db", we create a routine import task named "test1". At the same time, we write the data in "test1" and "test2" to a Kafka topic named "my_topic" so that data from Kafka can be imported into both tables through a routine import task. + Assuming we need to import data from Kafka into test1 and test2 tables in example_db, we create a routine load task named test1, and write data from test1 and test2 to a Kafka topic named `my_topic`. This way, we can import data from Kafka into two tables through one routine load task. ```sql CREATE ROUTINE LOAD example_db.test1 @@ -425,9 +449,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -3. Create a Kafka routine import task named test1 for example_tbl of example_db. Import tasks are in strict mode. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db. The import task is in strict mode. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -451,9 +473,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -4. Import data from the Kafka cluster through SSL authentication. Also set the client.id parameter. The import task is in non-strict mode and the time zone is Africa/Abidjan - - +- Import data from Kafka cluster using SSL authentication. Also set client.id parameter. Import task is in non-strict mode, timezone is Africa/Abidjan ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -481,9 +501,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -5. Import data in Json format. By default, the field name in Json is used as the column name mapping. Specify to import three partitions 0, 1, and 2, and the starting offsets are all 0 - - +- Import Json format data. Use field names in Json as column name mapping by default. Specify importing partitions 0,1,2, all starting offsets are 0 ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 @@ -506,9 +524,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -6. Import Json data, extract fields through Jsonpaths, and specify the root node of the Json document - - +- Import Json data, extract fields through Jsonpaths, and specify Json document root node ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -534,9 +550,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -7. Create a Kafka routine import task named test1 for example_tbl of example_db. And use conditional filtering. - - +- Create a Kafka routine load task named test1 for example_tbl in example_db with condition filtering. ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl @@ -561,9 +575,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -8. Import data to Unique with sequence column Key model table - - +- Import data into a Unique Key model table containing sequence columns ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -585,9 +597,7 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" ); ``` -9. Consume from a specified point in time - - +- Start consuming from a specified time point ```sql CREATE ROUTINE LOAD example_db.test_job ON example_tbl @@ -603,30 +613,4 @@ Assuming that we need to import data from Kafka into tables "test1" and "test2" "kafka_topic" = "my_topic", "kafka_default_offsets" = "2021-05-21 10:00:00" ); - ``` - -## Keywords - - CREATE, ROUTINE, LOAD, CREATE LOAD - -## Best Practice - -Partition and Offset for specified consumption - -Doris supports the specified Partition and Offset to start consumption, and also supports the function of consumption at a specified time point. The configuration relationship of the corresponding parameters is described here. - -There are three relevant parameters: - -- `kafka_partitions`: Specify a list of partitions to be consumed, such as "0, 1, 2, 3". -- `kafka_offsets`: Specify the starting offset of each partition, which must correspond to the number of `kafka_partitions` list. For example: "1000, 1000, 2000, 2000" -- `property.kafka_default_offsets`: Specifies the default starting offset of the partition. - -When creating an import job, these three parameters can have the following combinations: - -| Composition | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offsets` | Behavior | -| ----------- | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | -| 1 | No | No | No | The system will automatically find all partitions corresponding to the topic and start consumption from OFFSET_END | -| 2 | No | No | Yes | The system will automatically find all partitions corresponding to the topic and start consumption from the location specified by default offset | -| 3 | Yes | No | No | The system will start consumption from OFFSET_END of the specified partition | -| 4 | Yes | Yes | No | The system will start consumption from the specified offset of the specified partition | -| 5 | Yes | No | Yes | The system will start consumption from the specified partition, the location specified by default offset | + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 5b93ad79b926c..ebc06d9229ec1 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,31 +25,51 @@ under the License. --> -## Description +## Description## Description -Used to pause a Routine Load job. A suspended job can be rerun with the RESUME command. +This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. + +## Syntax ```sql -PAUSE [ALL] ROUTINE LOAD FOR job_name +PAUSE [] ROUTINE LOAD FOR ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to pause. If ALL is specified, job_name is not required. + +## Optional Parameters + +**1. `[ALL]`** + +> Optional parameter. If ALL is specified, it indicates pausing all routine load jobs. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: -1. Pause the routine import job named test1. +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | - ```sql - PAUSE ROUTINE LOAD FOR test1; - ``` +## Notes -2. Pause all routine import jobs. +- After a job is paused, it can be restarted using the RESUME command +- The pause operation will not affect tasks that have already been dispatched to BE, these tasks will continue to complete - ```sql - PAUSE ALL ROUTINE LOAD; - ``` +## Examples -## Keywords +- Pause a routine load job named test1. - PAUSE, ROUTINE, LOAD + ```sql + PAUSE ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Pause all routine load jobs. + ```sql + PAUSE ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index da72700bdcb82..7974c1c3542d6 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -28,29 +28,50 @@ under the License. ## Description -Used to restart a suspended Routine Load job. The restarted job will continue to consume from the previously consumed offset. +This syntax is used to restart one or all paused Routine Load jobs. The restarted job will continue consuming from the previously consumed offset. + +## Syntax ```sql -RESUME [ALL] ROUTINE LOAD FOR job_name +RESUME [] ROUTINE LOAD FOR ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to restart. If ALL is specified, job_name is not required. + +## Optional Parameters + +**1. `[ALL]`** + +> Optional parameter. If ALL is specified, it indicates restarting all paused routine load jobs. + +## Privilege Control + +Users executing this SQL command must have at least the following privileges: -1. Restart the routine import job named test1. +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD privilege on the table | - ```sql - RESUME ROUTINE LOAD FOR test1; - ``` +## Notes -2. Restart all routine import jobs. +- Only jobs in PAUSED state can be restarted +- Restarted jobs will continue consuming data from the last consumed position +- If a job has been paused for too long, the restart may fail due to expired Kafka data - ```sql - RESUME ALL ROUTINE LOAD; - ``` +## Examples -## Keywords +- Restart a routine load job named test1. - RESUME, ROUTINE, LOAD + ```sql + RESUME ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Restart all routine load jobs. + ```sql + RESUME ALL ROUTINE LOAD; + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 8880986f281b5..fc079f8901cbc 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -29,32 +29,40 @@ under the License. ## Description -This statement is used to demonstrate the creation statement of a routine import job. +This statement is used to display the creation statement of a routine load job. -The kafka partition and offset in the result show the currently consumed partition and the corresponding offset to be consumed. +The result shows the current consuming Kafka partitions and their corresponding offsets to be consumed. -grammar: +## Syntax ```sql -SHOW [ALL] CREATE ROUTINE LOAD for load_name; +SHOW [] CREATE ROUTINE LOAD for ; ``` -illustrate: +## Required Parameters -1. `ALL`: optional parameter, which means to get all jobs, including historical jobs -2. `load_name`: routine import job name +**1. `load_name`** -## Example +> The name of the routine load job -1. Show the creation statement of the specified routine import job under the default db +## Optional Parameters - ```sql - SHOW CREATE ROUTINE LOAD for test_load - ``` +**1. `[ALL]`** -## Keywords +> Optional parameter that represents retrieving all jobs, including historical jobs - SHOW, CREATE, ROUTINE, LOAD +## Permission Control -## Best Practice +Users executing this SQL command must have at least the following permission: +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Examples + +- Show the creation statement of a specified routine load job in the default database + + ```sql + SHOW CREATE ROUTINE LOAD for test_load + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index c275b9cd2ae1f..6f7d84542242a 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -28,52 +28,55 @@ under the License. ## Description -View the currently running subtasks of a specified Routine Load job. - +This syntax is used to view the currently running subtasks of a specified Routine Load job. +## Syntax ```sql SHOW ROUTINE LOAD TASK -WHERE JobName = "job_name"; +WHERE JobName = ; ``` -The returned results are as follows: - -```text - TaskId: d67ce537f1be4b86-abf47530b79ab8e6 - TxnId: 4 - TxnStatus: UNKNOWN - JobId: 10280 - CreateTime: 2020-12-12 20:29:48 - ExecuteStartTime: 2020-12-12 20:29:48 - Timeout: 20 - BeId: 10002 -DataSourceProperties: {"0":19} -``` +## Required Parameters + +**1. `job_name`** + +> The name of the routine load job to view. + +## Return Results + +The return results include the following fields: -- `TaskId`: The unique ID of the subtask. -- `TxnId`: The import transaction ID corresponding to the subtask. -- `TxnStatus`: The import transaction status corresponding to the subtask. When TxnStatus is null, it means that the subtask has not yet started scheduling. -- `JobId`: The job ID corresponding to the subtask. -- `CreateTime`: The creation time of the subtask. -- `ExecuteStartTime`: The time when the subtask is scheduled to be executed, usually later than the creation time. -- `Timeout`: Subtask timeout, usually twice the `max_batch_interval` set by the job. -- `BeId`: The ID of the BE node executing this subtask. -- `DataSourceProperties`: The starting offset of the Kafka Partition that the subtask is ready to consume. is a Json format string. Key is Partition Id. Value is the starting offset of consumption. +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| TaskId | Unique ID of the subtask | +| TxnId | Import transaction ID corresponding to the subtask | +| TxnStatus | Import transaction status of the subtask. Null indicates the subtask has not yet been scheduled | +| JobId | Job ID corresponding to the subtask | +| CreateTime | Creation time of the subtask | +| ExecuteStartTime | Time when the subtask was scheduled for execution, typically later than creation time | +| Timeout | Subtask timeout, typically twice the `max_batch_interval` set in the job | +| BeId | BE node ID executing this subtask | +| DataSourceProperties | Starting offset of Kafka Partition that the subtask is preparing to consume. It's a Json format string. Key is Partition Id, Value is the starting offset for consumption | -## Example +## Privilege Control -1. Display the subtask information of the routine import task named test1. +Users executing this SQL command must have at least the following privileges: - ```sql - SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; - ``` +| Privilege | Object | Notes | +| :-------- | :----- | :---- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD TASK requires LOAD privilege on the table | -## Keywords +## Notes - SHOW, ROUTINE, LOAD, TASK +- A null TxnStatus doesn't indicate task error, it may mean the task hasn't been scheduled yet +- The offset information in DataSourceProperties can be used to track data consumption progress +- When Timeout is reached, the task will automatically end regardless of whether data consumption is complete -## Best Practice +## Examples -With this command, you can view how many subtasks are currently running in a Routine Load job, and which BE node is running on. +- Show subtask information for a routine load task named test1. + ```sql + SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 201a06b0bfe9c..2d502603eedb6 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -27,99 +27,115 @@ under the License. ## Description -This statement is used to display the running status of the Routine Load job +This statement is used to display the running status of Routine Load jobs. You can view the status information of either a specific job or all jobs. -grammar: +## Syntax ```sql -SHOW [ALL] ROUTINE LOAD [FOR jobName]; +SHOW [] ROUTINE LOAD [FOR ]; ``` -Result description: +## Optional Parameters -``` - Id: job ID - Name: job name - CreateTime: job creation time - PauseTime: The last job pause time - EndTime: Job end time - DbName: corresponding database name - TableName: The name of the corresponding table (In the case of multiple tables, since it is a dynamic table, the specific table name is not displayed, and we uniformly display it as "multi-table"). - IsMultiTbl: Indicates whether it is a multi-table - State: job running state - DataSourceType: Data source type: KAFKA - CurrentTaskNum: The current number of subtasks - JobProperties: Job configuration details -DataSourceProperties: Data source configuration details - CustomProperties: custom configuration - Statistic: Job running status statistics - Progress: job running progress - Lag: job delay status -ReasonOfStateChanged: The reason for the job state change - ErrorLogUrls: The viewing address of the filtered unqualified data - OtherMsg: other error messages -``` +**1. `[ALL]`** + +> Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. + +**2. `[FOR jobName]`** -* State +> Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. +> +> Supports the following formats: +> +> - `job_name`: Shows the job with the specified name in the current database +> - `db_name.job_name`: Shows the job with the specified name in the specified database - There are the following 5 states: - * NEED_SCHEDULE: The job is waiting to be scheduled - * RUNNING: The job is running - * PAUSED: The job is paused - * STOPPED: The job has ended - * CANCELLED: The job was canceled +## Return Results -* Progress +| Field Name | Description | +| :------------------- | :---------------------------------------------------------- | +| Id | Job ID | +| Name | Job name | +| CreateTime | Job creation time | +| PauseTime | Most recent job pause time | +| EndTime | Job end time | +| DbName | Corresponding database name | +| TableName | Corresponding table name (shows 'multi-table' for multiple tables) | +| IsMultiTbl | Whether it's a multi-table job | +| State | Job running status | +| DataSourceType | Data source type: KAFKA | +| CurrentTaskNum | Current number of subtasks | +| JobProperties | Job configuration details | +| DataSourceProperties | Data source configuration details | +| CustomProperties | Custom configurations | +| Statistic | Job running statistics | +| Progress | Job running progress | +| Lag | Job delay status | +| ReasonOfStateChanged | Reason for job state change | +| ErrorLogUrls | URLs to view filtered data that failed quality checks | +| OtherMsg | Other error messages | - For Kafka data sources, displays the currently consumed offset for each partition. For example, {"0":"2"} indicates that the consumption progress of Kafka partition 0 is 2. +## Permission Control -*Lag +Users executing this SQL command must have at least the following permission: - For Kafka data sources, shows the consumption latency of each partition. For example, {"0":10} means that the consumption delay of Kafka partition 0 is 10. +| Privilege | Object | Notes | +| :----------- | :----- | :----------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | -## Example +## Notes -1. Show all routine import jobs named test1 (including stopped or canceled jobs). The result is one or more lines. +- State descriptions: + - NEED_SCHEDULE: Job is waiting to be scheduled + - RUNNING: Job is running + - PAUSED: Job is paused + - STOPPED: Job has ended + - CANCELLED: Job has been cancelled - ```sql - SHOW ALL ROUTINE LOAD FOR test1; - ``` +- Progress description: + - For Kafka data source, shows the consumed offset for each partition + - For example, {"0":"2"} means the consumption progress of Kafka partition 0 is 2 -2. Show the currently running routine import job named test1 +- Lag description: + - For Kafka data source, shows the consumption delay for each partition + - For example, {"0":10} means the consumption lag of Kafka partition 0 is 10 - ```sql - SHOW ROUTINE LOAD FOR test1; - ``` +## Examples -3. Display all routine import jobs (including stopped or canceled jobs) under example_db. The result is one or more lines. +- Show all routine load jobs (including stopped or cancelled ones) named test1 - ```sql - use example_db; - SHOW ALL ROUTINE LOAD; - ``` + ```sql + SHOW ALL ROUTINE LOAD FOR test1; + ``` -4. Display all running routine import jobs under example_db +- Show currently running routine load jobs named test1 - ```sql - use example_db; - SHOW ROUTINE LOAD; - ``` + ```sql + SHOW ROUTINE LOAD FOR test1; + ``` -5. Display the currently running routine import job named test1 under example_db +- Show all routine load jobs (including stopped or cancelled ones) in example_db. Results can be one or multiple rows. - ```sql - SHOW ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ALL ROUTINE LOAD; + ``` -6. Displays all routine import jobs named test1 under example_db (including stopped or canceled jobs). The result is one or more lines. +- Show all currently running routine load jobs in example_db - ```sql - SHOW ALL ROUTINE LOAD FOR example_db.test1; - ``` + ```sql + use example_db; + SHOW ROUTINE LOAD; + ``` -## Keywords +- Show currently running routine load job named test1 in example_db - SHOW, ROUTINE, LOAD + ```sql + SHOW ROUTINE LOAD FOR example_db.test1; + ``` -## Best Practice +- Show all routine load jobs (including stopped or cancelled ones) named test1 in example_db. Results can be one or multiple rows. + ```sql + SHOW ALL ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 36fb589789c71..ffea7818a144e 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -28,23 +28,48 @@ under the License. ## Description -User stops a Routine Load job. A stopped job cannot be rerun. +This syntax is used to stop a Routine Load job. Unlike the PAUSE command, stopped jobs cannot be restarted. If you need to import data again, you'll need to create a new import job. + +## Syntax ```sql -STOP ROUTINE LOAD FOR job_name; +STOP ROUTINE LOAD FOR ; ``` -## Example +## Required Parameters + +**1. `job_name`** + +> Specifies the name of the job to stop. It can be in the following formats: +> +> - `job_name`: Stop a job with the specified name in the current database +> - `db_name.job_name`: Stop a job with the specified name in the specified database + +## Permission Control + +Users executing this SQL command must have at least the following permission: + +| Privilege | Object | Notes | +| :--------- | :----- | :------------------------------------------------------- | +| LOAD_PRIV | Table | SHOW ROUTINE LOAD requires LOAD permission on the table | + +## Notes -1. Stop the routine import job named test1. +- The stop operation is irreversible; stopped jobs cannot be restarted using the RESUME command +- The stop operation takes effect immediately, and running tasks will be interrupted +- It's recommended to check the job status using the SHOW ROUTINE LOAD command before stopping a job +- If you only want to temporarily pause a job, use the PAUSE command instead - ```sql - STOP ROUTINE LOAD FOR test1; - ``` +## Examples -## Keywords +- Stop a routine load job named test1 - STOP, ROUTINE, LOAD + ```sql + STOP ROUTINE LOAD FOR test1; + ``` -## Best Practice +- Stop a routine load job in a specified database + ```sql + STOP ROUTINE LOAD FOR example_db.test1; + ``` \ No newline at end of file From d8a60b25ba9a624906d1eed4c82950ef81ec0cd2 Mon Sep 17 00:00:00 2001 From: liujiwen-up Date: Thu, 13 Feb 2025 22:23:59 +0800 Subject: [PATCH 2/2] Fix routine load statements documentation docs --- .../load-and-export/ALTER-ROUTINE-LOAD.md | 30 +++---- .../load-and-export/CREATE-ROUTINE-LOAD.md | 76 +++++++++--------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 6 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 8 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 24 +++--- .../load-and-export/CREATE-ROUTINE-LOAD.md | 78 +++++++++---------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 4 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 4 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 24 +++--- .../load-and-export/CREATE-ROUTINE-LOAD.md | 78 +++++++++---------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 4 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 4 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 24 +++--- .../load-and-export/CREATE-ROUTINE-LOAD.md | 78 +++++++++---------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 4 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 4 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 30 +++---- .../load-and-export/CREATE-ROUTINE-LOAD.md | 76 +++++++++--------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 6 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 8 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- .../load-and-export/ALTER-ROUTINE-LOAD.md | 30 +++---- .../load-and-export/CREATE-ROUTINE-LOAD.md | 76 +++++++++--------- .../load-and-export/PAUSE-ROUTINE-LOAD.md | 6 +- .../load-and-export/RESUME-ROUTINE-LOAD.md | 4 +- .../SHOW-CREATE-ROUTINE-LOAD.md | 4 +- .../load-and-export/SHOW-ROUTINE-LOAD-TASK.md | 5 +- .../load-and-export/SHOW-ROUTINE-LOAD.md | 8 +- .../load-and-export/STOP-ROUTINE-LOAD.md | 6 +- 48 files changed, 405 insertions(+), 399 deletions(-) diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 0167c8afcfabc..a5f3a3621d67d 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -32,21 +32,23 @@ This syntax is used to modify an existing routine load job. Only jobs in PAUSED ## Syntax ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. > > The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. -**2. `job_properties`** +## Optional Parameters + +**1. ``** > Specifies the job parameters to be modified. Currently supported parameters include: > @@ -65,21 +67,21 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** +**2. ``** -> The type of data source. Currently supports: +> Properties related to the data source. Currently supports: > -> - KAFKA +> - `` +> - `` +> - `` +> - `` +> - Custom properties, such as `` -**4. `data_source_properties`** +**3. ``** -> Properties related to the data source. Currently supports: +> The type of data source. Currently supports: > -> - kafka_partitions -> - kafka_offsets -> - kafka_broker_list -> - kafka_topic -> - Custom properties, such as property.group.id +> - KAFKA ## Privilege Control diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index e0eec79594f64..c92f27a7c375f 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -34,27 +34,27 @@ Currently, it only supports importing CSV or Json format data from Kafka through ## Syntax ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > The name of the import job. Within the same database, only one job with the same name can be running. -**2. `FROM data_source`** +**2. `FROM `** > The type of data source. Currently supports: KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. > @@ -62,7 +62,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` > -> 2. `kafka_topic` +> 2. `` > > Specifies the Kafka topic to subscribe to. > ```text @@ -71,7 +71,7 @@ FROM [] ## Optional Parameters -**1. `tbl_name`** +**1. ``** > Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. > @@ -82,7 +82,7 @@ FROM [] > > Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. -**2. `merge_type`** +**2. ``** > Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. > @@ -102,13 +102,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > Specifies the column separator, defaults to `\t` > > `COLUMNS TERMINATED BY ","` > -> 2. `columns_mapping` +> 2. `` > > Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. > @@ -116,7 +116,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 3. `preceding_filter` +> 3. `` > > Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -124,7 +124,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 4. `where_predicates` +> 4. `` > > Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -132,7 +132,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. > -> 5. `partitions` +> 5. `` > > Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. > @@ -140,7 +140,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 6. `DELETE ON` +> 6. `` > > Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. > @@ -148,13 +148,13 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 7. `ORDER BY` +> 7. `` > > Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. -**4. `job_properties`** +**4. ``** > Used to specify general parameters for routine import jobs. > @@ -167,7 +167,7 @@ FROM [] > > Currently, we support the following parameters: > -> 1. `desired_concurrent_number` +> 1. `` > > The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. > @@ -175,7 +175,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > These three parameters represent: > @@ -191,7 +191,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. > @@ -199,7 +199,7 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 4. `strict_mode` +> 4. `` > > Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: > @@ -237,55 +237,55 @@ FROM [] > > Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. > -> 5. `timezone` +> 5. `` > > Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > Specifies the import data format, default is csv, json format is supported. > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > When importing json format data, jsonpaths can be used to specify fields to extract from Json data. > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. > @@ -293,19 +293,19 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 14. `enclose` +> 14. `` > > Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. > > Note: When enclose is set to `"`, trim_double_quotes must be set to true. > -> 15. `escape` +> 15. `` > > Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. > **5. Optional properties in `data_source_properties`** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. > @@ -329,7 +329,7 @@ FROM [] > > Note: Time format cannot be mixed with OFFSET format. > -> 2. `property` +> 2. `` > > Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. > @@ -344,7 +344,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> 2.1 When using SSL to connect to Kafka, the following parameters need to be specified: > > ```text > "property.security.protocol" = "ssl", @@ -368,11 +368,11 @@ FROM [] > > Used to specify the client's public key, private key, and private key password respectively. > -> 2. Specify default starting offset for kafka partitions +> 2.2 Specify default starting offset for kafka partitions > -> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> If `/` is not specified, all partitions will be consumed by default. > -> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> In this case, `` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. > > Example: > diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index ebc06d9229ec1..9f3a53d0ad9fe 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,19 +25,19 @@ under the License. --> -## Description## Description +## Description This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. ## Syntax ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to pause. If ALL is specified, job_name is not required. diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 7974c1c3542d6..90039afe32140 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -33,12 +33,12 @@ This syntax is used to restart one or all paused Routine Load jobs. The restarte ## Syntax ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to restart. If ALL is specified, job_name is not required. diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index fc079f8901cbc..50f8edf5eec2f 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ The result shows the current consuming Kafka partitions and their corresponding ## Syntax ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## Required Parameters -**1. `load_name`** +**1. ``** > The name of the routine load job diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 6f7d84542242a..70d0a7bcc600f 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -33,13 +33,12 @@ This syntax is used to view the currently running subtasks of a specified Routin ## Syntax ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## Required Parameters -**1. `job_name`** +**1. ``** > The name of the routine load job to view. diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 2d502603eedb6..528a56c412acd 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -32,7 +32,7 @@ This statement is used to display the running status of Routine Load jobs. You c ## Syntax ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## Optional Parameters @@ -41,14 +41,14 @@ SHOW [] ROUTINE LOAD [FOR ]; > Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. -**2. `[FOR jobName]`** +**2. `[FOR ]`** > Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. > > Supports the following formats: > -> - `job_name`: Shows the job with the specified name in the current database -> - `db_name.job_name`: Shows the job with the specified name in the specified database +> - ``: Shows the job with the specified name in the current database +> - `.`: Shows the job with the specified name in the specified database ## Return Results diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index ffea7818a144e..efa758e2b49a6 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -38,12 +38,12 @@ STOP ROUTINE LOAD FOR ; ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to stop. It can be in the following formats: > -> - `job_name`: Stop a job with the specified name in the current database -> - `db_name.job_name`: Stop a job with the specified name in the specified database +> - ``: Stop a job with the specified name in the current database +> - `.`: Stop a job with the specified name in the specified database ## Permission Control diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 68c1c2760d937..0fe9afb2b838d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -33,21 +33,23 @@ under the License. ## 语法 ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 > > 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -**2. `job_properties`** +## 可选参数 + +**1. ``** > 指定需要修改的作业参数。目前支持修改的参数包括: > @@ -66,13 +68,7 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** - -> 数据源的类型。当前支持: -> -> - KAFKA - -**4. `data_source_properties`** +**2. ``** > 数据源的相关属性。目前支持: > @@ -82,6 +78,12 @@ FROM > - kafka_topic > - 自定义 property,如 property.group.id +**3. ``** + +> 数据源的类型。当前支持: +> +> - KAFKA + ## 权限控制 执行此 SQL 命令的用户必须至少具有以下权限: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 4ac39f6df86a4..55eb01d0788d1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -38,27 +38,27 @@ under the License. ## 语法 ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 -**2. `FROM data_source`** +**2. `FROM `** > 数据源的类型。当前支持:KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 > @@ -66,7 +66,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` -> 2. `kafka_topic` +> 2. `` > > 指定要订阅的 Kafka 的 topic。 > ```text @@ -75,7 +75,7 @@ FROM [] ## 可选参数 -**1. `tbl_name`** +**1. ``** > 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 > @@ -87,13 +87,13 @@ FROM [] > tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 -**2. `merge_type`** +**2. ``** > 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 -**3. `load_properties`** +**3. ``** > 用于描述导入数据。组成如下: > @@ -107,13 +107,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > 指定列分隔符,默认为 `\t` > > `COLUMNS TERMINATED BY ","` -> 2. `columns_mapping` +> 2. `` > > 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -121,7 +121,7 @@ FROM [] > > tips: 动态表不支持此参数。 -> 3. `preceding_filter` +> 3. `` > > 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -129,7 +129,7 @@ FROM [] > > tips: 动态表不支持此参数。 > -> 4. `where_predicates` +> 4. `` > > 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -137,7 +137,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 -> 5. `partitions` +> 5. `` > > 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 > @@ -145,7 +145,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 6. `DELETE ON` +> 6. `` > > 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 > @@ -153,13 +153,13 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 7. `ORDER BY` +> 7. `` > > 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -**4. `job_properties`** +**4. ``** > 用于指定例行导入作业的通用参数。 > @@ -172,7 +172,7 @@ FROM [] > > 目前我们支持以下参数: > -> 1. `desired_concurrent_number` +> 1. `` > > 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 > @@ -180,7 +180,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > 这三个参数分别表示: > @@ -196,7 +196,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 > @@ -204,7 +204,7 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 4. `strict_mode` +> 4. `` > > 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: > @@ -242,55 +242,55 @@ FROM [] > > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 > -> 5. `timezone` +> 5. `` > > 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > 指定导入数据格式,默认是 csv,支持 json 格式。 > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 > @@ -298,19 +298,19 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 14. `enclose` +> 14. `` > > 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 > > 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 > -> 15. `escape` +> 15. `` > > 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 > -**5. `data_source_properties` 中的可选属性** +**5. `` 中的可选属性** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 > @@ -334,7 +334,7 @@ FROM [] > > 注意,时间格式不能和 OFFSET 格式混用。 > -> 2. `property` +> 2. `` > > 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 > @@ -349,7 +349,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> 2.1 使用 SSL 连接 Kafka 时,需要指定以下参数: > > ```text > "property.security.protocol" = "ssl", @@ -373,7 +373,7 @@ FROM [] > > 分别用于指定 client 的 public key,private key 以及 private key 的密码。 > -> 2. 指定 kafka partition 的默认起始 offset +> 2.2 指定 kafka partition 的默认起始 offset > > 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 > @@ -385,7 +385,7 @@ FROM [] > "property.kafka_default_offsets" = "OFFSET_BEGINNING" > ``` -**6. `COMMENT`** +**6. ``** > 例行导入任务的注释信息。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 4653c3b3c7ed8..71e5926fd0e40 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 3371f39a0d5e8..51a4ef065f541 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 9931c745df012..dddb0d3c5ee5b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ under the License. ## 语法 ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## 必选参数 -**1. `load_name`** +**1. ``** > 例行导入作业名称 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 252791c572402..3c7edd55587c1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -31,13 +31,12 @@ under the License. ## 语法 ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 要查看的例行导入作业名称。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 24da597680db2..023bd2eaa9d77 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -31,7 +31,7 @@ under the License. ## 语法 ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## 可选参数 @@ -40,7 +40,7 @@ SHOW [] ROUTINE LOAD [FOR ]; > 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 -**2. `[FOR jobName]`** +**2. `[FOR ]`** > 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 > diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 57c5f8abfecee..718ff57774474 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ STOP ROUTINE LOAD FOR ; ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要停止的作业名称。可以是以下形式: > -> - `job_name`: 停止当前数据库下指定名称的作业 -> - `db_name.job_name`: 停止指定数据库下指定名称的作业 +> - ``: 停止当前数据库下指定名称的作业 +> - `.`: 停止指定数据库下指定名称的作业 ## 权限控制 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 68c1c2760d937..0fe9afb2b838d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -33,21 +33,23 @@ under the License. ## 语法 ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 > > 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -**2. `job_properties`** +## 可选参数 + +**1. ``** > 指定需要修改的作业参数。目前支持修改的参数包括: > @@ -66,13 +68,7 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** - -> 数据源的类型。当前支持: -> -> - KAFKA - -**4. `data_source_properties`** +**2. ``** > 数据源的相关属性。目前支持: > @@ -82,6 +78,12 @@ FROM > - kafka_topic > - 自定义 property,如 property.group.id +**3. ``** + +> 数据源的类型。当前支持: +> +> - KAFKA + ## 权限控制 执行此 SQL 命令的用户必须至少具有以下权限: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 4ac39f6df86a4..55eb01d0788d1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -38,27 +38,27 @@ under the License. ## 语法 ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 -**2. `FROM data_source`** +**2. `FROM `** > 数据源的类型。当前支持:KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 > @@ -66,7 +66,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` -> 2. `kafka_topic` +> 2. `` > > 指定要订阅的 Kafka 的 topic。 > ```text @@ -75,7 +75,7 @@ FROM [] ## 可选参数 -**1. `tbl_name`** +**1. ``** > 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 > @@ -87,13 +87,13 @@ FROM [] > tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 -**2. `merge_type`** +**2. ``** > 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 -**3. `load_properties`** +**3. ``** > 用于描述导入数据。组成如下: > @@ -107,13 +107,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > 指定列分隔符,默认为 `\t` > > `COLUMNS TERMINATED BY ","` -> 2. `columns_mapping` +> 2. `` > > 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -121,7 +121,7 @@ FROM [] > > tips: 动态表不支持此参数。 -> 3. `preceding_filter` +> 3. `` > > 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -129,7 +129,7 @@ FROM [] > > tips: 动态表不支持此参数。 > -> 4. `where_predicates` +> 4. `` > > 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -137,7 +137,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 -> 5. `partitions` +> 5. `` > > 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 > @@ -145,7 +145,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 6. `DELETE ON` +> 6. `` > > 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 > @@ -153,13 +153,13 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 7. `ORDER BY` +> 7. `` > > 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -**4. `job_properties`** +**4. ``** > 用于指定例行导入作业的通用参数。 > @@ -172,7 +172,7 @@ FROM [] > > 目前我们支持以下参数: > -> 1. `desired_concurrent_number` +> 1. `` > > 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 > @@ -180,7 +180,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > 这三个参数分别表示: > @@ -196,7 +196,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 > @@ -204,7 +204,7 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 4. `strict_mode` +> 4. `` > > 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: > @@ -242,55 +242,55 @@ FROM [] > > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 > -> 5. `timezone` +> 5. `` > > 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > 指定导入数据格式,默认是 csv,支持 json 格式。 > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 > @@ -298,19 +298,19 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 14. `enclose` +> 14. `` > > 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 > > 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 > -> 15. `escape` +> 15. `` > > 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 > -**5. `data_source_properties` 中的可选属性** +**5. `` 中的可选属性** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 > @@ -334,7 +334,7 @@ FROM [] > > 注意,时间格式不能和 OFFSET 格式混用。 > -> 2. `property` +> 2. `` > > 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 > @@ -349,7 +349,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> 2.1 使用 SSL 连接 Kafka 时,需要指定以下参数: > > ```text > "property.security.protocol" = "ssl", @@ -373,7 +373,7 @@ FROM [] > > 分别用于指定 client 的 public key,private key 以及 private key 的密码。 > -> 2. 指定 kafka partition 的默认起始 offset +> 2.2 指定 kafka partition 的默认起始 offset > > 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 > @@ -385,7 +385,7 @@ FROM [] > "property.kafka_default_offsets" = "OFFSET_BEGINNING" > ``` -**6. `COMMENT`** +**6. ``** > 例行导入任务的注释信息。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 4653c3b3c7ed8..71e5926fd0e40 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 3371f39a0d5e8..51a4ef065f541 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 9931c745df012..dddb0d3c5ee5b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ under the License. ## 语法 ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## 必选参数 -**1. `load_name`** +**1. ``** > 例行导入作业名称 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 252791c572402..3c7edd55587c1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -31,13 +31,12 @@ under the License. ## 语法 ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 要查看的例行导入作业名称。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 24da597680db2..023bd2eaa9d77 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -31,7 +31,7 @@ under the License. ## 语法 ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## 可选参数 @@ -40,7 +40,7 @@ SHOW [] ROUTINE LOAD [FOR ]; > 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 -**2. `[FOR jobName]`** +**2. `[FOR ]`** > 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 > diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 57c5f8abfecee..718ff57774474 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ STOP ROUTINE LOAD FOR ; ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要停止的作业名称。可以是以下形式: > -> - `job_name`: 停止当前数据库下指定名称的作业 -> - `db_name.job_name`: 停止指定数据库下指定名称的作业 +> - ``: 停止当前数据库下指定名称的作业 +> - `.`: 停止指定数据库下指定名称的作业 ## 权限控制 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 68c1c2760d937..0fe9afb2b838d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -33,21 +33,23 @@ under the License. ## 语法 ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 指定要修改的作业名称。标识符必须以字母字符开头,并且不能包含空格或特殊字符,除非整个标识符字符串用反引号括起来。 > > 标识符不能使用保留关键字。有关更多详细信息,请参阅标识符要求和保留关键字。 -**2. `job_properties`** +## 可选参数 + +**1. ``** > 指定需要修改的作业参数。目前支持修改的参数包括: > @@ -66,13 +68,7 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** - -> 数据源的类型。当前支持: -> -> - KAFKA - -**4. `data_source_properties`** +**2. ``** > 数据源的相关属性。目前支持: > @@ -82,6 +78,12 @@ FROM > - kafka_topic > - 自定义 property,如 property.group.id +**3. ``** + +> 数据源的类型。当前支持: +> +> - KAFKA + ## 权限控制 执行此 SQL 命令的用户必须至少具有以下权限: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index 4ac39f6df86a4..55eb01d0788d1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -38,27 +38,27 @@ under the License. ## 语法 ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## 必选参数 -**1. `[db.]job_name`** +**1. `[.]`** > 导入作业的名称,在同一个 database 内,相同名称只能有一个 job 在运行。 -**2. `FROM data_source`** +**2. `FROM `** > 数据源的类型。当前支持:KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka 的 broker 连接信息。格式为 ip:host。多个 broker 之间以逗号分隔。 > @@ -66,7 +66,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` -> 2. `kafka_topic` +> 2. `` > > 指定要订阅的 Kafka 的 topic。 > ```text @@ -75,7 +75,7 @@ FROM [] ## 可选参数 -**1. `tbl_name`** +**1. ``** > 指定需要导入的表的名称,可选参数,如果不指定,则采用动态表的方式,这个时候需要 Kafka 中的数据包含表名的信息。 > @@ -87,13 +87,13 @@ FROM [] > tips: 动态表不支持 `columns_mapping` 参数。如果你的表结构和 Doris 中的表结构一致,且存在大量的表信息需要导入,那么这种方式将是不二选择。 -**2. `merge_type`** +**2. ``** > 数据合并类型。默认为 APPEND,表示导入的数据都是普通的追加写操作。MERGE 和 DELETE 类型仅适用于 Unique Key 模型表。其中 MERGE 类型需要配合 [DELETE ON] 语句使用,以标注 Delete Flag 列。而 DELETE 类型则表示导入的所有数据皆为删除数据。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的类型,否则会导致导入失败。 -**3. `load_properties`** +**3. ``** > 用于描述导入数据。组成如下: > @@ -107,13 +107,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > 指定列分隔符,默认为 `\t` > > `COLUMNS TERMINATED BY ","` -> 2. `columns_mapping` +> 2. `` > > 用于指定文件列和表中列的映射关系,以及各种列转换等。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -121,7 +121,7 @@ FROM [] > > tips: 动态表不支持此参数。 -> 3. `preceding_filter` +> 3. `` > > 过滤原始数据。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -129,7 +129,7 @@ FROM [] > > tips: 动态表不支持此参数。 > -> 4. `where_predicates` +> 4. `` > > 根据条件对导入的数据进行过滤。关于这部分详细介绍,可以参阅 [列的映射,转换与过滤] 文档。 > @@ -137,7 +137,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表的列,否则会导致导入失败。通常在使用动态多表的时候,我们仅建议通用公共列使用此参数。 -> 5. `partitions` +> 5. `` > > 指定导入目的表的哪些 partition 中。如果不指定,则会自动导入到对应的 partition 中。 > @@ -145,7 +145,7 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 6. `DELETE ON` +> 6. `` > > 需配合 MEREGE 导入模式一起使用,仅针对 Unique Key 模型的表。用于指定导入数据中表示 Delete Flag 的列和计算关系。 > @@ -153,13 +153,13 @@ FROM [] > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -> 7. `ORDER BY` +> 7. `` > > 仅针对 Unique Key 模型的表。用于指定导入数据中表示 Sequence Col 的列。主要用于导入时保证数据顺序。 > > tips: 当使用动态多表的时候,请注意此参数应该符合每张动态表,否则会导致导入失败。 -**4. `job_properties`** +**4. ``** > 用于指定例行导入作业的通用参数。 > @@ -172,7 +172,7 @@ FROM [] > > 目前我们支持以下参数: > -> 1. `desired_concurrent_number` +> 1. `` > > 期望的并发度。一个例行导入作业会被分成多个子任务执行。这个参数指定一个作业最多有多少任务可以同时执行。必须大于 0。默认为 5。 > @@ -180,7 +180,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > 这三个参数分别表示: > @@ -196,7 +196,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > 采样窗口内,允许的最大错误行数。必须大于等于 0。默认是 0,即不允许有错误行。 > @@ -204,7 +204,7 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 4. `strict_mode` +> 4. `` > > 是否开启严格模式,默认为关闭。如果开启后,非空原始数据的列类型变换如果结果为 NULL,则会被过滤。指定方式为: > @@ -242,55 +242,55 @@ FROM [] > > 注意:10 虽然是一个超过范围的值,但是因为其类型符合 decimal 的要求,所以 strict mode 对其不产生影响。10 最后会在其他 ETL 处理流程中被过滤。但不会被 strict mode 过滤。 > -> 5. `timezone` +> 5. `` > > 指定导入作业所使用的时区。默认为使用 Session 的 timezone 参数。该参数会影响所有导入涉及的和时区有关的函数结果。 > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > 指定导入数据格式,默认是 csv,支持 json 格式。 > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > 当导入数据格式为 json 时,可以通过 jsonpaths 指定抽取 Json 数据中的字段。 > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > 当导入数据格式为 json 时,strip_outer_array 为 true 表示 Json 数据以数组的形式展现,数据中的每一个元素将被视为一行数据。默认值是 false。 > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > 当导入数据格式为 json 时,可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。 > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > 整型,用于设置发送批处理数据的并行度,如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > 布尔类型,为 true 表示支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > 布尔类型,为 true 表示使用部分列更新,默认值为 false,该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。 > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > 采样窗口内,允许的最大过滤率。必须在大于等于0到小于等于1之间。默认值是 0。 > @@ -298,19 +298,19 @@ FROM [] > > 被 where 条件过滤掉的行不算错误行。 > -> 14. `enclose` +> 14. `` > > 包围符。当 csv 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为",",包围符为"'",数据为"a,'b,c'",则"b,c"会被解析为一个字段。 > > 注意:当 enclose 设置为`"`时,trim_double_quotes 一定要设置为 true。 > -> 15. `escape` +> 15. `` > > 转义符。用于转义在csv字段中出现的与包围符相同的字符。例如数据为"a,'b,'c'",包围符为"'",希望"b,'c被作为一个字段解析,则需要指定单字节转义符,例如 `\`,然后将数据修改为 `a,'b,\'c'`。 > -**5. `data_source_properties` 中的可选属性** +**5. `` 中的可选属性** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > 指定需要订阅的 kafka partition,以及对应的每个 partition 的起始 offset。如果指定时间,则会从大于等于该时间的最近一个 offset 处开始消费。 > @@ -334,7 +334,7 @@ FROM [] > > 注意,时间格式不能和 OFFSET 格式混用。 > -> 2. `property` +> 2. `` > > 指定自定义 kafka 参数。功能等同于 kafka shell 中 "--property" 参数。 > @@ -349,7 +349,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. 使用 SSL 连接 Kafka 时,需要指定以下参数: +> 2.1 使用 SSL 连接 Kafka 时,需要指定以下参数: > > ```text > "property.security.protocol" = "ssl", @@ -373,7 +373,7 @@ FROM [] > > 分别用于指定 client 的 public key,private key 以及 private key 的密码。 > -> 2. 指定 kafka partition 的默认起始 offset +> 2.2 指定 kafka partition 的默认起始 offset > > 如果没有指定 `kafka_partitions/kafka_offsets`,默认消费所有分区。 > @@ -385,7 +385,7 @@ FROM [] > "property.kafka_default_offsets" = "OFFSET_BEGINNING" > ``` -**6. `COMMENT`** +**6. ``** > 例行导入任务的注释信息。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index 4653c3b3c7ed8..71e5926fd0e40 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要暂停的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 3371f39a0d5e8..51a4ef065f541 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -31,12 +31,12 @@ under the License. ## 语法 ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要重启的作业名称。如果指定了 ALL,则无需指定 job_name。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index 9931c745df012..dddb0d3c5ee5b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ under the License. ## 语法 ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## 必选参数 -**1. `load_name`** +**1. ``** > 例行导入作业名称 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 252791c572402..3c7edd55587c1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -31,13 +31,12 @@ under the License. ## 语法 ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## 必选参数 -**1. `job_name`** +**1. ``** > 要查看的例行导入作业名称。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 24da597680db2..023bd2eaa9d77 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -31,7 +31,7 @@ under the License. ## 语法 ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## 可选参数 @@ -40,7 +40,7 @@ SHOW [] ROUTINE LOAD [FOR ]; > 可选参数。如果指定,则会显示所有作业(包括已停止或取消的作业)。否则只显示当前正在运行的作业。 -**2. `[FOR jobName]`** +**2. `[FOR ]`** > 可选参数。指定要查看的作业名称。如果不指定,则显示当前数据库下的所有作业。 > diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index 57c5f8abfecee..718ff57774474 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ STOP ROUTINE LOAD FOR ; ## 必选参数 -**1. `job_name`** +**1. ``** > 指定要停止的作业名称。可以是以下形式: > -> - `job_name`: 停止当前数据库下指定名称的作业 -> - `db_name.job_name`: 停止指定数据库下指定名称的作业 +> - ``: 停止当前数据库下指定名称的作业 +> - `.`: 停止指定数据库下指定名称的作业 ## 权限控制 diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 0167c8afcfabc..a5f3a3621d67d 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -32,21 +32,23 @@ This syntax is used to modify an existing routine load job. Only jobs in PAUSED ## Syntax ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. > > The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. -**2. `job_properties`** +## Optional Parameters + +**1. ``** > Specifies the job parameters to be modified. Currently supported parameters include: > @@ -65,21 +67,21 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** +**2. ``** -> The type of data source. Currently supports: +> Properties related to the data source. Currently supports: > -> - KAFKA +> - `` +> - `` +> - `` +> - `` +> - Custom properties, such as `` -**4. `data_source_properties`** +**3. ``** -> Properties related to the data source. Currently supports: +> The type of data source. Currently supports: > -> - kafka_partitions -> - kafka_offsets -> - kafka_broker_list -> - kafka_topic -> - Custom properties, such as property.group.id +> - KAFKA ## Privilege Control diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index e0eec79594f64..c92f27a7c375f 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -34,27 +34,27 @@ Currently, it only supports importing CSV or Json format data from Kafka through ## Syntax ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > The name of the import job. Within the same database, only one job with the same name can be running. -**2. `FROM data_source`** +**2. `FROM `** > The type of data source. Currently supports: KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. > @@ -62,7 +62,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` > -> 2. `kafka_topic` +> 2. `` > > Specifies the Kafka topic to subscribe to. > ```text @@ -71,7 +71,7 @@ FROM [] ## Optional Parameters -**1. `tbl_name`** +**1. ``** > Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. > @@ -82,7 +82,7 @@ FROM [] > > Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. -**2. `merge_type`** +**2. ``** > Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. > @@ -102,13 +102,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > Specifies the column separator, defaults to `\t` > > `COLUMNS TERMINATED BY ","` > -> 2. `columns_mapping` +> 2. `` > > Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. > @@ -116,7 +116,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 3. `preceding_filter` +> 3. `` > > Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -124,7 +124,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 4. `where_predicates` +> 4. `` > > Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -132,7 +132,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. > -> 5. `partitions` +> 5. `` > > Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. > @@ -140,7 +140,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 6. `DELETE ON` +> 6. `` > > Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. > @@ -148,13 +148,13 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 7. `ORDER BY` +> 7. `` > > Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. -**4. `job_properties`** +**4. ``** > Used to specify general parameters for routine import jobs. > @@ -167,7 +167,7 @@ FROM [] > > Currently, we support the following parameters: > -> 1. `desired_concurrent_number` +> 1. `` > > The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. > @@ -175,7 +175,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > These three parameters represent: > @@ -191,7 +191,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. > @@ -199,7 +199,7 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 4. `strict_mode` +> 4. `` > > Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: > @@ -237,55 +237,55 @@ FROM [] > > Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. > -> 5. `timezone` +> 5. `` > > Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > Specifies the import data format, default is csv, json format is supported. > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > When importing json format data, jsonpaths can be used to specify fields to extract from Json data. > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. > @@ -293,19 +293,19 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 14. `enclose` +> 14. `` > > Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. > > Note: When enclose is set to `"`, trim_double_quotes must be set to true. > -> 15. `escape` +> 15. `` > > Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. > **5. Optional properties in `data_source_properties`** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. > @@ -329,7 +329,7 @@ FROM [] > > Note: Time format cannot be mixed with OFFSET format. > -> 2. `property` +> 2. `` > > Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. > @@ -344,7 +344,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> 2.1 When using SSL to connect to Kafka, the following parameters need to be specified: > > ```text > "property.security.protocol" = "ssl", @@ -368,11 +368,11 @@ FROM [] > > Used to specify the client's public key, private key, and private key password respectively. > -> 2. Specify default starting offset for kafka partitions +> 2.2 Specify default starting offset for kafka partitions > -> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> If `/` is not specified, all partitions will be consumed by default. > -> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> In this case, `` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. > > Example: > diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index ebc06d9229ec1..9f3a53d0ad9fe 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,19 +25,19 @@ under the License. --> -## Description## Description +## Description This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. ## Syntax ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to pause. If ALL is specified, job_name is not required. diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 7974c1c3542d6..90039afe32140 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -33,12 +33,12 @@ This syntax is used to restart one or all paused Routine Load jobs. The restarte ## Syntax ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to restart. If ALL is specified, job_name is not required. diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index fc079f8901cbc..50f8edf5eec2f 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ The result shows the current consuming Kafka partitions and their corresponding ## Syntax ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## Required Parameters -**1. `load_name`** +**1. ``** > The name of the routine load job diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 6f7d84542242a..70d0a7bcc600f 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -33,13 +33,12 @@ This syntax is used to view the currently running subtasks of a specified Routin ## Syntax ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## Required Parameters -**1. `job_name`** +**1. ``** > The name of the routine load job to view. diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 2d502603eedb6..528a56c412acd 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -32,7 +32,7 @@ This statement is used to display the running status of Routine Load jobs. You c ## Syntax ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## Optional Parameters @@ -41,14 +41,14 @@ SHOW [] ROUTINE LOAD [FOR ]; > Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. -**2. `[FOR jobName]`** +**2. `[FOR ]`** > Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. > > Supports the following formats: > -> - `job_name`: Shows the job with the specified name in the current database -> - `db_name.job_name`: Shows the job with the specified name in the specified database +> - ``: Shows the job with the specified name in the current database +> - `.`: Shows the job with the specified name in the specified database ## Return Results diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index ffea7818a144e..efa758e2b49a6 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -38,12 +38,12 @@ STOP ROUTINE LOAD FOR ; ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to stop. It can be in the following formats: > -> - `job_name`: Stop a job with the specified name in the current database -> - `db_name.job_name`: Stop a job with the specified name in the specified database +> - ``: Stop a job with the specified name in the current database +> - `.`: Stop a job with the specified name in the specified database ## Permission Control diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md index 0167c8afcfabc..a5f3a3621d67d 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/ALTER-ROUTINE-LOAD.md @@ -32,21 +32,23 @@ This syntax is used to modify an existing routine load job. Only jobs in PAUSED ## Syntax ```sql -ALTER ROUTINE LOAD FOR [] +ALTER ROUTINE LOAD FOR [.] [] -FROM +FROM [] [] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > Specifies the name of the job to be modified. The identifier must begin with a letter character and cannot contain spaces or special characters unless the entire identifier string is enclosed in backticks. > > The identifier cannot use reserved keywords. For more details, please refer to identifier requirements and reserved keywords. -**2. `job_properties`** +## Optional Parameters + +**1. ``** > Specifies the job parameters to be modified. Currently supported parameters include: > @@ -65,21 +67,21 @@ FROM > - partial_columns > - max_filter_ratio -**3. `data_source`** +**2. ``** -> The type of data source. Currently supports: +> Properties related to the data source. Currently supports: > -> - KAFKA +> - `` +> - `` +> - `` +> - `` +> - Custom properties, such as `` -**4. `data_source_properties`** +**3. ``** -> Properties related to the data source. Currently supports: +> The type of data source. Currently supports: > -> - kafka_partitions -> - kafka_offsets -> - kafka_broker_list -> - kafka_topic -> - Custom properties, such as property.group.id +> - KAFKA ## Privilege Control diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md index e0eec79594f64..c92f27a7c375f 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/CREATE-ROUTINE-LOAD.md @@ -34,27 +34,27 @@ Currently, it only supports importing CSV or Json format data from Kafka through ## Syntax ```sql -CREATE ROUTINE LOAD [] [ON ] +CREATE ROUTINE LOAD [.] [ON ] [] [] [] FROM [] -[] +[COMMENT ""] ``` ## Required Parameters -**1. `[db.]job_name`** +**1. `[.]`** > The name of the import job. Within the same database, only one job with the same name can be running. -**2. `FROM data_source`** +**2. `FROM `** > The type of data source. Currently supports: KAFKA -**3. `data_source_properties`** +**3. ``** -> 1. `kafka_broker_list` +> 1. `` > > Kafka broker connection information. Format is ip:host. Multiple brokers are separated by commas. > @@ -62,7 +62,7 @@ FROM [] > "kafka_broker_list" = "broker1:9092,broker2:9092" > ``` > -> 2. `kafka_topic` +> 2. `` > > Specifies the Kafka topic to subscribe to. > ```text @@ -71,7 +71,7 @@ FROM [] ## Optional Parameters -**1. `tbl_name`** +**1. ``** > Specifies the name of the table to import into. This is an optional parameter. If not specified, the dynamic table method is used, which requires the data in Kafka to contain table name information. > @@ -82,7 +82,7 @@ FROM [] > > Tips: Dynamic tables do not support the `columns_mapping` parameter. If your table structure matches the table structure in Doris and there is a large amount of table information to import, this method will be the best choice. -**2. `merge_type`** +**2. ``** > Data merge type. Default is APPEND, which means the imported data are ordinary append write operations. MERGE and DELETE types are only available for Unique Key model tables. The MERGE type needs to be used with the [DELETE ON] statement to mark the Delete Flag column. The DELETE type means that all imported data are deleted data. > @@ -102,13 +102,13 @@ FROM [] > [ORDER BY] > ``` > -> 1. `column_separator` +> 1. `` > > Specifies the column separator, defaults to `\t` > > `COLUMNS TERMINATED BY ","` > -> 2. `columns_mapping` +> 2. `` > > Used to specify the mapping relationship between file columns and table columns, as well as various column transformations. For a detailed introduction to this part, you can refer to the [Column Mapping, Transformation and Filtering] document. > @@ -116,7 +116,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 3. `preceding_filter` +> 3. `` > > Filter raw data. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -124,7 +124,7 @@ FROM [] > > Tips: Dynamic tables do not support this parameter. > -> 4. `where_predicates` +> 4. `` > > Filter imported data based on conditions. For detailed information about this part, please refer to the [Column Mapping, Transformation and Filtering] document. > @@ -132,7 +132,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match the columns of each dynamic table, otherwise the import will fail. When using dynamic multiple tables, we only recommend using this parameter for common public columns. > -> 5. `partitions` +> 5. `` > > Specify which partitions of the destination table to import into. If not specified, data will be automatically imported into the corresponding partitions. > @@ -140,7 +140,7 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 6. `DELETE ON` +> 6. `` > > Must be used with MERGE import mode, only applicable to Unique Key model tables. Used to specify the Delete Flag column and calculation relationship in the imported data. > @@ -148,13 +148,13 @@ FROM [] > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. > -> 7. `ORDER BY` +> 7. `` > > Only applicable to Unique Key model tables. Used to specify the Sequence Col column in the imported data. Mainly used to ensure data order during import. > > Tips: When using dynamic multiple tables, please note that this parameter should match each dynamic table, otherwise the import will fail. -**4. `job_properties`** +**4. ``** > Used to specify general parameters for routine import jobs. > @@ -167,7 +167,7 @@ FROM [] > > Currently, we support the following parameters: > -> 1. `desired_concurrent_number` +> 1. `` > > The desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks can run simultaneously for a job. Must be greater than 0. Default is 5. > @@ -175,7 +175,7 @@ FROM [] > > `"desired_concurrent_number" = "3"` > -> 2. `max_batch_interval/max_batch_rows/max_batch_size` +> 2. `//` > > These three parameters represent: > @@ -191,7 +191,7 @@ FROM [] > "max_batch_size" = "209715200" > ``` > -> 3. `max_error_number` +> 3. `` > > Maximum number of error rows allowed within the sampling window. Must be greater than or equal to 0. Default is 0, meaning no error rows are allowed. > @@ -199,7 +199,7 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 4. `strict_mode` +> 4. `` > > Whether to enable strict mode, default is off. If enabled, when non-null original data's column type conversion results in NULL, it will be filtered. Specified as: > @@ -237,55 +237,55 @@ FROM [] > > Note: Although 10 is a value exceeding the range, because its type meets decimal requirements, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows, but won't be filtered by strict mode. > -> 5. `timezone` +> 5. `` > > Specifies the timezone used for the import job. Defaults to the Session's timezone parameter. This parameter affects all timezone-related function results involved in the import. > > `"timezone" = "Asia/Shanghai"` > -> 6. `format` +> 6. `` > > Specifies the import data format, default is csv, json format is supported. > > `"format" = "json"` > -> 7. `jsonpaths` +> 7. `` > > When importing json format data, jsonpaths can be used to specify fields to extract from Json data. > > `-H "jsonpaths: [\"$.k2\", \"$.k1\"]"` > -> 8. `strip_outer_array` +> 8. `` > > When importing json format data, strip_outer_array set to true indicates that Json data is presented as an array, where each element in the data will be treated as a row. Default value is false. > > `-H "strip_outer_array: true"` > -> 9. `json_root` +> 9. `` > > When importing json format data, json_root can be used to specify the root node of Json data. Doris will parse elements extracted from the root node through json_root. Default is empty. > > `-H "json_root: $.RECORDS"` > -> 10. `send_batch_parallelism` +> 10. `` > > Integer type, used to set the parallelism of sending batch data. If the parallelism value exceeds `max_send_batch_parallelism_per_job` in BE configuration, the BE serving as the coordination point will use the value of `max_send_batch_parallelism_per_job`. > > `"send_batch_parallelism" = "10"` > -> 11. `load_to_single_tablet` +> 11. `` > > Boolean type, true indicates support for a task to import data to only one tablet of the corresponding partition, default value is false. This parameter is only allowed to be set when importing data to olap tables with random bucketing. > > `"load_to_single_tablet" = "true"` > -> 12. `partial_columns` +> 12. `` > > Boolean type, true indicates using partial column updates, default value is false. This parameter is only allowed to be set when the table model is Unique and uses Merge on Write. Dynamic multiple tables do not support this parameter. > > `"partial_columns" = "true"` > -> 13. `max_filter_ratio` +> 13. `` > > Maximum filter ratio allowed within the sampling window. Must be between greater than or equal to 0 and less than or equal to 1. Default value is 0. > @@ -293,19 +293,19 @@ FROM [] > > Rows filtered by where conditions are not counted as error rows. > -> 14. `enclose` +> 14. `` > > Enclosure character. When csv data fields contain row or column separators, to prevent accidental truncation, a single-byte character can be specified as an enclosure for protection. For example, if the column separator is "," and the enclosure is "'", for data "a,'b,c'", "b,c" will be parsed as one field. > > Note: When enclose is set to `"`, trim_double_quotes must be set to true. > -> 15. `escape` +> 15. `` > > Escape character. Used to escape characters in csv fields that are the same as the enclosure character. For example, if the data is "a,'b,'c'", enclosure is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as `\`, and modify the data to `a,'b,\'c'`. > **5. Optional properties in `data_source_properties`** -> 1. `kafka_partitions/kafka_offsets` +> 1. `/` > > Specifies the kafka partitions to subscribe to and the starting offset for each partition. If a time is specified, consumption will start from the nearest offset greater than or equal to that time. > @@ -329,7 +329,7 @@ FROM [] > > Note: Time format cannot be mixed with OFFSET format. > -> 2. `property` +> 2. `` > > Specifies custom kafka parameters. Functions the same as the "--property" parameter in kafka shell. > @@ -344,7 +344,7 @@ FROM [] > "property.ssl.ca.location" = "FILE:ca.pem" > ``` > -> 1. When using SSL to connect to Kafka, the following parameters need to be specified: +> 2.1 When using SSL to connect to Kafka, the following parameters need to be specified: > > ```text > "property.security.protocol" = "ssl", @@ -368,11 +368,11 @@ FROM [] > > Used to specify the client's public key, private key, and private key password respectively. > -> 2. Specify default starting offset for kafka partitions +> 2.2 Specify default starting offset for kafka partitions > -> If `kafka_partitions/kafka_offsets` is not specified, all partitions will be consumed by default. +> If `/` is not specified, all partitions will be consumed by default. > -> In this case, `kafka_default_offsets` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. +> In this case, `` can be specified to set the starting offset. Default is `OFFSET_END`, meaning subscription starts from the end. > > Example: > diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md index ebc06d9229ec1..9f3a53d0ad9fe 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/PAUSE-ROUTINE-LOAD.md @@ -25,19 +25,19 @@ under the License. --> -## Description## Description +## Description This syntax is used to pause one or all Routine Load jobs. Paused jobs can be restarted using the RESUME command. ## Syntax ```sql -PAUSE [] ROUTINE LOAD FOR +PAUSE [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to pause. If ALL is specified, job_name is not required. diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md index 7974c1c3542d6..90039afe32140 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/RESUME-ROUTINE-LOAD.md @@ -33,12 +33,12 @@ This syntax is used to restart one or all paused Routine Load jobs. The restarte ## Syntax ```sql -RESUME [] ROUTINE LOAD FOR +RESUME [ALL] ROUTINE LOAD FOR ``` ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to restart. If ALL is specified, job_name is not required. diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md index fc079f8901cbc..50f8edf5eec2f 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-CREATE-ROUTINE-LOAD.md @@ -36,12 +36,12 @@ The result shows the current consuming Kafka partitions and their corresponding ## Syntax ```sql -SHOW [] CREATE ROUTINE LOAD for ; +SHOW [ALL] CREATE ROUTINE LOAD for ; ``` ## Required Parameters -**1. `load_name`** +**1. ``** > The name of the routine load job diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md index 6f7d84542242a..70d0a7bcc600f 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD-TASK.md @@ -33,13 +33,12 @@ This syntax is used to view the currently running subtasks of a specified Routin ## Syntax ```sql -SHOW ROUTINE LOAD TASK -WHERE JobName = ; +SHOW ROUTINE LOAD TASK WHERE JobName = ; ``` ## Required Parameters -**1. `job_name`** +**1. ``** > The name of the routine load job to view. diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md index 2d502603eedb6..528a56c412acd 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-ROUTINE-LOAD.md @@ -32,7 +32,7 @@ This statement is used to display the running status of Routine Load jobs. You c ## Syntax ```sql -SHOW [] ROUTINE LOAD [FOR ]; +SHOW [ALL] ROUTINE LOAD [FOR ]; ``` ## Optional Parameters @@ -41,14 +41,14 @@ SHOW [] ROUTINE LOAD [FOR ]; > Optional parameter. If specified, all jobs (including stopped or cancelled jobs) will be displayed. Otherwise, only currently running jobs will be shown. -**2. `[FOR jobName]`** +**2. `[FOR ]`** > Optional parameter. Specifies the job name to view. If not specified, all jobs under the current database will be displayed. > > Supports the following formats: > -> - `job_name`: Shows the job with the specified name in the current database -> - `db_name.job_name`: Shows the job with the specified name in the specified database +> - ``: Shows the job with the specified name in the current database +> - `.`: Shows the job with the specified name in the specified database ## Return Results diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md index ffea7818a144e..efa758e2b49a6 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/STOP-ROUTINE-LOAD.md @@ -38,12 +38,12 @@ STOP ROUTINE LOAD FOR ; ## Required Parameters -**1. `job_name`** +**1. ``** > Specifies the name of the job to stop. It can be in the following formats: > -> - `job_name`: Stop a job with the specified name in the current database -> - `db_name.job_name`: Stop a job with the specified name in the specified database +> - ``: Stop a job with the specified name in the current database +> - `.`: Stop a job with the specified name in the specified database ## Permission Control