-
Notifications
You must be signed in to change notification settings - Fork 96
Squall Local Configs
We will explain the content of a config file on squall-$VERSION/test/squall/confs/local/0_01G_hyracks_ncl
:
DIP_DISTRIBUTED false
DIP_QUERY_NAME hyracks
DIP_TOPOLOGY_NAME_PREFIX username
DIP_DATA_ROOT ../test/data/tpch/
DIP_SQL_ROOT ../test/squall/sql_queries/
DIP_SCHEMA_PATH ../test/squall/schemas/tpch.txt
DIP_RESULT_ROOT ../test/results/
# DIP_DB_SIZE is in GBs
DIP_DB_SIZE 0.01
########################################
#DIP_OPTIMIZER_TYPE INDEX_SIMPLE
#DIP_MAX_SRC_PAR 1
#DIP_OPTIMIZER_TYPE INDEX_RULE_BUSHY
#DIP_MAX_SRC_PAR 1
#DIP_OPTIMIZER_TYPE NAME_MANUAL_PAR_LEFTY
#DIP_PLAN CUSTOMER:2,ORDERS:3:4
#DIP_OPTIMIZER_TYPE NAME_MANUAL_COST_LEFTY
#DIP_PLAN CUSTOMER,ORDERS
#DIP_TOTAL_SRC_PAR 20
#DIP_OPTIMIZER_TYPE NAME_RULE_LEFTY
#DIP_TOTAL_SRC_PAR 20
DIP_OPTIMIZER_TYPE NAME_COST_LEFTY
DIP_TOTAL_SRC_PAR 20
########################################
#below are unlikely to change
DIP_EXTENSION .tbl
DIP_READ_SPLIT_DELIMITER \|
DIP_GLOBAL_ADD_DELIMITER |
DIP_GLOBAL_SPLIT_DELIMITER \|
DIP_ACK_EVERY_TUPLE false
DIP_KILL_AT_THE_END true
# Storage manager parameters
# Storage directory for local runs
STORAGE_LOCAL_DIR /tmp/ramdisk
# Storage directory for cluster runs
STORAGE_DIP_DIR /export/home/squalldata/storage
STORAGE_COLD_START true
MEMORY_SIZE_MB 4096
In order to distinguish parameters of Squall and Storm, we use prefix DIP
for Squall, which is a shortcut for Distributed Incremental Processing. DIP_DISTRIBUTED
must be false to execute the query plan in Local mode. DIP_QUERY_NAME
must correspond to a query from DIP_SQL_ROOT
(which is set tosquall-$VERSION/test/squall/sql_queries/
). In this case, DIP_QUERY_NAME = hyracks
corresponds to a SQL query from squall-$VERSION/test/squall/sql_queries/hyracks.sql
. Topology name is built by concatenation of DIP_TOPOLOGY_NAME_PREFIX
and DIP_TOPOLOGY_NAME
.
DIP_TOPOLOGY_NAME_PREFIX
is there to distinguish different users.
A database path is built by the concatenation of DIP_DATA_ROOT
, DIP_DB_SIZE
parameters and G
string. We needed DIP_DB_SIZE
separately because our optimizer uses this information for allocating parallelism for Storm components.
Query optimizers are described in detail at Query Optimizers. You can select only one optimizer at a time. You can do it by decommenting one of the DIP_OPTIMIZER_TYPE lines (while commenting all the others) and the first one or two lines below it.
A user has control only on parallelism of spouts. The exact setting depends on optimizer type. DIP_MAX_SRC_PAR
assigns specified parallelism to each relation, except for those which contains less than 100 tuples - in that case, parallelism is set to 1. DIP_TOTAL_SRC_PAR
refers to total parallelism of all spouts. The total parallelism is partitioned among spouts based on the number of tuples sent down the hierarchy.
The parallelism for Bolts is set automatically, taking into account the position of a component in the query plan, such that there is no bottleneck with the minimal number of nodes used.
Due to main memory constraints, you cannot run arbitrary large database with small component parallelism. For information on detecting this behavior, please consult Squall query plans vs Storm topologies, section How to know we run out of memory?.
DIP_SQL_ROOT
is a absolute or relative path for SQL queries on your local machine. DIP_ACK_EVERY_TUPLE
refers to a way we ensure that the processing is done, so the final result and the full execution time can be acquired. If the parameter is set to true, that means we ack each and every tuple. If the parameter is set to false, each Spout sends a special message as the last tuple. For more information about implications of this parameter, please consult Squall query plans vs Storm topologies, section To ack or not to ack?.
DIP_SCHEMA_PATH
is a absolute or relative path for schema information (including known cardinalities) on your local machine. By appending DIP_RESULT_ROOT
, last part of DIP_DATA_ROOT
(in this case tpch
), database size information (DIP_DB_SIZE
appended with G
), query name and suffix .result
, we obtain full path to expected result file. If this file does not exist, comparison actual and expected results will not take place.
Now we explain the parameters you most likely would not need to change;
DIP_EXTENSION
refers to file extension in your database. In our case, the names
of the database files were customer.tbl
, orders.tbl
, etc.
DIP_READ_SPLIT_DELIMITER
is a regular expression used for delimiting columns
of a tuple in a database file. DIP_GLOBAL_ADD_DELIMITER
and DIP_GLOBAL_SPLIT_DELIMITER
are used in
Squall internally for serializing and deserializing tuples between different components. DIP_KILL_AT_THE_END
assures your topology is killed after the final
result is written to a file. If you set this to false, your topology will execute
forever, consuming resources that could be used by other topologies executing
at the same time.
Thus, in order to change database size, we have to modify DIP_DB_SIZE
parameter, and for changing the query we have to change DIP_QUERY_NAME
. Please note that only 0.01-scalling factor TPC-H database is bundled in Squall. For generating TPC-H databases of different sizes, please consult TPC-H documentation. You can find more examples of config files in squall-$VERSION/test/squall/confs
, but for Local Mode only those in directory local are applicable. You can also write config files from scratch, but make sure you put them in squall-$VERSION/test/squall/confs/
. You can run Squall with arbitrary config file, as long as you specify the correct path to it:
cd squall-$VERSION/bin
./squall_local.sh $CONFIG_FILE_PATH