Skip to content

Squall Local Configs

Aleksandar Vitorovic edited this page Jan 9, 2015 · 11 revisions

We will explain the content of a config file on squall-$VERSION/test/squall/confs/local/0_01G_hyracks_ncl:

DIP_DISTRIBUTED false
DIP_QUERY_NAME hyracks

DIP_TOPOLOGY_NAME_PREFIX username
DIP_DATA_ROOT ../test/data/tpch/
DIP_SQL_ROOT ../test/squall/sql_queries/
DIP_SCHEMA_PATH ../test/squall/schemas/tpch.txt
DIP_RESULT_ROOT ../test/results/

# DIP_DB_SIZE is in GBs
DIP_DB_SIZE 0.01

########################################
#DIP_OPTIMIZER_TYPE INDEX_SIMPLE
#DIP_MAX_SRC_PAR 1

#DIP_OPTIMIZER_TYPE INDEX_RULE_BUSHY
#DIP_MAX_SRC_PAR 1

#DIP_OPTIMIZER_TYPE NAME_MANUAL_PAR_LEFTY
#DIP_PLAN CUSTOMER:2,ORDERS:3:4

#DIP_OPTIMIZER_TYPE NAME_MANUAL_COST_LEFTY
#DIP_PLAN CUSTOMER,ORDERS
#DIP_TOTAL_SRC_PAR 20

#DIP_OPTIMIZER_TYPE NAME_RULE_LEFTY
#DIP_TOTAL_SRC_PAR 20

DIP_OPTIMIZER_TYPE NAME_COST_LEFTY
DIP_TOTAL_SRC_PAR 20

########################################

#below are unlikely to change
DIP_EXTENSION .tbl
DIP_READ_SPLIT_DELIMITER \|
DIP_GLOBAL_ADD_DELIMITER |
DIP_GLOBAL_SPLIT_DELIMITER \|

DIP_ACK_EVERY_TUPLE false
DIP_KILL_AT_THE_END true

# Storage manager parameters
# Storage directory for local runs
STORAGE_LOCAL_DIR /tmp/ramdisk
# Storage directory for cluster runs
STORAGE_DIP_DIR /export/home/squalldata/storage 
STORAGE_COLD_START true
MEMORY_SIZE_MB 4096

In order to distinguish parameters of Squall and Storm, we use prefix DIP for Squall, which is a shortcut for Distributed Incremental Processing. DIP_DISTRIBUTED must be false to execute the query plan in Local mode. DIP_QUERY_NAME must correspond to a query from DIP_SQL_ROOT (which is set tosquall-$VERSION/test/squall/sql_queries/). In this case, DIP_QUERY_NAME = hyracks corresponds to a SQL query from squall-$VERSION/test/squall/sql_queries/hyracks.sql. Topology name is built by concatenation of DIP_TOPOLOGY_NAME_PREFIX and DIP_TOPOLOGY_NAME. DIP_TOPOLOGY_NAME_PREFIX is there to distinguish different users.

A database path is built by the concatenation of DIP_DATA_ROOT, DIP_DB_SIZE parameters and G string. We needed DIP_DB_SIZE separately because our optimizer uses this information for allocating parallelism for Storm components.

Query optimizers are described in detail at Query Optimizers. You can select only one optimizer at a time. You can do it by decommenting one of the DIP_OPTIMIZER_TYPE lines (while commenting all the others) and the first one or two lines below it.

A user has control only on parallelism of spouts. The exact setting depends on optimizer type. DIP_MAX_SRC_PAR assigns specified parallelism to each relation, except for those which contains less than 100 tuples - in that case, parallelism is set to 1. DIP_TOTAL_SRC_PAR refers to total parallelism of all spouts. The total parallelism is partitioned among spouts based on the number of tuples sent down the hierarchy.

The parallelism for Bolts is set automatically, taking into account the position of a component in the query plan, such that there is no bottleneck with the minimal number of nodes used.

Due to main memory constraints, you cannot run arbitrary large database with small component parallelism. For information on detecting this behavior, please consult Squall query plans vs Storm topologies, section How to know we run out of memory?.

DIP_SQL_ROOT is a absolute or relative path for SQL queries on your local machine. DIP_ACK_EVERY_TUPLE refers to a way we ensure that the processing is done, so the final result and the full execution time can be acquired. If the parameter is set to true, that means we ack each and every tuple. If the parameter is set to false, each Spout sends a special message as the last tuple. For more information about implications of this parameter, please consult Squall query plans vs Storm topologies, section To ack or not to ack?.

DIP_SCHEMA_PATH is a absolute or relative path for schema information (including known cardinalities) on your local machine. By appending DIP_RESULT_ROOT, last part of DIP_DATA_ROOT (in this case tpch), database size information (DIP_DB_SIZE appended with G), query name and suffix .result, we obtain full path to expected result file. If this file does not exist, comparison actual and expected results will not take place.

Now we explain the parameters you most likely would not need to change; DIP_EXTENSION refers to file extension in your database. In our case, the names of the database files were customer.tbl, orders.tbl, etc. DIP_READ_SPLIT_DELIMITER is a regular expression used for delimiting columns of a tuple in a database file. DIP_GLOBAL_ADD_DELIMITER and DIP_GLOBAL_SPLIT_DELIMITER are used in Squall internally for serializing and deserializing tuples between different components. DIP_KILL_AT_THE_END assures your topology is killed after the final result is written to a file. If you set this to false, your topology will execute forever, consuming resources that could be used by other topologies executing at the same time.

Thus, in order to change database size, we have to modify DIP_DB_SIZE parameter, and for changing the query we have to change DIP_QUERY_NAME. Please note that only 0.01-scalling factor TPC-H database is bundled in Squall. For generating TPC-H databases of different sizes, please consult TPC-H documentation. You can find more examples of config files in squall-$VERSION/test/squall/confs, but for Local Mode only those in directory local are applicable. You can also write config files from scratch, but make sure you put them in squall-$VERSION/test/squall/confs/. You can run Squall with arbitrary config file, as long as you specify the correct path to it:

cd squall-$VERSION/bin
./squall_local.sh $CONFIG_FILE_PATH