-
Notifications
You must be signed in to change notification settings - Fork 96
Rationale
Squall aims to support online analytics expressed in SQL. "Online" means that the final result is constantly updated as new tuples arrive into the system. At each step, the system represents an eventually consistent and correct query result for all the tuples seen so far. In contrast to batch processing, all the components are active and are processing tuples all the time. In other words, data is not processed in batches and there is no stage that waits for the completion of its precursor stage.
The Squall engine is meant to support three kinds of query processing problems:
-
Online stream processing: data arrives in streaming fashion; the state of the system is part of the recently seen data (e.g. a sliding or tumbling window over the stream), and we compute queries over the stream through window semantics.
-
Incremental query evaluation: We materialize a view (expressed as a query) of a database. Whenever an update to the database arrives, we want to quickly refresh the materialized view. The challenge here is to avoid recomputing the view from scratch every time an update arrives.
-
Online aggregation: There is a large, conceptually static database or data warehouse and we want to evaluate a query on it and see a continuously improving approximation of the query result --defined within confidence error bounds-- while the computation of the query result is executing. Typically, the data would be read out of the database and fed into the Storm topology in random order to allow computing error bounds using statistical machinery (conceptually by sampling, but for performance reasons, in practice, the scheme starts with a database whose entries have been randomly reshuffled offline to allow for an efficient scan at query processing time).
To best of our knowledge, Squall is the first system to provide online processing across many nodes. For example, the state-of-the-art online aggregation system DBO [1] works only on a single node.
[1] Christopher Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2007. Scalable approximate query processing with the DBO engine. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 725-736.