-
Notifications
You must be signed in to change notification settings - Fork 96
Rationale
Squall aims to support online analytics expressed in SQL. "Online" means that the final result is constantly updated as new tuples arrive into the system. At each step, the system represents an eventually consistent and correct query result for all the tuples seen so far. In contrast to batch processing, all the components are active and are processing tuples simultaneously at all the time. In other words, data is not processed in batches and there are no stages that wait for the completion of other stages, e.g, Synchronized Staging in iterative Hadoop jobs or the need for Map
stages to finish before Reduce
stages.
The Squall engine is meant to support three kinds of query processing problems:
-
Online Stream Query Processing: Data arrives in a streaming fashion. The state of the system is part of the recently seen data (e.g. time/count-based sliding or tumbling windows over the stream) where continuous queries are computed over the stream through window semantics.
-
Incremental Query Evaluation: We materialize a view expressed as a query over the database. Whenever an update to the database arrives, we want to quickly refresh the materialized view and thus the query results. The challenge here is to avoid recomputing the view from scratch every time an update arrives.
-
Online Aggregation: There is a large, conceptually static database or data warehouse and we want to evaluate a query on it and see a continuously improving approximation of the query result --defined within confidence error bounds-- while the computation of the query result is executing. Typically, the data would be read out of the database and fed into the Storm topology in random order to allow computing error bounds using statistical estimation theory tools (conceptually by sampling, but for performance reasons, in practice, the scheme starts with a database whose entries have been randomly reshuffled offline to allow for an efficient scan at query processing time). To best of our knowledge, Squall is the first system that permits support for online aggregation while processing across many nodes. For example, the state-of-the-art online aggregation system DBO [1] works only on a single node.
[1] Christopher Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2007. Scalable approximate query processing with the DBO engine. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 725-736.