-
Notifications
You must be signed in to change notification settings - Fork 96
Rationale
Squall aims to support online analytics expressed in SQL. "Online" means that the final result is constantly updated as new tuples arrive into the system. At each step, the system represents an eventually consistent and correct query result for all the tuples seen so far. In contrast to batch processing, all the operators are active all the time and are processing tuples simultaneously. In other words, data is not processed in batches and there are no stages that wait for the completion of other stages, e.g., Synchronized Staging in iterative Hadoop jobs or the need for Map
stages to finish before Reduce
stages.
The Squall engine is meant to support three kinds of query processing problems:
-
Incremental Query Evaluation: We materialize a view expressed as a query over the database. Whenever an update to the database arrives, we want to quickly refresh the materialized view and thus the query results. The challenge here is to avoid recomputing the view from scratch every time an update arrives.
-
Online Stream Processing: Data arrives in a streaming fashion. The state of the system is part of the recently seen data (e.g. time/count-based sliding or tumbling windows over the stream) where continuous queries are computed over the stream.
-
Online Aggregation: There is a large, conceptually static database or data warehouse and we want to evaluate a query on it and see a continuously improving approximation of the query result --defined within confidence error bounds-- while the computation of the query result is executing. Typically, the data would be read out of the database and fed into the Storm topology in random order to allow computing error bounds using statistical estimation theory tools (conceptually by sampling, but for performance reasons, in practice, the scheme starts with a database whose entries have been randomly reshuffled offline to allow for an efficient scan at query processing time).