This repository contains SIFT codebase. SIFT is a scalable architecture to extract data from Google Trends. SIFT has been published at IMC'22 in a work studying user-affecting Internet outages.
@inproceedings{10.1145/3517745.3561428,
author = {Kirci, Ege Cem and Vahlensieck, Martin and Vanbever, Laurent},
title = {"Is My Internet down?": Sifting through User-Affecting Outages with Google Trends},
year = {2022},
isbn = {9781450392594},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3517745.3561428},
doi = {10.1145/3517745.3561428},
abstract = {What are the worst outages for Internet users? How long do they last, and how wide are they? Such questions are hard to answer via traditional outage detection and analysis techniques, as they conventionally rely on network-level signals and do not necessarily represent users' perceptions of connectivity.We present SIFT, a detection and analysis tool for capturing user-affecting Internet outages. SIFT leverages users' aggregated web search activity to detect outages. Specifically, SIFT starts by building a timeline of users' interests in outage-related search queries. It then analyzes this timeline looking for spikes of user interest. Finally, SIFT characterizes these spikes in duration, geographical extent, and simultaneously trending search terms which may help understand root causes, such as power outages or associated ISPs.We use SIFT to collect more than 49 000 Internet outages in the United States over the last two years. Among others, SIFT reveals that user-affecting outages: (i) do not happen uniformly: half of them originate from 10 states only; (ii) can affect users for a long time: 10\% of them last at least 3 hours; and (iii) can have a broad impact: 11\% of them simultaneously affect at least 10 distinct states. SIFT annotations also reveal a perhaps overlooked fact: outages are often caused by climate and/or power-related issues.},
booktitle = {Proceedings of the 22nd ACM Internet Measurement Conference},
pages = {290–297},
numpages = {8},
keywords = {Google trends, internet outages, data mining, anomaly detection},
location = {Nice, France},
series = {IMC '22}
}
Please send us an email at ekirci-at-ethz-dot-ch
and martinva-at-student-dot-ethz-dot-ch
, if:
- You are interested in collaborating with the project.
- You are having issues trying to run SIFT.
- You happen to find a bug.
- You have any questions or concerns specific to the paper.
Thank you for your interest!
SIFT consists of five commands:
sift_queue
queues requests which will then be executed bysift_dispatcher
.sift_dispatcher
executes the requests using one or more fetchers.sift_fetcher
is the part that interacts with Trends. It is executed by the dispatcher.sift_tool
can be used to add new keywords and tags.sift_stitch
is used to create a SQLite3 database with a stitched time series.
Requests and the returned data are stored inside a PostgreSQL database.
- PostgreSQL (tested with version 12). Must be running locally as all tools use authentication via UNIX sockets.
- Python 3
First set up PostgreSQL. SIFT connects to PostgreSQL using UNIX
sockets. For that create a database and a user named after the user
name under which you want to run SIFT (in the following we use the
user sift
):
CREATE DATABASE sift;
CREATE USER sift;
To test the setup, change to the sift
user and do
psql
You should be presented with a prompt without error messages or having to enter a password.
Now under the sift
user get a copy of the source code, either by cloning
this repository or extracting a tarball. Change to the directory and
create and activate a virtual environment using:
python3 -m venv venv
. venv/bin/activate
Then install the dependencies using
pip install -r requirements.txt
Note: You might need to install additional packages like
python3-dev
andlibpq-dev
.
After that install SIFT using
pip install .
Initialize the database using
psql < sql/schema.sql
and apply all the migrations in sql/migrations
.
If you want to start fetching right away, you can load the database with the internet outage topic keyword and US states using
psql < sql/quickstart.sql
Then add some requests using sift_queue
and start sift_dispatcher
with the --local
option to execute the fetcher under the same user
id as the dispatcher:
sift_dispatcher --local
When not running with the --local
flag, sift_dispatcher
uses a file
called fetcher.json
to get a list of all fetchers. To get started
copy the file fetchers.json.example
to fetcher.json
and edit it
accordingly.
The main feature of SIFT is the possibility to run fetchers on remote machines.
To do that, you first need to copy scripts/sift_fetcher.sh
and the fetcher
itself (bin/sift_fetcher
) to machine you wish to use.
Next you have to generate an SSH key for the sift
user and add it to the
authorized_keys
file on the remote machine with the following options to
restrict its scope:
from="<ip-of-dispatcher>",restrict,command="/bin/sh <remote-user-home>/sift_fetcher.sh ssh" ssh-ed25519 AAAA...
You can read about the individual options in the sshd(8) man page. Replace
<ip-of-dispatcher>
with the IP from which the dispatcher contacts the fetcher
and <remote-user-home>
with the directory in which you copied
sift_fetcher.sh
and sift_fetcher
.
To initialize the required virtual environment run
sh sift_fetcher.sh init
The entry in fetchers.json
should look something like this:
[
...
{"active": true, "type": "ssh", "user": "<remote-user>", "host": "<remote-host>"},
...
]
Under the hood sift_dispatcher
executes SSH via Popen
and joins the user
and host with an @
. This means that it will use for example .ssh/config
to
add additional options, such as an different port or a jump host.
It is also possible to execute the fetcher on the same machine but under a different user using
sudo(8). For this copy the files scripts/sift_fetcher.sh
and bin/sift_fetcher
to the home
directory of the target user as for remote fetching and add a line like this to sudoers(5):
dispatcher ALL=(fetcher:fetcher) NOPASSWD: /bin/sh /home/fetcher/sift_fetcher.sh fetch *
Replace dispatcher
and fetcher
appropriately.
The entry in fetchers.json
should then look like this:
[
...
{"type": "sudo", "user": "fetcher", "group": "fetcher", "script": "/home/fetcher/sift_fetcher.sh"},
...
]
The different fetcher implementations are in bin/sift_dispatcher
and should
be easy enough to extend.
The dispatcher crashes quite easily. This is intentional to avoid overloading Trends in case of a bug.
SIFT has a web interface to browse the data. It is located in web_inteface/
.
The dependencies can be installed by switching to web_interface/
and
executing pip install -r requirements.txt Then a local development server (again in the
web_interface` directory) can be started with
FLASK_APP=vis flask run
Stitched time series are not generated by the web interface but must be
precomputed using sift_stitch
. sift_stitch
creates a database
with the stitched time series called time_series.db
. This file must
be present in the directory from which you run the web interface.
sift_stitch
requires that the requests have so-called resolution tags.
These can be added with sift_cli add-resolution-tags
.
By default the web interface connects via UNIX sockets to the database.
To change that modify the DATABASE_URL
in web_interface/config.py
.
For increased security it is possible to run the web interface under a different user. To do that create a new user in PostgreSQL (e.g. vis) and grant select rights on the appropriate tables and views:
GRANT SELECT ON TABLE request_who, request_types, request_api, locations,
request_status, fetchers, requests, keyword_topics, keywords, keywords_info,
keywords_in_request, keywords_related, trends_time, trends_geo_scopes,
trends_geo, log, tags, keywords_tags, peaks_detailed, peaks_detailed_requests,
keywords_ignore, rtags, request_tags, keywords_and_topics,
raw_fetcher_output_count TO vis;
Don't forget to update DATABASE_URL
in config.py
.
You can stop a dispatcher by pressing Ctrl-C
. If possible press it
when the dispatcher is in its wait period. You can also pass the
--exit
flag which causes the dispatcher to exit as soon as there is
nothing to do (which does not mean that there are no more requests).
It is safe to run multiple dispatchers in parallel using different fetchers as each dispatcher locks a request by setting its status to running before executing.
In some error cases it can happen that upon error the request is left
in a 'running' state even when it is not. When that happens, the
request has to be manually reset to open using (replacing
rid-of-the-request
):
UPDATE requests SET r_status = (SELECT rs_id FROM request_status WHERE rs_name = 'open')
WHERE r_status = (SELECT rs_id FROM request_status WHERE rs_name = 'running')
AND r_id = rid-of-the-request;
You should stop all dispatchers first to assure that indeed no dispatcher is trying to fetch the request.