ROSIEBot

The Robotic Open Science Indexing Engine

Static mirroring utility for the Open Science Framework, maintained by the Center for Open Science.

Visit the COS Github for more innovations in the openness, integrity, and reproducibility of scientific research.

Project Overview

Installation

This software requires Python 3.5 for the aiohttp library. If desired, create a virtualenv:

From scratch:

pip install virtualenv

pip install virtualenvwrapper

Create virtualenv 'rosie':

mkvirtualenv rosie --python=python3.5

Switch into the rosie environment for the first time:

workon rosie

Clone the repository into your development folder:

git clone https://github.com/zamattiac/ROSIEBOt.git

Navigate into the new folder ( cd ROSIEBot )

pip install -r requirements.txt to install dependency libraries in the virtualenv.

Enter/exit the virtual environment:

workon rosie/deactivate

ROSIEisms

OSF Pages by Category

Project	Registration	User	Institution
Dashboard	Dashboard	Profile	Dashboard
Files	Files
Wiki	Wiki
Analytics	Analytics
Registrations
Forks	Forks

Our Process

Crawling: getting lists of all the URLs to visit
Scraping: visiting all those URLs and saving their content to the mirror
Resuming: continuing the crawl/scrape process if it stops in the middle
Verifying: making sure all the files are present and in acceptable condition
Compiling active: getting a list from the API about existing pages

Using the Command Line Interface

Running cli.py

The python file cli.py needs to be run in the command line in the rosie virtualenv. This project is optimized for Mac.

Every command consists of the following and the flag for one mode:

python cli.py

See python cli.py --help for some further usage assistance.

Mode flags:

`--compile_active`

Make a taskfile of all the currently active pages on the OSF. This is useful primarily for --delete, which requires such a file to remove no-longer-existant pages from the mirror.

####--scrape

Crawl and scrape the site. Must include date marker --dm=<DATE>, where <DATE> is the date of last scrape in the form YYYY-MM-DDTHH:MM:SS.000, eg. 1970-06-15T00:00:00.000

One must specify which categories to scrape:

--nodes (projects)
--registrations
--users
--institutions

Any or all can be added.

If the nodes flag is used, one must specify which project pages to include:

-d : dashboard
-f : files page
-w : wiki pages
-a : analytics
-r : list of registrations of the project
-k: list of forks of the project

`--resume`

Pick up where a normal process left off in case of an unfortunate halt. The normal process creates and updates a .json task file with its status, and this must be included with the flag --tf=<FILENAME>. The filename will be of the form YYYYMMDDHHMM.json and should be visible in the ROSIEBot directory.

`--verify`

Verify the completeness of the mirror. See below for steps. This process also requires a .json file in the form described in the resume step, and --rn=<INT>, where <INT> is the desired number of retries.

Verification Steps

Verify that each URL found by the crawler has a corresponding file on the mirror.
Compare the size of each file to the minimum possible size for a complete page.
Rescrape failed pages and try again.

`--delete`

Remove anything inside a category folder that isn't listed on the API. Requires a compile_active-produced taskfile.

python cli.py --delete --ctf=<TASKFILE>

`--index`

Creates a search engine index.

Note: Do not run until the static folder is in place in the archive.

Using search: the search button on each page should be replaced with a link to /search.html

Hosting a Mirror

Scraped pages require a static folder inside the mirror. Please get a fresh copy from the OSF repo and place directly inside archive/.

Once static is in place, run python cli.py --index to set up search utility.

Simple local server setup (does not preserve original archive organization, but does use OSF organization)

This option creates a flat copy of the archive without categorical folders. Nginx configuration is required otherwise.

Make sure whatever utilities you desire (e.g. verify, index) have been run before the copy is made.

Run bash scripts/host_locally.sh from the ROSIEBot root. Here is your mirror.

Packaging the archive

zip -r archive.zip archive/ zip -r flat-archive.zip flat-archive/

Using Nginx to host (preserving structure)

Including the following location lines provides necessary routing for a non-flat mirror.

See How to set up prerender step 2 for Nginx information, bearing in mind that some parts do not apply

server {
        listen 80 default_server;
        listen [::]:80 default_server ipv6only=on;

        root /path/to/archive;
        # index index.html index.htm;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ /registration/$uri/ /profile/$uri/ /project/$uri/ /project/$uri/home /registration/$uri/home =404;
                # index index.html index.htm;
                # Uncomment to enable naxsi on this location
                # include /etc/nginx/naxsi.rules
        }

        location /static/ {
                alias /path/to/archive/static/;
        }
}

Authenticating your mirror

(Future)

How to set up prerender for a local OSF

Name		Name	Last commit message	Last commit date
Latest commit History 397 Commits
scripts		scripts
search		search
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
crawler.py		crawler.py
deleter.py		deleter.py
indexer.py		indexer.py
osf.conf		osf.conf
pages.py		pages.py
prerender.md		prerender.md
project_overview.md		project_overview.md
requirements.txt		requirements.txt
settings.py		settings.py
task_file_example.txt		task_file_example.txt
tests.py		tests.py
tests_verifier.py		tests_verifier.py
verifier.py		verifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROSIEBot

The Robotic Open Science Indexing Engine

Static mirroring utility for the Open Science Framework, maintained by the Center for Open Science.

Project Overview

Installation

From scratch:

Create virtualenv 'rosie':

Switch into the rosie environment for the first time:

Clone the repository into your development folder:

Enter/exit the virtual environment:

ROSIEisms

OSF Pages by Category

Our Process

Using the Command Line Interface

Running cli.py

Mode flags:

`--compile_active`

`--resume`

`--verify`

Verification Steps

`--delete`

`--index`

Hosting a Mirror

Simple local server setup (does not preserve original archive organization, but does use OSF organization)

Packaging the archive

Using Nginx to host (preserving structure)

Authenticating your mirror

How to set up prerender for a local OSF

About

Releases

Packages

Contributors 3

Languages

License

CenterForOpenScience/ROSIEBot

Folders and files

Latest commit

History

Repository files navigation

ROSIEBot

The Robotic Open Science Indexing Engine

Static mirroring utility for the Open Science Framework, maintained by the Center for Open Science.

Project Overview

Installation

From scratch:

Create virtualenv 'rosie':

Switch into the rosie environment for the first time:

Clone the repository into your development folder:

Enter/exit the virtual environment:

ROSIEisms

OSF Pages by Category

Our Process

Using the Command Line Interface

Running cli.py

Mode flags:

--compile_active

--resume

--verify

Verification Steps

--delete

--index

Hosting a Mirror

Simple local server setup (does not preserve original archive organization, but does use OSF organization)

Packaging the archive

Using Nginx to host (preserving structure)

Authenticating your mirror

How to set up prerender for a local OSF

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`--compile_active`

`--resume`

`--verify`

`--delete`

`--index`

Packages