So that when soup.io hits rock bottom, you don't have to.
After 11 years on soup.io, I am losing the portal that connected me to so many beautiful people and made me have some actual friends. I am a hoarded here. I don't wanna lose everything.
This made me try to hack this script together super fast in one evening of despair and sadness.
The output format of this script is supported by loforo.com without modifications; you will just have to share (host?) the files with them somehow, as of 23.07.2020 the JSON will be enough. Example soup imported from this script's export: http://ikari.loforo.com/
This is a python tool to back up your soup.io account, by creating a comprehensive JSON archive of well-formed, parsed, trash-free list of your posts and their metadata.
It will collect their publication time, content, who it was reposted from, etc.
Generating a dump of 10 years of soup will take about 12 hours... It also took 48 MB of JSON, and one post from 2013, somehow, is "broken" because it has very different HTML layout.
- Get reactions and what you reacted to (not just the links)
- Once we get the dump, it should be able to download all attachments; for now, use the shell commands provided (actually more reliable)
- Converting it later to an export format other services will understand; honestly I'm good with just not losing the data for now.
- Dumping friends and followers list </3...
-
You need python 3
-
Install requirements in a virtualenv:
pip3 install virtualenv # virtualenv is a python package to separate project and their requirements virtualenv -p python3 .env # create one for this project source .env/bin/activate # and use it pip install -r requirements.txt # and install the requirements (one is called BeautifulSoup)
-
Check usage:
python -m downsouper.download ikari.soup.io # use -c for continuing broken backups later
You can check what options want to be supported (not necessarily are):
# python -m downsouper.download --help
usage: download.py [-h] [-a] [-o OUTPUT] [-r RETRIES] [-c] url
The tool to download soup before it's down
positional arguments:
url
optional arguments:
-h, --help show this help message and exit
-a, --attachments Download attachments
-o OUTPUT, --output OUTPUT
Output JSON file path/name, where the data will be
stored
-r RETRIES, --retries RETRIES
How many times to retry if there's a 50X error
(defaults to 0, forever)
-c, --continue Detect oldest 'since' in output file and add older
entries
I'm sorry.
The examples here are stripped down to the most interesting fields.
{
"kind": "image",
"source": [
{
"url": "https://parkaboy.soup.io/post/656261571/Image",
"author": "parkaboy"
},
{
"url": "https://kundel.soup.io/post/658268509/Image",
"author": "kundel"
}
],
"permalink": "https://ikari.soup.io/post/658270875/Image",
"title": "(Image)",
"is_repost": true,
"id": "post658270875",
"timestamp": "Jun 20 2018 12:48:37 UTC",
"content": {
"full_res_images": [
"https://asset.soup.io/asset/14385/6111_ca55.jpeg"
],
"images": [
{
"width": 539,
"src": "https://asset.soup.io/asset/14385/6111_ca55.jpeg",
"ratio": 0.9472759226713533,
"height": 569
}
]
}
}
{
"kind": "quote",
"source": [
{
"url": "https://fajnychnielubie.soup.io/post/658082306/I-write-differently-from-what-I-speak",
"author": "fajnychnielubie"
}
],
"is_repost": true,
"timestamp": "Jun 19 2018 10:25:45 UTC",
"content": {
"images": [],
"body": " I write differently from what I speak, I speak differently from what I think, I think differently from the way I ought to think, and so it all proceeds into deepest darkness.\n",
"cite": "\u2014 Kafka, Letters to Ottla and the Family"
}
}
{
"kind": "regular",
"permalink": "https://ikari.soup.io/post/679864405/let-me-just-lay-here-with-the",
"author": "ikari",
"id": "post679864405",
"original_post": {
"url": "https://my-great-inspirations.soup.io/post/676707442/Image",
"author": "my-great-inspirations"
},
"timestamp": "Feb 05 2020 21:06:00 UTC",
"content": {
"body": " let me just... lay here... with the books\n",
"images": []
},
"is_reaction": true
}
So I figured: what if I want to get the files as fast as possible?
Install jq
for fancy json querying, then:
# getting all video urls from a dump
cat ikari.soup.io.json | jq -r "keys[] as \$k | .[(\$k)].posts[].content.video[0].src" | grep -vP "^null$"
Go a step further and download all the videos:
mkdir assets
cd assets
cat ../ikari.soup.io.json | jq -r "keys[] as \$k | .[(\$k)].posts[].content.video[0].src" | grep -vP "^null$" | xargs wget -nc
Download all images in their full size (notice the script drops the _xxx resizing suffix and puts a nice full size images list when possible when making the soup dump)
mkdir ikari
cd ikari
cat ../$(basename $(pwd)).soup.io.json | jq -r "keys[] as \$k | .[(\$k)].posts[].content.full_res_images[]?" | grep -vP "^null$" | xargs wget -nc
Be careful, it will take a lot of space. Dump of images from ikari.soup.io takes 19GiB of disk space.
-
Why is the code so ugly?
Because I'm in a hurry. Need to download before soup gets DDoSed by the angry users. We have a week left only.
-
I am getting error 429 and soup doesn't work
You got banned for making too many requests. Interestingly enough, this happens to me only if I open the web version of soup, not by testing the script. Turns out my browser makes a million retries on failing requests (and does so immediately). A buggy script or a browser extension would be to blame.
-
I have the json, but I need the pictures
Download all the
full_size_image
,video.src
andaudio.src
before soup goes down. On a fast connection, this worked suprisingly well (downloading 19 GB of my soup wasn't a problem on the server-side). You can do parallel requests. -
Why is it so slow?
It downloads your soup one page (screen) at a time, like you would with the browser. Each page takes 3-15 seconds to generate. Soup is under heavy stress now. My soup was 2151 pages long, which means 2151 requests, each taking ~10 seconds &emdash; should finish within 6 hours. In reality it took a little more.
-
How to convert it to Wordpress archive / tumblr??
Well, this is a problem to solve once we have the backup. It has enough metadata to do so.
-
What's the file structure?
It's one huge JSON grouped into "chunks" which are exactly the pages as they loaded one by one. This also means adding new posts shifts EVERYTHING within the chunks. I tried to deal with it with the new
--newposts
option but it's not well tested at all.On the other side it helps group the results into smaller pages and was helping me continue from where it was interrupted previously easy. Each chunk ID relates to one
/since/{chunk}
request. -
What is
content.unkown
?If a rare type of post appeared, I just dumped everything to be parseable later (with the
--fix
option). -
I cannot into computers, make it a button
I want to, but they didn't give us enough time. I can't right now. I have a job as well.
-
Why are all timestamps
"?"
?You have to enable showing them on your soup first, if you want them included.
(this entire section is stolen from a sister project nathell/soupscraper )
Soup.io is officially dead in new hands now, but old servers haven’t been turned off yet. So apparently you still have a chance of backing up your soup.
Here’s how:
- Edit your hosts file (instructions for macOS) and add the following entries:
45.153.143.247 soup.io
45.153.143.247 www.soup.io
45.153.143.247 YOURSOUP.soup.io
45.153.143.248 asset.soup.io
Put in your soup’s name in place of YOURSOUP
.
The servers throw 503 errors only now.