Skip to content

Commit

Permalink
feat: Added local storage cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
db0 committed Aug 31, 2023
1 parent c68c70d commit b724ce7
Show file tree
Hide file tree
Showing 9 changed files with 287 additions and 104 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# 1.1.0

Support for deleting potential CSAM from pict-rs Local Storage

# 1.0.0

Support for deleting potential CSAM from pict-rs Object Storage
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# lemmy-safety
This is a tool for Lemmy Administrators to easily check and clean all images in the pict-rs object storage for illegal or unethical content
This is a tool for Lemmy Administrators to easily check and clean all images in the pict-rs storage for illegal or unethical content

Note, this script **does not save any images locally** and it **does not send images to any extenal services**. All images are stored in RAM only, checked and then forgotten.

Expand All @@ -10,7 +10,7 @@ There's two big potential problems:
1. Malicious users can simply open a new post, upload an image and cancel the new post, and that image will then be invisibly hosted by their instance among thousands of others with a URL known only by the malicious user. That user could then contact their provider anonymously forwarding that URL, and try to take their lemmy instance down
2. Users on different instances with looser controls can upload CSAM posts and if those instances subscribed by any user in your own instance those image thumbnails will be cached to your own instance. Even if the relevant CSAM post is deleted, such images will persists in your object storage.

The lemmy safety will go directly through your object storage and scan each image for potential CSAM and automatically delete it. Covering both those problems in one go. You can also run this script constantly, to ensure no new such images can survive.
The lemmy safety will go directly through your pict-rs storage (either object storage or filesysystem) and scan each image for potential CSAM and automatically delete it. Covering both those problems in one go. You can also run this script constantly, to ensure no new such images can survive.

The results will also be written in an sqlite DB, which can then be used to follow-up and discover the user and instances uploading them.

Expand All @@ -26,20 +26,31 @@ This means you need a GPU and the more powerful your GPU, the faster you can pro

* Install python>=3.10
* install requirements: `python -m pip install -r requirements.txt`
* Copy `env_example` to `.env`, then edit `.env` and add your Object Storage credentials and connection info
* Start the script
* Copy `env_example` to `.env`, then edit `.env` following instructions below based on the type of storage your pict-rs is using

## Object Storage

* Add your Object Storage credentials and connection info to `.env`
* Start the script `lemmy_safety_object_storage.py`

## Object Storage

* Add your pict-rs server ssh credentials and pict-rs paths to `.env`
* Start the script `lemmy_safety_local_storage.py`

## Run Types

The script will record all image checked in an sqlite db called `lemmy_safety.db` which will prevent it from checking the same image twice.

The script has two methods: `all` and `daemon`

## All
### All

Running with the cli arg `--all` will loop through all the images in your object storage and check each of them for CSAM.

Any potential image will be automatically deleted and its ID recorded in the DB for potential follow-up.

## Daemon
### Daemon

Running without the `-all` arg will make the script run constantly and check all images uploaded in the past 20 minutes (can be changed using `--minutes`).

Expand Down
15 changes: 10 additions & 5 deletions env_example
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
## Make a copy of this file into .env and change the below fields
OBJECT_STORAGE_ENDPOINT="https://eu2.example.com"
PICTRS_BUCKET="pictrs"
AWS_ACCESS_KEY_ID=1234asdf5678zxxcvb890qwerty
AWS_SECRET_ACCESS_KEY=1234567890qwertyuiopasdfghjkl
AWS_DEFAULT_REGION=auto
OBJECT_STORAGE_ENDPOINT="https://eu2.example.com" # Fill in when using object storage
PICTRS_BUCKET="pictrs" # Fill in when using object storage
AWS_ACCESS_KEY_ID=1234asdf5678zxxcvb890qwerty # Fill in when using object storage
AWS_SECRET_ACCESS_KEY=1234567890qwertyuiopasdfghjkl # Fill in when using object storage
AWS_DEFAULT_REGION=auto # Fill in when using object storage
SSH_HOSTNAME="127.0.0.1" # Fill in when using filesystem storage
SSH_PORT=22 # Fill in when using filesystem storage
SSH_USERNAME="root" # This user should have read/write access to your pict-rs files
SSH_PRIVKEY="/home/username/.ssh/id_rsa" # Path to your private key file
SSH_PICTRS_FILES_DIRECTORY="/lemmy/lemmy.example.com/volumes/pictrs/files" # Path to your pictrs files directory
78 changes: 0 additions & 78 deletions lemmy_safety.py

This file was deleted.

16 changes: 1 addition & 15 deletions lemmy_safety/check.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,13 @@

from dotenv import load_dotenv
from loguru import logger
import PIL.Image

from horde_safety.csam_checker import check_for_csam
from horde_safety.interrogate import get_interrogator_no_blip
from lemmy_safety import object_storage
from PIL import UnidentifiedImageError

interrogator = get_interrogator_no_blip()

def check_image(key):
try:
image: PIL.Image.Image = object_storage.download_image(key)
except UnidentifiedImageError:
logger.warning("Image could not be read. Returning it as CSAM to be sure.")
return True
if not image:
return None
def check_image(image):
try:
is_csam, results, info = check_for_csam(
interrogator=interrogator,
Expand All @@ -28,8 +18,4 @@ def check_image(key):
except OSError:
logger.warning("Image could not be read. Returning it as CSAM to be sure.")
return True
if is_csam:
logger.warning(f"{key} rejected as CSAM")
else:
logger.info(f"{key} is OK")
return is_csam
86 changes: 86 additions & 0 deletions lemmy_safety/local_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import os
import datetime
import paramiko
import sys
from getpass import getpass
from PIL import Image
from io import BytesIO
import stat
from pathlib import Path
from loguru import logger
import pytz

hostname = os.getenv("SSH_HOSTNAME")
if hostname is None:
logger.error("You need to provide an SSH_HOSTNAME var in your .env file")
sys.exit(1)
port = os.getenv("SSH_PORT")
if hostname is None:
logger.error("You need to provide an SSH_PORT var in your .env file")
sys.exit(1)
port = int(port)
username = os.getenv("SSH_USERNAME")
if hostname is None:
logger.error("You need to provide an SSH_USERNAME var in your .env file")
sys.exit(1)
private_key_path = os.getenv("SSH_PRIVKEY")
if hostname is None:
logger.error("You need to provide an SSH_PRIVKEY var in your .env file")
sys.exit(1)
remote_base_directory = os.getenv("SSH_PICTRS_FILES_DIRECTORY")
if hostname is None:
logger.error("You need to provide an SSH_PICTRS_FILES_DIRECTORY var in your .env file")
sys.exit(1)

private_key_passphrase = getpass(prompt="Enter passphrase for private key: ")
private_key = paramiko.RSAKey(filename=private_key_path, password=private_key_passphrase)

def get_connection():
# I can't re-use the same connection when using threading
# So we have to initiate a new connection per thread
transport = paramiko.Transport((hostname, port))
transport.connect(username=username, pkey=private_key)
sftp = paramiko.SFTPClient.from_transport(transport)
return sftp

def get_all_images(min_date=None):
sftp = get_connection()
filelist = []

def list_files_recursively(remote_directory):
files = sftp.listdir_attr(remote_directory)
for file_info in files:
file_path = os.path.join(remote_directory, file_info.filename)
if stat.S_ISREG(file_info.st_mode): # Check if it's a regular file
modify_time = datetime.datetime.fromtimestamp(file_info.st_mtime, tz=pytz.UTC)
if min_date is None or modify_time >= min_date:
filelist.append(
{
"key": str(Path(file_path).relative_to(Path(remote_base_directory))),
"filepath": Path(file_path),
"mtime": modify_time,
}
)
elif stat.S_ISDIR(file_info.st_mode): # Check if it's a directory
list_files_recursively(file_path)

list_files_recursively(remote_base_directory)
return filelist

def download_image(remote_path):
sftp = get_connection()
remote_file = sftp.open(remote_path, "rb")
image_bytes = remote_file.read()
remote_file.close()
image_pil = Image.open(BytesIO(image_bytes))
return image_pil


def delete_image(remote_path):
sftp = get_connection()
try:
sftp.remove(remote_path)
except FileNotFoundError:
logger.error(f"File not found: {remote_path}")
except Exception as e:
logger.error(f"Error deleting file {remote_path}: {e}")
69 changes: 69 additions & 0 deletions lemmy_safety_local_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import time
import logging
from datetime import datetime, timedelta, timezone
from concurrent.futures import ThreadPoolExecutor
import argparse
import PIL.Image

from loguru import logger
import sys

from lemmy_safety.check import check_image
from lemmy_safety import local_storage
from lemmy_safety import database
from PIL import UnidentifiedImageError

logging.basicConfig(format='%(asctime)s - %(levelname)s - %(module)s:%(lineno)d - %(message)s', level=logging.WARNING)


arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--all', action="store_true", required=False, default=False, help="Check all images in the storage account")
arg_parser.add_argument('-t', '--threads', action="store", required=False, default=100, type=int, help="How many threads to use. The more threads, the more VRAM requirements, but the faster the processing.")
arg_parser.add_argument('-m', '--minutes', action="store", required=False, default=20, type=int, help="The images of the past how many minutes to check.")
arg_parser.add_argument('--dry_run', action="store_true", required=False, default=False, help="Will check and reprt but will not delete")
args = arg_parser.parse_args()


def check_and_delete_filename(file_details):
try:
image: PIL.Image.Image = local_storage.download_image(str(file_details["filepath"]))
except UnidentifiedImageError:
logger.warning("Image could not be read. Returning it as CSAM to be sure.")
is_csam = True
if not image:
is_csam = None
else:
is_csam = check_image(image)
if is_csam and not args.dry_run:
local_storage.delete_image(str(file_details["filepath"]))
return is_csam, file_details

def run_cleanup(cutoff_time = None):
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for file_details in local_storage.get_all_images(cutoff_time):
if not database.is_image_checked(file_details["key"]):
futures.append(executor.submit(check_and_delete_filename, file_details))
if len(futures) >= args.threads:
for future in futures:
result, fdetails = future.result()
database.record_image(fdetails["key"],csam=result)
logger.info(f"Safety Checked Images: {len(futures)}")
futures = []
for future in futures:
result, fdetails = future.result()
database.record_image(fdetails["key"],csam=result)
logger.info(f"Safety Checked Images: {len(futures)}")

if __name__ == "__main__":
if args.all:
run_cleanup()
else:
while True:
try:
cutoff_time = datetime.now(timezone.utc) - timedelta(minutes=args.minutes)
run_cleanup(cutoff_time)
time.sleep(30)
except:
time.sleep(30)

Loading

0 comments on commit b724ce7

Please sign in to comment.