Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Purge Threshold #256

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

ChrisLeinbach
Copy link

@ChrisLeinbach ChrisLeinbach commented Dec 13, 2024

Problem

Currently, the threshold at which files are purged from disk is hard coded at 95%. Depending on users and systems this can be either too high or too low. Editing the threshold to adjust for user preferences would have to be done in the script where the disk cleanup is run. This is less than ideal because it puts this control out of scope for less technical users and would be wiped out or cause issues during updates.

Proposed Change

Add a purge threshold setting. This setting allows the user to control at what filled percentage the purge activities are run.

Detailed Description of Changes

  1. Adds a configuration item to advanced.php to set the purge threshold. Also adds relevant description and warnings.

    • The current setting allow a range of 20-99 percent. These felt sensible to me but I'm open to changing them.
    • Note: The purge threshold is still active when the user has set Keep mode. This does present a risk that a user will set a low threshold while in keep mode and the services will be shut down sooner than they expect. This may warrant a more advanced implementation if it is felt that this is a sufficiently large risk.
  2. Adds a snippet to update_birdnet_snippets.sh to set the default purge threshold. The default matches the threshold that was previously hardcoded.

  3. Changes the space equivalence check in disk_check.sh to use the newly defined purge threshold.

  4. Adds the purge threshold to install_config.sh

Adds a purge threshold setting. This setting allows the user to
control at what filled percentage the purge activities are run.
This replaces the default, hard coded 95% point.

The purge threshold defaults at the original 95%. I set a minimum
of 20 and max of 99 because those values felt sensible but am open
to changing those based on feedback.

Note: The purge threshold is still active when the keep option is
set. I added a note for this but this still presents some risk
where users who change this while in Keep mode could have their
services shut down earlier than they expect.
Adds a purge threshold setting. This setting allows the user to
control at what filled percentage the purge activities are run.
This replaces the default, hard coded 95% point.

The purge threshold defaults at the original 95%. I set a minimum
of 20 and max of 99 because those values felt sensible but am open
to changing those based on feedback.

Note: The purge threshold is still active when the keep option is
set. I added a note for this but this still presents some risk
where users who change this while in Keep mode could have their
services shut down earlier than they expect.

Patch: Fix a couple of typos in initial changes and improve
formatting.
@alexbelgium
Copy link

Hi, indeed looks useful for people who want to avoid being at limit of their disks ; or who share their system with data storage with another app

@Nachtzuster
Copy link
Owner

the 'Amount of files to keep for each species' setting is meant to further limit the space if that 95% safety valve does not do.
Have you considered using that?

@ChrisLeinbach
Copy link
Author

the 'Amount of files to keep for each species' setting is meant to further limit the space if that 95% safety valve does not do.

Have you considered using that?

I had tried it and admittedly wasn't very patient nor did I take the time to debug it but for whatever reason it didn't clear up my usage, at least as fast as I had expected. By the time this drew my attention, it was a problem for my Pi so I was trying to take the most direct path.

I'm going to change my threshold back to 95% and set it to keep all files and let it build up some data over the next few days. I'll report back on that as an issue if my debugging of the 'keep x files' functionality truly isn't working as it should be.

Having said that, it feels a bit off to me that two parallel but unlinked methods of managing disk space are implemented. Im wondering if there is appetite for a rewrite of the disk management scripts into a single authoritative script?

@alexbelgium
Copy link

alexbelgium commented Jan 3, 2025

the 'Amount of files to keep for each species' setting is meant to further limit the space if that 95% safety valve does not do.
Have you considered using that?

I had tried it and admittedly wasn't very patient nor did I take the time to debug it but for whatever reason it didn't clear up my usage, at least as fast as I had expected. By the time this drew my attention, it was a problem for my Pi so I was trying to take the most direct path.

Hi, the cron job only every day at 2am. In theory this should be enough. Watch-out that it all depends on the number of birds : if you have 100 species with 100 recordings of mp3 at 4mo, that makes 40 Go :-)

Execute this script and it will show you the number of recordings available for each of your species. Please keep in mind that you will often have more files than the minimum set : the files from the last 30 days are protected, as well as all files with a no-purge tag.

#!/bin/bash

source /etc/birdnet/birdnet.conf
base_dir="$HOME/BirdSongs/Extracted/By_Date"
cd "$base_dir" || true

# Get unique species
bird_names=$(
    sqlite3 -readonly "$HOME"/BirdNET-Pi/scripts/birds.db <<EOF
.mode column
.headers off
SELECT DISTINCT Com_Name FROM detections;
.quit
EOF
)

# Sanitize the bird names (remove single quotes and replace spaces with underscores)
sanitized_names="$(echo "$bird_names" | tr ' ' '_' | tr -d "'" | grep '[[:alnum:]]')"
# Remove trailing underscores
sanitized_names=$(echo "$sanitized_names" | sed 's/_*$//')

# Create an associative array to store species and their file counts
declare -A species_file_counts

# Read each line from the variable and count the files for each species
while read -r species; do
    species_san="${species/-/=}"
    file_count=$(find */"$species" -type f -name "*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*.*" \
        -not -name "*.png" | wc -l)
    species_file_counts["$species"]=$file_count
done <<<"$sanitized_names"

# Sort the species by file count in descending order and print them
for species in "${!species_file_counts[@]}"; do
    echo "$species : ${species_file_counts[$species]}"
done | sort -t ':' -k2 -nr

@ChrisLeinbach
Copy link
Author

Hi, the cron job only every day at 2am

This is the part I was missing. For whatever reason I assumed that it ran as frequently as the disk check script. I switched to this repo after discovering the issue and realizing mcguirepr89's was unmaintained. The 'keep x files' would have never run based on my series of events.

alexbelgium added a commit to alexbelgium/BirdNET-Pi that referenced this pull request Jan 8, 2025
@Nachtzuster
Copy link
Owner

Though I'm not against setting the threshold per-se, it kind of bothers me that 'Amount of files...' selects different files to delete than disk_check.sh - oh well

@alexbelgium
Copy link

alexbelgium commented Jan 19, 2025

Though I'm not against setting the threshold per-se, it kind of bothers me that 'Amount of files...' selects different files to delete than disk_check.sh - oh well

Would you prefer to adapt disk_check to the "Amount of files" logic? I.e. : protect files of last 7 days, protect files marked as locked, and delete files by order of priority files with lower confidence values until the threshold is reached.

Currently the disk size logic seems to just delete files older than 90 days not locked, irrespective of their confidence. Which could lead to inconsistent outcome (if lots of files were generated in recent days it could fail to reach the threshold) and could lead to having tons of one common bird remaining, but a few older specific bird songs being removed. We could have the same logic as "Amount of files" with the exception that the threshold to reach would be based on disk space and not amount of files. Another alternative is to just run the "amount of files" logic with lowered "files to keep" at each iteration until enough disk space freed is reached. This would allow better repartition of files deletion between species

Edit : it could be simple actually : if we modify in disk_species_clean.sh the line max_files_species="${MAX_FILES_SPECIES:-1000}" to also accept as value an argument to the script if provided, we could iteratively call this script from disk_check.sh several times starting from high max_files_species values then calling it again with decreased max_files_species values until it reaches an acceptable disk space. For example, we might say that if we reach 95%, the objective is to free 0.8x of this value (so to reach 76% disk used when threshold is 95%, or 40% disk used when threshold is 50%), then execute the disk_species_clean.sh until that value is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants