Skip to content

Commit

Permalink
pretty output
Browse files Browse the repository at this point in the history
  • Loading branch information
e3rd committed Mar 8, 2024
1 parent 2c79d19 commit d577060
Show file tree
Hide file tree
Showing 3 changed files with 118 additions and 37 deletions.
75 changes: 74 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,12 @@

Yet another file deduplicator.

# About

## What are the use cases?
* I have downloaded photos and videos from the cloud. Oh, both Google Photos and Youtube shrink the file and changes the format. Moreover, it have shortened the file name to 47 characters and capitalize the extension. So how should I know that I have them all backed up offline?
* My disk is cluttered with several backups and I'd like to be sure these are all just copies.
* I merge data from multiple sources. Some files in the backup might have the former orignal file modification date that I might wish to restore.

## What is compared?

Expand All @@ -29,4 +32,74 @@ These imply the folders have the same structure. Deduplidog is tolerant towards

## Doubts?

The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `bashify=True` to output bash commands you may launch after thorough examining.
The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `bashify=True` to output bash commands you may launch after thorough examining.

# Examples

It works great when launched from a [Jupyter Notebook](https://jupyter.org/).

```python3
import logging
from deduplidog import Deduplidog

Deduplidog("/home/user/duplicates", "/media/disk/origs", ignore_date=True, rename=True)
```

```
Find files by size, ignoring: date, crc32
Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with ✓).
Number of originals: 38
* /home/user/duplicates/foo.txt
/media/disk/origs/foo.txt
🔨home: renamable
📄media: DATE WARNING + a day
Affectable: 38/38
Affected size: 59.9 kB
Warnings: 1
```

We found out all the files in the *duplicates* folder seem to be useless but one. It's date is earlier than the original one. See with full log.

```python3
Deduplidog("/home/user/duplicates", "/media/disk/origs", ignore_date=True, rename=True, set_both_to_older_date=True, logging_level=logging.INFO)
```

```
Find files by size, ignoring: date, crc32
Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with ✓).
Original file mtime date might be set backwards to the duplicate file.
Number of originals: 38
* /home/user/duplicates/foo.txt
/media/disk/origs/foo.txt
🔨home: renamable
📄media: redatable 2022-04-28 16:58:56 -> 2020-04-26 16:58:00
* /home/user/duplicates/bar.txt
/media/disk/origs/bar.txt
🔨home: renamable
* /home/user/duplicates/third.txt
/media/disk/origs/third.txt
🔨home: renamable
...
Affectable: 38/38
Affected size: 59.9 kB
```

You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the 🔨 icon whereas those affected at the original folder uses 📄 icon. We might add `execute=True` parameter to perform the actions. Or use `bashify=True` to inspect.

```python3
Deduplidog("/home/user/duplicates", "/media/disk/origs", ignore_date=True, rename=True, set_both_to_older_date=True, bashify=True)
```

The `bashify=True` just produces the commands we might use.

```bash
touch -t 1524754680.0 /media/disk/origs/foo.txt
mv -n /home/user/duplicates/foo.txt /home/user/duplicates/✓foo.txt
mv -n /home/user/duplicates/bar.txt /home/user/duplicates/✓bar.txt
mv -n /home/user/duplicates/third.txt /home/user/duplicates/✓third.txt
```

# Documentation – `Deduplidog` class

Find the duplicates. Normally, the file must have the same size, date and name. (Name might be just similar if parameters like strip_end_counter are set.) If media_magic=True, media files receive different rules: Neither the size nor the date are compared. See its help.

78 changes: 43 additions & 35 deletions deduplidog/deduplidog.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,8 @@
MEDIA_SUFFIXES = IMAGE_SUFFIXES + VIDEO_SUFFIXES

logger = logging.getLogger(__name__)
Status = dict[Path, list[str | datetime]]
"Lists changes performed/suggested to given path"
Change = tuple[Path, Path, Status]
"Work file, original file, change status"
Change = dict[Path, list[str | datetime]]
"Lists changes performed/suggested to given path. First entry is the work file, the second is the original file."


@dataclass
Expand Down Expand Up @@ -105,7 +103,7 @@ class Deduplidog:
accepted_img_hash_diff: int = 1
"Used only when media_magic is True"
img_compare_date = False
"If True and media_magic=True, aby se obrázek považoval za duplikát, musí mít podobný čas v EXIFu či souboru. Used only when media_magic is True."
"If True and media_magic=True, the file date or the EXIF date must match."

file_list: list[Path] = None
"Use original file list. If none, a new is generated or a cached version is used."
Expand Down Expand Up @@ -136,6 +134,8 @@ def __post_init__(self):
"stats counter"
self.affected_count = 0
"stats counter"
self.warning_count = 0
"stats counter"
self.ignored_count = 0
"Files skipped because previously renamed with deduplidog"
self.having_multiple_candidates: dict[Path, list[Path]] = {}
Expand Down Expand Up @@ -198,10 +198,12 @@ def perform(self):
raise
finally:
if self.bar:
print(f"{'Affected' if self.execute else 'Affectable'}: {self.affected_count}/{self.bar.total- self.ignored_count}", end="")
print(f"{'Affected' if self.execute else 'Affectable'}: {self.affected_count}/{len(self.file_list)- self.ignored_count}", end="")
if self.ignored_count:
print(f" ({self.ignored_count} ignored)", end="")
print("\nAffected size:", naturalsize(self.size_affected))
if self.warning_count:
print(f"Warnings: {self.warning_count}")
if self.having_multiple_candidates:
print("Unsuccessful files having multiple candidates length:", len(self.having_multiple_candidates))
# self.queue.put(None)
Expand Down Expand Up @@ -282,9 +284,8 @@ def _loop_files(self):
continue
except KeyboardInterrupt:
print(f"Interrupted. You may proceed where you left with the skip={skip+bar.n} parameter.")
return bar
return
break
return bar

def _process_file(self, work_file: Path, bar: tqdm):
# work file name transformation
Expand Down Expand Up @@ -341,7 +342,7 @@ def _process_file(self, work_file: Path, bar: tqdm):

def _affect(self, work_file: Path, original: Path):
# which file will be affected? The work file or the mistakenly original file?
status = {work_file: [], original: []}
change = {work_file: [], original: []}
affected_file, other_file = work_file, original
warning = False
if affected_file == other_file:
Expand All @@ -355,7 +356,7 @@ def _affect(self, work_file: Path, original: Path):
case True, True:
affected_file, other_file = original, work_file
case False, True:
status[work_file].append(f"SIZE WARNING {naturalsize(work_size-orig_size)}")
change[work_file].append(f"SIZE WARNING {naturalsize(work_size-orig_size)}")
warning = True
if self.affect_only_if_smaller and affected_file.stat().st_size >= other_file.stat().st_size:
logger.debug("Skipping %s as it is smaller than %s", affected_file, other_file) # TODO check
Expand All @@ -371,13 +372,13 @@ def _affect(self, work_file: Path, original: Path):
case True, True:
# dates are not the same and we want change them
if other_date < affected_date:
self._change_file_date(affected_file, affected_date, other_date, status)
self._change_file_date(affected_file, affected_date, other_date, change)
elif other_date > affected_date:
self._change_file_date(other_file, other_date, affected_date, status)
self._change_file_date(other_file, other_date, affected_date, change)
case False, True if (other_date > affected_date):
# attention, we do not want to tamper dates however the file marked as duplicate has
# lower timestamp (which might be genuine)
status[other_file].append(f"DATE WARNING + {naturaldelta(other_date-affected_date)}")
change[other_file].append(f"DATE WARNING + {naturaldelta(other_date-affected_date)}")
warning = True

# renaming
Expand All @@ -399,7 +400,7 @@ def _affect(self, work_file: Path, original: Path):
if self.bashify:
print(f"mv -n {_qp(affected_file)} {_qp(target_path)}") # TODO check
self.passed_away.add(affected_file)
status[affected_file].append(status_)
change[affected_file].append(status_)
if self.replace_with_original:
status_ = "replacable"
if other_file.name == affected_file.name:
Expand All @@ -416,24 +417,28 @@ def _affect(self, work_file: Path, original: Path):
if self.bashify:
# TODO check
print(f"cp --preserve {_qp(other_file)} {_qp(affected_file.parent)} && rm {_qp(affected_file)}")
status[affected_file].append(status_)

self.changes.append((work_file, original, status))
suffix = " (affected):" if affected_file is original else ":"
getattr(logger, "warning" if warning else "info")("Original%s %s %s",
suffix, self._path(original), " ".join(str(s) for s in status[original]))
getattr(logger, "warning" if warning else "info")(
"Work file: %s %s", self._path(work_file), " ".join(str(s) for s in status[work_file]))

def _change_file_date(self, path, old_date, new_date, status):
change[affected_file].append(status_)

self.changes.append(change)
if warning:
self.warning_count += 1
if (warning and self.logging_level <= logging.WARNING) or (self.logging_level <= logging.INFO):
self._print_change(change)
# suffix = " (affected):" if affected_file is original else ":"
# getattr(logger, "warning" if warning else "info")("Original%s %s %s",
# suffix, self._path(original), " ".join(str(s) for s in change[original]))
# getattr(logger, "warning" if warning else "info")(
# "Work file: %s %s", self._path(work_file), " ".join(str(s) for s in change[work_file]))

def _change_file_date(self, path, old_date, new_date, change: Change):
# Consider following usecase:
# Duplicated file 1, date 14:06
# Duplicated file 2, date 15:06
# Original file, date 18:00.
# The status message will mistakingly tell that we change Original date to 14:06 (good), then to 15:06 (bad).
# However, these are just the status messages. But as we resolve the dates at the launch time,
# original date will end up as 14:06 because 15:06 will be later.
status[path].extend(("redating" if self.execute else 'redatable',
change[path].extend(("redating" if self.execute else 'redatable',
datetime.fromtimestamp(old_date), "->", datetime.fromtimestamp(new_date)))
if self.execute:
os.utime(path, (new_date,)*2) # change access time, modification time
Expand Down Expand Up @@ -522,17 +527,18 @@ def image_similar(self, original: Path, work_file: Path, work_pil: Image, ref_ti
def build_originals(original_dir: str | Path, suffixes: bool | tuple[str]):
return [p for p in tqdm(Path(original_dir).rglob("*"), desc="Caching original files") if p.is_file() and not p.is_symlink() and (not suffixes or p.suffix.lower() in suffixes)]

def print_changes(mdf):
def print_changes(self):
"Prints performed/suggested changes to be inspected in a human readable form."
work, orig = "🔨", "📄"
for work_file, original, status in mdf.changes:
print("*", work_file)
print(" ", original)
for path, changes in status.items():
if not len(changes):
continue
print(f" {work}{mdf.work_dir_name}:" if path ==
work_file else f" {orig}{mdf.original_dir_name}:", *(str(s) for s in changes))
[self._print_change(change) for change in self.changes]

def _print_change(self, change: Change):
wicon, oicon = "🔨", "📄"
wf, of = change
print("*", wf)
print(" ", of)
[print(text, *(str(s) for s in changes))
for text, changes in zip((f" {wicon}{self.work_dir_name}:",
f" {oicon}{self.original_dir_name}:"), change.values()) if len(changes)]


@cache
Expand All @@ -556,6 +562,8 @@ def _qp(path: Path):
return f'"{s}"' if " " in s else s


# TODO: below are some functions that should be converted into documented utils or removed

def remove_prefix_in_workdir(work_dir: str):
""" Removes the prefix ✓ recursively from all the files.
The prefix might have been previously given by the deduplidog. """
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
version = "0.1.0"
version = "0.5.0"
description = "Deduplicate folders"
authors = ["Edvard Rejthar <[email protected]>"]
license = "GPL-3.0-or-later"
Expand Down

0 comments on commit d577060

Please sign in to comment.