Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve procedure for finding duplicates (may not find all duplicates) #13

Open
sypets opened this issue Jan 21, 2025 · 0 comments · May be fixed by #14
Open

Improve procedure for finding duplicates (may not find all duplicates) #13

sypets opened this issue Jan 21, 2025 · 0 comments · May be fixed by #14

Comments

@sypets
Copy link
Contributor

sypets commented Jan 21, 2025

Originally, the DB command would find pseudo-duplicates which were not duplicates if searched case-sensitive, e.g. like select identifier,storage,count(*) as c from sys_file group by identifier,storage having count(*) > 1

uid identifier storage
1 /ABC.jpg 1
2 /abc.jpg 1

would show these as duplicates

This is still the case, but then in the code an additional check is performed if the identifier is really different:

if ($masterFileIdentifier !== $identifier) {
   // identifier is not the same, skip this one (may happen because of case-insensitive DB queries)
   continue;
}

https://github.com/ElementareTeilchen/unduplicator/blob/main/Classes/Command/UnduplicateCommand.php#L182

The problem

The first hit (with ordering ASC by uid) is treated as "master" and the following hits will be checked with this. However, if we have something like this:

uid identifier storage
1 /ABC.jpg 1
2 /abc.jpg 1
3 /abc.jpg 1

The identifier for record with uid 2 and uid 3 will both be compared with identifier for record 1 and they will be different so the loop is left via continue. However the records 2 and 3 will not be compared with each other.

Possible solution

We could change the check in the loop. However, this is awkward anyway, might be better to change the DB query.

We originally did not change the DB query because we did not see a possibility to do this easily with QueryBuilder based on doctrine/dbal.

A query which checks case-sensitively could look like this:

select storage,MAX(identifier),COUNT(*) AS c from sys_file group by md5(identifier),storage having count(*) > 1;

or

select storage,identifier,COUNT(*) AS c from sys_file group by BINARY identifier,storage having count(*) > 1;

We need to potentially correct this in:

  • UnduplicateCommand::findDuplicates: GROUP BY statement
  • UnduplicateCommand::findDuplicateFilesForIdentifier: WHERE statement

Resources

sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025
@sypets sypets linked a pull request Jan 21, 2025 that will close this issue
sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025
sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant