Improve procedure for finding duplicates (may not find all duplicates) #13

sypets · 2025-01-21T08:00:30Z

Originally, the DB command would find pseudo-duplicates which were not duplicates if searched case-sensitive, e.g. like select identifier,storage,count(*) as c from sys_file group by identifier,storage having count(*) > 1

uid	identifier	storage
1	/ABC.jpg	1
2	/abc.jpg	1

would show these as duplicates

This is still the case, but then in the code an additional check is performed if the identifier is really different:

if ($masterFileIdentifier !== $identifier) {
   // identifier is not the same, skip this one (may happen because of case-insensitive DB queries)
   continue;
}

https://github.com/ElementareTeilchen/unduplicator/blob/main/Classes/Command/UnduplicateCommand.php#L182

The problem

The first hit (with ordering ASC by uid) is treated as "master" and the following hits will be checked with this. However, if we have something like this:

uid	identifier	storage
1	/ABC.jpg	1
2	/abc.jpg	1
3	/abc.jpg	1

The identifier for record with uid 2 and uid 3 will both be compared with identifier for record 1 and they will be different so the loop is left via continue. However the records 2 and 3 will not be compared with each other.

Possible solution

We could change the check in the loop. However, this is awkward anyway, might be better to change the DB query.

We originally did not change the DB query because we did not see a possibility to do this easily with QueryBuilder based on doctrine/dbal.

A query which checks case-sensitively could look like this:

select storage,MAX(identifier),COUNT(*) AS c from sys_file group by md5(identifier),storage having count(*) > 1;

or

select storage,identifier,COUNT(*) AS c from sys_file group by BINARY identifier,storage having count(*) > 1;

We need to potentially correct this in:

UnduplicateCommand::findDuplicates: GROUP BY statement
UnduplicateCommand::findDuplicateFilesForIdentifier: WHERE statement

Resources

see https://stackoverflow.com/questions/64348050/typo3-case-sensitive-search-with-binary-in-doctrine-where-clause

The text was updated successfully, but these errors were encountered:

Resolves: ElementareTeilchen#13

sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025

Improve db check for duplicates

876fb39

Resolves: ElementareTeilchen#13

sypets linked a pull request Jan 21, 2025 that will close this issue

Improve db check for duplicates #14

Draft

sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025

Improve db check for duplicates

a88e37d

Resolves: ElementareTeilchen#13

sypets added a commit to sypets/unduplicator that referenced this issue Jan 21, 2025

Improve db check for duplicates

37ab27e

Resolves: ElementareTeilchen#13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve procedure for finding duplicates (may not find all duplicates) #13

Improve procedure for finding duplicates (may not find all duplicates) #13

sypets commented Jan 21, 2025 •

edited

Loading

Improve procedure for finding duplicates (may not find all duplicates) #13

Improve procedure for finding duplicates (may not find all duplicates) #13

Comments

sypets commented Jan 21, 2025 • edited Loading

The problem

Possible solution

Resources

sypets commented Jan 21, 2025 •

edited

Loading