You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally, the DB command would find pseudo-duplicates which were not duplicates if searched case-sensitive, e.g. like select identifier,storage,count(*) as c from sys_file group by identifier,storage having count(*) > 1
uid
identifier
storage
1
/ABC.jpg
1
2
/abc.jpg
1
would show these as duplicates
This is still the case, but then in the code an additional check is performed if the identifier is really different:
if ($masterFileIdentifier !== $identifier) {
// identifier is not the same, skip this one (may happen because of case-insensitive DB queries)
continue;
}
The first hit (with ordering ASC by uid) is treated as "master" and the following hits will be checked with this. However, if we have something like this:
uid
identifier
storage
1
/ABC.jpg
1
2
/abc.jpg
1
3
/abc.jpg
1
The identifier for record with uid 2 and uid 3 will both be compared with identifier for record 1 and they will be different so the loop is left via continue. However the records 2 and 3 will not be compared with each other.
Possible solution
We could change the check in the loop. However, this is awkward anyway, might be better to change the DB query.
We originally did not change the DB query because we did not see a possibility to do this easily with QueryBuilder based on doctrine/dbal.
A query which checks case-sensitively could look like this:
select storage,MAX(identifier),COUNT(*) AS c from sys_file group by md5(identifier),storage having count(*) > 1;
or
select storage,identifier,COUNT(*) AS c from sys_file group by BINARY identifier,storage having count(*) > 1;
We need to potentially correct this in:
UnduplicateCommand::findDuplicates: GROUP BY statement
UnduplicateCommand::findDuplicateFilesForIdentifier: WHERE statement
Originally, the DB command would find pseudo-duplicates which were not duplicates if searched case-sensitive, e.g. like
select identifier,storage,count(*) as c from sys_file group by identifier,storage having count(*) > 1
would show these as duplicates
This is still the case, but then in the code an additional check is performed if the identifier is really different:
https://github.com/ElementareTeilchen/unduplicator/blob/main/Classes/Command/UnduplicateCommand.php#L182
The problem
The first hit (with ordering ASC by uid) is treated as "master" and the following hits will be checked with this. However, if we have something like this:
The identifier for record with uid 2 and uid 3 will both be compared with identifier for record 1 and they will be different so the loop is left via continue. However the records 2 and 3 will not be compared with each other.
Possible solution
We could change the check in the loop. However, this is awkward anyway, might be better to change the DB query.
We originally did not change the DB query because we did not see a possibility to do this easily with QueryBuilder based on doctrine/dbal.
A query which checks case-sensitively could look like this:
select storage,MAX(identifier),COUNT(*) AS c from sys_file group by md5(identifier),storage having count(*) > 1;
or
select storage,identifier,COUNT(*) AS c from sys_file group by BINARY identifier,storage having count(*) > 1;
We need to potentially correct this in:
Resources
The text was updated successfully, but these errors were encountered: