-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shared cache search and searchByMime don't scale well #10077
Comments
I would like to add the question about the performance of searchByMime() as well as search() outside of the shared folder? Is the performance on file cache level for the local storage OK, then why not simple call the same methods if we are searching a shared folder? From the share table we know the starting point at the file cache, so we could simply ask the local implementation of search() to search for a certain pattern beginning at the root folder provided by the sharing table. |
The local cache uses an SQL LIKE to compare all file in the local storage. We cannot simply delegate to the underlying storage because, as you pointed out correctly, we need to limit the results to shared entries. Hm that might actually work when first resolving the shared file name to the path in the share owners filespace and then searching in that view. The path would be prefixed with the path in the owners view and we could add another LIKE pattern to the query... We would need to make the local storage allow searches in subfolders only. we could add a $path param to search(): and then use that to limit the results in the sql query: public function search($pattern, $path = '') {
// normalize pattern
$pattern = $this->normalize($pattern);
$sql = '
SELECT `fileid`, `storage`, `path`, `parent`, `name`,
`mimetype`, `mimepart`, `size`, `mtime`, `encrypted`,
`unencrypted_size`, `etag`, `permissions`
FROM `*PREFIX*filecache`
WHERE `storage` = ? AND ';
$params = array($this->getNumericStorageId());
// limit to path or children
if ($path !== '') {
$sql .= '`path` LIKE ? AND ';
$params[] = $path.'%';
}
$dbtype = \OC_Config::getValue( 'dbtype', 'sqlite' );
if($dbtype === 'oci') {
//remove starting and ending % from the pattern
$pattern = '^'.str_replace('%', '.*', $pattern).'$';
$sql .= 'REGEXP_LIKE(`name`, ?, \'i\')';
} else if($dbtype === 'pgsql') {
$sql .= '`name` ILIKE ?';
} else if ($dbtype === 'mysql') {
$sql .= '`name` COLLATE utf8_general_ci LIKE ?';
} else {
$sql .= '`name` LIKE ?';
}
$params[] = $pattern;
$result = \OC_DB::executeAudited($sql, $params); @icewind1991 input or feedback? |
My idea (for a while now) is to keep track of the (grand)parent<->child relation in a table.
This makes it easy and cheap to get all childs of a folder (usefull for delete, move and searching) and all parents of a file (usefull for etag and mtime updating) and can be used by the sharing cache instead of the table proposed by @butonic |
Might be useful. Independent from it I would like to go this way:
This was exactly my idea. We have the file_cache (and in the future maybe the table suggested by @icewind1991) to keep track of the file system and I think we should use it to search within the file system. I don't see a reason why every storage should implement its own search logic. This also has the advantage that we have one code path and if we optimize the search for the file cache all storages will benefit. |
I agree, I would also like a more advanced method that allows searching using an arbitrary combination of name, mimetype, parent, etc I'll see if I can make a proof of concept soon |
It's crucial for searchByMime() to have a limit argument so that apps which need to collect references to files of a certain type can do it in batch instead of having to send an array containing potentially thousands of entries to the browser. |
getDirectoryListing() is also affected, but it will probably die once search supports a path and limits. |
@DeepDiver1975 just curious if this got on your radar at some point (or shouldn't be) ;-) /me considers himself an interested party as his gallery app doesn't work with the amount of pics he has :( |
This makes some stuff really cumbersome to use, just a decent example: @cmonteroluque Please re-evaluate whether this really should be backlog. This makes for example the gallery pretty much unusable on S3. Try to open the gallery on S3 (especially the "Shared" folder) and you will see what I mean 😉 |
Ok. I think in Gallery this is caused by something else, probably file locking: https://blackfire.io/profiles/f1d6349f-4ab3-4190-beab-ccbd59527844/graph @icewind1991 Ideas? |
Moving that one out to #19888 |
00004526 |
@felixboehm @bboule @butonic Is there any change to fix in a earlier version than 9.0? |
@cdamken can you email the link to the SFDC ticket associated with this? |
Sure! done! |
Alternative and/or additional tables to implement the file cache and/or sharing is definitly out of scope for 9.0. We need an architectural discussion upon this -> 9.1 candidate. @cmonteroluque @karlitschek |
This is a case where each search on shared files took around 1 minute. As this is still in planning phase it's unlikely that this will be in 9.0, or am I wrong? @cdamken is just asking for a quick fix/work around. cc @MTRichards @karlitschek @cmonteroluque |
@butonic would the full text search app be a workaround? |
There have been changes in the 9.0 implementation of the shared cache search methods, haven't done any performance measurements though |
@felixboehm only the elastic search based one for ee because it can search in shared files. but it has never been in production AFAIK. so needs an update and thorough testing. |
using the search elastic app now. |
@butonic close this? |
Already moved twice, no feedback, am closing this now. Please reopen if important. |
Still a problem without search_elastic. search_lucene could be used to provide a solution for small installations. would need some work, owncloud-archive/search_lucene#10. or we could add this as another performance test via smashbox, cc @mrow4a |
@butonic any estimate / breakdown of tasks to improve this ? So far it feels like this is a rather big project, moving to backlog. |
@cdamken FYI ^ |
The current implementation does not scale because
searchByMime()
as well assearch()
uses dozens of sql queries when walking the directory tree: getFolderContents first looks up the source cache via the shares table and that then looks up the files in the filecache. That is at least two queries per folder from what I can tell. Have a look at the following query log: http://pastebin.com/6V0M2NAgAnd I already stripped it of completely unneeded duplicate queries for user group and what not. But that is yet another issue.
What I propose is having the current filecache and share tables and an additional relation table called 'oc_shares_files' having the fileid and the shareid as foreign keys:
pass in the pattern and the user searching through shared files. The query could be optimized to also show files owned by the user.
Maintaining the relations is not that hard:
All this happens inside the database. A quick test on an oracle db that had to insert 7736 rows took less than a second. Honestly, this is what RDBMS are made for. On my productive mysql machine I did:
so ... IMHO performance is clearly not an issue here.
Other operations can now also be offloaded to the database:
If we were using foreign keys we could even use the cascade option to remove entries from the relation table.
Migration to a relations table is easy because it only adds the relation table.
It is basically a speed tradeoff where we maintain the relations of the shared files in the database to speed up all queries related to file shares, because we can let the database handle permissions and don't have to go through php and the connection.
Open questions pointed out by @schiesbn:
What happens when a different storage is mounted inside the shared folder?
My initial thought is that, since the whole storage is shared, it makes sens to have a
oc_shares_storages
table to track this case.Let me take a step back and think out loud here. Our
oc_filecache
should contain every file only once. Theoc_storages
is basically used to grant access to a subset of these files to users. Theoc_share
table actually does the same. The storages table also serves another purpose: naming the physical mount points in the virtual directory tree. By creating a share we effectively create a new storage that contains a subtree of the virtual directory tree. This subtree can contain any number of storages: physical mount poins as well as other virtual storages. ... mind blown ... Mathemathically speaking we are dealing with sets. Sets of existing files (storages) and sets of accessible files which is a subset of the filecache. The root for each accessible set is either the root of a storage or a share entry. The diffeerence is that with storages we have a way of joining or intersecting the set of files (because they share the same storage id) whereas with shares we cannot use joins or intersections, because they only contain the root element.All of this gets even more interesting when a folder has been shared twice to the same user (via a->b, a->c, b->d, c->d)
It seems to me we are hiding access controls in storages and shares. Maybe we should make them explicit. I need to give this some more thought ... input is welcome, maybe from someone who is much more familiar with set theory than I am.
The text was updated successfully, but these errors were encountered: