Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared cache search and searchByMime don't scale well #10077

Closed
butonic opened this issue Jul 31, 2014 · 34 comments
Closed

shared cache search and searchByMime don't scale well #10077

butonic opened this issue Jul 31, 2014 · 34 comments

Comments

@butonic
Copy link
Member

butonic commented Jul 31, 2014

The current implementation does not scale because searchByMime() as well as search() uses dozens of sql queries when walking the directory tree: getFolderContents first looks up the source cache via the shares table and that then looks up the files in the filecache. That is at least two queries per folder from what I can tell. Have a look at the following query log: http://pastebin.com/6V0M2NAg

And I already stripped it of completely unneeded duplicate queries for user group and what not. But that is yet another issue.

What I propose is having the current filecache and share tables and an additional relation table called 'oc_shares_files' having the fileid and the shareid as foreign keys:

CREATE TABLE `oc_shares_files` (`shareid` int NOT NULL , `fileid` int NOT NULL , PRIMARY KEY (`shareid`, `fileid`));

-- We can then join the share table via the relations to search files
SELECT "sf"."shareid", "fc"."fileid"
FROM "oc_filecache" "fc"
JOIN "oc_share_files" "sf" ON "sf"."fileid"="fc"."fileid"
JOIN "oc_share" "s" ON "sf"."shareid"="s"."id"
 WHERE "fc"."name" LIKE ?
AND "s"."share_with" = ?

pass in the pattern and the user searching through shared files. The query could be optimized to also show files owned by the user.

Maintaining the relations is not that hard:

-- add all children i nthe same storage when a new share is created: params are shareid, fileid of the shared file and path of the shared file with % appended,
INSERT INTO  `oc_shares_files` (`shareid`,`fileid`)
  SELECT ?, `fc`.`fileid` 
  FROM `oc_filecache` `fc` 
    JOIN `oc_filecache` `fc2` ON `fc`.`fileid` = `fc`.`fileid`
  WHERE `fc`.`storage` = `fc2`.`storage`
    AND `fc2`.`fileid` = ?
    AND `fc`.`path` LIKE ?

All this happens inside the database. A quick test on an oracle db that had to insert 7736 rows took less than a second. Honestly, this is what RDBMS are made for. On my productive mysql machine I did:

SELECT count(*) FROM `oc_filecache` WHERE `path` LIKE 'files%';
+----------+
| count(*) |
+----------+
|    21750 |
+----------+
1 row in set (0.19 sec)

INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT 1, `fc`.`fileid`    FROM `oc_filecache` `fc`      JOIN `oc_filecache` `fc2` ON `fc`.`fileid` = `fc`.`fileid`   WHERE `fc`.`storage` = `fc2`.`storage`     AND `fc2`.`fileid` = 2938     AND `fc`.`path` LIKE 'files%';
Query OK, 17261 rows affected (0.04 sec)
Records: 17261  Duplicates: 0  Warnings: 0

so ... IMHO performance is clearly not an issue here.

Other operations can now also be offloaded to the database:

-- on a create hook add the file to the shares_files relation by looking up the parent in the share_files table
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT `sf`.`shareid`, `fc`.`fileid` FROM `oc_filecache` `fc` JOIN `oc_share_files` `sf` ON `sf`.`fileid`=`fc`.`parent` WHERE `fc`.`fileid` = ?

-- on a delete hook remove the file or all files of a share from the shares_files relation
-- when deleting a file (or folder)
DELETE FROM `oc_shares_files` WHERE `fileid` IN (SELECT `fileid` FROM `oc_filecache` WHERE `path` LIKE ? )
-- when deleting a share
DELETE FROM `oc_shares_files` WHERE `shareid` = ?

If we were using foreign keys we could even use the cascade option to remove entries from the relation table.

Migration to a relations table is easy because it only adds the relation table.

It is basically a speed tradeoff where we maintain the relations of the shared files in the database to speed up all queries related to file shares, because we can let the database handle permissions and don't have to go through php and the connection.

Open questions pointed out by @schiesbn:
What happens when a different storage is mounted inside the shared folder?
My initial thought is that, since the whole storage is shared, it makes sens to have a oc_shares_storages table to track this case.

Let me take a step back and think out loud here. Our oc_filecache should contain every file only once. The oc_storages is basically used to grant access to a subset of these files to users. The oc_share table actually does the same. The storages table also serves another purpose: naming the physical mount points in the virtual directory tree. By creating a share we effectively create a new storage that contains a subtree of the virtual directory tree. This subtree can contain any number of storages: physical mount poins as well as other virtual storages. ... mind blown ... Mathemathically speaking we are dealing with sets. Sets of existing files (storages) and sets of accessible files which is a subset of the filecache. The root for each accessible set is either the root of a storage or a share entry. The diffeerence is that with storages we have a way of joining or intersecting the set of files (because they share the same storage id) whereas with shares we cannot use joins or intersections, because they only contain the root element.

All of this gets even more interesting when a folder has been shared twice to the same user (via a->b, a->c, b->d, c->d)

It seems to me we are hiding access controls in storages and shares. Maybe we should make them explicit. I need to give this some more thought ... input is welcome, maybe from someone who is much more familiar with set theory than I am.

@butonic butonic changed the title shared cache search and searchByMime don't scale shared cache search and searchByMime don't scale well Jul 31, 2014
@schiessle
Copy link
Contributor

I would like to add the question about the performance of searchByMime() as well as search() outside of the shared folder? Is the performance on file cache level for the local storage OK, then why not simple call the same methods if we are searching a shared folder? From the share table we know the starting point at the file cache, so we could simply ask the local implementation of search() to search for a certain pattern beginning at the root folder provided by the sharing table.

@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

The local cache uses an SQL LIKE to compare all file in the local storage. We cannot simply delegate to the underlying storage because, as you pointed out correctly, we need to limit the results to shared entries. Hm that might actually work when first resolving the shared file name to the path in the share owners filespace and then searching in that view. The path would be prefixed with the path in the owners view and we could add another LIKE pattern to the query... We would need to make the local storage allow searches in subfolders only.

we could add a $path param to search(): and then use that to limit the results in the sql query:

    public function search($pattern, $path = '') {

        // normalize pattern
        $pattern = $this->normalize($pattern);

        $sql = '
            SELECT `fileid`, `storage`, `path`, `parent`, `name`,
                `mimetype`, `mimepart`, `size`, `mtime`, `encrypted`,
                `unencrypted_size`, `etag`, `permissions`
            FROM `*PREFIX*filecache`
            WHERE `storage` = ? AND ';

        $params = array($this->getNumericStorageId());

        // limit to path or children 
        if ($path !== '') {
            $sql .= '`path` LIKE ? AND ';
            $params[] = $path.'%';
        }

        $dbtype = \OC_Config::getValue( 'dbtype', 'sqlite' );
        if($dbtype === 'oci') {
            //remove starting and ending % from the pattern
            $pattern = '^'.str_replace('%', '.*', $pattern).'$';
            $sql .= 'REGEXP_LIKE(`name`, ?, \'i\')';
        } else if($dbtype === 'pgsql') {
            $sql .= '`name` ILIKE ?';
        } else if ($dbtype === 'mysql') {
            $sql .= '`name` COLLATE utf8_general_ci LIKE ?';
        } else {
            $sql .= '`name` LIKE ?';
        }
        $params[] = $pattern;

        $result = \OC_DB::executeAudited($sql, $params);

@icewind1991 input or feedback?

@icewind1991
Copy link
Contributor

My idea (for a while now) is to keep track of the (grand)parent<->child relation in a table.

CREATE TABLE `oc_files_tree` (`parentid` int NOT NULL , `childid` int NOT NULL , PRIMARY KEY (`parentid`, `childid`));

This makes it easy and cheap to get all childs of a folder (usefull for delete, move and searching) and all parents of a file (usefull for etag and mtime updating) and can be used by the sharing cache instead of the table proposed by @butonic

@schiessle
Copy link
Contributor

My idea (for a while now) is to keep track of the (grand)parent<->child relation in a table.

Might be useful.

Independent from it I would like to go this way:

Hm that might actually work when first resolving the shared file name to the path in the share owners filespace and then searching in that view. The path would be prefixed with the path in the owners view and we could add another LIKE pattern to the query... We would need to make the local storage allow searches in subfolders only.

This was exactly my idea. We have the file_cache (and in the future maybe the table suggested by @icewind1991) to keep track of the file system and I think we should use it to search within the file system. I don't see a reason why every storage should implement its own search logic. This also has the advantage that we have one code path and if we optimize the search for the file cache all storages will benefit.

@icewind1991
Copy link
Contributor

I agree, I would also like a more advanced method that allows searching using an arbitrary combination of name, mimetype, parent, etc

I'll see if I can make a proof of concept soon

@oparoz
Copy link
Contributor

oparoz commented Jan 4, 2015

It's crucial for searchByMime() to have a limit argument so that apps which need to collect references to files of a certain type can do it in batch instead of having to send an array containing potentially thousands of entries to the browser.

@oparoz
Copy link
Contributor

oparoz commented Feb 13, 2015

getDirectoryListing() is also affected, but it will probably die once search supports a path and limits.

@jospoortvliet
Copy link

@DeepDiver1975 just curious if this got on your radar at some point (or shouldn't be) ;-)

/me considers himself an interested party as his gallery app doesn't work with the amount of pics he has :(

@DeepDiver1975 DeepDiver1975 modified the milestone: backlog Mar 21, 2015
@LukasReschke
Copy link
Member

This makes some stuff really cumbersome to use, just a decent example:

2015-10-20_11-46-30

@cmonteroluque Please re-evaluate whether this really should be backlog. This makes for example the gallery pretty much unusable on S3. Try to open the gallery on S3 (especially the "Shared" folder) and you will see what I mean 😉

@LukasReschke
Copy link
Member

Ok. I think in Gallery this is caused by something else, probably file locking: https://blackfire.io/profiles/f1d6349f-4ab3-4190-beab-ccbd59527844/graph

2015-10-20_12-15-02

@icewind1991 Ideas?

@LukasReschke
Copy link
Member

Moving that one out to #19888

@ghost ghost modified the milestones: 9.0-next, backlog Oct 20, 2015
@cdamken
Copy link
Contributor

cdamken commented Jan 14, 2016

@MorrisJobke @butonic

00004526

@cdamken
Copy link
Contributor

cdamken commented Jan 18, 2016

@felixboehm @bboule @butonic Is there any change to fix in a earlier version than 9.0?

@bboule
Copy link

bboule commented Jan 18, 2016

@cdamken can you email the link to the SFDC ticket associated with this?

@cdamken
Copy link
Contributor

cdamken commented Jan 18, 2016

Sure! done!

@DeepDiver1975
Copy link
Member

Alternative and/or additional tables to implement the file cache and/or sharing is definitly out of scope for 9.0. We need an architectural discussion upon this -> 9.1 candidate.

@cmonteroluque @karlitschek

@MorrisJobke
Copy link
Contributor

00004526

This is a case where each search on shared files took around 1 minute. As this is still in planning phase it's unlikely that this will be in 9.0, or am I wrong? @cdamken is just asking for a quick fix/work around.

cc @MTRichards @karlitschek @cmonteroluque

@butonic butonic removed this from the 9.0-current milestone Feb 11, 2016
@felixboehm
Copy link
Contributor

@butonic would the full text search app be a workaround?

@icewind1991
Copy link
Contributor

There have been changes in the 9.0 implementation of the shared cache search methods, haven't done any performance measurements though

@butonic
Copy link
Member Author

butonic commented Feb 17, 2016

@felixboehm only the elastic search based one for ee because it can search in shared files. but it has never been in production AFAIK. so needs an update and thorough testing.

@felixboehm
Copy link
Contributor

using the search elastic app now.

@PVince81 PVince81 modified the milestones: 9.1, 9.1.1 Jul 18, 2016
@felixboehm
Copy link
Contributor

@butonic close this?

@PVince81 PVince81 modified the milestones: 9.1.2, 9.1.1 Sep 21, 2016
@PVince81 PVince81 modified the milestones: 9.1.3, 9.1.2 Oct 20, 2016
@PVince81
Copy link
Contributor

Already moved twice, no feedback, am closing this now. Please reopen if important.

@butonic
Copy link
Member Author

butonic commented Nov 30, 2016

Still a problem without search_elastic. search_lucene could be used to provide a solution for small installations. would need some work, owncloud-archive/search_lucene#10.

or we could add this as another performance test via smashbox, cc @mrow4a

@PVince81 PVince81 reopened this Nov 30, 2016
@PVince81 PVince81 modified the milestones: 9.1.4, 9.1.3 Nov 30, 2016
@PVince81 PVince81 modified the milestones: 9.1.5, 9.1.4 Feb 6, 2017
@PVince81
Copy link
Contributor

@butonic any estimate / breakdown of tasks to improve this ?

So far it feels like this is a rather big project, moving to backlog.

@PVince81 PVince81 modified the milestones: backlog, 9.1.5 Apr 13, 2017
@PVince81
Copy link
Contributor

@cdamken FYI ^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests