Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplemented xlocate as xbps-locate #585

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

friedelschoen
Copy link

I've implemented a new xbps-tool xbps-locate! xbps-rindex collects data into index.plist inside *-repodata but also files into files.plist. xbps-locate will fetch the files.plist from the repo-pool and search for the desired file. I cannot test

Also added to TODO, cleanage of xbps-rindex doesn't clean files.plist yet.

I've also added into repo_open_* in lib/repo.c that the archive-iterator just assumes that the files are in order (they are still written in order for compatiblity) but is checking the actual filename.

On my computer, I've to manually disable _BSD_SOURCE and _SVID_SOURCE, so there is a commit, I don't know if it's only on my computer. (Void Linux x86_64-musl)

Thanks for looking into my code!

@classabbyamp
Copy link
Member

classabbyamp commented Jan 11, 2024

how does this affect repodata size?

why not make it part of xbps-query?

this also means x(bps-)locate loses the power of pcre and delta-updating the index

@classabbyamp
Copy link
Member

_BSD_SOURCE was fixed in 48c9879, rebase

@friedelschoen
Copy link
Author

Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file. void-packages got about 13000 packages with each about 50 files (I guess, just assuming right now).

100x50x13'000 ≈ 60mb uncompressed.

Maybe using an extra file like x86_64-files which would be downloaded independently is an option if the overhead is too much.

You're right about loosing the power of PCRE then, maybe a third-party library?

@classabbyamp
Copy link
Member

classabbyamp commented Jan 12, 2024

Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file. void-packages got about 13000 packages with each about 50 files (I guess, just assuming right now).

100x50x13'000 ≈ 60mb uncompressed.

$ git -C .cache/xlocate.git/ grep '.' @ | wc -l
3528851
$ git -C .cache/xlocate.git/ grep '.' @ | cut -d: -f3- | wc -c
235757398

so at the very minimum 235 MB assuming single-byte ASCII characters only and no plist overhead

@friedelschoen
Copy link
Author

Oke! I wasn't aware of that much overhead to include it directly into *-repodata so I'll look into implementing something like x86_64-files and leaving the original *-repodata alone.

@0x5c
Copy link

0x5c commented Jan 12, 2024

It's worth noting that the existing xlocate index is large enough to already be in git, where it still takes ages to download if you don't already have a clone and can't take advantage of the delta updating that git provides.
235MB is well within territory of a download that can take ages on a mediocre Internet connection; 5 years ago this would have been a 30~40min project on the Internet connection I had.
Even with my current, much better, internet connection that's still an unreasonable amount of delay when simply trying to sync the repodata.

@friedelschoen
Copy link
Author

friedelschoen commented Jan 12, 2024

After some research, making a plist with all the files in xlocate.git:

% cat ../make-plist.sh
echo "<plist>"
echo "\t<dict>"

for pkg in *; do
	echo "\t\t<key>$pkg</key>"
	echo "\t\t<array>"
	for file in $(awk '{print $1}' $pkg); do
		echo "\t\t\t<string>$file</string>"
	done
	echo "\t\t</array>"
done
echo "\t</dict>"
echo "</plist>"
% sh ../make-plist.sh | zstd -f9o ../files.zstd
/*stdin*\            :  5.21%   (   238 MiB =>   12.4 MiB, ../files.zstd)     
% find * -print -exec cat {} \; | zstd -f9o ../files.zstd
/*stdin*\            :  6.58%   (   197 MiB =>   13.0 MiB, ../files.zstd)  

13MiB still is a lot to just include into *-repodata so I would put it into a seperate *-files file, which is fetched individually.

Taking gcc-fortran which is about 13MB takes 5.3s, cloning the xlocate.git takes about 11s. Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.

I cannot tell how accurate this comparision is and how linear is behaves on slower networks. Please correct me if I'm wrong.

% time wget -O /dev/null https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
--2024-01-12 16:20:32--  https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
Resolving repo-default.voidlinux.org (repo-default.voidlinux.org)... 2a01:4f9:4b:42dc::d01, 65.21.160.177
Connecting to repo-default.voidlinux.org (repo-default.voidlinux.org)|2a01:4f9:4b:42dc::d01|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13833815 (13M) [application/octet-stream]
Saving to: ‘/dev/null’

100%[==============================================================================>]  13.19M  2.52MB/s    in 5.0s    

2024-01-12 16:20:37 (2.64 MB/s) - ‘/dev/null’ saved [13833815/13833815]


________________________________________________________
Executed in    5.30 secs      fish           external
   usr time  215.72 millis    0.10 millis  215.62 millis
   sys time  271.42 millis    1.01 millis  270.41 millis

% time git clone https://repo-default.voidlinux.org/xlocate/xlocate.git test
Cloning into 'test'...
Fetching objects: 18387, done.
Updating files: 100% (18503/18503), done.

________________________________________________________
Executed in   11.12 secs    fish           external
   usr time    2.86 secs    0.32 millis    2.86 secs
   sys time    1.43 secs    1.06 millis    1.43 secs

@0x5c
Copy link

0x5c commented Jan 12, 2024

Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.

That's reason git is used, it provides the mechanism to download only the new parts of the index ("delta-updating"), keeping existing files as is

@friedelschoen
Copy link
Author

friedelschoen commented Jan 22, 2024

I've now re-implemented xlocate into xbps-query (-o and --ownedhash) to have better integration. From there, you can still search by file/link but also by hash! Every file-hash is included into *arch*-files thus if you are searching exactly this file, you can search it without knowing its name or actual location. *arch*-files shouldn't be too heavy still. Maybe someone can have a look 👍🏼

Also can someone with a binary-repo make a index-file with xbps-rindex and compare speed with xlocate, I don't have the capacities to download a binary-repo to test.

@friedelschoen friedelschoen force-pushed the xbps-locate branch 2 times, most recently from 196d5ff to 4ac9897 Compare January 23, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants