Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_new_commits crashes #2

Open
audrism opened this issue Nov 16, 2018 · 120 comments
Open

get_new_commits crashes #2

audrism opened this issue Nov 16, 2018 · 120 comments

Comments

@audrism
Copy link

audrism commented Nov 16, 2018

echo tst,https://github.com/ssc-oscar/tst,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 | /usr/bin/get_new_commits
path: tst
url: https://github.com/ssc-oscar/tst
new head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7
old head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7
no update!
Segmentation fault

@KayGau
Copy link

KayGau commented Nov 16, 2018

The problem is caused by the incorrectly free a null pointer in line 102. I'll fix the bug and commit a new version

@audrism
Copy link
Author

audrism commented Nov 16, 2018

txs.

@audrism
Copy link
Author

audrism commented Nov 16, 2018

Btw, current fetch does the full clone of the repo. In other words it does not save on network traffic at all. What would help would be to fetch only objects associated with new commits.

@KayGau
Copy link

KayGau commented Nov 20, 2018

Hi, Professor! I run a test to test the fetch operation. I added some code to observe how many objects were fetched. First, I create an empty directory, then use the git_remote_fetch to fetch from remote, and I received 21 objects. Then I modify a remote file, the use git_remote_fetch again, I received 3 objects. It looks like that the fetch operation fetches objects associated with new commits

@audrism
Copy link
Author

audrism commented Nov 20, 2018

Great, thats how fetch works with a populated git database. But we have an
empty git database with objects are stored elsewhere. Is it possible to modify fetch so that it looks for objects that do not need to be fetched in an external database?

@KayGau
Copy link

KayGau commented Nov 20, 2018

I'm sorry, Professor, I couldn't get the point. To get new commit objects, we have to fetch from the remote repository. Then what is "looks for objects that do not need to be fetched in an external database"? Could you please describe in more detail? I' sorry......

@audrism
Copy link
Author

audrism commented Nov 20, 2018

Lets consider the example you have above:

  1. "First, I create an empty directory" (empty git repo I presume?),

  2. "then use the git_remote_fetch to fetch from remote, and I received 21 objects."

  3. Now lets take all these 21 objects and store hashes in some database, e.g., flat file
    You can use gitListSimp.sh to do that.

  4. Create another empty git repo

  5. Run fetch on it, but use objects in the external database so that this fetch
    only receives 3 objects not all 24 objects?

@KayGau
Copy link

KayGau commented Nov 20, 2018

Hi, Professor!

  1. In the first step, it indeed create a directory(empty git repo) to store the .git directory. I didn't delete the directory when the program is over.
  2. In the third step, I tried the gitListSimp.sh, in the shell, I type in the follwing command: bash gitListSimp.sh /home/kgao/Desktop/test/.git
    and it prints all the commits.
    After run the shell, the directory created in the first step still exists and the .git still stays there. So there is no need to create another empty git repo in step 4, just use the directory created in step1. Run fetch on it, and use objects in the .git in the directory.

@audrism
Copy link
Author

audrism commented Nov 20, 2018

The use case is that I have a database of over 11B of git objects, and would like to update it. In the considered example, all of the 21 objects obtained in the first step are
imported into that database, but the repo /home/kgao/Desktop/test/.git is no longer there.
I'd like to retrieve only the new three objects, not all 24 objects.

@KayGau
Copy link

KayGau commented Nov 21, 2018

Oh, I know the key point. You have collected commits, but have no local repository, so when I fetch, there is no local repo to compare with, I need to compare with the commit in the database. I thought you have local repo……Then I'll need to modify the fetch logic. I'll try!

@audrism
Copy link
Author

audrism commented Nov 21, 2018

Thank you. You can assume that any of the existing object hashes and content can be obtained from the database.

@KayGau
Copy link

KayGau commented Nov 23, 2018

I traced the execution of fetch, and found the git_smart__negotiate_fetch function in the smart_protocol.c file tells the remote what we want and what we have. I modified it to tell the remote that we have the commits that we got last time, but later when executing to the git_smart__download_pack function in the smart_protocol.c file, the fetch operation raises an error! I checked the output, found the remote indeed knew that we have local commits. Then I checked the function, and found that when downloading from the remote, it first writes the odb in the repo to a writepack. The odb is the object database in the repo, which means that to correctly download the new commits, we must have local repo...

@audrism
Copy link
Author

audrism commented Nov 23, 2018

Yes, we need to have local repo. The question is if we just need to prepopulate it with the objects we have, or can we replace with an alternative database, e.g., https://github.com/libgit2/libgit2-backends

@KayGau
Copy link

KayGau commented Nov 23, 2018

OK. By the way, what's the type of the database used? mysql, redis or sqlite?

@audrism
Copy link
Author

audrism commented Nov 23, 2018

Its a custom database based on tokyocabinet since none of the above work at that size. You can assume an arbitrary interface to it. The difference from the git native object store is that you need to pass repo url (to get heads associated with a specific repo)

@KayGau
Copy link

KayGau commented Nov 23, 2018

OK. Thank you, Professor!

@audrism
Copy link
Author

audrism commented Nov 23, 2018

also, the type of object needs to be passed
as well as repo: tag/tree/blob/commit

@KayGau
Copy link

KayGau commented Nov 29, 2018

Hi, Professor! I wrote 4 source files these days to populate blob, tree, commit and tag objects into the odb individually. Then I fetch from the remote. The result is still that it fetches all objects, including those we have populated into the odb......It seems that populating the odb doesn't work

@audrism
Copy link
Author

audrism commented Nov 29, 2018

Have you checked them in? So what is the difference between fetch done on naturally created repo and prepopulated one? have you set heads properly? For example:
cat gather/.git/refs/heads/master
e01c6718c3b9264926886d6190b05bfa1d069167

When prepopulating object store, heads would not necessarily be set. Best would be to compare the exact differences between the two repos and that will give hints why fetch behaves differently.

@KayGau
Copy link

KayGau commented Dec 2, 2018

I figured out the cause: when create the first commit, Libgit2 won't create a branch, so I need create the master branch manually.
By the way, when populating a binary blob(such as an executable file) to the odb, the libgit2 only offers an API git_blob_createfromdisk() to read a file from the filesystem and write its content to the odb. So I am wondering if the binary blob's content is stored in the database as a file?

@audrism
Copy link
Author

audrism commented Dec 2, 2018

Great! It checks the .git/refs/heads/* for the latest commits on each branch and compares them to what is on the server. That way only diffs can be sent back.

Can you pre-populate repo only with commits, or, preferably, only the last commit?
Re-creating a large repo from database may be time-consuming, perhaps you can revers-engineer what happens when fetch is done on non-empty master?

@KayGau
Copy link

KayGau commented Dec 2, 2018

Indeed, populating all objects is time-consuming. I'll have a try.

@KayGau
Copy link

KayGau commented Dec 6, 2018

Hi, Professor! I tried to populate only with commit objects these days. I found the function git_commit_create_with_signature() could help. So I modified its logic to populate only commit object. It works. It can populate commit objects, and it can also only populate the last commit.

However, an error similar to the one when I tried to modify the fetch logic earlier to make it look for objects that do not need to be fetched in an external database happened.
When I used the database that contains only the last commit or all the commits to fetch(git_remote_fetch) from the remote, a fetch error occurred. I checked the .git/objects, and found that it has a damaged pack file.
I also used the git fetch command to repeat. I printed the fetch process, which showed that the remote indeed know that we have the last commit, but when downloading from the remote, it ignored some objects. I found these ignored objects has relation with the missing objects in the database such as the blob and the tree objects.

@audrism
Copy link
Author

audrism commented Dec 6, 2018

Great.
What happens is that first the additional commits are determined on the server, but in addition to commits, the new trees (and subtrees) associated with the new commits need to be sent. In order to determine what new trees and associated blobs need to be sent a diff is done to calculate delta. It seems that a part of the diff may be done on the client side or, more likely, the lose objects are combined into the compressed storage (pack file) and the prepopulated commit does not have its tree, hence crash.

Can you track the attempts to access local objects during fetch, e.g, using trace or adding print to relevant functions?
Once such calls are identified, we can replace them by calls to an external database.

@KayGau
Copy link

KayGau commented Dec 6, 2018

Oh, yes. I'll try to find the corresponding logic. Thank you for your hint!

@KayGau
Copy link

KayGau commented Feb 19, 2019

Hi, Professor!
When only populating the latest commit and fetch, an error occurred because of lacking former objects. So I modified the logic to resolve it and succeed.
Then another error occurred: missing delta bases. I print out the missing object, find its the latest commit's tree. I tried to populate the latest commit's tree to the odb, but it doesn't work. So I tried to figure out the reason. I find the fetch protocol downloads data, maybe packfile, which is binary data and is organized based on former objects which is called "delta bases". And then, using those delta bases, decompress the packfile. So the error "missing delta bases" occurred when the odb lacks those former objects. Only all of those objects in the odb, the error won't occur. Thus, we must access to the external database. So I want to ask if the external database is complete, for example including binary files, pics and so on. If not, the error will happen again.

@audrism
Copy link
Author

audrism commented Feb 19, 2019

The external database contains shas of all objects.
It does store content of all tags, trees, and commits, but only text blobs.

Can you stop fetch once it downloads the pack file?
All needed objects should be in the pack file, and the binary blobs don't need to
be extracted. Pack file format is not too complicated, no need to use git
to unpack it.

This talks a bit about pack file format
https://git-scm.com/book/en/v2/Git-Internals-Packfiles

In other word, can you write a function that
a) Allows fetch to download the pack file
b) Stops there

I think that the packfile may contain all what is needed.

@KayGau
Copy link

KayGau commented Feb 21, 2019

I get the point. I need to capture what the fetch operation received from the remote. I think maybe I can print the data to a file.
But I am not sure if the received data is consistent with the packfile format. So it need to be confirmed. This talks about the format in detail.
https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt
I will then validate the received data format. If it is, I will try to decompress it. This talks about it. https://codewords.recurse.com/issues/three/unpacking-git-packfiles

@audrism
Copy link
Author

audrism commented Feb 21, 2019

Try to save it first (it might actually save it on its own). I can help you with
decoding if it is complicated.

@KayGau
Copy link

KayGau commented Feb 24, 2019

I checked the library code, but I didn't find code relevant to save it. When fetching, Libgit2 will receive data from the remote. The data was stored in a struct git_pkt_data. Libgit2 will receive more than once until all data were downloaded. Then Libgit2 will decompress the data. Missing delta bases occurred when decompressing it.
I then tried to write the data in the struct git_pkt_data to a file and succeeded. Is this OK?

@KayGau
Copy link

KayGau commented Apr 10, 2019

I find the bug. It is because of the input, the input should obey the following format:
hash;head's name;offset;length
In the tst/linux-stable.idx, the last part is not length, but the end offset. So, it should look like this:
0f0910a100951204a48052ce62ca72915511ecc6;master;0;1632
8433e5c9c8304b750c519ce3e0940dab675f6573;linux-3.18.y;1632;302
8433e5c9c8304b750c519ce3e0940dab675f6573;linux-3.18.y-queue;1934;302
0199619b21f7320482e8a2db14cf8bc974a7766a;linux-4.1.y;2236;301
623dfab42becf5c56c9a31b7eaf90cb6eb86459f;linux-4.1.y-queue;2537;1110
Then it can successfully fetch. In this case, it fetches 2825657 objects, still a lot, but I think if it is because of there are still some branches not provided.

@audrism
Copy link
Author

audrism commented Apr 24, 2019

I created a realistic test case of updating one linux kernel repo:
tst/ls.idx has commits from an earlier state of the repo (tst/linux-stable.heads.1555007357) and I am trying to update to the current version (tst/linux-stable.heads.1556119740)

Btw, not all commits in tst/linux-stable.heads.1555007357 are in the cloned repo.

cat tst/ls.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/ls.bin tst/ls.pack
tree ada630e1da499723c827ba0ff1084f93daf9ed9c
parent b89e3859db0658df57abfb1396ebad8d1f4580bb
au
  thor Steve F ;ch <stf`        @microsoft.com> 1552856318 -0500
Result of SHA1 : 8ee9a2d029c9980a3545c2acbeaa8def113f5b88
Segmentation fault

@KayGau
Copy link

KayGau commented Apr 25, 2019

According to the output, I think it is because of the ls.bin file. I download it, but cannot open it locally. I changed it to the UTF-8 format, but it shows messy code.(Using cat command shows the same error). So I am wondering if it was correctly produced? The former .bin file (linux-stable.bin) can be correctly opened.
I am a little confused about why "not all commits in tst/linux-stable.heads.1555007357 are in the cloned repo". I checked a case locally. When I download a repo, using find .git/objects will cover all the commits, but fin .git/refs will show only the master branch.

@audrism
Copy link
Author

audrism commented Apr 25, 2019

  1. correct, the batch wants uncompressed commits, it would make sense to read compressed data though? can you change batch to read compressed data? (its the zlib that is use by git to compress object/create pack file)?

  2. No there are several object missing, for example
    these commits are in tst/linux-stable.heads.1555007357
    but are not in the currently cloned version of https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable

2c56cc648c953f4c55d215bda8894d2d1af083d0
363a48d6a7728ae167d82c96b8198f30df245e73
3d9c55a7eaf3ab272e3931c0d43a0082d594f745
4015156a8247efed44281818a518c95e37323593
435c43a7aaa2eb50996391fa7ec11945c341d71d
4974ffe3e72f4a065a9b8f01661a378156b94bd8
64c0aa2ee0e25f49da0ff7aaf04595de61f23306
78d31592da78ad793dba5a289d3c93d0edbb58c0
9a7c9255ec3851bb32ced8dbd271acc3ad125bc5
9eeacd838a2e7d3d838b4b4a0808056383d121ae
a017fd9894843a081fe409688ae4d02e907cfbe1
a9ef068a445f4897b8ebe3a1c42c0e7f25d1bb53
aecbffbf4512172fc26f835affbd8c963585f944
b6f2e7667ad28631b07a70e3c45ac089f5db593f
e3a3a99e4c112ec0bf891cdca2a15c068ce7a0de
eb1937f6e059b8e15f3ec51d57549f962e91c01d
edd999994aac4fa9336d3d9a140908f355364d08
f471faf0251d0b8660d0a4c9f8f709183054bbca

@audrism
Copy link
Author

audrism commented Apr 25, 2019

Here are results on ucompressed version:

cat tst/tst.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/tst.bin tst/ls.pack
.....
filter_wants head=1359 local=0 id=8b27b23bbd78750df2eb8a5c59ad067acbb0d273 name=refs/tags/v4.1.48 /home/audris/swsc/libgit2/src/fetch.c:141
filter_wants head=1360 local=0 id=0199619b21f7320482e8a2db14cf8bc974a7766a name=refs/tags/v4.1.48^{} /home/audris/swsc/libgit2/src/fetch.c:141
git_fetch_negotiate need 85 /home/audris/swsc/libgit2/src/fetch.c:177
fetch error: Object not found - no match for id (51a60126aea86f259169d74fb1de5ca3d6f6481b)

@KayGau
Copy link

KayGau commented Apr 26, 2019

OK, I'll try to find the error.

@audrism
Copy link
Author

audrism commented Apr 26, 2019

Here is an update (nothing should be retrievd as of now) as the
tst/1556204601.* has all the the commits and head labels:

cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
tree aca070ed8a759fe2d241e188964c7a43190466dd
parent cc59bae2c6b2ab2bc277fbcf09944f03c8d5a8ed
author Tadeusz Struk <[email protected]> 1553711558 -0700
committer Sasha Levin <[email protected]> 1556069556 -0400

tpm: fix an invalid condition in tpm_common_poll

[ Upstream commit 7110629263469b4664d00b38ef80a656eddf3637 ]

The poll condition should only check response_length,
because reads should only be issued if there is data to read.
The response_read flag only prevents double writes.
The problem was that the write set the response_read to false,
enqued a tpm job, and returned. Then application called poll
which checked the response_read flag and returned EPOLLIN.
Then the application called read, but got nothing.
After all that the async_work kicked in.
Added also mutex_lock around the poll check to prevent
other possible race conditions.

Fixes: 9488585b21bef0df12 ("tpm: add support for partial reads")
Reported-by: Mantas Mikul�nas <[email protected]>
Tested-by: Mantas Mikul�nas <[email protected]>
Signed-off-by: Tadeusz Struk <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: James Morris <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>

Result of SHA1 : ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f
tree ada630e1da499723c827ba0ff1084f93daf9ed9c
parent b89e3859db0658df57abfb1396ebad8d1f4580bb
author Steve French <[email protected]> 1552856318 -0500
committer Sasha Levin <[email protected]> 1553731709 -0400

fix incorrect error code mapping for OBJECTID_NOT_FOUND

[ Upstream commit 85f9987b236cf46e06ffdb5c225cf1f3c0acb789 ]

It was mapped to EIO which can be confusing when user space
queries for an object GUID for an object for which the server
file system doesn't support (or hasn't saved one).

As Amir Goldstein suggested this is similar to ENOATTR
(equivalently ENODATA in Linux errno definitions) so
changing NT STATUS code mapping for OBJECTID_NOT_FOUND
to ENODATA.

Signed-off-by: Steve French <[email protected]>
CC: Amir Goldstein <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>

Result of SHA1 : 51a60126aea86f259169d74fb1de5ca3d6f6481b
Segmentation fault

@KayGau
Copy link

KayGau commented May 3, 2019

  • Segmentation fault is caused by the fopen function. Git allows a branch name has right slash, i.e '/' in it. But Ubuntu regards '/' as a directory separator. For the 51a60126aea86f259169d74fb1de5ca3d6f6481b branch, its branch name is for-greg/3.18-2. So when writing to the refs/heads, segmentation fault occurred. I also checked that a directory called for-greg will exist in the refs/heads. So I write a function to fopen a file even though its dir doesn't exist.

  • For change batch to read compressed data, can you upload a test case including the file contains the compressed data and the file contains the index? Thx.

@audrism
Copy link
Author

audrism commented May 3, 2019

No seg fault, but still does not work:

cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack

....
filter_wants head=1383 local=0 id=0199619b21f7320482e8a2db14cf8bc974a7766a name=refs/tags/v4.1.48^{} /home/audris/docker/libgit2/src/fetch.c:141
git_fetch_negotiate need 96 /home/audris/docker/libgit2/src/fetch.c:177
fetch error: Object not found - no match for id (ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f)

however ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f is in tst/1556204601.bin

(btw, tst/1556204601.bin is uncompressed, should it be compressed now?)

@KayGau
Copy link

KayGau commented May 4, 2019

  • I haven't change batch_fetch to be an compresss version now.

  • I checked tat/1556204601.idx, its format is not correct, branch name should follow the sha1.Did you change it locally? If so, can you upload it? I tested it using

ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f;queue-5.0;0;1313
51a60126aea86f259169d74fb1de5ca3d6f6481b;for-greg/3.18-2;1313;825
022ee96afba9847ce136484d3a23cf82820e09a4;for-greg/3.18-4;2138;737
f0910a100951204a48052ce62ca72915511ecc6;master;26341;1632
it works well, and downloaded about 400MB data

@audrism
Copy link
Author

audrism commented May 4, 2019

I committed fixed 1556204601.idx and tst.idx
It works for both, though packfile has a lot of objects
nObj=2732403

@KayGau
Copy link

KayGau commented May 4, 2019

Glad to hear that it works. Next, I will change batch_fetch to read compressed data, so can you upload a test case: that is a file contains compressed data and a file contains indexes?

@audrism
Copy link
Author

audrism commented May 5, 2019

Another issue:
cat tst/ls.1556989710.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/ls.1556989710.bin tst/ls.1556989710.pack

does nothing, even though the upstream repo has new commits, e.g.,
this is new:
9e75e9b555146056d400d1685b7fe35294ea5c46 refs/heads/for-greg/3.18-7
for
4d70cdadcd344d56e8a13ea5571299993c8f0918 refs/heads/for-greg/3.18-7

@audrism
Copy link
Author

audrism commented May 29, 2019

@KayGau Here is a test case that crashes
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
...
Result of SHA1 : 51a60126aea86f259169d74fb1de5ca3d6f6481b
Segmentation fault

Can you take a look at what is going on here?

@KayGau
Copy link

KayGau commented May 30, 2019

OK. I am sorry that I was busy with my graduation project the in the last few weeks. I will fix it as soon as possible!

@KayGau
Copy link

KayGau commented Jun 10, 2019

Hi, Audris.

@KayGau Here is a test case that crashes
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
...
Result of SHA1 : 51a60126aea86f259169d74fb1de5ca3d6f6481b
Segmentation fault

Can you take a look at what is going on here?

The error was because that I uses the wrong method to write sha1 value into .git/refs/heads directory. I have fixed the bug

@KayGau
Copy link

KayGau commented Jun 10, 2019

Another issue:
cat tst/ls.1556989710.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/ls.1556989710.bin tst/ls.1556989710.pack

does nothing, even though the upstream repo has new commits, e.g.,
this is new:
9e75e9b555146056d400d1685b7fe35294ea5c46 refs/heads/for-greg/3.18-7
for
4d70cdadcd344d56e8a13ea5571299993c8f0918 refs/heads/for-greg/3.18-7

I also checked this issue. Using the latest betch_fetch, It can correctly populate all objects provided by ls.1556989710.bin into an empty repository. But an error occured:

fetch error: The given reference name 'refs/heads/refs/tags/for-greg-3.18-01102018^{}' is not valid

I found that it was not a head. I remember that all these 'heads' are obtained using the git_get_last function. I test this function, and find it will get all remote's refs, not only heads. In general, it contains the following things:

  1. head. store in the .git/refs/heads directory. contains remote branch latest commit SHA. For example refs/heads/for-greg/4.14-1 in ls.heads.1556989710
  2. tag. store in the .git/refs/tags directory. contains remote tag's SHA. For example, refs/tags/for-greg-3.18-01102018 in ls.heads.1556989710
  3. commit pointed by tag. For example, refs/tags/for-greg-3.18-01102018^{} in ls.heads.1556989710. it will follow behind the tag point to it, and append ^{} being tag name.
  4. remote's remote refs. For example, refs/remotes/sstable/for-greg/3.18-5 in ls.heads.1556989710
  5. note. note is a git's feature. For example refs/notes/stable in ls.heads.1556989710

I reread libgit2's fetch implementation again. It will first receive all remote refs and check them whether they are in local repo. So I think, I should modify batch_fetch:

  1. populate all remote refs' pointed object into an empty repo. Here, We should access WoC to get all refs pointed object, i.e. commit and tag
  2. populate all commit sha1 value and tag sha1 value into .git/ref directory

@audrism
Copy link
Author

audrism commented Jun 11, 2019

Wonderful, investigation!

I am currently selecting heads via git ls-remote, and no longer using git_get_last.

  1. I can simply exclude refs but grep -v refs/tags | grep -v refs/remotes | grep -v refs/notes,
    however, as you noted we need to store them all, otherwise it will try to get them again: is that correct?

  2. Or can we modify fetch to exclude remotes that are not heads from the update?

To test the first approach I can produce ls.heads.1556989710 that contains not just commits but alls all the tags.

@KayGau
Copy link

KayGau commented Jun 11, 2019

For approach 1, if excluded them, it will try to get them again.
According to Libgit2, when fetching, remote will send all its refs to local.Then local will check if they exists in local repo. If exists, it will mark them as they needn't update. Then, local will send all these checked refs and local refs to remote. Then remote will calculate what local need.
Yes, we need to contain not only commits but tags in ls.heads.1556989710, i.e, all remote refs objects.
Then I will store all them (heads, tags, remotes, notes) into correct position.
BTW, could you normalize all these refs names as the following formats: remove 'refs/' and keep the rest all. for example: heads/for-greg/4.14-1 in ls.heads.1556989710, tags/for-greg-3.18-01102018, remotes/sstable/for-greg/3.18-5, tags/for-greg-3.18-01102018^{}, notes/stable

@audrism
Copy link
Author

audrism commented Jun 11, 2019

So you don't think you can change git fetch to send back the tags/remotes/notes it receives from the remote together with the local heads?

Otherwise, what needs to be stored for the tags/remotes/notes: just the sha1 or the content itself?

@KayGau
Copy link

KayGau commented Jun 30, 2019

Libgit2 will send back the tags/remotes/notes it receives from the remote together with the local heads after checking.
We need the sha1, content and ref names. Just as the former format:
an index file:
sha1;name;offset;length
an bin file:
contains commit and tag object content

@audrism
Copy link
Author

audrism commented Jun 30, 2019

would git ls-remote give at least the sha1's of all the needed pieces?
What git object represents "notes" and "remotes" or are these just strings?

@audrism
Copy link
Author

audrism commented Jul 1, 2019

Ok, I found a description of git notes:
"Commit notes are blobs containing extra information about an object (usually information to supplement a commit’s message). These blobs are taken from notes refs. A notes ref is usually a branch which contains "files" whose paths are the object names for the objects they describe, with some directory separators included for performance reasons [1]."

I don't think there is need to store them as they can be obtained from scratch during a fetch.
Similarly for remotes.

In other words, prepopulate repo only with commits and, perhaps, tags.
Would that work? i

@KayGau
Copy link

KayGau commented Jul 3, 2019

According to the Libgit2, it will first check all remote send back refs and mark refs that local don't have. Then it will send all checked remote refs to remote with local refs. I tried to prepopulate only commit objects, it works most time. But there are some cases that it behave wrongly (I construct these cases manually):

  1. considering only 1 branch. In time A, the latest ref points to commit a. After time A, I tagged a tag to commit a's earlier commit b, and no new commit. In time B, I run our git fetch, it fetches nothing, but, it should fetch the new tag
  2. I test our git fetch on a real project: terminal. Remote send back some commit object that is is pull request. When populating only commit object, it fetches a branches' all git objects

@audrism
Copy link
Author

audrism commented Jul 3, 2019

Does that mean that if both commits and tags are pre-populated it would work fine? I am not sure how to prepopulate notes and remotes.

@KayGau
Copy link

KayGau commented Jul 3, 2019

I will try to pre-populate commits and tags to see if it works. Prepopulating notes and remotes is the same as prepopulating commits ans tags.

@KayGau
Copy link

KayGau commented Nov 26, 2019

Audris, Libgit2's fetch is more complicated than I thought. Below is what I found:

  1. git_remote_fetch first query remote's all refs (include refs/heads/, refs/remotes/, refs/notes/, refs/pull/, .... and many other refs in the .git/refs directory)
  2. Then git_remote_fetch will only check those remote'refs with prefix refs/heads/ and mark them and send them to remote
  3. git_remote_fetch will organize all local refs with prefix refs/heads/, refs/remotes/, refs/notes/ by time and send them to the remote one by one (but will send at most 256 times, see code in src/transports/smart_protocol.c git_smart__negotiate_fetch() 's comment: Our support for ACK extensions is ...)
  4. Then wait for remote to calculate what local need and receive a packfile
    So, according to the procedure stated above, we need all refs with prefix refs/heads/, refs/remotes/, refs/notes/
    But, I don't understand why libgit2 will only send at most 256 local refs. In my opinion, remote should know all local's refs to calculate what local need, though there are seldom 256 active branches at the same time. Maybe some investigate are needed.

@audrism
Copy link
Author

audrism commented Nov 26, 2019

Thank you for clarifying. I am storing the entire packed-refs file since only that part of the git repo is retrieved via

git clone --mirror

In that case, the following command is used to update:

git remote update

I am not sure is a similar command exists in libgit2 or if it works differently from fetch

@KayGau
Copy link

KayGau commented Nov 27, 2019

git clone --mirror will set a local mirror of remote repository, including all refs, according to the link git clone --mirrot
git remote update is equivalent to git fetch --all, according to the link git remote update and git fetch
I just searched libgit2 and didn't found similar functions.
And what we use in libgit2 is git_remote_fetch function. We use default fetch options (GIT_FETCH_OPTIONS_INIT) to fetch from remote. I test it on a small 2 branch, it will fetch both branches's update objetcs. So, I think git_remote_fetch can do what git remote update do

@KayGau
Copy link

KayGau commented Dec 3, 2019

I have figured out how to populate refs/notes/ and refs/remotes/. And I will complete a new batch_fetch.c soon.
Btw, I'm sorry to find that 2 disks that we brought from USA were broken accidentally. And some blob files are missing, they are blob_{10, 20, 25-29, 50, 60, 61, 83-89, 109}.bin. And the following maps are lacking on our server:

  • {a2b, a2c, a2f, a2p, b2a, b2c, b2f, c2b, c2h, c2f, c2p, f2b, f2c, p2a, p2c}FullP.{0..31}.tch
  • da0: /da0_data/play/PYthruMaps/c2bPtaPkgPPY.*.gz/, - da0:
  • da0: /da0_data/play/ipythruMaps/c2bPtaPkgPipy.*.gz/
    So could you please copy these blob files and maps? We will appreciate it very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants