-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow serialization for MemoryFile #22277
base: master
Are you sure you want to change the base?
Allow serialization for MemoryFile #22277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
@jwnimmer-tri I'd like you to take a look at this.
First, while cpp_yaml has Binary
it is not a broadly supported type. It can only be used directly with an Emitter
and not the EmitFromEvents
we use. The reading has similar issues. Seeing the global !!binary
tag does not lead to the creation of a Binary
(or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.
Several open questions I'd like feedback on.
- Are we good with the logic regarding
std::vector<uint8_t>
as maybe producing!!binary ....
? - Where should that get documented?
- If there's a typo on the
!!binary
tag (e.g., a local tag!binary
) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for!binary
. We simply ignore it (treating it as if it weren't there). - Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
@jwnimmer-tri I'd like you to take a look at this.
First, while cpp_yaml has
Binary
it is not a broadly supported type. It can only be used directly with anEmitter
and not theEmitFromEvents
we use. The reading has similar issues. Seeing the global!!binary
tag does not lead to the creation of aBinary
(or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.Several open questions I'd like feedback on.
- Are we good with the logic regarding
std::vector<uint8_t>
as maybe producing!!binary ....
?- Where should that get documented?
- If there's a typo on the
!!binary
tag (e.g., a local tag!binary
) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for!binary
. We simply ignore it (treating it as if it weren't there).- Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.
I suspect the answer to (2) is probably yaml_doxygen.h
, so I'll also peruse that at the same time I'm dealing with the python tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 13 files at r1.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
I suspect the answer to (2) is probably
yaml_doxygen.h
, so I'll also peruse that at the same time I'm dealing with the python tests.
(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:
(a) When reading yaml and we encounter a !!str
but our Python field schema says bytes
, is that an error or should we automatically call value.encode('utf-8')
to convert the unicode string into bytes
as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ?
then (how) do we resolve the tag?
(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t>
can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq
of !!int
. The options are either to key on std::vector<std::byte>
as the static type, or to add a little flag to MakeNameValue
to say that that the T = std::string
is not unicode.
(3) If this is not unique to !binary
(and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.
(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.
We'll need to agree on the full contract of (1a) first before we can write out that test matrix.
common/test/memory_file_test.cc
line 126 at r1 (raw file):
/** Confirm that this can be serialized appropriately. */ GTEST_TEST(MemoryFileTest, Serialization) { // Emtpy file.
typo
common/test/memory_file_test.cc
line 181 at r1 (raw file):
// Invalid characters in base64 are not ignored; they become zeros in the // decoding. We'll put *four* invalid characters in; that leads to exactly // five null bytes before we get back into the expected string.
Skimming the yaml-cpp code, its YAML::DecodeBase64
function does seem to detect errors properly, so we should probably just use that instead of CRU?
common/test/memory_file_test.cc
line 181 at r1 (raw file):
// Invalid characters in base64 are not ignored; they become zeros in the // decoding. We'll put *four* invalid characters in; that leads to exactly // five null bytes before we get back into the expected string.
In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)
common/yaml/yaml_node.h
line 174 at r1 (raw file):
static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"}; // https://yaml.org/spec/1.2.2/#24-tags (see Example 2.23).
The example is good, but isn't really a specification. I think we should link to either only the spec (https://yaml.org/type/binary.html) or else both the spec and the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:
(a) When reading yaml and we encounter a
!!str
but our Python field schema saysbytes
, is that an error or should we automatically callvalue.encode('utf-8')
to convert the unicode string intobytes
as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag?
then (how) do we resolve the tag?(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think
std::vector<uint8_t>
can cut it -- we need to allow for actual lists of (small) integers to output as a yaml!!seq
of!!int
. The options are either to key onstd::vector<std::byte>
as the static type, or to add a little flag toMakeNameValue
to say that that theT = std::string
is not unicode.
(3) If this is not unique to
!binary
(and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.
(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.
We'll need to agree on the full contract of (1a) first before we can write out that test matrix.
For now I'm going to do the following:
- Strip out
std::filesystem::path
support into its own PR. - Make
MemoryFile::Serialize()
simply throw for now. We can resolve the question of how to serialize its contents later.
common/test/memory_file_test.cc
line 181 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Skimming the yaml-cpp code, its
YAML::DecodeBase64
function does seem to detect errors properly, so we should probably just use that instead of CRU?
Right. In that case, it simply returns a zero-length vector.
common/test/memory_file_test.cc
line 181 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)
A relic from when the encoding/decoding happened in MemoryFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
For now I'm going to do the following:
- Strip out
std::filesystem::path
support into its own PR.- Make
MemoryFile::Serialize()
simply throw for now. We can resolve the question of how to serialize its contents later.
Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.
P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.
I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
I hoped I'd simplified it to work nicely. The idea was that using In C++ it's simple because the only type that gets passed to the static types is The writing is really just frosting. Representing a string in yaml defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
BTW this is probably ready for rework now, to get |
5979ec1
to
3619733
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With everything else making its own way to Drake, this PR reduces to a very simple change.
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
I hoped I'd simplified it to work nicely. The idea was that using
!!binary
was just a representation thing. Every value in yaml is fundamentally a string. The goal was to give an alternate string encoding in the yaml. And between parsing the yaml and assigning the type, we'd convert!!binary
to the equivalent "string" and then the types would deal with the result.In C++ it's simple because the only type that gets passed to the static types is
std::string
and the only time we convert from string is when assigning it to a static type. It's murkier in python because the yaml "value" may be astr
and may bebytes
. And some values get converted prior to Drake's typed yaml code. I still hoped the logic for mappingstr
/bytes
to type would happen in the Drake code. But reading is always the challenge.The writing is really just frosting. Representing a string in yaml defaults to
!!binary
if it has "ugly" bytes. Obviously, I'm still missing something fundamental. So, we'll chat. I'm available first thing in the morning but will be afk from about 7:15-8:30 while I get my daughter to school.
I think this is largely closed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need pydrake-flavored yaml reading and writing tests, too (the bindings are missing). Bindings when we do the Serialize-with-copying-tricks are sometimes difficult, so feel free to tag me if you'd like help with the glue code.
Reviewed 10 of 13 files at r2.
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had wondered about that. I glanced at a few other types i knew we could serialize in python (e.g., CameraConfig
) and could see no explicit artifacts to enable the binding nor tests on successful (de)serialization. I also looked at Rgba
(which does do funny redirects on serialization) and didn't see anything explicit there either.
So, given all of that, I'd definitely welcome some pointers towards what I'm missing.
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers
The starting point would be copying the new C++ unit test over into the bindings (and updating function names, unittest asserts, etc), so we have something to aim for. |
The MemoryFile's contents are serialized as a std::vector<std::byte> to account for the fact that std::string can include arbitrary byte values. (In yaml, serializing std::vector<std::byte>> produces a base64-encoded octet stream with the tag "!!binary".)
3619733
to
d2db893
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we don't have that for other serializable types? Am I missing something? Or is this something new?
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers
In many cases a serializeable object is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not getting it. Creating bindings by using the archive pattern is surely only proof that the archive pattern was correctly implemented. That's not proof that calling it plays nicely with python, is it? It's been a while since I've gone digging into the archive-enabled binding infrastructure, so I could easily have forgotten a critical detail.
And, again, I'd point out Rgba
as a proper analogy for MemoryFile
. Not a struct
, does some fun redirection for one of its values, no serialization testing in python.
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers
It's sufficient evidence that the struct itself is precisely correct in terms of its bindings. You're right that it does not acceptance-test whether all of the field types of struct are themselves also correctly bound transitively, but in terms of "bindings code to add onto the struct" it is all we need. Any bugs in python land would fall outside of any code in that struct or its bindings.
It has serialization testing in Really all I'm saying here is that on paper I don't imagine that MemoryFile would work correctly when (de)serialized in Python, so we should have at least one test case to sanity check it (and prevent regressions). The |
Ahh....I'd missed |
The bindings we'll need will look a little bit like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers
common/memory_file.h
line 106 at r3 (raw file):
a->Visit(MakeNameValue("contents", &bytes)); contents_ = std::string(reinterpret_cast<const char*>(bytes.data()), bytes.size());
Here we also need to reset the checksum.
Path member can now be serialized like other basic member types.
Serializing something of type vector<uint8_t> is interpreted as a byte string. If all bytes are printable, it is equivalent to a string. If there are any non-printable characters, it is converted to base64 and prefixed with YAML's !!binary tag.
MemoryFile makes use of this to serialize its file contents (it creates a temporary vector<uint8_t> for serialization to trigger the potential binary yaml encoding.
This change is