Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow serialization for MemoryFile #22277

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SeanCurtis-TRI
Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI commented Dec 5, 2024

Path member can now be serialized like other basic member types.

Serializing something of type vector<uint8_t> is interpreted as a byte string. If all bytes are printable, it is equivalent to a string. If there are any non-printable characters, it is converted to base64 and prefixed with YAML's !!binary tag.

MemoryFile makes use of this to serialize its file contents (it creates a temporary vector<uint8_t> for serialization to trigger the potential binary yaml encoding.


This change is Reviewable

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
@jwnimmer-tri I'd like you to take a look at this.

First, while cpp_yaml has Binary it is not a broadly supported type. It can only be used directly with an Emitter and not the EmitFromEvents we use. The reading has similar issues. Seeing the global !!binary tag does not lead to the creation of a Binary (or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.

Several open questions I'd like feedback on.

  1. Are we good with the logic regarding std::vector<uint8_t> as maybe producing !!binary ....?
  2. Where should that get documented?
  3. If there's a typo on the !!binary tag (e.g., a local tag !binary) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for !binary. We simply ignore it (treating it as if it weren't there).
  4. Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

@jwnimmer-tri I'd like you to take a look at this.

First, while cpp_yaml has Binary it is not a broadly supported type. It can only be used directly with an Emitter and not the EmitFromEvents we use. The reading has similar issues. Seeing the global !!binary tag does not lead to the creation of a Binary (or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.

Several open questions I'd like feedback on.

  1. Are we good with the logic regarding std::vector<uint8_t> as maybe producing !!binary ....?
  2. Where should that get documented?
  3. If there's a typo on the !!binary tag (e.g., a local tag !binary) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for !binary. We simply ignore it (treating it as if it weren't there).
  4. Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.

I suspect the answer to (2) is probably yaml_doxygen.h, so I'll also peruse that at the same time I'm dealing with the python tests.

@jwnimmer-tri jwnimmer-tri self-assigned this Dec 5, 2024
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 13 files at r1.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I suspect the answer to (2) is probably yaml_doxygen.h, so I'll also peruse that at the same time I'm dealing with the python tests.

(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:

(a) When reading yaml and we encounter a !!str but our Python field schema says bytes, is that an error or should we automatically call value.encode('utf-8') to convert the unicode string into bytes as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ? then (how) do we resolve the tag?

(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t> can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq of !!int. The options are either to key on std::vector<std::byte> as the static type, or to add a little flag to MakeNameValue to say that that the T = std::string is not unicode.


(3) If this is not unique to !binary (and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.


(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.

We'll need to agree on the full contract of (1a) first before we can write out that test matrix.


common/test/memory_file_test.cc line 126 at r1 (raw file):

/** Confirm that this can be serialized appropriately. */
GTEST_TEST(MemoryFileTest, Serialization) {
  // Emtpy file.

typo


common/test/memory_file_test.cc line 181 at r1 (raw file):

  // Invalid characters in base64 are not ignored; they become zeros in the
  // decoding. We'll put *four* invalid characters in; that leads to exactly
  // five null bytes before we get back into the expected string.

Skimming the yaml-cpp code, its YAML::DecodeBase64 function does seem to detect errors properly, so we should probably just use that instead of CRU?


common/test/memory_file_test.cc line 181 at r1 (raw file):

  // Invalid characters in base64 are not ignored; they become zeros in the
  // decoding. We'll put *four* invalid characters in; that leads to exactly
  // five null bytes before we get back into the expected string.

In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)


common/yaml/yaml_node.h line 174 at r1 (raw file):

  static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"};

  // https://yaml.org/spec/1.2.2/#24-tags (see Example 2.23).

The example is good, but isn't really a specification. I think we should link to either only the spec (https://yaml.org/type/binary.html) or else both the spec and the example.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:

(a) When reading yaml and we encounter a !!str but our Python field schema says bytes, is that an error or should we automatically call value.encode('utf-8') to convert the unicode string into bytes as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ? then (how) do we resolve the tag?

(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t> can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq of !!int. The options are either to key on std::vector<std::byte> as the static type, or to add a little flag to MakeNameValue to say that that the T = std::string is not unicode.


(3) If this is not unique to !binary (and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.


(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.

We'll need to agree on the full contract of (1a) first before we can write out that test matrix.

For now I'm going to do the following:

  1. Strip out std::filesystem::path support into its own PR.
  2. Make MemoryFile::Serialize() simply throw for now. We can resolve the question of how to serialize its contents later.

common/test/memory_file_test.cc line 181 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Skimming the yaml-cpp code, its YAML::DecodeBase64 function does seem to detect errors properly, so we should probably just use that instead of CRU?

Right. In that case, it simply returns a zero-length vector.


common/test/memory_file_test.cc line 181 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)

A relic from when the encoding/decoding happened in MemoryFile.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

For now I'm going to do the following:

  1. Strip out std::filesystem::path support into its own PR.
  2. Make MemoryFile::Serialize() simply throw for now. We can resolve the question of how to serialize its contents later.

Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.

P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.

I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.

@SeanCurtis-TRI
Copy link
Contributor Author

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.

I hoped I'd simplified it to work nicely. The idea was that using !!binary was just a representation thing. Every value in yaml is fundamentally a string. The goal was to give an alternate string encoding in the yaml. And between parsing the yaml and assigning the type, we'd convert !!binary to the equivalent "string" and then the types would deal with the result.

In C++ it's simple because the only type that gets passed to the static types is std::string and the only time we convert from string is when assigning it to a static type. It's murkier in python because the yaml "value" may be a str and may be bytes. And some values get converted prior to Drake's typed yaml code. I still hoped the logic for mapping str/bytes to type would happen in the Drake code. But reading is always the challenge.

The writing is really just frosting. Representing a string in yaml defaults to !!binary if it has "ugly" bytes. Obviously, I'm still missing something fundamental. So, we'll chat. I'm available first thing in the morning but will be afk from about 7:15-8:30 while I get my daughter to school.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)

@jwnimmer-tri
Copy link
Collaborator

BTW this is probably ready for rework now, to get MemoryFile up and running.

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_path_and_bytestring branch 2 times, most recently from 5979ec1 to 3619733 Compare January 8, 2025 23:43
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With everything else making its own way to Drake, this PR reduces to a very simple change.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I hoped I'd simplified it to work nicely. The idea was that using !!binary was just a representation thing. Every value in yaml is fundamentally a string. The goal was to give an alternate string encoding in the yaml. And between parsing the yaml and assigning the type, we'd convert !!binary to the equivalent "string" and then the types would deal with the result.

In C++ it's simple because the only type that gets passed to the static types is std::string and the only time we convert from string is when assigning it to a static type. It's murkier in python because the yaml "value" may be a str and may be bytes. And some values get converted prior to Drake's typed yaml code. I still hoped the logic for mapping str/bytes to type would happen in the Drake code. But reading is always the challenge.

The writing is really just frosting. Representing a string in yaml defaults to !!binary if it has "ugly" bytes. Obviously, I'm still missing something fundamental. So, we'll chat. I'm available first thing in the morning but will be afk from about 7:15-8:30 while I get my daughter to school.

I think this is largely closed out.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need pydrake-flavored yaml reading and writing tests, too (the bindings are missing). Bindings when we do the Serialize-with-copying-tricks are sometimes difficult, so feel free to tag me if you'd like help with the glue code.

Reviewed 10 of 13 files at r2.
Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)

@jwnimmer-tri jwnimmer-tri added the release notes: feature This pull request contains a new feature label Jan 8, 2025
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wondered about that. I glanced at a few other types i knew we could serialize in python (e.g., CameraConfig) and could see no explicit artifacts to enable the binding nor tests on successful (de)serialization. I also looked at Rgba (which does do funny redirects on serialization) and didn't see anything explicit there either.

So, given all of that, I'd definitely welcome some pointers towards what I'm missing.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

@jwnimmer-tri
Copy link
Collaborator

The starting point would be copying the new C++ unit test over into the bindings (and updating function names, unittest asserts, etc), so we have something to aim for.

The MemoryFile's contents are serialized as a std::vector<std::byte> to
account for the fact that std::string can include arbitrary byte values.

(In yaml, serializing std::vector<std::byte>> produces a base64-encoded
octet stream with the tag "!!binary".)
@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_path_and_bytestring branch from 3619733 to d2db893 Compare January 9, 2025 15:45
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't have that for other serializable types? Am I missing something? Or is this something new?

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

@jwnimmer-tri
Copy link
Collaborator

In many cases a serializeable object is a struct with public fields, and is bound using DefAttributesUsingSerialize which case unit testing simple attribute access is also good enough for covering how YAML works. Since this is a class with a magic Serialize it deserves more careful testing.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not getting it. Creating bindings by using the archive pattern is surely only proof that the archive pattern was correctly implemented. That's not proof that calling it plays nicely with python, is it? It's been a while since I've gone digging into the archive-enabled binding infrastructure, so I could easily have forgotten a critical detail.

And, again, I'd point out Rgba as a proper analogy for MemoryFile. Not a struct, does some fun redirection for one of its values, no serialization testing in python.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

@jwnimmer-tri
Copy link
Collaborator

That's not proof that calling it plays nicely with python, is it?

It's sufficient evidence that the struct itself is precisely correct in terms of its bindings. You're right that it does not acceptance-test whether all of the field types of struct are themselves also correctly bound transitively, but in terms of "bindings code to add onto the struct" it is all we need. Any bugs in python land would fall outside of any code in that struct or its bindings.

And, again, I'd point out Rgba as a proper analogy for MemoryFile. Not a struct, does some fun redirection for one of its values, no serialization testing in python.

It has serialization testing in test_rgba_yaml in common_test.py.


Really all I'm saying here is that on paper I don't imagine that MemoryFile would work correctly when (de)serialized in Python, so we should have at least one test case to sanity check it (and prevent regressions).

The Serialize function is not bound automatically for us, we need to add code to enable it. (This is true even for boring Serialize functions on plain structs.) Basically the common/module_py.cc stanza for using Class = MemoryFile; is missing new bindings for serialize, and we want to have a test which shows it's missing.

@SeanCurtis-TRI
Copy link
Contributor Author

Ahh....I'd missed test_rgba_yaml. :) Thanks. I'll investigate. Stay tuned.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Jan 9, 2025

The bindings we'll need will look a little bit like BindPiecewisePolynomialSerialize except in that case the problem is even more challenging because of access control and nested types. For our case we'll need at least the def_property_readonly_static("__fields__" ...) and something for writing back into the fields. I can help once we have some tests to guide it.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers


common/memory_file.h line 106 at r3 (raw file):

    a->Visit(MakeNameValue("contents", &bytes));
    contents_ =
        std::string(reinterpret_cast<const char*>(bytes.data()), bytes.size());

Here we also need to reset the checksum.

@jwnimmer-tri jwnimmer-tri changed the title Yaml serialization includes std::filesystem::path and vector<uint8_t> Allow serialization for MemoryFile Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes: feature This pull request contains a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants