Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] MapVector cannot be loaded via IPC #42218

Open
vibhatha opened this issue Jun 20, 2024 · 8 comments
Open

[Java] MapVector cannot be loaded via IPC #42218

vibhatha opened this issue Jun 20, 2024 · 8 comments

Comments

@vibhatha
Copy link
Collaborator

Describe the bug, including details regarding any error messages, version, and platform.

Referring to the stackoverflow filed issue: https://stackoverflow.com/questions/77878272/apache-arrow-not-all-nodes-and-buffers-were-consumed-error-when-writing-a-map

The following code would yield an error;

File file = new File("test.arrow");

    Field keyField = new Field("id", FieldType.notNullable(new ArrowType.Int(64, true)),
        Collections.emptyList());
    Field valueField = new Field("value", FieldType.nullable(new ArrowType.Int(64, true)), Collections.emptyList());
    Field structField =
        new Field("entry", FieldType.notNullable(ArrowType.Struct.INSTANCE), List.of(keyField, valueField));
    Field mapIntToIntField = new Field("mapFieldIntToInt", FieldType.notNullable(new ArrowType.Map(false)), List.of(structField));

    Schema schema = new Schema(Arrays.asList(mapIntToIntField));

    System.out.println("Writing...");

    try (BufferAllocator allocator = new RootAllocator()) {
      try (
          VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schema, allocator);
          MapVector mapVector = (MapVector) vectorSchemaRoot.getVector("mapFieldIntToInt")) {
        UnionMapWriter mapWriter = mapVector.getWriter();
        mapWriter.setPosition(0);
        mapWriter.startMap();
        for (int i = 0; i < 3; i++) {
          mapWriter.startEntry();
          mapWriter.key().bigInt().writeBigInt(i);
          mapWriter.value().bigInt().writeBigInt(i * 7);
          mapWriter.endEntry();
        }
        mapWriter.endMap();
        mapWriter.setValueCount(1);
        vectorSchemaRoot.setRowCount(1);

        System.out.println(vectorSchemaRoot.getFieldVectors().size());
        System.out.println("vectorSchemaRoot.getVector(0): " + vectorSchemaRoot.getVector(0));

        try (
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())) {
          writer.start();
          writer.writeBatch();
          writer.end();
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    }

    System.out.println("Reading...");

    try(
        BufferAllocator rootAllocator = new RootAllocator();
        FileInputStream fileInputStream = new FileInputStream(file);
        ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
    ){
      System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
      for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        boolean loaded = reader.loadRecordBatch(arrowBlock);
        System.out.println(loaded);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
      }
    } catch (IOException e) {
      e.printStackTrace();
    }

Error

Exception in thread "main" java.lang.IllegalArgumentException: not all nodes, buffers and variadicBufferCounts were consumed. nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[24], address:123230812873128, capacity:1, ArrowBuf[25], address:123230812873136, capacity:24] variadicBufferCounts: []
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
	at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:214)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:192)

Component(s)

Java

@vibhatha
Copy link
Collaborator Author

It seems that the validity buffer of the key is not properly written. It is all null.

@llama90
Copy link
Contributor

llama90 commented Jun 20, 2024

@vibhatha Is this unrelated to this comment?

UPDATE:

I don't think it's related to them. I will create an issue and resolve it.

@vibhatha
Copy link
Collaborator Author

@llama90 this is a very older issue which I am trying to solve.

@vibhatha vibhatha self-assigned this Jun 21, 2024
@vibhatha
Copy link
Collaborator Author

@lidavidm a question:

Field keyField = new Field("id", FieldType.notNullable(new ArrowType.Int(64, true)),
        Collections.emptyList());
Field valueField = new Field("value", FieldType.nullable(new ArrowType.Int(64, true)), Collections.emptyList());
Field structField =
     new Field("entry", FieldType.notNullable(ArrowType.Struct.INSTANCE), List.of(keyField, valueField));
Field mapIntToIntField = new Field("mapFieldIntToInt", FieldType.notNullable(new ArrowType.Map(false)), List.of(structField));

After debugging this is what I think is happening. We have given the key field a name id and the value field a name value. When we try to write to vectors, there are already 2 vectors for the StructVector (within MapVector) two children i.e.
mapVector.getChildrenFromFields().get(0).getChildrenFromFields().get(0).getField() -> key: Int(64, true) not null and mapVector.getChildrenFromFields().get(0).getChildrenFromFields().get(1).getField() -> value: Int(64, true).

But when we go for writing data

@Override
  public BigIntWriter bigInt() {
    switch (mode) {
      case KEY:
        return entryWriter.bigInt(MapVector.KEY_NAME);
      case VALUE:
        return entryWriter.bigInt(MapVector.VALUE_NAME);
      default:
        return this;
    }
  }

This is the regular check we have, and these KEY_NAME and VALUE_NAME are hardcoded as key and value respectively. They are not being updated by looking into the given struct. Thus at writing time, it introduces an additional vector with id name, and that make is not consume the key. At least this is what is happening in highlevel. If I rename id to key the code works.

In the reading part, it has an incorrect schema. Worse case is, we can get the schema from the vector itself, let's say. Then again we have 2 idle vectors in case users use different names. Shouldn't we update the KEY_NAME and VALUE_NAME properly? Or Am I misreading this?

@lidavidm
Copy link
Member

We should get it from the vector, yes. They are recommended to be "key" and "value" but it is not meant to be required

arrow/format/Schema.fbs

Lines 126 to 129 in d28078d

/// In a field with Map type, the field has a child Struct field, which then
/// has two children: key type and the second the value type. The names of the
/// child fields may be respectively "entries", "key", and "value", but this is
/// not enforced.

@vibhatha
Copy link
Collaborator Author

So fix it? Or enforce the key,value usage?

@lidavidm
Copy link
Member

Fix it, the spec says explicitly not to enforce key/value

@felipecrv
Copy link
Contributor

felipecrv commented Jul 9, 2024

We should get it from the vector, yes. They are recommended to be "key" and "value" but it is not meant to be required

arrow/format/Schema.fbs

Lines 126 to 129 in d28078d

/// In a field with Map type, the field has a child Struct field, which then
/// has two children: key type and the second the value type. The names of the
/// child fields may be respectively "entries", "key", and "value", but this is
/// not enforced.

"this is not enforced" doesn't mean it's not assumed in a lot of places.

It's one of these cases where in theory you can use any name, but in practice, there is a de-facto standard.

Postel's Law applies, so we should advise users to avoid using different names while also making the official Arrow implementations robust (i.e. accepting of custom names).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants