[Java] MapVector cannot be loaded via IPC #42218

vibhatha · 2024-06-20T10:58:43Z

Describe the bug, including details regarding any error messages, version, and platform.

Referring to the stackoverflow filed issue: https://stackoverflow.com/questions/77878272/apache-arrow-not-all-nodes-and-buffers-were-consumed-error-when-writing-a-map

The following code would yield an error;

File file = new File("test.arrow");

    Field keyField = new Field("id", FieldType.notNullable(new ArrowType.Int(64, true)),
        Collections.emptyList());
    Field valueField = new Field("value", FieldType.nullable(new ArrowType.Int(64, true)), Collections.emptyList());
    Field structField =
        new Field("entry", FieldType.notNullable(ArrowType.Struct.INSTANCE), List.of(keyField, valueField));
    Field mapIntToIntField = new Field("mapFieldIntToInt", FieldType.notNullable(new ArrowType.Map(false)), List.of(structField));

    Schema schema = new Schema(Arrays.asList(mapIntToIntField));

    System.out.println("Writing...");

    try (BufferAllocator allocator = new RootAllocator()) {
      try (
          VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schema, allocator);
          MapVector mapVector = (MapVector) vectorSchemaRoot.getVector("mapFieldIntToInt")) {
        UnionMapWriter mapWriter = mapVector.getWriter();
        mapWriter.setPosition(0);
        mapWriter.startMap();
        for (int i = 0; i < 3; i++) {
          mapWriter.startEntry();
          mapWriter.key().bigInt().writeBigInt(i);
          mapWriter.value().bigInt().writeBigInt(i * 7);
          mapWriter.endEntry();
        }
        mapWriter.endMap();
        mapWriter.setValueCount(1);
        vectorSchemaRoot.setRowCount(1);

        System.out.println(vectorSchemaRoot.getFieldVectors().size());
        System.out.println("vectorSchemaRoot.getVector(0): " + vectorSchemaRoot.getVector(0));

        try (
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())) {
          writer.start();
          writer.writeBatch();
          writer.end();
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    }

    System.out.println("Reading...");

    try(
        BufferAllocator rootAllocator = new RootAllocator();
        FileInputStream fileInputStream = new FileInputStream(file);
        ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
    ){
      System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
      for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        boolean loaded = reader.loadRecordBatch(arrowBlock);
        System.out.println(loaded);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
      }
    } catch (IOException e) {
      e.printStackTrace();
    }

Error

Exception in thread "main" java.lang.IllegalArgumentException: not all nodes, buffers and variadicBufferCounts were consumed. nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[24], address:123230812873128, capacity:1, ArrowBuf[25], address:123230812873136, capacity:24] variadicBufferCounts: []
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
	at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:214)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
	at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:192)

Component(s)

Java

vibhatha · 2024-06-20T11:37:58Z

It seems that the validity buffer of the key is not properly written. It is all null.

llama90 · 2024-06-20T11:46:27Z

@vibhatha Is this unrelated to this comment?

UPDATE:

I don't think it's related to them. I will create an issue and resolve it.

vibhatha · 2024-06-20T13:09:10Z

@llama90 this is a very older issue which I am trying to solve.

vibhatha · 2024-06-22T01:34:51Z

@lidavidm a question:

Field keyField = new Field("id", FieldType.notNullable(new ArrowType.Int(64, true)),
        Collections.emptyList());
Field valueField = new Field("value", FieldType.nullable(new ArrowType.Int(64, true)), Collections.emptyList());
Field structField =
     new Field("entry", FieldType.notNullable(ArrowType.Struct.INSTANCE), List.of(keyField, valueField));
Field mapIntToIntField = new Field("mapFieldIntToInt", FieldType.notNullable(new ArrowType.Map(false)), List.of(structField));

After debugging this is what I think is happening. We have given the key field a name id and the value field a name value. When we try to write to vectors, there are already 2 vectors for the StructVector (within MapVector) two children i.e.
mapVector.getChildrenFromFields().get(0).getChildrenFromFields().get(0).getField() -> key: Int(64, true) not null and mapVector.getChildrenFromFields().get(0).getChildrenFromFields().get(1).getField() -> value: Int(64, true).

But when we go for writing data

@Override
  public BigIntWriter bigInt() {
    switch (mode) {
      case KEY:
        return entryWriter.bigInt(MapVector.KEY_NAME);
      case VALUE:
        return entryWriter.bigInt(MapVector.VALUE_NAME);
      default:
        return this;
    }
  }

This is the regular check we have, and these KEY_NAME and VALUE_NAME are hardcoded as key and value respectively. They are not being updated by looking into the given struct. Thus at writing time, it introduces an additional vector with id name, and that make is not consume the key. At least this is what is happening in highlevel. If I rename id to key the code works.

In the reading part, it has an incorrect schema. Worse case is, we can get the schema from the vector itself, let's say. Then again we have 2 idle vectors in case users use different names. Shouldn't we update the KEY_NAME and VALUE_NAME properly? Or Am I misreading this?

lidavidm · 2024-06-22T01:39:21Z

We should get it from the vector, yes. They are recommended to be "key" and "value" but it is not meant to be required

arrow/format/Schema.fbs

Lines 126 to 129 in d28078d

    
           /// In a field with Map type, the field has a child Struct field, which then 
        
           /// has two children: key type and the second the value type. The names of the 
        
           /// child fields may be respectively "entries", "key", and "value", but this is 
        
           /// not enforced.

vibhatha · 2024-06-22T02:02:56Z

So fix it? Or enforce the key,value usage?

lidavidm · 2024-06-22T02:10:43Z

Fix it, the spec says explicitly not to enforce key/value

felipecrv · 2024-07-09T00:46:16Z

We should get it from the vector, yes. They are recommended to be "key" and "value" but it is not meant to be required

arrow/format/Schema.fbs

Lines 126 to 129 in d28078d

/// In a field with Map type, the field has a child Struct field, which then

/// has two children: key type and the second the value type. The names of the

/// child fields may be respectively "entries", "key", and "value", but this is

/// not enforced.

"this is not enforced" doesn't mean it's not assumed in a lot of places.

It's one of these cases where in theory you can use any name, but in practice, there is a de-facto standard.

Postel's Law applies, so we should advise users to avoid using different names while also making the official Arrow implementations robust (i.e. accepting of custom names).

vibhatha added the Type: bug label Jun 20, 2024

github-actions bot added the Component: Java label Jun 20, 2024

vibhatha self-assigned this Jun 21, 2024

github-actions bot mentioned this issue Jun 24, 2024

GH-42218: [Java] MapVector cannot be loaded via IPC #43014

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] MapVector cannot be loaded via IPC #42218

[Java] MapVector cannot be loaded via IPC #42218

vibhatha commented Jun 20, 2024

vibhatha commented Jun 20, 2024

llama90 commented Jun 20, 2024 •

edited

Loading

vibhatha commented Jun 20, 2024

vibhatha commented Jun 22, 2024

lidavidm commented Jun 22, 2024

vibhatha commented Jun 22, 2024

lidavidm commented Jun 22, 2024

felipecrv commented Jul 9, 2024 •

edited

Loading

[Java] MapVector cannot be loaded via IPC #42218

[Java] MapVector cannot be loaded via IPC #42218

Comments

vibhatha commented Jun 20, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

vibhatha commented Jun 20, 2024

llama90 commented Jun 20, 2024 • edited Loading

vibhatha commented Jun 20, 2024

vibhatha commented Jun 22, 2024

lidavidm commented Jun 22, 2024

vibhatha commented Jun 22, 2024

lidavidm commented Jun 22, 2024

felipecrv commented Jul 9, 2024 • edited Loading

llama90 commented Jun 20, 2024 •

edited

Loading

felipecrv commented Jul 9, 2024 •

edited

Loading