GPKG: setting fid when writing file via arrow gives an error #11527

theroggy · 2024-12-19T23:36:44Z

What is the bug?

Setting an FID explicitly and writing the row to e.g. GPKG gives an error: Inconsistent values of FID and field of same name.

Steps to reproduce the issue

Script to reproduce the error:

from osgeo import ogr

ogr.UseExceptions()
ds = ogr.GetDriverByName("GPKG").CreateDataSource("C:/temp/test.gpkg")
src_lyr = ds.CreateLayer("src_lyr")

field_def = ogr.FieldDefn("field_integer", ogr.OFTInteger)
src_lyr.CreateField(field_def)

field_def = ogr.FieldDefn("field_string", ogr.OFTString)
src_lyr.CreateField(field_def)

feat_def = src_lyr.GetLayerDefn()
src_feature = ogr.Feature(feat_def)
src_feature.SetField("field_integer", 17)
src_feature.SetField("field_string", "abc def")
src_feature.SetGeometry(ogr.CreateGeometryFromWkt("POINT (1 2)"))
src_feature.SetFID(123)

src_lyr.CreateFeature(src_feature)

src_feature2 = ogr.Feature(feat_def)
for i in range(feat_def.GetFieldCount()):
    src_feature2.SetFieldNull(i)
src_lyr.CreateFeature(src_feature2)

dst_lyr = ds.CreateLayer("dst_lyr")

stream = src_lyr.GetArrowStream(["INCLUDE_FID=YES"])
schema = stream.GetSchema()

success, error_msg = dst_lyr.IsArrowSchemaSupported(schema)
assert success, error_msg

for i in range(schema.GetChildrenCount()):
    if schema.GetChild(i).GetName() != "geom":
        dst_lyr.CreateFieldFromArrowSchema(schema.GetChild(i))

while True:
    array = stream.GetNextRecordBatch()
    if array is None:
        break
    assert dst_lyr.WriteArrowBatch(schema, array) == ogr.OGRERR_NONE

dst_lyr.ResetReading()

dst_feature = dst_lyr.GetNextFeature()
assert str(src_feature) == str(dst_feature).replace("dst_lyr", "src_lyr")

dst_feature = dst_lyr.GetNextFeature()
assert str(src_feature2) == str(dst_feature).replace("dst_lyr", "src_lyr")

Versions and provenance

Tested on Windows 11, using gdal 3.10.0 installed via conda-forge. Can also be reproduced in e.g. GDAL 3.8.5 and 3.9.2 (geopandas/pyogrio#511).

Additional context

No response

The text was updated successfully, but these errors were encountered:

…ot set Related to OSGeo#11527

rouault · 2024-12-20T01:40:37Z

There was a stupidly embarassing error logic in the GPKG driver that is fixed per #11528 , but your issue is different.

It can be solved independently from that bugfix by just changing your code snippet to exclude the creation of an attribute field corresponding to the FID column of the source. So

for i in range(schema.GetChildrenCount()):
    if schema.GetChild(i).GetName() not in ("geom", src_lyr.GetFIDColumn()):
        dst_lyr.CreateFieldFromArrowSchema(schema.GetChild(i))

Dealing with FIDs in OGR has always be a bit tricky because the OGR abstraction give them a special status while database-like format make them a regular field. That isn't improved by Arrow not having a FID concept itself and thus needing to sometimes synthetize a column. I suspect the WriteArrowBatch() logic could perhaps be enhanced to support both a destination layer that has a named FID column and a regular field of the same name, but not totally sure we want to make that work because even if the GPKG driver would work that could cause issues in other drivers. So better not call CreateFieldFromArrowSchema() with a name that matches the name of the FID column.

jorisvandenbossche · 2024-12-20T07:52:48Z

to exclude the creation of an attribute field corresponding to the FID column of the source

How does the destination layer know that that column is the FID column? (because the arrow batch you are writing still has that field in the schema)
(I don't directly find a SetFIDColumn equivalent to OGRLayer::GetFIDColumn)

jorisvandenbossche · 2024-12-20T07:56:00Z

Ah, but OGRLayer::WriteArrowBatch has a FID=name option, and that's also how it is used in the example of copying source to out layer.

(for pyogrio's use of the arrow writer, I suppose we should enable the user to specify the name for the FID column then when writing)

rouault · 2024-12-20T08:04:20Z

(I don't directly find a SetFIDColumn equivalent to OGRLayer::GetFIDColumn)

There's none. Some drivers (like GPKG) offer the capacity to customize its name by offering a FID layer creation option at CreateLayer() time.

theroggy · 2024-12-20T11:19:31Z

@rouault thanks for the advice... The code in pyogrio did the same thing as the sample code above: add the "fid" column as an ordinary column as well... triggering the issue fixed in #11528, so that is being fixed now in pyogrio (geopandas/pyogrio#511).

Following up on that I noticed that the "fid" column detection is case insensitive, which is fine. However the issue fixed in #11528 doesn't seem to be triggered when the column to use the fids from is called "FID" instead of "fid"... which seems a bit odd?

theroggy · 2024-12-20T11:43:49Z

@rouault hmm... I was too fast... the behaviour seems to be still slightly different... I just made the check that an "fid" column should not be added as an ordinary column case insensitive, but this leads to another error: Error while writing batch to OGR layer: Cannot find OGR field for Arrow array FID.

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

I tried to make an overview of the different situations and the behaviour as I interprete it:

if column is called "fid" and ordinary column is added as well: bug fixed in GPKG: fix FID vs field of same name consistency check when field is not set #11528 is triggered, because the column is discovered as fid + is ordinary column and the comparison of both leads to an error
if column is called "fid" and not added as ordinary, everything is fine
if column is called "FID" and added as ordinary: column is not initially seen as fid column, but is discovered as fid-like column. Because it is not seen as fid column, no comparison is done, so no error, and it is "recuperated" as fid-like column to be saved as fid.
if column is called "FID" and not added as ordinary, it is not seen as fid column and cannot be "recuperated". This seems to trigger an error because no FID-like column is found?

rouault · 2024-12-20T16:30:56Z

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

to be honest, I'm lost :-) There might be things fixable, and others not. Not sure I'm in the mood of investigating that... I do remember I struggled mapping Arrow to OGR related to FID and yes things are going to be a bit messy. I'd wishing someone could take the lead in GDAL for "Arrow related activities"...
In general, but not always, field name comparisons in GDAL are case insensitive. In some places to satisfy some use cases where their was "foo" and "FOO", we start by a case sensitive comparison, and if no match fallback to the historical behaviour of case insensitive comparision
For WriteArrowBatch() specifically, at line 6523 "if (pszFIDName && sInfo.osName == pszFIDName)" of ogr/ogrsf_frmts/generic/ogrlayerarrow.cpp, the comparison is case sensitive.

theroggy · 2024-12-20T16:48:56Z

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

to be honest, I'm lost :-) There might be things fixable, and others not. Not sure I'm in the mood of investigating that... I do remember I struggled mapping Arrow to OGR related to FID and yes things are going to be a bit messy. I'd wishing someone could take the lead in GDAL for "Arrow related activities"... In general, but not always, field name comparisons in GDAL are case insensitive. In some places to satisfy some use cases where their was "foo" and "FOO", we start by a case sensitive comparison, and if no match fallback to the historical behaviour of case insensitive comparision For WriteArrowBatch() specifically, at line 6523 "if (pszFIDName && sInfo.osName == pszFIDName)" of ogr/ogrsf_frmts/generic/ogrlayerarrow.cpp, the comparison is case sensitive.

:-). No problem. It confirms what I thought... I had been reading some code, but didn't (immediately) find the case-sensitive comparison.

I found where the FIDAsRegularColumnIndex is detected, and there the EQUAL macro is used, so case insensitive... (

gdal/ogr/ogrsf_frmts/gpkg/ogrgeopackagetablelayer.cpp

Line 1851 in e7b6c09

EQUAL(oFieldDefn.GetNameRef(), m_pszFidColumn))

):

if (m_pszFidColumn != nullptr &&
        EQUAL(oFieldDefn.GetNameRef(), m_pszFidColumn))
    {
        m_iFIDAsRegularColumnIndex = m_poFeatureDefn->GetFieldCount() - 1;
    }

And... I also found the code that seems to set the FID to that column if there is no FID column found, here:

gdal/ogr/ogrsf_frmts/gpkg/ogrgeopackagetablelayer.cpp

Line 2433 in e7b6c09

poFeature->SetFID(poFeature->GetFieldAsInteger64(

So putting everything together it explains the behaviour I'm seeing: the "real" FID column detection is case sensitive, but the "FID regular column" detection is case insensitive, and if only the second type is found it is "recuperated".

So, no worries... I'm already happy that there is a logic and that the behaviour can be explained :-).

…ot set Related to #11527

This was referenced Dec 19, 2024

ENH: Use use_arrow=True for reading and writing dataframes geofileops/geofileops#392

Draft

BUG: write_dataframe to GPKG with an "fid" column gives an error if use_arrow=True geopandas/pyogrio#510

Closed

rouault self-assigned this Dec 20, 2024

rouault added a commit to rouault/gdal that referenced this issue Dec 20, 2024

GPKG: fix FID vs field of same name consistency check when field is n…

84ae544

…ot set Related to OSGeo#11527

rouault mentioned this issue Dec 20, 2024

GPKG: fix FID vs field of same name consistency check when field is not set #11528

Merged

theroggy mentioned this issue Dec 20, 2024

BUG: fix writing fids to e.g. GPKG file with use_arrow geopandas/pyogrio#511

Merged

rouault added a commit that referenced this issue Dec 21, 2024

GPKG: fix FID vs field of same name consistency check when field is n…

02c9d70

…ot set Related to #11527

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPKG: setting fid when writing file via arrow gives an error #11527

GPKG: setting fid when writing file via arrow gives an error #11527

theroggy commented Dec 19, 2024 •

edited

Loading

rouault commented Dec 20, 2024

jorisvandenbossche commented Dec 20, 2024

jorisvandenbossche commented Dec 20, 2024

rouault commented Dec 20, 2024

theroggy commented Dec 20, 2024 •

edited

Loading

theroggy commented Dec 20, 2024 •

edited

Loading

rouault commented Dec 20, 2024

theroggy commented Dec 20, 2024 •

edited

Loading

GPKG: setting fid when writing file via arrow gives an error #11527

GPKG: setting fid when writing file via arrow gives an error #11527

Comments

theroggy commented Dec 19, 2024 • edited Loading

What is the bug?

Steps to reproduce the issue

Versions and provenance

Additional context

rouault commented Dec 20, 2024

jorisvandenbossche commented Dec 20, 2024

jorisvandenbossche commented Dec 20, 2024

rouault commented Dec 20, 2024

theroggy commented Dec 20, 2024 • edited Loading

theroggy commented Dec 20, 2024 • edited Loading

rouault commented Dec 20, 2024

theroggy commented Dec 20, 2024 • edited Loading

theroggy commented Dec 19, 2024 •

edited

Loading

theroggy commented Dec 20, 2024 •

edited

Loading

theroggy commented Dec 20, 2024 •

edited

Loading

theroggy commented Dec 20, 2024 •

edited

Loading