Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPKG: setting fid when writing file via arrow gives an error #11527

Open
theroggy opened this issue Dec 19, 2024 · 8 comments
Open

GPKG: setting fid when writing file via arrow gives an error #11527

theroggy opened this issue Dec 19, 2024 · 8 comments
Assignees

Comments

@theroggy
Copy link
Contributor

theroggy commented Dec 19, 2024

What is the bug?

Setting an FID explicitly and writing the row to e.g. GPKG gives an error: Inconsistent values of FID and field of same name.

Steps to reproduce the issue

Script to reproduce the error:

from osgeo import ogr

ogr.UseExceptions()
ds = ogr.GetDriverByName("GPKG").CreateDataSource("C:/temp/test.gpkg")
src_lyr = ds.CreateLayer("src_lyr")

field_def = ogr.FieldDefn("field_integer", ogr.OFTInteger)
src_lyr.CreateField(field_def)

field_def = ogr.FieldDefn("field_string", ogr.OFTString)
src_lyr.CreateField(field_def)

feat_def = src_lyr.GetLayerDefn()
src_feature = ogr.Feature(feat_def)
src_feature.SetField("field_integer", 17)
src_feature.SetField("field_string", "abc def")
src_feature.SetGeometry(ogr.CreateGeometryFromWkt("POINT (1 2)"))
src_feature.SetFID(123)

src_lyr.CreateFeature(src_feature)

src_feature2 = ogr.Feature(feat_def)
for i in range(feat_def.GetFieldCount()):
    src_feature2.SetFieldNull(i)
src_lyr.CreateFeature(src_feature2)

dst_lyr = ds.CreateLayer("dst_lyr")

stream = src_lyr.GetArrowStream(["INCLUDE_FID=YES"])
schema = stream.GetSchema()

success, error_msg = dst_lyr.IsArrowSchemaSupported(schema)
assert success, error_msg

for i in range(schema.GetChildrenCount()):
    if schema.GetChild(i).GetName() != "geom":
        dst_lyr.CreateFieldFromArrowSchema(schema.GetChild(i))

while True:
    array = stream.GetNextRecordBatch()
    if array is None:
        break
    assert dst_lyr.WriteArrowBatch(schema, array) == ogr.OGRERR_NONE

dst_lyr.ResetReading()

dst_feature = dst_lyr.GetNextFeature()
assert str(src_feature) == str(dst_feature).replace("dst_lyr", "src_lyr")

dst_feature = dst_lyr.GetNextFeature()
assert str(src_feature2) == str(dst_feature).replace("dst_lyr", "src_lyr")

Versions and provenance

Tested on Windows 11, using gdal 3.10.0 installed via conda-forge. Can also be reproduced in e.g. GDAL 3.8.5 and 3.9.2 (geopandas/pyogrio#511).

Additional context

No response

@rouault
Copy link
Member

rouault commented Dec 20, 2024

There was a stupidly embarassing error logic in the GPKG driver that is fixed per #11528 , but your issue is different.

It can be solved independently from that bugfix by just changing your code snippet to exclude the creation of an attribute field corresponding to the FID column of the source. So

for i in range(schema.GetChildrenCount()):
    if schema.GetChild(i).GetName() not in ("geom", src_lyr.GetFIDColumn()):
        dst_lyr.CreateFieldFromArrowSchema(schema.GetChild(i))

Dealing with FIDs in OGR has always be a bit tricky because the OGR abstraction give them a special status while database-like format make them a regular field. That isn't improved by Arrow not having a FID concept itself and thus needing to sometimes synthetize a column. I suspect the WriteArrowBatch() logic could perhaps be enhanced to support both a destination layer that has a named FID column and a regular field of the same name, but not totally sure we want to make that work because even if the GPKG driver would work that could cause issues in other drivers. So better not call CreateFieldFromArrowSchema() with a name that matches the name of the FID column.

@jorisvandenbossche
Copy link
Contributor

to exclude the creation of an attribute field corresponding to the FID column of the source

How does the destination layer know that that column is the FID column? (because the arrow batch you are writing still has that field in the schema)
(I don't directly find a SetFIDColumn equivalent to OGRLayer::GetFIDColumn)

@jorisvandenbossche
Copy link
Contributor

Ah, but OGRLayer::WriteArrowBatch has a FID=name option, and that's also how it is used in the example of copying source to out layer.

(for pyogrio's use of the arrow writer, I suppose we should enable the user to specify the name for the FID column then when writing)

@rouault
Copy link
Member

rouault commented Dec 20, 2024

(I don't directly find a SetFIDColumn equivalent to OGRLayer::GetFIDColumn)

There's none. Some drivers (like GPKG) offer the capacity to customize its name by offering a FID layer creation option at CreateLayer() time.

@theroggy
Copy link
Contributor Author

theroggy commented Dec 20, 2024

@rouault thanks for the advice... The code in pyogrio did the same thing as the sample code above: add the "fid" column as an ordinary column as well... triggering the issue fixed in #11528, so that is being fixed now in pyogrio (geopandas/pyogrio#511).

Following up on that I noticed that the "fid" column detection is case insensitive, which is fine. However the issue fixed in #11528 doesn't seem to be triggered when the column to use the fids from is called "FID" instead of "fid"... which seems a bit odd?

@theroggy
Copy link
Contributor Author

theroggy commented Dec 20, 2024

@rouault hmm... I was too fast... the behaviour seems to be still slightly different... I just made the check that an "fid" column should not be added as an ordinary column case insensitive, but this leads to another error: Error while writing batch to OGR layer: Cannot find OGR field for Arrow array FID.

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

I tried to make an overview of the different situations and the behaviour as I interprete it:

  • if column is called "fid" and ordinary column is added as well: bug fixed in GPKG: fix FID vs field of same name consistency check when field is not set #11528 is triggered, because the column is discovered as fid + is ordinary column and the comparison of both leads to an error
  • if column is called "fid" and not added as ordinary, everything is fine
  • if column is called "FID" and added as ordinary: column is not initially seen as fid column, but is discovered as fid-like column. Because it is not seen as fid column, no comparison is done, so no error, and it is "recuperated" as fid-like column to be saved as fid.
  • if column is called "FID" and not added as ordinary, it is not seen as fid column and cannot be "recuperated". This seems to trigger an error because no FID-like column is found?

@rouault
Copy link
Member

rouault commented Dec 20, 2024

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

to be honest, I'm lost :-) There might be things fixable, and others not. Not sure I'm in the mood of investigating that... I do remember I struggled mapping Arrow to OGR related to FID and yes things are going to be a bit messy. I'd wishing someone could take the lead in GDAL for "Arrow related activities"...
In general, but not always, field name comparisons in GDAL are case insensitive. In some places to satisfy some use cases where their was "foo" and "FOO", we start by a case sensitive comparison, and if no match fallback to the historical behaviour of case insensitive comparision
For WriteArrowBatch() specifically, at line 6523 "if (pszFIDName && sInfo.osName == pszFIDName)" of ogr/ogrsf_frmts/generic/ogrlayerarrow.cpp, the comparison is case sensitive.

@theroggy
Copy link
Contributor Author

theroggy commented Dec 20, 2024

So it seems that the primary discovery of the fid column is case sensitive, but that the discovery of "ordinary" columns that could be used as FID is not... or something like that?

to be honest, I'm lost :-) There might be things fixable, and others not. Not sure I'm in the mood of investigating that... I do remember I struggled mapping Arrow to OGR related to FID and yes things are going to be a bit messy. I'd wishing someone could take the lead in GDAL for "Arrow related activities"... In general, but not always, field name comparisons in GDAL are case insensitive. In some places to satisfy some use cases where their was "foo" and "FOO", we start by a case sensitive comparison, and if no match fallback to the historical behaviour of case insensitive comparision For WriteArrowBatch() specifically, at line 6523 "if (pszFIDName && sInfo.osName == pszFIDName)" of ogr/ogrsf_frmts/generic/ogrlayerarrow.cpp, the comparison is case sensitive.

:-). No problem. It confirms what I thought... I had been reading some code, but didn't (immediately) find the case-sensitive comparison.

I found where the FIDAsRegularColumnIndex is detected, and there the EQUAL macro is used, so case insensitive... (

EQUAL(oFieldDefn.GetNameRef(), m_pszFidColumn))
):

if (m_pszFidColumn != nullptr &&
        EQUAL(oFieldDefn.GetNameRef(), m_pszFidColumn))
    {
        m_iFIDAsRegularColumnIndex = m_poFeatureDefn->GetFieldCount() - 1;
    }

And... I also found the code that seems to set the FID to that column if there is no FID column found, here:

poFeature->SetFID(poFeature->GetFieldAsInteger64(

So putting everything together it explains the behaviour I'm seeing: the "real" FID column detection is case sensitive, but the "FID regular column" detection is case insensitive, and if only the second type is found it is "recuperated".

So, no worries... I'm already happy that there is a logic and that the behaviour can be explained :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants