DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

alecStewart1 · 2024-12-19T21:48:48Z

Hello!

Firstly, thank you for this package!

At work, we do a lot of stuff with Esri and we deal with shape files a lot. Initially I've read in a CSV file and a Shapefile into 2 different dataframes and combined them with vcat. The CSV file is from calculated the centroids for a polygon on a layer we have on ArcGIS, the shapefile contains points from another source:

import GeoDataFrames as GDF

centroid_df = GDF.read("/home/my-user/centroids.csv")
point_df = GDF.read("/home/my-user/points.shp")

combined_df = vcat(centroid_df, point_df, cols=:union)

GDF.write("/home/my-user/combined_points.shp", combined_df)

This does create a valid shapefile, but any columns that contain rows with items that are in or have Mandarin or Cyrillic script are shown as ????????? or "?????????" whenever I load the new combined shapefile with GeoDataFrames or into ArcGIS.

This is similar to writing to a CSV file, even with options=Dict("bom"=>"true"), in that Mandarin and Cyrillic script characters are mangled to seemingly Latin-1 characters:

# same dataframes as above

GDF.write("/home/my-user/combined_points.csv", combined_df, options=Dict("bom"=>"true"))

Is there an option I can pass to the driver for shapefiles, is there something I'm missing for both drivers, or is there something else I can do?

The text was updated successfully, but these errors were encountered:

asinghvi17 · 2024-12-20T04:55:26Z

I tried an example with Cyrillic characters in a CSV through GeoDataFrames, and it seems fine. So it looks like the only issue is with your Shapefile. Since I don't have access to those I can't tell you exactly what went wrong. But you can try Shapefile.jl and see if that loads those files better, since it probably makes less assumptions and may be easier to fix if broken.

GeoDataFrames is ArchGDAL under the hood so we are at the whims of GDAL here.

Details

Can you try this minimal example, and see if that causes the same error on your machine? It seems to work for me.

julia> using GeoDataFrames

julia> descriptions = [String(rand('А':'Ҁ', 5)) for _ in 1:10]
10-element Vector{String}:
 "дэМѲЪ"
 "ѴЖїѾв"
 "ѨѶоѧз"
 "УѕѦИэ"
 "ѸѬѳѴо"
 "ЪўѝѯѮ"
 "ъѲТкў"
 "ѭйФйѓ"
 "ђЙнШѲ"
 "мщѼыг"

julia> geometries = tuple.(rand(10), rand(10))
10-element Vector{Tuple{Float64, Float64}}:
 (0.3616396660054806, 0.1902277850964662)
 (0.9946340856206181, 0.7562092008804872)
 (0.8648571829290774, 0.00931884536274874)
 (0.41750353601434986, 0.4618622731533355)
 (0.04766980429969825, 0.5472432276083967)
 (0.8020186665742213, 0.24774530424596475)
 (0.22464094645451838, 0.37652599046554514)
 (0.15877861428124762, 0.7791053151409258)
 (0.27718245266096586, 0.7923647914178605)
 (0.27286993041519136, 0.7142004310660254)

julia> df = GeoDataFrames.DataFrame(geometry = geometries, description = descriptions)
10×2 DataFrame
 Row │ geometry                description
     │ Tuple…                  String
─────┼─────────────────────────────────────
   1 │ (0.36164, 0.190228)     дэМѲЪ
   2 │ (0.994634, 0.756209)    ѴЖїѾв
   3 │ (0.864857, 0.00931885)  ѨѶоѧз
   4 │ (0.417504, 0.461862)    УѕѦИэ
   5 │ (0.0476698, 0.547243)   ѸѬѳѴо
   6 │ (0.802019, 0.247745)    ЪўѝѯѮ
   7 │ (0.224641, 0.376526)    ъѲТкў
   8 │ (0.158779, 0.779105)    ѭйФйѓ
   9 │ (0.277182, 0.792365)    ђЙнШѲ
  10 │ (0.27287, 0.7142)       мщѼыг

julia> GeoDataFrames.write("try1.csv", df)
"try1.csv"

julia> using CSV

julia> rdf = CSV.read("try1.csv", GeoDataFrames.DataFrame)
10×1 DataFrame
 Row │ description
     │ String15
─────┼─────────────
   1 │ дэМѲЪ
   2 │ ѴЖїѾв
   3 │ ѨѶоѧз
   4 │ УѕѦИэ
   5 │ ѸѬѳѴо
   6 │ ЪўѝѯѮ
   7 │ ъѲТкў
   8 │ ѭйФйѓ
   9 │ ђЙнШѲ
  10 │ мщѼыг

julia> all(rdf.description .== descriptions)
true

alecStewart1 · 2024-12-20T15:21:57Z

Here's what I got running the minimal example on my machine

julia> import GeoDataFrames as GDF

julia> descriptions = [String(rand('А':'Ҁ', 5)) for _ in 1:10]
10-element Vector{String}:
 "жВѬћг"
 "ѰЯѧѮЯ"
 "бВеѹн"
 "ЛѫёгЗ"
 "јфѭѡѝ"
 "ФХеле"
 "ИјвЙп"
 "џнѩѫб"
 "ёьћяѽ"
 "сѪўїб"

julia> geometries = tuple.(rand(10), rand(10))
10-element Vector{Tuple{Float64, Float64}}:
 (0.6007809329793226, 0.4691203390956308)
 (0.3374102289148402, 0.24100429968946713)
 (0.8797485178458518, 0.352745991782348)
 (0.1556353875200167, 0.07057473124988933)
 (0.2184235181480766, 0.09903465565672998)
 (0.8270660955174479, 0.2707867773054232)
 (0.840726403212758, 0.5996573922935156)
 (0.314223991781063, 0.6113849793459665)
 (0.26012121229690677, 0.915283663456271)
 (0.5753997476649441, 0.23098723435765955)

julia> df = GDF.DataFrame(geometry = geometries, description = descriptions)
10×2 DataFrame
 Row │ geometry               description
     │ Tuple…                 String
─────┼────────────────────────────────────
   1 │ (0.600781, 0.46912)    лѲгѭѽ
   2 │ (0.33741, 0.241004)    ХѦѪџѦ
   3 │ (0.879749, 0.352746)   ЩЪПѝѥ
   4 │ (0.155635, 0.0705747)  ЛМыѕъ
   5 │ (0.218424, 0.0990347)  ЧѺЗѐП
   6 │ (0.827066, 0.270787)   эѺщВГ
   7 │ (0.840726, 0.599657)   ЧШќѿѼ
   8 │ (0.314224, 0.611385)   ХРїяѽ
   9 │ (0.260121, 0.915284)   ЪыкАй
  10 │ (0.5754, 0.230987)     ітАЯѥ

julia> GDF.write("try1.csv", df)
"try1.csv"

julia> using CSV

julia> rdf = CSV.read("try1.csv", GDF.DataFrame)
10×1 DataFrame
 Row │ description
     │ String15
─────┼─────────────
   1 │ лѲгѭѽ
   2 │ ХѦѪџѦ
   3 │ ЩЪПѝѥ
   4 │ ЛМыѕъ
   5 │ ЧѺЗѐП
   6 │ эѺщВГ
   7 │ ЧШќѿѼ
   8 │ ХРїяѽ
   9 │ ЪыкАй
  10 │ ітАЯѥ

julia> all(rdf.description .== descriptions)
true

Opening the try1.csv file in Excel has the characters mangled as Latin-1.

I would think this could be solved by given the options a dict with CSV.jl's write option bom.

julia> GDF.write("try2.csv", df, options=Dict("bom"=>"true"))

That doesn't work either. However, calling CSV.write itself with bom=true does work:

julia> CSV.write("try3.csv", df, bom=true)

So whatever GeoDataFrames.write uses for writing to CSV doesn't use or allow for the bom=true option for CSV.write, if it's using CSV.jl. If not, then some other underlying mechanism should allow for adding UTF-8 BOM header.

evetion · 2024-12-20T16:17:46Z

Can you try with WRITE_BOM instead of just bom when you do write? That is apparently the config option according to GDAL https://gdal.org/en/stable/drivers/vector/csv.html (which we use underneath).

alecStewart1 · 2024-12-20T17:12:35Z

Can you try with WRITE_BOM instead of just bom when you do write? That is apparently the config option according to GDAL https://gdal.org/en/stable/drivers/vector/csv.html (which we use underneath).

Yup, that seemed to work!

Seems like for shapefiles it's SHAPE_ENCODING, although it doesn't say what values you can use.

https://gdal.org/en/stable/drivers/vector/shapefile.html#encoding

Looking here, I guess you can pass

CPL_ENC_UTF8
CPL_ENC_ASCII
CPL_ENC_ISO8859_1

Or uh...maybe just Dict("SHAPE_ENCODING"=>"UTF8") in our case?

EDIT 1:

Okay neither Dict("SHAPE_ENCODING"=>"UTF8") or Dict("SHAPE_ENCODING"=>"UTF-8") worked so I guess is would be Dict("SHAPE_ENCODING"=>"CPL_ENC_UTF8")?

EDIT 2:

Dict("SHAPE_ENCODING"=>"CPL_ENC_UTF8") didn't work. Time to dig around source code I suppose.

EDIT 3:

Okay...leaving it blank doesn't work either (Dict("SHAPE_ENCODING"=>""))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

alecStewart1 commented Dec 19, 2024 •

edited

Loading

asinghvi17 commented Dec 20, 2024 •

edited

Loading

alecStewart1 commented Dec 20, 2024

evetion commented Dec 20, 2024

alecStewart1 commented Dec 20, 2024 •

edited

Loading

DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

Comments

alecStewart1 commented Dec 19, 2024 • edited Loading

asinghvi17 commented Dec 20, 2024 • edited Loading

alecStewart1 commented Dec 20, 2024

evetion commented Dec 20, 2024

alecStewart1 commented Dec 20, 2024 • edited Loading

alecStewart1 commented Dec 19, 2024 •

edited

Loading

asinghvi17 commented Dec 20, 2024 •

edited

Loading

alecStewart1 commented Dec 20, 2024 •

edited

Loading