Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame -> ESRI Shapefile: UTF-8/16 mangled to ?????????. DataFrame -> CSV: UTF-8/16 mangled to Latin-1 characters #99

Open
alecStewart1 opened this issue Dec 19, 2024 · 4 comments

Comments

@alecStewart1
Copy link

alecStewart1 commented Dec 19, 2024

Hello!

Firstly, thank you for this package!

At work, we do a lot of stuff with Esri and we deal with shape files a lot. Initially I've read in a CSV file and a Shapefile into 2 different dataframes and combined them with vcat. The CSV file is from calculated the centroids for a polygon on a layer we have on ArcGIS, the shapefile contains points from another source:

import GeoDataFrames as GDF

centroid_df = GDF.read("/home/my-user/centroids.csv")
point_df = GDF.read("/home/my-user/points.shp")

combined_df = vcat(centroid_df, point_df, cols=:union)

GDF.write("/home/my-user/combined_points.shp", combined_df)

This does create a valid shapefile, but any columns that contain rows with items that are in or have Mandarin or Cyrillic script are shown as ????????? or "?????????" whenever I load the new combined shapefile with GeoDataFrames or into ArcGIS.

This is similar to writing to a CSV file, even with options=Dict("bom"=>"true"), in that Mandarin and Cyrillic script characters are mangled to seemingly Latin-1 characters:

# same dataframes as above

GDF.write("/home/my-user/combined_points.csv", combined_df, options=Dict("bom"=>"true"))

Is there an option I can pass to the driver for shapefiles, is there something I'm missing for both drivers, or is there something else I can do?

@asinghvi17
Copy link
Contributor

asinghvi17 commented Dec 20, 2024

I tried an example with Cyrillic characters in a CSV through GeoDataFrames, and it seems fine. So it looks like the only issue is with your Shapefile. Since I don't have access to those I can't tell you exactly what went wrong. But you can try Shapefile.jl and see if that loads those files better, since it probably makes less assumptions and may be easier to fix if broken.

GeoDataFrames is ArchGDAL under the hood so we are at the whims of GDAL here.

Details

Can you try this minimal example, and see if that causes the same error on your machine? It seems to work for me.

julia> using GeoDataFrames

julia> descriptions = [String(rand('А':'Ҁ', 5)) for _ in 1:10]
10-element Vector{String}:
 "дэМѲЪ"
 "ѴЖїѾв"
 "ѨѶоѧз"
 "УѕѦИэ"
 "ѸѬѳѴо"
 "ЪўѝѯѮ"
 "ъѲТкў"
 "ѭйФйѓ"
 "ђЙнШѲ"
 "мщѼыг"

julia> geometries = tuple.(rand(10), rand(10))
10-element Vector{Tuple{Float64, Float64}}:
 (0.3616396660054806, 0.1902277850964662)
 (0.9946340856206181, 0.7562092008804872)
 (0.8648571829290774, 0.00931884536274874)
 (0.41750353601434986, 0.4618622731533355)
 (0.04766980429969825, 0.5472432276083967)
 (0.8020186665742213, 0.24774530424596475)
 (0.22464094645451838, 0.37652599046554514)
 (0.15877861428124762, 0.7791053151409258)
 (0.27718245266096586, 0.7923647914178605)
 (0.27286993041519136, 0.7142004310660254)

julia> df = GeoDataFrames.DataFrame(geometry = geometries, description = descriptions)
10×2 DataFrame
 Row │ geometry                description
     │ Tuple…                  String
─────┼─────────────────────────────────────
   1 │ (0.36164, 0.190228)     дэМѲЪ
   2 │ (0.994634, 0.756209)    ѴЖїѾв
   3 │ (0.864857, 0.00931885)  ѨѶоѧз
   4 │ (0.417504, 0.461862)    УѕѦИэ
   5 │ (0.0476698, 0.547243)   ѸѬѳѴо
   6 │ (0.802019, 0.247745)    ЪўѝѯѮ
   7 │ (0.224641, 0.376526)    ъѲТкў
   8 │ (0.158779, 0.779105)    ѭйФйѓ
   9 │ (0.277182, 0.792365)    ђЙнШѲ
  10 │ (0.27287, 0.7142)       мщѼыг

julia> GeoDataFrames.write("try1.csv", df)
"try1.csv"

julia> using CSV

julia> rdf = CSV.read("try1.csv", GeoDataFrames.DataFrame)
10×1 DataFrame
 Row │ description
     │ String15
─────┼─────────────
   1 │ дэМѲЪ
   2 │ ѴЖїѾв
   3 │ ѨѶоѧз
   4 │ УѕѦИэ
   5 │ ѸѬѳѴо
   6 │ ЪўѝѯѮ
   7 │ ъѲТкў
   8 │ ѭйФйѓ
   9 │ ђЙнШѲ
  10 │ мщѼыг

julia> all(rdf.description .== descriptions)
true

@alecStewart1
Copy link
Author

Here's what I got running the minimal example on my machine

julia> import GeoDataFrames as GDF

julia> descriptions = [String(rand('А':'Ҁ', 5)) for _ in 1:10]
10-element Vector{String}:
 "жВѬћг"
 "ѰЯѧѮЯ"
 "бВеѹн"
 "ЛѫёгЗ"
 "јфѭѡѝ"
 "ФХеле"
 "ИјвЙп"
 "џнѩѫб"
 "ёьћяѽ"
 "сѪўїб"

julia> geometries = tuple.(rand(10), rand(10))
10-element Vector{Tuple{Float64, Float64}}:
 (0.6007809329793226, 0.4691203390956308)
 (0.3374102289148402, 0.24100429968946713)
 (0.8797485178458518, 0.352745991782348)
 (0.1556353875200167, 0.07057473124988933)
 (0.2184235181480766, 0.09903465565672998)
 (0.8270660955174479, 0.2707867773054232)
 (0.840726403212758, 0.5996573922935156)
 (0.314223991781063, 0.6113849793459665)
 (0.26012121229690677, 0.915283663456271)
 (0.5753997476649441, 0.23098723435765955)

julia> df = GDF.DataFrame(geometry = geometries, description = descriptions)
10×2 DataFrame
 Row │ geometry               description
     │ Tuple…                 String
─────┼────────────────────────────────────
   1 │ (0.600781, 0.46912)    лѲгѭѽ
   2 │ (0.33741, 0.241004)    ХѦѪџѦ
   3 │ (0.879749, 0.352746)   ЩЪПѝѥ
   4 │ (0.155635, 0.0705747)  ЛМыѕъ
   5 │ (0.218424, 0.0990347)  ЧѺЗѐП
   6 │ (0.827066, 0.270787)   эѺщВГ
   7 │ (0.840726, 0.599657)   ЧШќѿѼ
   8 │ (0.314224, 0.611385)   ХРїяѽ
   9 │ (0.260121, 0.915284)   ЪыкАй
  10 │ (0.5754, 0.230987)     ітАЯѥ

julia> GDF.write("try1.csv", df)
"try1.csv"

julia> using CSV

julia> rdf = CSV.read("try1.csv", GDF.DataFrame)
10×1 DataFrame
 Row │ description
     │ String15
─────┼─────────────
   1 │ лѲгѭѽ
   2 │ ХѦѪџѦ
   3 │ ЩЪПѝѥ
   4 │ ЛМыѕъ
   5 │ ЧѺЗѐП
   6 │ эѺщВГ
   7 │ ЧШќѿѼ
   8 │ ХРїяѽ
   9 │ ЪыкАй
  10 │ ітАЯѥ

julia> all(rdf.description .== descriptions)
true

Opening the try1.csv file in Excel has the characters mangled as Latin-1.

Screenshot 2024-12-20 at 9 11 45 AM

I would think this could be solved by given the options a dict with CSV.jl's write option bom.

julia> GDF.write("try2.csv", df, options=Dict("bom"=>"true"))

That doesn't work either. However, calling CSV.write itself with bom=true does work:

julia> CSV.write("try3.csv", df, bom=true)
Screenshot 2024-12-20 at 9 19 07 AM

So whatever GeoDataFrames.write uses for writing to CSV doesn't use or allow for the bom=true option for CSV.write, if it's using CSV.jl. If not, then some other underlying mechanism should allow for adding UTF-8 BOM header.

@evetion
Copy link
Owner

evetion commented Dec 20, 2024

Can you try with WRITE_BOM instead of just bom when you do write? That is apparently the config option according to GDAL https://gdal.org/en/stable/drivers/vector/csv.html (which we use underneath).

@alecStewart1
Copy link
Author

alecStewart1 commented Dec 20, 2024

Can you try with WRITE_BOM instead of just bom when you do write? That is apparently the config option according to GDAL https://gdal.org/en/stable/drivers/vector/csv.html (which we use underneath).

Yup, that seemed to work!

Seems like for shapefiles it's SHAPE_ENCODING, although it doesn't say what values you can use.

https://gdal.org/en/stable/drivers/vector/shapefile.html#encoding

Looking here, I guess you can pass

  • CPL_ENC_UTF8
  • CPL_ENC_ASCII
  • CPL_ENC_ISO8859_1

Or uh...maybe just Dict("SHAPE_ENCODING"=>"UTF8") in our case?

EDIT 1:

Okay neither Dict("SHAPE_ENCODING"=>"UTF8") or Dict("SHAPE_ENCODING"=>"UTF-8") worked so I guess is would be Dict("SHAPE_ENCODING"=>"CPL_ENC_UTF8")?

EDIT 2:

Dict("SHAPE_ENCODING"=>"CPL_ENC_UTF8") didn't work. Time to dig around source code I suppose.

EDIT 3:

Okay...leaving it blank doesn't work either (Dict("SHAPE_ENCODING"=>""))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants