Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of support for non-ASCII characters in cabal sdist #3758

Closed
snoyberg opened this issue Sep 2, 2016 · 21 comments
Closed

Lack of support for non-ASCII characters in cabal sdist #3758

snoyberg opened this issue Sep 2, 2016 · 21 comments

Comments

@snoyberg
Copy link
Collaborator

snoyberg commented Sep 2, 2016

This bug is really based in the tar package, and affects both Stack (commercialhaskell/stack#2549 and commercialhaskell/stack#2557) and cabal-install. Repro steps for cabal-install:

  1. git clone https://github.com/snoyberg/yaml
  2. cd yaml
  3. git checkout yaml/0.8.18.2
  4. cabal sdist
  5. tar xfv dist/yaml-0.8.18.2.tar.gz

Expected: Find a file with the name yaml-0.8.18.2/test/resources/accenté/foo.yaml

Actual: The file is named yaml-0.8.18.2/test/resources/accente^A/foo.yaml

Note that Hackage itself rejects non-ASCII filenames, so this is almost a moot point. However, if cabal sdist properly encodes the file path, Hackage will reject the upload with a useful error message, instead of silently accepting the upload (which happened to me). There may also be an argument to be made that Hackage's filepath checking code should be made a bit more strict I'm not sure.

@ezyang
Copy link
Contributor

ezyang commented Sep 2, 2016

To be clear, this only applies to cabal sdist, and not cabal act-as-setup -- sdist, because Cabal's code calls the tar program directly which does not have this bug.

See upstream haskell/tar#6

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

Thank you for the comment, you just unbroke my brain. I couldn't reconcile what I saw in the Cabal source code with the behavior I was experiencing.

@ezyang
Copy link
Contributor

ezyang commented Sep 2, 2016

It certainly puzzled me a while too.

@23Skidoo
Copy link
Member

23Skidoo commented Sep 2, 2016

So this is basically a dupe of haskell/tar#6?

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

Yes, but it can be worked around inside cabal-install the same way I proposed doing it in Stack, see https://github.com/commercialhaskell/stack/pull/2557/files#diff-7f7c1fa6ec0ee064e1a11c95f5b1b025L110.

@23Skidoo 23Skidoo added this to the cabal-install 1.24.0.1 milestone Sep 2, 2016
@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

I was about to try my hand at a patch for this, but I'm not sure which library to use for the UTF8 encoding. Neither text nor utf8-string appear to be dependencies of cabal-install right now, and I doubt they are desired additions. I suppose an alternative is simply to check the paths to ensure they are ASCII and error out if not.

@23Skidoo
Copy link
Member

23Skidoo commented Sep 2, 2016

@snoyberg cabal-install does depend on text indirectly (via parsec), so using text is fine.

@23Skidoo
Copy link
Member

23Skidoo commented Sep 2, 2016

@snoyberg

https://github.com/commercialhaskell/stack/pull/2557/files#diff-7f7c1fa6ec0ee064e1a11c95f5b1b025L110

Something like this workaround would be acceptable for now, but in the long run we should fix it at the source.

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

I was just looking into "fixing it at the source" as well. It seems unclear how to do it. One idea I had is:

  • Introduce a bytesToTarPath function, which would be defined as bytesToTarPath isDir path = toTarPath isDir (BS.unpack path)
  • Introduce a toTarPathUtf8 function, which would be defined as toTarPathUtf8 isDir path = bytesToTarPath isDir (encodeUtf8 (T.pack path))
  • Deprecate toTarPath

I can float this idea over there too. But I'm not too excited about it, since it's just moving the exact same approach I'm recommending here and in Stack to the tar package, and there doesn't seem to be anything better without breaking backwards compatibility.

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

Actually, this is harder to resolve in cabal-install than it was in Stack, due to usage of the Tar.pack function. So in this case, it probably does make more sense to move the fix upstream.

@23Skidoo
Copy link
Member

23Skidoo commented Sep 2, 2016

@snoyberg Since our aim is to make tarballs with unicode fail better, I was thinking about making the tar package fail directly on encountering paths with unicode.

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 2, 2016

IMO handling that case with UTF8 encoding makes more sense, but best behavior is certainly ambiguous.

@dcoutts
Copy link
Contributor

dcoutts commented Sep 3, 2016

So what do people recommend here? Should we generally switch cabal and hackage over to allowing UTF8 encoded file names? Or just try and fail in a clearer way?

Back in the day it wan't really possible to rely on file names being UTF8 on unix, but these days we probably can. The other reason we've historically been conservative on hackage is that we want to try and keep maximum compatibility for tools on all platforms and preserving stuff for long time periods, so sticking closely to Tar posix standards seems wise here.

And yes, I'd of course like whatever we decide to go upstream (can of course keep temporary workarounds).

@23Skidoo
Copy link
Member

23Skidoo commented Sep 3, 2016

Should we generally switch cabal and hackage over to allowing UTF8 encoded file names? Or just try and fail in a clearer way?

Let's implement the latter for now, which I think is much easier than the former.

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 3, 2016

I'm fully in favor of just failing more clearly. If not too complicated, I'd also recommend adding support now for UTF8 aware unpack in case the situation changes in the future, but that's not high priority IMO.

@dcoutts
Copy link
Contributor

dcoutts commented Sep 3, 2016

Ok, thanks for feedback folks. I concur, lets fail better now, and I'll put it on my TODO to support UTF8 filenames in the tar package, but won't treat it as an "omg! emergency!" level of urgency.

@snoyberg
Copy link
Collaborator Author

snoyberg commented Sep 4, 2016

Perhaps it would be wise to additionally tighten up the non-ASCII character detection in tar so that Hackage rejects packages that are uploaded using today's cabal sdist/stack sdist. In particular, yaml version 0.8.18.4 has a corrupted filepath which ideally would have been rejected by Hackage.

@ezyang ezyang modified the milestones: cabal-install 1.24.0.1, Soon Sep 6, 2016
@dcoutts
Copy link
Contributor

dcoutts commented Sep 8, 2016

True could add an ASCII check to the optional portability checks in the tar lib.

@23Skidoo 23Skidoo modified the milestones: 1.24.0.2, 1.24.0.1 Oct 8, 2016
@23Skidoo
Copy link
Member

23Skidoo commented Oct 8, 2016

Moving this to 1.24.0.2.

@23Skidoo 23Skidoo modified the milestones: 1.24.2.1, 1.24.2.0 Dec 2, 2016
@23Skidoo
Copy link
Member

23Skidoo commented Dec 2, 2016

Remilestoned.

@ulysses4ever
Copy link
Collaborator

This was a dupe of #2558 which is resolved now (thanks to upstream).

@ulysses4ever ulysses4ever closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants