Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded + sign in URI #55

Open
janvogt opened this issue Mar 6, 2020 · 6 comments
Open

Encoded + sign in URI #55

janvogt opened this issue Mar 6, 2020 · 6 comments

Comments

@janvogt
Copy link

janvogt commented Mar 6, 2020

I'd expect

let parse = parseRelativeRef strictURIParserOptions
     serialize = serializeURIRef'
     uri =  "/%2B" -- encoded + sign.
in serialize <$> (parse uri >>= parse . serialize) == Right uri

to be true, but it's not. Left hand side will be Right "/%20"

@hasufell
Copy link
Collaborator

hasufell commented Aug 5, 2022

We discovered this issue in ghcup: https://gitlab.haskell.org/haskell/ghcup-hs/-/issues/398

Although this seems to be spec compliant:

URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

But the spec only talks about encoding, not decoding.

I'd prefer that I can have the parser fail in strict mode.

@hasufell
Copy link
Collaborator

hasufell commented Dec 5, 2022

Any news?

@hasufell
Copy link
Collaborator

urlDecode and urlEncode are not symmetric:

-- | This function was extracted from the @http-types@ package. The
-- license can be found in licenses/http-types/LICENSE
urlDecode ::
-- | Whether to decode '+' to ' '
Bool ->
BS.ByteString ->
BS.ByteString
urlDecode replacePlus z = fst $ BS.unfoldrN (BS.length z) go z
where
go bs' =
case BS.uncons bs' of
Nothing -> Nothing
Just (43, ws) | replacePlus -> Just (32, ws) -- plus to space
Just (37, ws) -> Just $
fromMaybe (37, ws) $ do
-- percent
(x, xs) <- BS.uncons ws
x' <- hexVal x
(y, ys) <- BS.uncons xs
y' <- hexVal y
Just (combine x' y', ys)
Just (w, ws) -> Just (w, ws)
hexVal w
| 48 <= w && w <= 57 = Just $ w - 48 -- 0 - 9
| 65 <= w && w <= 70 = Just $ w - 55 -- A - F
| 97 <= w && w <= 102 = Just $ w - 87 -- a - f
| otherwise = Nothing
combine :: Word8 -> Word8 -> Word8
combine a b = shiftL a 4 .|. b

-- | Percent-encoding for URLs. Specify a list of additional
-- unreserved characters to permit.
urlEncode :: [Word8] -> ByteString -> Builder
urlEncode extraUnreserved = mconcat . map encodeChar . BS.unpack
where
encodeChar ch
| unreserved' ch = BB.fromWord8 ch
| otherwise = h2 ch
unreserved' ch
| ch >= 65 && ch <= 90 = True -- A-Z
| ch >= 97 && ch <= 122 = True -- a-z
| ch >= 48 && ch <= 57 = True -- 0-9
unreserved' c = c `elem` extraUnreserved
h2 v = let (a, b) = v `divMod` 16 in bs $ BS.pack [37, h a, h b] -- percent (%)
h i
| i < 10 = 48 + i -- zero (0)
| otherwise = 65 + i - 10 -- 65: A

urlEncode allows to pass extra unreserved characters which means it won't percent-encode those. And indeed, it seems + is passed to it for paths:

unreservedPath8 :: [Word8]
unreservedPath8 = unreserved8 ++ map ord8 ":@&=+$,"

urlEncodePath :: ByteString -> Builder
urlEncodePath = urlEncode unreservedPath8

normalizeRelativeRef :: URINormalizationOptions -> Maybe Scheme -> URIRef Relative -> Builder
normalizeRelativeRef o@URINormalizationOptions {..} mScheme RelativeRef {..} =
  authority <> path <> query <> fragment
  where
    path
      | unoSlashEmptyPath && BS.null rrPath = BB.fromByteString "/"
      | segs == [""] = BB.fromByteString "/"
      | otherwise = mconcat (intersperse (c8 '/') (map urlEncodePath segs))
-- ...

This isn't such a trivial decision either, because e.g. tel:+1-816-555-1212 might break, see https://www.rfc-editor.org/rfc/rfc3966.html#section-3

On the other hand, specific schemes may need more specific parsers as per RFC 3986.

Looking at https://timothygu.me/urltester/#input=%2F%252B I don't see any parser decoding %2B to +.

@hasufell
Copy link
Collaborator

hasufell commented Dec 31, 2023

I tried finding more information about the practices with percent encoding and the + and space relationship:

What to percent encode can depend on the URI scheme, which have their own RFCs: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

The main spec says:

A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component.

So we only percent encode what is not in the allowed character set. For paths, usually consisting of pchar, the allowed set is:

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

      pct-encoded = "%" HEXDIG HEXDIG

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

So + is legal in the path and there's no reason to percent encode it.

As such, I think the expected behavior should be:

  1. parsing must interpret percent encoded characters in all the sections that allow pct-encoded as per the BNF, so %2B must be parsed as + in paths
  2. encoding (serializing) must not turn + into %2B, because + is legal in paths

I think this behavior is present in #64

@hasufell
Copy link
Collaborator

@kozross any opinion?

@kozross
Copy link

kozross commented Dec 31, 2023

@hasufell - what does this have to do with me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants