Encoded + sign in URI #55

janvogt · 2020-03-06T20:05:35Z

I'd expect

let parse = parseRelativeRef strictURIParserOptions
     serialize = serializeURIRef'
     uri =  "/%2B" -- encoded + sign.
in serialize <$> (parse uri >>= parse . serialize) == Right uri

to be true, but it's not. Left hand side will be Right "/%20"

The text was updated successfully, but these errors were encountered:

hasufell · 2022-08-05T03:17:26Z

We discovered this issue in ghcup: https://gitlab.haskell.org/haskell/ghcup-hs/-/issues/398

Although this seems to be spec compliant:

URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

But the spec only talks about encoding, not decoding.

I'd prefer that I can have the parser fail in strict mode.

hasufell · 2022-12-05T04:28:58Z

Any news?

hasufell · 2023-12-31T12:29:07Z

urlDecode and urlEncode are not symmetric:

uri-bytestring/src/URI/ByteString/Internal.hs

Lines 916 to 944 in 1c0134c

    
           -- | This function was extracted from the @http-types@ package. The 
        
           -- license can be found in licenses/http-types/LICENSE 
        
           urlDecode :: 
        
             -- | Whether to decode '+' to ' ' 
        
             Bool -> 
        
             BS.ByteString -> 
        
             BS.ByteString 
        
           urlDecode replacePlus z = fst $ BS.unfoldrN (BS.length z) go z 
        
             where 
        
               go bs' = 
        
                 case BS.uncons bs' of 
        
                   Nothing -> Nothing 
        
                   Just (43, ws) | replacePlus -> Just (32, ws) -- plus to space 
        
                   Just (37, ws) -> Just $ 
        
                     fromMaybe (37, ws) $ do 
        
                       -- percent 
        
                       (x, xs) <- BS.uncons ws 
        
                       x' <- hexVal x 
        
                       (y, ys) <- BS.uncons xs 
        
                       y' <- hexVal y 
        
                       Just (combine x' y', ys) 
        
                   Just (w, ws) -> Just (w, ws) 
        
               hexVal w 
        
                 | 48 <= w && w <= 57 = Just $ w - 48 -- 0 - 9 
        
                 | 65 <= w && w <= 70 = Just $ w - 55 -- A - F 
        
                 | 97 <= w && w <= 102 = Just $ w - 87 -- a - f 
        
                 | otherwise = Nothing 
        
               combine :: Word8 -> Word8 -> Word8 
        
               combine a b = shiftL a 4 .|. b

uri-bytestring/src/URI/ByteString/Internal.hs

Lines 949 to 967 in 1c0134c

    
           -- | Percent-encoding for URLs. Specify a list of additional 
        
           -- unreserved characters to permit. 
        
           urlEncode :: [Word8] -> ByteString -> Builder 
        
           urlEncode extraUnreserved = mconcat . map encodeChar . BS.unpack 
        
             where 
        
               encodeChar ch 
        
                 | unreserved' ch = BB.fromWord8 ch 
        
                 | otherwise = h2 ch 
        
               unreserved' ch 
        
                 | ch >= 65 && ch <= 90 = True -- A-Z 
        
                 | ch >= 97 && ch <= 122 = True -- a-z 
        
                 | ch >= 48 && ch <= 57 = True -- 0-9 
        
               unreserved' c = c `elem` extraUnreserved 
        
               h2 v = let (a, b) = v `divMod` 16 in bs $ BS.pack [37, h a, h b] -- percent (%) 
        
               h i 
        
                 | i < 10 = 48 + i -- zero (0) 
        
                 | otherwise = 65 + i - 10 -- 65: A

urlEncode allows to pass extra unreserved characters which means it won't percent-encode those. And indeed, it seems + is passed to it for paths:

unreservedPath8 :: [Word8]
unreservedPath8 = unreserved8 ++ map ord8 ":@&=+$,"

urlEncodePath :: ByteString -> Builder
urlEncodePath = urlEncode unreservedPath8

normalizeRelativeRef :: URINormalizationOptions -> Maybe Scheme -> URIRef Relative -> Builder
normalizeRelativeRef o@URINormalizationOptions {..} mScheme RelativeRef {..} =
  authority <> path <> query <> fragment
  where
    path
      | unoSlashEmptyPath && BS.null rrPath = BB.fromByteString "/"
      | segs == [""] = BB.fromByteString "/"
      | otherwise = mconcat (intersperse (c8 '/') (map urlEncodePath segs))
-- ...

This isn't such a trivial decision either, because e.g. tel:+1-816-555-1212 might break, see https://www.rfc-editor.org/rfc/rfc3966.html#section-3

On the other hand, specific schemes may need more specific parsers as per RFC 3986.

Looking at https://timothygu.me/urltester/#input=%2F%252B I don't see any parser decoding %2B to +.

hasufell · 2023-12-31T18:02:59Z

I tried finding more information about the practices with percent encoding and the + and space relationship:

RFC 3986 does not say anything about replacing + with space... it's from the HTML standard on application/x-www-form-urlencoded content type... I believe this library has no business dealing with that
percent-encoding is specified here:
- https://www.rfc-editor.org/rfc/rfc3986#section-2.1
when to encode/decode is specified here:
- https://www.rfc-editor.org/rfc/rfc3986#section-2.4

What to percent encode can depend on the URI scheme, which have their own RFCs: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

The main spec says:

A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component.

So we only percent encode what is not in the allowed character set. For paths, usually consisting of pchar, the allowed set is:

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

      pct-encoded = "%" HEXDIG HEXDIG

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

So + is legal in the path and there's no reason to percent encode it.

As such, I think the expected behavior should be:

parsing must interpret percent encoded characters in all the sections that allow pct-encoded as per the BNF, so %2B must be parsed as + in paths
encoding (serializing) must not turn + into %2B, because + is legal in paths

I think this behavior is present in #64

hasufell · 2023-12-31T18:12:50Z

@kozross any opinion?

kozross · 2023-12-31T18:25:42Z

@hasufell - what does this have to do with me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoded + sign in URI #55

Encoded + sign in URI #55

janvogt commented Mar 6, 2020

hasufell commented Aug 5, 2022

hasufell commented Dec 5, 2022

hasufell commented Dec 31, 2023

hasufell commented Dec 31, 2023 •

edited

Loading

hasufell commented Dec 31, 2023

kozross commented Dec 31, 2023

Encoded + sign in URI #55

Encoded + sign in URI #55

Comments

janvogt commented Mar 6, 2020

hasufell commented Aug 5, 2022

hasufell commented Dec 5, 2022

hasufell commented Dec 31, 2023

hasufell commented Dec 31, 2023 • edited Loading

hasufell commented Dec 31, 2023

kozross commented Dec 31, 2023

hasufell commented Dec 31, 2023 •

edited

Loading