Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making handling of CR LF newlines more consistent with Gawk #51

Open
benhoyt opened this issue Feb 23, 2021 · 4 comments
Open

Comments

@benhoyt
Copy link
Owner

benhoyt commented Feb 23, 2021

Per discussion on issue #33 (from here down), GoAWK handles CR LF (Windows) line endings differently from gawk (I haven't tried awk or mawk). GoAWK doesn't include the CR in the field (because it's part of the line ending), whereas Gawk does. I'm not sure if there are differences between Gawk's handling on Windows and Linux.

I kinda think the GoAWK approach is more sensible and platform-native, but consistency with other AWKs is good too ... worth thinking about further.

Arnold Robbins said this:

Gawk is consistent . RS has the default value of \n and that is what terminates records. As far as gawk is concerned, the \r is no different from any other character, which is why it appears as part of the last field in the record.

That said, on Windows, I believe the default is to work in text mode, in which case gawk never sees the \r\n line ending, it only sees \n. One can use BINMODE to force gawk to see those characters, in which case you would need to set RS = "\r?\n" in order to get correct processing.

Take the Windows advice with a grain of salt. I have not used a Windows system directly in over two years, and when I did I used Cygwin, so some experimentation may be in order.

If one is processing a Windows file on Linux, then one should use a utility like dos2unix on the file, or tr, before sending the data to GoAwk, which does not (yet! hint, hint) allow RS to be a regular expression. Using GoAwk on Windows, well, you'll have to figure out what the Go runtime is handing off to your code.

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 23, 2021

I've thought about this a bit more, and I prefer the GoAWK behavior here, so I'm going to stick with it for now. Including the CR in the field seems against the spirit of FS=" " splitting the fields on whitespace and stripping the whitespace.

@benhoyt benhoyt closed this as completed Dec 23, 2021
@ko1nksm
Copy link

ko1nksm commented Jun 3, 2022

I am confused by this spec of goawk.

With the exception of goawk, other awk implementations are consistent in their handling of newline characters. (Testing is done on Ubuntu 20.04)

$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

If you prefer the GoAWK behavior, how about setting the default value of RS to \r?\n?

$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

# See POSIX documentation below (nawk: awk version 20121220)
$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 0a 42 43 0a 44                                 |A.BC.D|

# It is fixed in the on macOS 11.6.5 version of nawk (nawk: awk version 20200816)
$ printf "A\r\nB\rC\nD" | /usr/bin/awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

RS
The first character of the string value of RS shall be the input record separator; a by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a shall always be a field separator, no matter what the value of FS is.

In my opinion, portability is important. And in any case, we need a way to treat \r as a normal character for compatibility.

@benhoyt
Copy link
Owner Author

benhoyt commented Jun 4, 2022

Thanks, I'm going to reopen this issue to revisit this.

@benhoyt benhoyt reopened this Jun 4, 2022
@mikegleen
Copy link

I don't want to have to care whether input text comes with \n or \r\n at the end of lines. And goawk makes this dream come true. With normal awk I can get code working on a unix-like system, deploy it to Windows (or process a file from Windows) and watch it crash and burn. Having to remember to say BEGIN{RS="\r?\n"} in every script is not a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants