Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rmk32 eol convention for input defaults to ANY, extend OPENSTREAM so that EOL can be specified as an "external format" #1785

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rmkaplan
Copy link
Contributor

As per the technical meeting on 7/15/2024.

This sets the default EOL convention for input files to be ANY.

It also extends the possibilities for the externalformat parameter to OPENSTREAM. It can be a known format atom (e.g. :UTF-8) as before. But it can also be an EOL convention (CR, LF, CRLF, ANY) or a (format eolconvention) pair (e.g. (:XCCS LF)).

The motivation for this extension is to sneak in the EOL convention in the :EXTERNAL-FORMAT optional argument to CL:OPEN. The Commonlisp spec doesn't allow for arbitrary opening parameters to be specified, we trick it at least for the EOL convention by overloading the external format argument (essentially treating the EOL as a funky external format).

(* ; "Edited 6-Jul-2022 00:00 by rmk")
(* ; "Edited 19-Dec-2021 09:30 by rmk")
(* ; "Edited 14-Dec-2021 16:10 by rmk")
(* ; "Edited 13-Dec-2021 15:20 by rmk")
(* ; "Edited 29-Jun-2021 17:07 by rmk:")
(* ; "Edited 5-Oct-92 13:45 by jds")

(* ;; "RMK: July 2024: Default EOL to ANY on input streams, allow EXTERNAL FORMAT to be a (FORMAT EOL) list so CL:OPEN can get the EOL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the principle of "be liberal in what you accept, and conservative in what you generate"; would it make sense for the EXTERNALFORMAT to be in proplist (:key val ...) format? That would allow the items to be in either order, and would establish the pattern for extending this in the future, if necessary.
Should that generalization be added to the implementation of CL:OPEN before it calls to IL:OPENSTREAM? There it could also ensure the EOL symbols are in the IL: package, and put the values in the correct order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COnsider the possibility that this change ahs too much flexibility, and the flexiblity means more error cases. What are the uses of EXTERNAL-FORMAT? When you are copying from one place to another, can you copy butes instead of characters (The ELEMENT-TYPE of Common Lisp streams can be BYTE or CHARACTER.

A simpler to implement and more backward comapatible would be to get rid of EOL as a separate parameter and "bake" it into the EXTERAN-FORMAT keyword:

We currently have :UTF-8 and :XCCS as the two frequent cases.
Declare that UTF-8 implies EOL=LF and add (i you need it) :UTF-8-CR or UTF-8-CRLF.
Declare that XCCS implies EOL=.CR on output and ANY on input.

Then you don't have to edit where any program assumes EQ can be used to answewr whether two streams have the same EXTERNAL-FORMAT which could happen anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to copy bytes, use COPYBYTES. If you want to copy characters, use COPYCHARS (which will convert the bytes from one format to another). Does commonlisp specify a function that branches on the element-type? It should choose which of the subfunctions to call.

Each external format already has its own default EOL convention. This extension is for the case where for whatever reason the user wants to override that. For OPENSTREAM the override can be passed as a separate parameter, but CL:OPEN doesn't allow for that kind of additional specification. This is all about sneaking that in without doing more serious damage.

This doesn't affect what is returned as the external-format STREAMPROP of the stream, it's always an EQ-able atom. It's just that if the EOL convention had been changed from its default, the property in the external format wouldn't be accurate.

In Interlisp the function STREAMPROP can be used to change the format and the eol separately, after the open. Does commonlisp support that kind of operation? (Another use case for STREAMPROP: the ENDOFSTREAMOP as a stream property rather than something that has be specified on each input operation. Does commonlisp support that?)

I probably don't yet have the correct logic for the EOL convention of external formats, as we transition to ANY as the default for input streams. At open the ANY should be installed for input streams even if the format specifies one of the specific conventions. The format's convention should apply by default only to output streams. If the user really wants a specific format on input, then an override should be applied (at open or by STREAMPROP).

(BTW, in the original, inherited implementation of external formats there was a flag EOLVALID. I don't understand the use case for that, and it isn't fetched anywhere in our core directories. But I left it in.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last week we decided to investigate soee different options for ANY -- find the first EOL and use that interpretation throughout.
Two things to think about: First EOL with input = ANY means you can do COPYBYTES.
Second, use EXTERNALFORMAT for EOL convention.

:FIRST-USE? Copychars vs copybytes. We're moving this to Draft.

@masinter masinter marked this pull request as draft July 29, 2024 22:32
@MattHeffron
Copy link
Contributor

Has any additional work been done on this?
(Just asking...)

@rmkaplan
Copy link
Contributor Author

Nothing more has been done. I believe that the next step is to add another 2-bit field to the STREAM datatype (beyond the part that Maiko knows about) to hold the actual EOL convention that is detected when the file is read as ANY. This is so that COPYCHARS can preserve the original EOL convention of the characters, and even be consistent if the EOL convention changes across the file.

@rmkaplan
Copy link
Contributor Author

BTW, there is a long related discussion at issue #345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants