Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic PSL formatting: what fixes should be applied? #2028

Closed
danderson opened this issue Jul 6, 2024 · 8 comments
Closed

Automatic PSL formatting: what fixes should be applied? #2028

danderson opened this issue Jul 6, 2024 · 8 comments

Comments

@danderson
Copy link
Contributor

I have a WIP that can automatically fix formatting issues in a PSL file, as well as apply machine edits while preserving correct formatting. It will take a few more PRs to merge it all, but you can peek at https://github.com/danderson/list/tree/danderson/psl-parser-format .

The idea is to give contributors a tool that they run before sending a PR, and it (a) tells them everything that they need to fix, but also (b) just automatically fixes things that don't require human judgement, like "did you sort these strings correctly". This also makes automated editing easier, because scripts can operate on an abstract syntax tree, and not have to worry about exactly how to lay out text bytes. Just adjust AST nodes, then validate+format+write the file back out and get a clean diff.

I'm opening this issue to get feedback from PSL maintainers: what formatting fixes do you want me to implement? @dnsguru @simon-friedberger

I've already implemented/thought of several, listed below. Each one requires a one-time PSL reformat to fix existing issues, with diffs ranging from <10 lines to >1000 lines. After that, we can require new contributions to run the formatter before sending the PR (and enforce it in CI), and there will be no random churn.

Each of the listed things is independent from the others, you can pick any combination. These changes are all purely cosmetic, and do not change the set of public suffixes, or the meaning of any comments (modulo bugs of course, but I'm confident I can reduce the risk of that with tests, pre/post format sanity checks, etc.).

  1. Basic consistency edits: remove leading/trailing whitespace, list wildcard exceptions immediately after corresponding wildcard, consistent inter-block spacing: patch

    • Technically this one does 3 independent things right now. I can separate them if you only want a subset. These are the changes that are more costly to not do, because avoiding this diff means I have to make the parser more complex to preserve the inconsistencies.
  2. Sort blocks in the private section, by company/product/owner name. This automates Volunteer support: comment on PRs that have Sorting wrong #1858, or at least one half of it. Somewhat large change, but not unjustified: all those blocks are indeed in the wrong order today. patch

  3. Sort suffixes in the private section in addition to the blocks. This is the other half of Volunteer support: comment on PRs that have Sorting wrong #1858. Patch for this one is tiny, 2 suffixes move by a few lines. patch

  4. Sort suffixes in the ICANN section. Large initial diff, due to the older TLD/ccTLD blocks with hand-curated information, and order changes in non-Latin scripts (I spot-checked some diffs with native speakers, they agree the new order is "correct" for their languages). patch

  5. Canonicalize metadata comments in the private section. The format I selected for the diff is "the most popular current format", to get a minimal diff. We can pick any format we like though. patch

    • The formatter only changes a comment when it has high confidence that it's not altering the meaning. This still has bugs in my WIP and some comments get mangled, please assume I will fix them and verify the diff if you want this reformat.
  6. Sort blocks in the ICANN section. I haven't tried this yet, but I assume the diff is going to be massive, and it needs coordination with the GTLD update script so they don't fight each other. But it's possible and pretty easy to implement, if you want it.

FWIW, my personal opinion:

  • 1+2+3 are low impact and seem obviously good to reduce maintainer workload.
  • I would like to do 4+6 to make the ICANN section clean and nice to machine-edit as well, but those 1-time diffs are very big, and I need to do more experimenting/verification to make sure the changes are correct. Not as high priority IMO.
  • I would like to do 5, but use the opportunity to switch over to a different "standard" comment format that is easier to parse (but still human-readable). That will also mean a large diff. I'm willing to do the necessary work, but IMO it's lower priority than other things on my todolist.
@simon-friedberger
Copy link
Contributor

I agree with what I think you're proposing. Let's do the private section bits first and do all of them. Let's figure out how to integrate this with the already automatic script for the ICANN section later.

@danderson
Copy link
Contributor Author

Acknowledged. I've set the initial configuration to focus on the private section, and we can look at the icann block later.

@dnsguru
Copy link
Member

dnsguru commented Aug 12, 2024

@danderson looks like there is a strange trailing crlf just before the ICANN section close that is causing the automation to trigger a PR daily.

Were you the last to touch this? Can you look at that?

@danderson
Copy link
Contributor Author

Likely related to the reformatting change, yes. I'm on it, apologies for the noise.

@dnsguru
Copy link
Member

dnsguru commented Aug 13, 2024

all good. your contributions towards evolving this are app appreciated immensely and things are never perfect in change.

please @ me on merges on the automaion

@danderson
Copy link
Contributor Author

danderson commented Aug 13, 2024

Looking at the timings on the git history, I believe #2093 fixed this, along with the submission of #2103. Timeline (times in US Pacific timezone because that's what github's giving me):

Bottom line is this mess was caused by my neglecting to adjust the gtld updater's output to match the reformat done in #2088, which caused the cronjob to keep trying to make the diff go away until Simon merged a fix. And then one last noisy change to make that newline go away once and for all. I'll be more careful in future, and will add "check what the automation runs would do" to my PR checklist.

The good news is the noise should be over now. The gtld updater and 'psltool fmt' both agree on what the file layout should be, so there should be no more spurious PRs.

@dnsguru
Copy link
Member

dnsguru commented Aug 13, 2024

ok - another thing to watch is that the date will always be different in ICANN's json file as they generate it daily, so maybe ignore it if it is the sole diff

@danderson
Copy link
Contributor Author

The gtld updater already handles that gracefully, thankfully. The timestamped header isn't included in the "has this changed" check, so something other than the timestamp has to change for it to write a new file and create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants