update readme, add overview, clean up tests

tgbugs · Dec 2, 2021 · c90700b · c90700b
1 parent 6f3987b
commit c90700b
Show file tree

Hide file tree

Showing 6 changed files with 135 additions and 12 deletions.
diff --git a/README.org b/README.org
@@ -8,15 +8,14 @@ An attempt to specify a formal grammar for [[https://orgmode.org/worg/dev/org-sy
 
 The grammar itself lives in [[file:./laundry/parser.rkt][parser.rkt]]. It is implemented using Racket's \\
 [[https://docs.racket-lang.org/brag/#%28part._.The_language%29][#lang brag]]. The details of the implementation are in the comments.
+
+For an overview of the approach see [[file:./docs/overview.org]].
 * Getting started
 Install [[https://download.racket-lang.org/][Racket]] for your platform.
 From the location of this readme run the following.
 #+begin_src bash
-raco pkg install laundry/
+raco pkg install laundry/ org/
 #+end_src
-There is another package named =org-mode= in the Racket package
-manager so the trailing slash is required. Eventually we'll get around
-to working out the naming issues.
 
 Once everything is installed you can run the tests by invoking the
 following in the directory of this readme.
@@ -26,15 +25,15 @@ raco test laundry
 
 You can also parse individual Org files using [[file:./laundry/cli.rkt]].
 #+begin_src bash :results drawer
-laundry/cli.rkt docs/thoughts.org org-mode/test.org
+laundry/cli.rkt docs/thoughts.org laundry/test.org
 #+end_src
 * Status
-Most of the core elements of Org syntax have been fully specified,
-however other important forms such as the markup syntax have not been
-fully implemented.
+Laundry can parse most of Org syntax, though there are still issues
+with the correctness of the parse in a number of cases.
 
-The second pass operations needed to define Org semantics has not been
-implemented yet.
+In particular there are a number of edge cases in the interaction
+between the syntax for various Org objects that have not been
+resolved.
 * Objectives
 The primary objective of this work is to provide a reference grammar
 and implementation that can be used to test other implementations of

diff --git a/docs/overview.org b/docs/overview.org
@@ -0,0 +1,101 @@
+#+title: Overview
+* Current approach
+There are currently two Org syntax parsers implemented in laundry.
+The first produces a canonical abstract syntax tree, the second is
+used for Dr Racket syntax highlighting. Their approaches are slightly
+different due to the fact that the tokenizer for Dr Racket must
+conform to more restricted behavior required for syntax highlighting.
+
+In principle the approach used for Dr Racket seems like it could be
+more efficient than the one that is used in the AST parser. This is
+because the AST parser has a step where it reassembles whole sections
+before passing them to a nested tokenizer/parser.
+
+The current implementation of the Org syntax parser in laundry works
+as follows. The full parsing process involves multiple nested passes
+that all have the same basic steps: tokenize, parse, expand. Nested
+steps are triggered during the expansion step where the raw stream is
+reconstructed prior to being passed to the nested parser. This all
+happens at compile time during syntax expansion.
+
+The current approach makes a trade off to push significant complexity
+into the tokenization step and to use nested tokenizers/parsers. Many
+of the tokenization patterns are shared between colorization and AST.
+However as mentioned, the approach in the AST is not optimal.
+
+Having tried the "put everything in the grammar" approach to parsing
+org, I can state that it does not work for LR parsers due to numerous
+cases of ambiguity in such a grammar. There were also nasty
+performance issues with trying to implement everything in the grammar,
+though that may be a issue in the implementation of the grammar
+library.
+
+* Considerations
+The primary challenge in parsing Org syntax is that the formal
+grammar must not be an ambiguous grammar. The first attempt at
+a formal grammar was essentially a direct translation of the
+Org syntax specification document, however such a translation
+produces an ambiguous grammar. This is a well known issue with
+semi-formal language specifications [fn:: See the note in the
+[[https://tree-sitter.github.io/tree-sitter/creating-parsers#writing-the-grammar][tree-sitter documentation]]].
+
+The alternative is to use a parser/grammar that is more powerful than
+the an LR parser/transitional eBNF grammar. One example would be to
+use a PEG grammar which avoids ambiguity via arbitrary lookahead.
+
+While such an approach is of interest, providing an unambiguous eBNF
+grammar that can be used with widely implemented LR-style parsers is
+of significant interest due to the fact that they are available in
+many languages.
+
+Without having actually attempted to write a PEG grammar and tokenizer
+for Org, it seems that the key trade off between using an LR vs PEG
+parser when specifying Org with respect to implementation complexity
+is likely to be that the tokenizer for the LR parser needs to do
+significantly more heavy lifting in order to avoid ambiguous
+constructions. In addition, it seems that in order for an LR parser to
+avoid ambiguity a single eBNF grammar cannot be used, instead nested
+subgrammars must be used since each requires a different tokenizer due
+to the fact that the top level tokenizer must be written in such a way
+as to avoid ambiguous pareses in the grammar.
+* How to parse Org syntax
+There are 4 axes which have to be considered when parsing org syntax.
+
+1. tokenizer/lexer complexity
+2. parser complexity
+3. nested grammars requiring multiple passes/phases of parsing, and
+   how parallel the parser can be
+4. newline first vs newline last, bof and eof
+
+I have explored some of the space defined by these dimensions.
+
+Using nested grammars is attractive, because it vastly simplifies the
+implementation of nested forms when by construction they do not have
+to worry about interactions with a form that has high
+priority. Consider for example the interaction between headings and
+source blocks.
+
+On the other hand, if your lexer supports certain features, it is possible
+to handle high priority forms if you construct you tokens with great care.
+
+However, this leads to grammars and tokenizers that are harder to
+understand.
+
+Org syntax also has many ambiguities. These often cannot be dealt with
+sufficiently in the grammar, and worse, apparently correct behavior
+becomes the result of quirks in the implementation rather than as a
+result of a correct and unambiguous grammar.
+
+More grammars with less complexity? Or fewer grammars with more complexity?
+
+In a newline first grammar headings must be parsed as a subgrammar
+because the start of the line and the end of the line must be
+identified at all times because they induce major changes on the
+interpretation of forms, leading to e.g. the inability to correctly
+parse tags without having either a newline or an explicit eof token
+that can be matched.
+
+If your tokenizer cannot emit/detect bof/eof for inclusion in the
+grammar, you are going to have a bad time, because a regular org
+grammar needs newlines at both ends. That said, a few more iterations
+may reveal that this does not necessarily have to be the case.
diff --git a/laundry/block.rkt b/laundry/block.rkt
@@ -109,3 +109,10 @@ block-type-rest : @not-whitespace
 ; TODO other escape sequences
 ;string-contents : ( @not-bs-dq | /BS DQ | BS )+
 
+; make tests pass
+wsnn : NOP
+newline : NOP
+nlpws : NOP
+blk-line-contents : NOP
+no-headlines : NOP
+not-whitespace : NOP
diff --git a/laundry/expander.rkt b/laundry/expander.rkt
@@ -319,6 +319,8 @@
   tags ; XXX internal should probably be a struct field
   archive
 
+  planning-malformed
+
   plain-list-element
   plain-list-line
   descriptive-list-line

diff --git a/laundry/plain-list.rkt b/laundry/plain-list.rkt
@@ -60,3 +60,16 @@ cb-done : CHAR-UPPER-X ; AAAAAAAAAAAAAAAAAAAAAAAA ;_; complexity XXX
 plain-list-tag-text : @not-newline
 plain-list-tag : pl-tag-end
 pl-tag-end : COLON COLON
+
+; make tests pass
+not-newline : NOP
+not-rsb-newline : NOP
+digits : NOP
+not-digits-newline : NOP
+not-at-newline1 : NOP
+not-lsb-at-digits-newline : NOP
+not-rsb-newline1 : NOP
+not-space-X-hyphen-newline1 : NOP
+not-lsb-space-X-hyphen-newline : NOP
+space : NOP
+not-colon-newline : NOP
diff --git a/laundry/test.rkt b/laundry/test.rkt
@@ -696,7 +696,7 @@
   (laundry-tokenizer-debug #f)
 
   (dotest "  #+begin_src")
-  (dotest "  #+begin_srclol" #:node-type 'paragraph)
+  (dotest "  #+begin_srclol" #:node-type 'paragraph) ; FIXME ... eof issue ?
   (dotest "  #+begin_src\n")
   (dotest "  #+begin_")
   (dotest "  #+begin_-") ; -> block
@@ -1229,6 +1229,7 @@ echo oops a block
       (values count batched)))
 
   (define (make-test-strings)
+    #f
     )
   )
 
@@ -2331,7 +2332,7 @@ don't affilaite to other unaff keyword
 
 (define h-l1
   '(org-file
-    (headline-node (heading 1)))) ; don't have to have (tags) if there are no tags
+    (headline-node (heading 1 "\n")))) ; don't have to have (tags) if there are no tags
 
 (module+ test-sentinel
   (current-module-path)