Skip to content

Commit

Permalink
update readme, add overview, clean up tests
Browse files Browse the repository at this point in the history
  • Loading branch information
tgbugs committed Dec 2, 2021
1 parent 6f3987b commit c90700b
Show file tree
Hide file tree
Showing 6 changed files with 135 additions and 12 deletions.
19 changes: 9 additions & 10 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,14 @@ An attempt to specify a formal grammar for [[https://orgmode.org/worg/dev/org-sy

The grammar itself lives in [[file:./laundry/parser.rkt][parser.rkt]]. It is implemented using Racket's \\
[[https://docs.racket-lang.org/brag/#%28part._.The_language%29][#lang brag]]. The details of the implementation are in the comments.

For an overview of the approach see [[file:./docs/overview.org]].
* Getting started
Install [[https://download.racket-lang.org/][Racket]] for your platform.
From the location of this readme run the following.
#+begin_src bash
raco pkg install laundry/
raco pkg install laundry/ org/
#+end_src
There is another package named =org-mode= in the Racket package
manager so the trailing slash is required. Eventually we'll get around
to working out the naming issues.

Once everything is installed you can run the tests by invoking the
following in the directory of this readme.
Expand All @@ -26,15 +25,15 @@ raco test laundry

You can also parse individual Org files using [[file:./laundry/cli.rkt]].
#+begin_src bash :results drawer
laundry/cli.rkt docs/thoughts.org org-mode/test.org
laundry/cli.rkt docs/thoughts.org laundry/test.org
#+end_src
* Status
Most of the core elements of Org syntax have been fully specified,
however other important forms such as the markup syntax have not been
fully implemented.
Laundry can parse most of Org syntax, though there are still issues
with the correctness of the parse in a number of cases.

The second pass operations needed to define Org semantics has not been
implemented yet.
In particular there are a number of edge cases in the interaction
between the syntax for various Org objects that have not been
resolved.
* Objectives
The primary objective of this work is to provide a reference grammar
and implementation that can be used to test other implementations of
Expand Down
101 changes: 101 additions & 0 deletions docs/overview.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#+title: Overview
* Current approach
There are currently two Org syntax parsers implemented in laundry.
The first produces a canonical abstract syntax tree, the second is
used for Dr Racket syntax highlighting. Their approaches are slightly
different due to the fact that the tokenizer for Dr Racket must
conform to more restricted behavior required for syntax highlighting.

In principle the approach used for Dr Racket seems like it could be
more efficient than the one that is used in the AST parser. This is
because the AST parser has a step where it reassembles whole sections
before passing them to a nested tokenizer/parser.

The current implementation of the Org syntax parser in laundry works
as follows. The full parsing process involves multiple nested passes
that all have the same basic steps: tokenize, parse, expand. Nested
steps are triggered during the expansion step where the raw stream is
reconstructed prior to being passed to the nested parser. This all
happens at compile time during syntax expansion.

The current approach makes a trade off to push significant complexity
into the tokenization step and to use nested tokenizers/parsers. Many
of the tokenization patterns are shared between colorization and AST.
However as mentioned, the approach in the AST is not optimal.

Having tried the "put everything in the grammar" approach to parsing
org, I can state that it does not work for LR parsers due to numerous
cases of ambiguity in such a grammar. There were also nasty
performance issues with trying to implement everything in the grammar,
though that may be a issue in the implementation of the grammar
library.

* Considerations
The primary challenge in parsing Org syntax is that the formal
grammar must not be an ambiguous grammar. The first attempt at
a formal grammar was essentially a direct translation of the
Org syntax specification document, however such a translation
produces an ambiguous grammar. This is a well known issue with
semi-formal language specifications [fn:: See the note in the
[[https://tree-sitter.github.io/tree-sitter/creating-parsers#writing-the-grammar][tree-sitter documentation]]].

The alternative is to use a parser/grammar that is more powerful than
the an LR parser/transitional eBNF grammar. One example would be to
use a PEG grammar which avoids ambiguity via arbitrary lookahead.

While such an approach is of interest, providing an unambiguous eBNF
grammar that can be used with widely implemented LR-style parsers is
of significant interest due to the fact that they are available in
many languages.

Without having actually attempted to write a PEG grammar and tokenizer
for Org, it seems that the key trade off between using an LR vs PEG
parser when specifying Org with respect to implementation complexity
is likely to be that the tokenizer for the LR parser needs to do
significantly more heavy lifting in order to avoid ambiguous
constructions. In addition, it seems that in order for an LR parser to
avoid ambiguity a single eBNF grammar cannot be used, instead nested
subgrammars must be used since each requires a different tokenizer due
to the fact that the top level tokenizer must be written in such a way
as to avoid ambiguous pareses in the grammar.
* How to parse Org syntax
There are 4 axes which have to be considered when parsing org syntax.

1. tokenizer/lexer complexity
2. parser complexity
3. nested grammars requiring multiple passes/phases of parsing, and
how parallel the parser can be
4. newline first vs newline last, bof and eof

I have explored some of the space defined by these dimensions.

Using nested grammars is attractive, because it vastly simplifies the
implementation of nested forms when by construction they do not have
to worry about interactions with a form that has high
priority. Consider for example the interaction between headings and
source blocks.

On the other hand, if your lexer supports certain features, it is possible
to handle high priority forms if you construct you tokens with great care.

However, this leads to grammars and tokenizers that are harder to
understand.

Org syntax also has many ambiguities. These often cannot be dealt with
sufficiently in the grammar, and worse, apparently correct behavior
becomes the result of quirks in the implementation rather than as a
result of a correct and unambiguous grammar.

More grammars with less complexity? Or fewer grammars with more complexity?

In a newline first grammar headings must be parsed as a subgrammar
because the start of the line and the end of the line must be
identified at all times because they induce major changes on the
interpretation of forms, leading to e.g. the inability to correctly
parse tags without having either a newline or an explicit eof token
that can be matched.

If your tokenizer cannot emit/detect bof/eof for inclusion in the
grammar, you are going to have a bad time, because a regular org
grammar needs newlines at both ends. That said, a few more iterations
may reveal that this does not necessarily have to be the case.
7 changes: 7 additions & 0 deletions laundry/block.rkt
Original file line number Diff line number Diff line change
Expand Up @@ -109,3 +109,10 @@ block-type-rest : @not-whitespace
; TODO other escape sequences
;string-contents : ( @not-bs-dq | /BS DQ | BS )+

; make tests pass
wsnn : NOP
newline : NOP
nlpws : NOP
blk-line-contents : NOP
no-headlines : NOP
not-whitespace : NOP
2 changes: 2 additions & 0 deletions laundry/expander.rkt
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,8 @@
tags ; XXX internal should probably be a struct field
archive

planning-malformed

plain-list-element
plain-list-line
descriptive-list-line
Expand Down
13 changes: 13 additions & 0 deletions laundry/plain-list.rkt
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,16 @@ cb-done : CHAR-UPPER-X ; AAAAAAAAAAAAAAAAAAAAAAAA ;_; complexity XXX
plain-list-tag-text : @not-newline
plain-list-tag : pl-tag-end
pl-tag-end : COLON COLON

; make tests pass
not-newline : NOP
not-rsb-newline : NOP
digits : NOP
not-digits-newline : NOP
not-at-newline1 : NOP
not-lsb-at-digits-newline : NOP
not-rsb-newline1 : NOP
not-space-X-hyphen-newline1 : NOP
not-lsb-space-X-hyphen-newline : NOP
space : NOP
not-colon-newline : NOP
5 changes: 3 additions & 2 deletions laundry/test.rkt
Original file line number Diff line number Diff line change
Expand Up @@ -696,7 +696,7 @@
(laundry-tokenizer-debug #f)

(dotest " #+begin_src")
(dotest " #+begin_srclol" #:node-type 'paragraph)
(dotest " #+begin_srclol" #:node-type 'paragraph) ; FIXME ... eof issue ?
(dotest " #+begin_src\n")
(dotest " #+begin_")
(dotest " #+begin_-") ; -> block
Expand Down Expand Up @@ -1229,6 +1229,7 @@ echo oops a block
(values count batched)))

(define (make-test-strings)
#f
)
)

Expand Down Expand Up @@ -2331,7 +2332,7 @@ don't affilaite to other unaff keyword

(define h-l1
'(org-file
(headline-node (heading 1)))) ; don't have to have (tags) if there are no tags
(headline-node (heading 1 "\n")))) ; don't have to have (tags) if there are no tags

(module+ test-sentinel
(current-module-path)
Expand Down

0 comments on commit c90700b

Please sign in to comment.