Simpler and less buggy preprocessor, support #include <> and more! #64

laurenthuberdeau · 2024-08-15T02:35:47Z

Context

I recently merged #8 and noted that I would come back to improving the tokenizer and preprocessor. This is that PR. It includes:

A simpler tokenizer that works better with macros, including the ability to produce '\n' tokens to detect end of lines: 46bbebc
Macro calls: empty arguments are no longer ignored. Example: M(1,,2,3) now has 4 arguments instead of being parsed as M(1,2,3): 9af3839
More flexible token pasting: empty arguments and C keywords identifiers can be pasted. Integers can be pasted to the left of identifiers, assuming another token pasting will result in a valid identifier: 9af3839 and 4ee81e9
Allow C keywords to be defined by the preprocessor. This allows us to replace unsupported types with supported types such as #define float int: 801be22
Support #include <> directives and the -I option which, when used together, makes it easy to use the portable libc: 64642cc
Reuse the C parser for #if expressions, reducing code duplication and giving us the ability to evaluate constant expressions (will be used to support constant non-integer array lengths): 1d9dd56

A bunch of bug fixes and other quality of life improvements:

Macro lines ending in EOF instead of \n no longer crash: b496f4b
Using a macro and immediately redefining it no longer causes the new macro to be used in the expansion before the redefinition: 7ec88a8
The sh backend compiles a = b = 0 expressions to : $((a = b = 0)) instead of a=$((b = 0)).
Produce an error when parse_definition fails to parse instead of causing the code generator to crash: 8374691
Include file location when crashing with fatal_error: bae4cbe

The end result is that we can tokenize TCC-0.9.27 🎉

Because the preprocessor treats whitespace as important, it used to parse macros by looking for certain characters. That was a problem when the preprocessor encountered whitespace and comments, and prevented it from recognizing the end of a macro. This commit fixes that issue by adding a flag (skip_newlines) indicating to the tokenizer if it should skip '\n' or not. When treating preoprocessor directives, this flag is set to false so that the macro parser can stop at the end of the line.

This creates an invalid identifier, but the result may be pasted with another identifier (to the left) resulting in a valid identifier.

Otherwise, while reading the arguments, the macro may be redefined and the expansion would use the new definition (only valid after the #define) instead of the previous one. An example showing the bug: #define FOO 1 int foo_val = FOO #define FOO 3 // Overwrites FOO ; Before, foo_val was assigned the value 3 and now 1 as expected.

This is useful to allow unused types to be redefined to something supported by pnut.

The -I option specifies the search path of files that are included with #include <...>.

Now that the tokenizer can produce NEWLINE tokens, we can reuse the C parser to parse #if expressions. Without newlines, the C parser would keep reading until the end of the expression, skipping over the newlines. Now, if it encounters a newline, a newline token is produced and the C parser fails to parse the expression. This replaces the code that implemented the shunting yard algorithm with a function that can evaluate constant expressions. This function evaluates AST nodes that represent constant expressions, and will be used to support non-integer literal expressions for array lengths.

This reverts commit e2db147.

laurenthuberdeau · 2024-09-02T18:01:49Z

TCC can now be preprocessed by pnut! 🎉 The only change required is making the string pool and heap (controlled by STRING_POOL_SIZE and HEAP_SIZE) 10x larger to not run out of memory.

When attempting to expand a macro, the list of argument is parsed by get_macro_args_toks which produced the next token after ')'. This token was then pushed on the tokens stack so it would be processed after the expanded macro's tokens. When multiple macros were expanded sequentially, this caused the last stack entry to never be empty, which broke the stack reuse mechanism (similar to TCO). This bug was not visible when bootstrapping pnuts because not enough macros were expanded in a row to trigger the issue. This is however a common pattern in TCC.

laurenthuberdeau · 2024-09-02T18:12:59Z

There are no meaningful difference in the bootstrap times:

========== Branch 'main' ==========

PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: ksh
PNUT_SH_OPTIONS_EXTRA: 
0.110s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.187s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
31.903s for: ksh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
11.539s for: ksh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.024s for: ksh pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: dash
PNUT_SH_OPTIONS_EXTRA: 
0.101s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.189s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
45.131s for: dash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
12.368s for: dash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.014s for: dash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.101s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.192s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
84.046s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
29.969s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.038s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: yash
PNUT_SH_OPTIONS_EXTRA: 
0.102s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.207s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
150.327s for: yash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
51.881s for: yash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.045s for: yash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: zsh
PNUT_SH_OPTIONS_EXTRA: 
0.109s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.192s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
1753.253s for: zsh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
245.234s for: zsh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.038s for: zsh pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
M	benchmark-bootstrap-with-options.sh

========== Branch 'laurent/changes-for-TCC' ==========

PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: ksh
PNUT_SH_OPTIONS_EXTRA: 
0.201s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.164s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
32.880s for: ksh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
11.980s for: ksh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.022s for: ksh pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: dash
PNUT_SH_OPTIONS_EXTRA: 
0.101s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.185s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
46.880s for: dash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
12.703s for: dash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.014s for: dash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.101s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.190s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
85.792s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
30.380s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.039s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: yash
PNUT_SH_OPTIONS_EXTRA: 
0.105s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.186s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
152.946s for: yash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
52.065s for: yash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.045s for: yash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe
PLATFORM: Darwin laurent-mbp 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
SHELL: zsh
PNUT_SH_OPTIONS_EXTRA: 
0.103s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.191s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
1765.779s for: zsh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
257.923s for: zsh pnut-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
0.041s for: zsh pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
0.001s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Di386 pnut.c > pnut-i386-compiled-pnut-i386-exe.exe

fix token pasting + logging around macro expansion

9af3839

laurenthuberdeau marked this pull request as ready for review August 15, 2024 02:35

laurenthuberdeau force-pushed the main branch from 4821029 to 7b72561 Compare August 16, 2024 20:28

laurenthuberdeau added 3 commits August 18, 2024 16:27

Simplify tokenizer

06f6af1

Remove unused keywords

a036b47

Handle files not-ending with newline

b496f4b

laurenthuberdeau force-pushed the laurent/changes-for-TCC branch from 9bdbff8 to 105e163 Compare August 18, 2024 20:27

laurenthuberdeau force-pushed the laurent/changes-for-TCC branch from 105e163 to 3e4ec9d Compare August 18, 2024 21:26

laurenthuberdeau added 8 commits August 18, 2024 17:29

Add 5 timeout when compiling and running tests

75c7d48

Add missing ARROW case in print_tok

dc51f23

Support token pasting between int and identifier

4ee81e9

This creates an invalid identifier, but the result may be pasted with another identifier (to the left) resulting in a valid identifier.

Allow C keywords to be defined

801be22

This is useful to allow unused types to be redefined to something supported by pnut.

A few more tests

9daa299

Make prepare.sh script more verbose

e246f6b

Fix RT_USE_LOOKUP_TABLE when using unicode chars

e2db147

laurenthuberdeau force-pushed the laurent/changes-for-TCC branch from 3e4ec9d to e2db147 Compare August 18, 2024 21:29

laurenthuberdeau changed the title ~~Changes for TCC~~ More preprocessor changes Aug 18, 2024

laurenthuberdeau added 4 commits August 18, 2024 18:25

Nicer sequence of assignments

469e227

Indicate file location when calling fatal_error

bae4cbe

Crash when parse_definition fails to parse

8374691

Support #include <...> when -I option is used

64642cc

The -I option specifies the search path of files that are included with #include <...>.

laurenthuberdeau force-pushed the laurent/changes-for-TCC branch from d83feef to 64642cc Compare August 19, 2024 01:31

laurenthuberdeau added 2 commits August 30, 2024 15:12

Merge branch 'main' into laurent/changes-for-TCC

89423a7

Move up AST nodes functions

d4e8109

laurenthuberdeau mentioned this pull request Aug 31, 2024

Add negative-zero.c test to document ksh bug #73

Merged

laurenthuberdeau added 3 commits August 31, 2024 13:00

Increase compile_test timeout when using pnut.sh

7bb7959

Adjust whitespace in fatal_error

e9e5dcc

laurenthuberdeau added 2 commits August 31, 2024 13:40

Revert "Fix RT_USE_LOOKUP_TABLE when using unicode chars"

5b57fcb

This reverts commit e2db147.

Merge branch 'main' into laurent/changes-for-TCC

4064bbc

laurenthuberdeau force-pushed the laurent/changes-for-TCC branch from 568be2a to f845ac9 Compare September 2, 2024 18:07

laurenthuberdeau changed the title ~~More preprocessor changes~~ Simpler and less buggy preprocessor, support #include <> and more! Sep 2, 2024

laurenthuberdeau merged commit f9b88d7 into main Sep 2, 2024
26 checks passed

laurenthuberdeau deleted the laurent/changes-for-TCC branch September 2, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler and less buggy preprocessor, support #include <> and more! #64

Simpler and less buggy preprocessor, support #include <> and more! #64

laurenthuberdeau commented Aug 15, 2024 •

edited

Loading

laurenthuberdeau commented Sep 2, 2024 •

edited

Loading

laurenthuberdeau commented Sep 2, 2024

Simpler and less buggy preprocessor, support #include <> and more! #64

Simpler and less buggy preprocessor, support #include <> and more! #64

Conversation

laurenthuberdeau commented Aug 15, 2024 • edited Loading

Context

laurenthuberdeau commented Sep 2, 2024 • edited Loading

laurenthuberdeau commented Sep 2, 2024

laurenthuberdeau commented Aug 15, 2024 •

edited

Loading

laurenthuberdeau commented Sep 2, 2024 •

edited

Loading