New parser -- first draft, request for review #274

kurtbrose · 2016-12-22T23:02:36Z

All regression tests are currently passing.

I still need to get all of the error messages cleaned up. (Every regression test that expects a failure has an exception raised -- but sometimes the message is cryptic.)

I'd appreciate some guidance if you have time:

does this look consistent with existing style for the project?
are there corner cases not covered by current regression tests to watch out for?
is there any wish-lists for parser features that I could integrate while I'm in here?

…ences are resolved internally

… yet

…t clean yet

mahmoud

In general I think it's much more than a good start! Docstrings on functions and objects will really add to maintainability, too.

I also noticed that it doesn't run on Python 3 out-of-the-box, though it works fine on Python 2.7. We may have to put some effort in to port that.

Also, does this new parser offer any features that the ply-based one didn't support?

mahmoud · 2016-12-22T23:06:26Z

tests/storm.thrift

@@ -1,4 +1,4 @@
-#!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode
+// #!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode


I noticed this earlier. technically thrift does not support hash comments, only C-style (// and /* */), so it's good the new parser caught it.

mahmoud · 2016-12-22T23:16:36Z

thriftpy/parser/parser.py

@@ -11,478 +11,80 @@
 import os
 import sys
 import types
-from ply import lex, yacc
-from .lexer import *  # noqa
+import parsley


Parsley is awesome. Way better for supporting the non-global use case.

mahmoud · 2016-12-22T23:21:16Z

thriftpy/parser/parser.py

+def _make_empty_struct(name, module, ttype=TType.STRUCT, base_cls=TPayload):
+    attrs = {'__module__': module.__name__, '_ttype': ttype}
+    module.__dict__[name] = type(name, (base_cls, ), attrs)
+    return module.__dict__[name]


i think setattr/getattr might be a bit more standard here, if i'm not missing anything

mahmoud · 2016-12-22T23:23:53Z

thriftpy/parser/parser.py

+    return struct._tspec[attr][1]
+
+
+GRAMMAR = '''


i could see this dense of a grammar being daunting to some. I think a link to the thrift grammar page and bit of whitespace could go a long way, though :)

that said, I do feel like having all the grammar in one place is strongly preferable to having it spread throughout many docstrings.

mahmoud · 2016-12-22T23:26:26Z

thriftpy/parser/parser.py

+)
+
+
+class ParseError(ThriftGrammerError): pass


I think it might help if all exception types were grouped above, with docstrings about their usage.

mahmoud · 2016-12-22T23:26:59Z

thriftpy/parser/parser.py

+PARSER = parsley.makeGrammar(
+    GRAMMAR, 
+    {
+        'Document': collections.namedtuple('Document', 'headers definitions'),


i like the use of namedtuples here, and that parsley supports that sort of set-once/immutable approach

mahmoud · 2016-12-22T23:27:51Z

thriftpy/parser/parser.py

@@ -11,478 +11,80 @@
 import os
 import sys
 import types
-from ply import lex, yacc
-from .lexer import *  # noqa
+import parsley
 from .exc import ThriftParserError, ThriftGrammerError


wish this were ThriftGrammarError instead of ThriftGrammerError. Maybe I can make a separate PR to alias that :P

mahmoud · 2016-12-22T23:41:50Z

thriftpy/parser/parser.py

            setattr(cls, key, val)
            _values_to_names[val] = key
            _names_to_values[key] = val
-    setattr(cls, '_VALUES_TO_NAMES', _values_to_names)
-    setattr(cls, '_NAMES_TO_VALUES', _names_to_values)
+    cls.'_VALUES_TO_NAMES' = _values_to_names


forgot to remove the quotes, i think

kurtbrose · 2016-12-23T00:22:42Z

Python 3 support is on the way. That's my top priority right now to get all the regression tests passing.

The two biggest new features are:

1- Safe recovery from parse errors. (There is no global state other than successfully loaded modules, which is easy to clear.) This enables iterative development of thrift files at the REPL or in a jupyter notebook type environment without having to exit and restart the process after a failed parse. #268

2- Parses Apache Thrift's test file https://git-wip-us.apache.org/repos/asf?p=thrift.git;a=blob_plain;f=test/ThriftTest.thrift;hb=HEAD

mariusvniekerk · 2016-12-23T01:21:05Z

Does this pull in doc pieces into doc attributes for functions and classes? That would be nice.

hit9 · 2016-12-23T01:27:17Z

Hi, the parser library we originally used was pyparsing. And we switched to ply since v0.1.10, one important reason is that ply is much more faster than pyparsing.

I have tested this pull request. I called thriftpy.load with a 2200+ lines thrift file with 1000 times. The ply version gives 0.00572609901428s and parsley version gives 9.22570109367s.

def bench():
    start = time.time()
    for i in range(1000):
        thriftpy.load('my.thrift')
    end = time.time()
    print(end - start)

The my.thrift file is the thrift of our private service, you can make one yourself.

thriftpy is heavily used in our company, it behaves well. We may not make big changes on the current parser. Unless the ply version parser has some major flaws or the new parser has some very important new features or much better performance.

NOTE: Deprecated results, see latter comments.

mahmoud · 2016-12-23T02:21:37Z

@hit9 with a difference that big, I'm fairly certain that you're seeing an artifact of the global cache. My experience is that Parsely is surprisingly comparable to Ply performance-wise.

hit9 · 2016-12-23T02:23:47Z

Wow I forgot the cache，will try later

mahmoud · 2016-12-23T02:26:28Z

I expect it will still be slower though, it's a stricter parser with more features (the ply parser can't parse Apache Thrift's test suite) :)

Because this is load/import-time, I would expect some minor hit is tolerable for increased compatibility.

hit9 · 2016-12-23T02:32:55Z

Bench results (2200+lines thrift, called 10 times, without cache)

ply version: 2.44250607491 (10 times)
parsely version: 78.2733860016s (10 times)

Hmm, the 7.8s per thrift file parsing is unacceptable for our service starting..

hit9 · 2016-12-23T02:37:39Z

the ply parser can't parse Apache Thrift's test suite

Yes.. we try to stick with apache thrift parser. But some features are dropped or won't implemented in favor of historical reasons, and for some features we just don't use them.

The global variable issue may not be solved, because ply requires that.

mahmoud · 2016-12-23T02:38:03Z

Hmm, 30x is closer to the real figure than 1800x, but we'll need to play around with it some more, I'm sure we can speed it up. Parsley has options to emit the generated Python parser, and that should take care of a good chunk. But long story short, we should parse that big file in under a second, ideally, right?

hit9 · 2016-12-23T02:43:24Z

Yes, <1s should be okay. We have some services with 2000 lines thrift file.. By the way, what's the reason for new parser on parsely but not stick with ply ? the current version works well for us.

hit9 · 2016-12-23T02:54:23Z

The load time less, the better. One service may rely on multiple third-party thrift services. For example, one of our major thrift service requires to communicate with 15 thrift services.

kurtbrose · 2016-12-23T05:11:14Z

Good call on performance. I haven't taken a look at that since the first POC. This hasn't been optimized at all, so I'm sure there is a lot of low-hanging fruit.

I just broke identifying reserved tokens out to a regular expression for a ~30% speedup (based on tests/parser-cases/tutorial.thrift). 9 more optimizations like that would be a 35x speedup.

In addition to optimization, we could cache the parsed files on disk, similar to how the python module loader transparently stores pyc's.

kurtbrose · 2016-12-24T00:40:29Z

Added docstring parsing to services and functions. Any number of comments right before the { of a service body, or immediately after a function definition will get joined together with newlines and set to the __doc__ of the generated class.

service Foo /* the docstring */ { 
   void foo() /* arg doc */
}

Open for feedback / other ideas.

kurtbrose · 2016-12-27T19:11:15Z

I've played around a bit, and I see the following options to improve performance. (Disregarding py / pyc file caching -- since that is possible but orthogonal to the parser itself.)

Move "heavy lifting" out of parlsey.

Move parsing various sub-parts to "built-in" rules. Along the lines of using regex to identify keywords, there are a couple mechanisms to break out "leaf" rules to other libraries / code. This approach has the advantage of being very iterative. The optimization can be done one rule at a time.
Move the first level tree-node extraction into parsimonious (https://github.com/erikrose/parsimonious) which is highly optimized. Another possibility is to move the entire thing to parsimonious, although this results in a much more verbose two-phase parser.

Optimize parsley. This has the advantage of optimization being orthogonal to grammar, so the grammar doesn't become longer or harder to modify.

Optimize ometa.runtime.OMetaBase. (I've already investigated that a bit here: https://gist.github.com/kurtbrose/df78fe07d3e1283a7a1f3fb54ff27907) Parsley generates code like this: https://github.com/python-parsley/parsley/blob/master/ometa/_generated/parsley.py which calls to the (parameterized) superclass. Cython optimized superclass = faster parser.
Write / adapt parsley meta-grammar to generate .pyx instead of .py. This builds on (2), now lift everything except the little bits of eval() python code into cython for maximum optimization.

I'm confident the parser will at least match the performance of the existing parser with a bit of work.

…nd python relative imports (allow imports to specify path relative to current file)

kurtbrose · 2016-12-28T18:45:39Z

To give a bit more context to this PR, I'm currently trying to migrate a medium sized mid-tier from an old version of facebook thrift to thriftpy (~320 thrift files, ~a dozen services and data stores/logs with occasional thrift formatting, maintained by ~30 engineers). The reason for the migration is that the Apache Thrift compiler is a huge pain-point. The compiler takes ~30 minutes to build, and requires the latest C++ boost libraries which makes setting up the build environment painful. In addition to solving this, thriftpy also has the nice bonus of not requiring a separate build step. This opens the door to functionality similar to Avro (https://avro.apache.org/docs/1.2.0/#compare) where data schemas can be stored next to the data and passed around between processes.

The migration is necessarily going to be gradual, which means it is important for the process to be able to recover from parse errors. (try/except around an exception importing a thrift module should allow the process to continue.) In addition, if trying to parse invalid thrift IDL corrupts the whole system, treating thrift specs as data that can be stored in a DB is not viable.

kurtbrose · 2016-12-29T02:11:36Z

Getting into the guts of the parser (131 of our legacy thrift files parsing so far), there are a lot of "quirks".

One is use of the reserved keywords 'END', 'end', and 'alias' as struct fields / enum items. The other is referencing enum items as module-level constants. (i.e. "foo.BAR" where "foo" is an included module with enum Baz { BAR = 1 })

It might make sense to maintain this as a separate parser module rather than merging to core given the quirks of our code.

… error messages improved with file paths in some places

kurtbrose · 2017-01-04T00:07:57Z

Closing for now, we are going with Thrift 0.9.3; may come back around to this in the future.

Thank you very much everybody for your time and attention, this was a great learning experience on parsing for me.

kurt.rose and others added 13 commits December 15, 2016 22:08

working on new parser

b4b5b00

working on parser integration

acc5811

getting a bunch of unit tests passing; need to refactor so that refer…

ae7afcc

…ences are resolved internally

migrated internal symbol lookups to GRAMMAR

09e4fca

grinding through test errors

dd51624

working on test cases

50293c4

constants.thrift is parsing

1991e53

most test cases passing

8f4da49

recursive structs / unions now parse okay

6b545c0

all parser unit tests passing except not raising expected error types…

533872c

… yet

all test_parser.py tests passing; error messages on invalid syntax no…

c6d9e59

…t clean yet

all regression tests passing

6c8ba62

a bit of code cleanup and removing dead code

b5272cf

mahmoud reviewed Dec 22, 2016

View reviewed changes

python2 tests passing, working on python3

08818d6

mahmoud reviewed Dec 22, 2016

View reviewed changes

kurt.rose added 2 commits December 22, 2016 15:50

changed ply requirement to parsley

4b62f08

making all format strings 2.6 compatible

27d9b39

adding more test cases to isolate Python3 issues

1aad7ee

splitting reserved token matching out to an RE

63e0bb5

added docstring parsing to services and functions

7d04471

kurt.rose and others added 4 commits December 27, 2016 17:01

making relative module search path more consistent with C #includes a…

6a4b0aa

…nd python relative imports (allow imports to specify path relative to current file)

improved error message to include module name and path on parse failure

57c1b6c

fixing enum value parsing

235fc53

improved enum handling

3c262b1

kurt.rose added 3 commits December 28, 2016 11:32

cleaning up relative imports

930ca68

found some more corner cases invovling enum fields of structs

a67ed0f

good progress

3b9f64e

handling unicode whitespaces, and mutliple includes of the same name;…

f82cf5e

… error messages improved with file paths in some places

kurtbrose closed this Jan 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parser -- first draft, request for review #274

New parser -- first draft, request for review #274

kurtbrose commented Dec 22, 2016

mahmoud left a comment

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016

mahmoud Dec 22, 2016 •

edited

Loading

kurtbrose commented Dec 23, 2016 •

edited

Loading

mariusvniekerk commented Dec 23, 2016

hit9 commented Dec 23, 2016 •

edited

Loading

mahmoud commented Dec 23, 2016

hit9 commented Dec 23, 2016 via email •

edited by lxyu

Loading

mahmoud commented Dec 23, 2016 •

edited

Loading

hit9 commented Dec 23, 2016

hit9 commented Dec 23, 2016

mahmoud commented Dec 23, 2016

hit9 commented Dec 23, 2016

hit9 commented Dec 23, 2016 •

edited

Loading

kurtbrose commented Dec 23, 2016

kurtbrose commented Dec 24, 2016

kurtbrose commented Dec 27, 2016

kurtbrose commented Dec 28, 2016

kurtbrose commented Dec 29, 2016 •

edited

Loading

kurtbrose commented Jan 4, 2017

		@@ -1,4 +1,4 @@
		#!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode
		// #!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode

New parser -- first draft, request for review #274

New parser -- first draft, request for review #274

Conversation

kurtbrose commented Dec 22, 2016

mahmoud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahmoud Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

kurtbrose commented Dec 23, 2016 • edited Loading

mariusvniekerk commented Dec 23, 2016

hit9 commented Dec 23, 2016 • edited Loading

mahmoud commented Dec 23, 2016

hit9 commented Dec 23, 2016 via email • edited by lxyu Loading

mahmoud commented Dec 23, 2016 • edited Loading

hit9 commented Dec 23, 2016

hit9 commented Dec 23, 2016

mahmoud commented Dec 23, 2016

hit9 commented Dec 23, 2016

hit9 commented Dec 23, 2016 • edited Loading

kurtbrose commented Dec 23, 2016

kurtbrose commented Dec 24, 2016

kurtbrose commented Dec 27, 2016

kurtbrose commented Dec 28, 2016

kurtbrose commented Dec 29, 2016 • edited Loading

kurtbrose commented Jan 4, 2017

mahmoud Dec 22, 2016 •

edited

Loading

kurtbrose commented Dec 23, 2016 •

edited

Loading

hit9 commented Dec 23, 2016 •

edited

Loading

hit9 commented Dec 23, 2016 via email •

edited by lxyu

Loading

mahmoud commented Dec 23, 2016 •

edited

Loading

hit9 commented Dec 23, 2016 •

edited

Loading

kurtbrose commented Dec 29, 2016 •

edited

Loading