-
Notifications
You must be signed in to change notification settings - Fork 286
New parser -- first draft, request for review #274
Conversation
…ences are resolved internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it's much more than a good start! Docstrings on functions and objects will really add to maintainability, too.
I also noticed that it doesn't run on Python 3 out-of-the-box, though it works fine on Python 2.7. We may have to put some effort in to port that.
Also, does this new parser offer any features that the ply-based one didn't support?
@@ -1,4 +1,4 @@ | |||
#!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode | |||
// #!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this earlier. technically thrift does not support hash comments, only C-style (//
and /* */
), so it's good the new parser caught it.
@@ -11,478 +11,80 @@ | |||
import os | |||
import sys | |||
import types | |||
from ply import lex, yacc | |||
from .lexer import * # noqa | |||
import parsley |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsley is awesome. Way better for supporting the non-global use case.
def _make_empty_struct(name, module, ttype=TType.STRUCT, base_cls=TPayload): | ||
attrs = {'__module__': module.__name__, '_ttype': ttype} | ||
module.__dict__[name] = type(name, (base_cls, ), attrs) | ||
return module.__dict__[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think setattr
/getattr
might be a bit more standard here, if i'm not missing anything
return struct._tspec[attr][1] | ||
|
||
|
||
GRAMMAR = ''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could see this dense of a grammar being daunting to some. I think a link to the thrift grammar page and bit of whitespace could go a long way, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, I do feel like having all the grammar in one place is strongly preferable to having it spread throughout many docstrings.
) | ||
|
||
|
||
class ParseError(ThriftGrammerError): pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might help if all exception types were grouped above, with docstrings about their usage.
PARSER = parsley.makeGrammar( | ||
GRAMMAR, | ||
{ | ||
'Document': collections.namedtuple('Document', 'headers definitions'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the use of namedtuples here, and that parsley supports that sort of set-once/immutable approach
@@ -11,478 +11,80 @@ | |||
import os | |||
import sys | |||
import types | |||
from ply import lex, yacc | |||
from .lexer import * # noqa | |||
import parsley | |||
from .exc import ThriftParserError, ThriftGrammerError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wish this were ThriftGrammarError
instead of ThriftGrammerError
. Maybe I can make a separate PR to alias that :P
setattr(cls, key, val) | ||
_values_to_names[val] = key | ||
_names_to_values[key] = val | ||
setattr(cls, '_VALUES_TO_NAMES', _values_to_names) | ||
setattr(cls, '_NAMES_TO_VALUES', _names_to_values) | ||
cls.'_VALUES_TO_NAMES' = _values_to_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to remove the quotes, i think
Python 3 support is on the way. That's my top priority right now to get all the regression tests passing. The two biggest new features are: 1- Safe recovery from parse errors. (There is no global state other than successfully loaded modules, which is easy to clear.) This enables iterative development of thrift files at the REPL or in a jupyter notebook type environment without having to exit and restart the process after a failed parse. #268 2- Parses Apache Thrift's test file https://git-wip-us.apache.org/repos/asf?p=thrift.git;a=blob_plain;f=test/ThriftTest.thrift;hb=HEAD |
Does this pull in doc pieces into doc attributes for functions and classes? That would be nice. |
Hi, the parser library we originally used was pyparsing. And we switched to ply since v0.1.10, one important reason is that ply is much more faster than pyparsing. I have tested this pull request. I called def bench():
start = time.time()
for i in range(1000):
thriftpy.load('my.thrift')
end = time.time()
print(end - start) The
NOTE: Deprecated results, see latter comments. |
@hit9 with a difference that big, I'm fairly certain that you're seeing an artifact of the global cache. My experience is that Parsely is surprisingly comparable to Ply performance-wise. |
Wow I forgot the cache,will try later
|
I expect it will still be slower though, it's a stricter parser with more features (the ply parser can't parse Apache Thrift's test suite) :) Because this is load/import-time, I would expect some minor hit is tolerable for increased compatibility. |
Bench results (2200+lines thrift, called 10 times, without cache)
Hmm, the |
Yes.. we try to stick with apache thrift parser. But some features are dropped or won't implemented in favor of historical reasons, and for some features we just don't use them. The |
Hmm, 30x is closer to the real figure than 1800x, but we'll need to play around with it some more, I'm sure we can speed it up. Parsley has options to emit the generated Python parser, and that should take care of a good chunk. But long story short, we should parse that big file in under a second, ideally, right? |
Yes, |
The |
Good call on performance. I haven't taken a look at that since the first POC. This hasn't been optimized at all, so I'm sure there is a lot of low-hanging fruit. I just broke identifying reserved tokens out to a regular expression for a ~30% speedup (based on tests/parser-cases/tutorial.thrift). 9 more optimizations like that would be a 35x speedup. In addition to optimization, we could cache the parsed files on disk, similar to how the python module loader transparently stores pyc's. |
Added docstring parsing to services and functions. Any number of comments right before the
Open for feedback / other ideas. |
I've played around a bit, and I see the following options to improve performance. (Disregarding py / pyc file caching -- since that is possible but orthogonal to the parser itself.)
I'm confident the parser will at least match the performance of the existing parser with a bit of work. |
…nd python relative imports (allow imports to specify path relative to current file)
To give a bit more context to this PR, I'm currently trying to migrate a medium sized mid-tier from an old version of facebook thrift to thriftpy (~320 thrift files, ~a dozen services and data stores/logs with occasional thrift formatting, maintained by ~30 engineers). The reason for the migration is that the Apache Thrift compiler is a huge pain-point. The compiler takes ~30 minutes to build, and requires the latest C++ boost libraries which makes setting up the build environment painful. In addition to solving this, thriftpy also has the nice bonus of not requiring a separate build step. This opens the door to functionality similar to Avro (https://avro.apache.org/docs/1.2.0/#compare) where data schemas can be stored next to the data and passed around between processes. The migration is necessarily going to be gradual, which means it is important for the process to be able to recover from parse errors. ( |
Getting into the guts of the parser (131 of our legacy thrift files parsing so far), there are a lot of "quirks". One is use of the reserved keywords 'END', 'end', and 'alias' as struct fields / enum items. The other is referencing enum items as module-level constants. (i.e. "foo.BAR" where "foo" is an included module with It might make sense to maintain this as a separate parser module rather than merging to core given the quirks of our code. |
… error messages improved with file paths in some places
Closing for now, we are going with Thrift 0.9.3; may come back around to this in the future. Thank you very much everybody for your time and attention, this was a great learning experience on parsing for me. |
All regression tests are currently passing.
I still need to get all of the error messages cleaned up. (Every regression test that expects a failure has an exception raised -- but sometimes the message is cryptic.)
I'd appreciate some guidance if you have time: