Skip to content
This repository has been archived by the owner on Dec 10, 2018. It is now read-only.

New parser -- first draft, request for review #274

Closed
wants to merge 27 commits into from

Conversation

kurtbrose
Copy link

All regression tests are currently passing.

I still need to get all of the error messages cleaned up. (Every regression test that expects a failure has an exception raised -- but sometimes the message is cryptic.)

I'd appreciate some guidance if you have time:

  • does this look consistent with existing style for the project?
  • are there corner cases not covered by current regression tests to watch out for?
  • is there any wish-lists for parser features that I could integrate while I'm in here?

Copy link

@mahmoud mahmoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it's much more than a good start! Docstrings on functions and objects will really add to maintainability, too.

I also noticed that it doesn't run on Python 3 out-of-the-box, though it works fine on Python 2.7. We may have to put some effort in to port that.

Also, does this new parser offer any features that the ply-based one didn't support?

@@ -1,4 +1,4 @@
#!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode
// #!/usr/local/bin/thrift --gen java:beans,nocamel,hashcode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this earlier. technically thrift does not support hash comments, only C-style (// and /* */), so it's good the new parser caught it.

@@ -11,478 +11,80 @@
import os
import sys
import types
from ply import lex, yacc
from .lexer import * # noqa
import parsley
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsley is awesome. Way better for supporting the non-global use case.

def _make_empty_struct(name, module, ttype=TType.STRUCT, base_cls=TPayload):
attrs = {'__module__': module.__name__, '_ttype': ttype}
module.__dict__[name] = type(name, (base_cls, ), attrs)
return module.__dict__[name]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think setattr/getattr might be a bit more standard here, if i'm not missing anything

return struct._tspec[attr][1]


GRAMMAR = '''
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could see this dense of a grammar being daunting to some. I think a link to the thrift grammar page and bit of whitespace could go a long way, though :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, I do feel like having all the grammar in one place is strongly preferable to having it spread throughout many docstrings.

)


class ParseError(ThriftGrammerError): pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might help if all exception types were grouped above, with docstrings about their usage.

PARSER = parsley.makeGrammar(
GRAMMAR,
{
'Document': collections.namedtuple('Document', 'headers definitions'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the use of namedtuples here, and that parsley supports that sort of set-once/immutable approach

@@ -11,478 +11,80 @@
import os
import sys
import types
from ply import lex, yacc
from .lexer import * # noqa
import parsley
from .exc import ThriftParserError, ThriftGrammerError
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wish this were ThriftGrammarError instead of ThriftGrammerError. Maybe I can make a separate PR to alias that :P

setattr(cls, key, val)
_values_to_names[val] = key
_names_to_values[key] = val
setattr(cls, '_VALUES_TO_NAMES', _values_to_names)
setattr(cls, '_NAMES_TO_VALUES', _names_to_values)
cls.'_VALUES_TO_NAMES' = _values_to_names
Copy link

@mahmoud mahmoud Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to remove the quotes, i think

@kurtbrose
Copy link
Author

kurtbrose commented Dec 23, 2016

Python 3 support is on the way. That's my top priority right now to get all the regression tests passing.

The two biggest new features are:

1- Safe recovery from parse errors. (There is no global state other than successfully loaded modules, which is easy to clear.) This enables iterative development of thrift files at the REPL or in a jupyter notebook type environment without having to exit and restart the process after a failed parse. #268

2- Parses Apache Thrift's test file https://git-wip-us.apache.org/repos/asf?p=thrift.git;a=blob_plain;f=test/ThriftTest.thrift;hb=HEAD

@mariusvniekerk
Copy link

Does this pull in doc pieces into doc attributes for functions and classes? That would be nice.

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016

Hi, the parser library we originally used was pyparsing. And we switched to ply since v0.1.10, one important reason is that ply is much more faster than pyparsing.

I have tested this pull request. I called thriftpy.load with a 2200+ lines thrift file with 1000 times. The ply version gives 0.00572609901428s and parsley version gives 9.22570109367s.

def bench():
    start = time.time()
    for i in range(1000):
        thriftpy.load('my.thrift')
    end = time.time()
    print(end - start)

The my.thrift file is the thrift of our private service, you can make one yourself.

thriftpy is heavily used in our company, it behaves well. We may not make big changes on the current parser. Unless the ply version parser has some major flaws or the new parser has some very important new features or much better performance.

NOTE: Deprecated results, see latter comments.

@mahmoud
Copy link

mahmoud commented Dec 23, 2016

@hit9 with a difference that big, I'm fairly certain that you're seeing an artifact of the global cache. My experience is that Parsely is surprisingly comparable to Ply performance-wise.

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016 via email

@mahmoud
Copy link

mahmoud commented Dec 23, 2016

I expect it will still be slower though, it's a stricter parser with more features (the ply parser can't parse Apache Thrift's test suite) :)

Because this is load/import-time, I would expect some minor hit is tolerable for increased compatibility.

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016

Bench results (2200+lines thrift, called 10 times, without cache)

  • ply version: 2.44250607491 (10 times)
  • parsely version: 78.2733860016s (10 times)

Hmm, the 7.8s per thrift file parsing is unacceptable for our service starting..

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016

the ply parser can't parse Apache Thrift's test suite

Yes.. we try to stick with apache thrift parser. But some features are dropped or won't implemented in favor of historical reasons, and for some features we just don't use them.

The global variable issue may not be solved, because ply requires that.

@mahmoud
Copy link

mahmoud commented Dec 23, 2016

Hmm, 30x is closer to the real figure than 1800x, but we'll need to play around with it some more, I'm sure we can speed it up. Parsley has options to emit the generated Python parser, and that should take care of a good chunk. But long story short, we should parse that big file in under a second, ideally, right?

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016

Yes, <1s should be okay. We have some services with 2000 lines thrift file.. By the way, what's the reason for new parser on parsely but not stick with ply ? the current version works well for us.

@hit9
Copy link
Contributor

hit9 commented Dec 23, 2016

The load time less, the better. One service may rely on multiple third-party thrift services. For example, one of our major thrift service requires to communicate with 15 thrift services.

@kurtbrose
Copy link
Author

Good call on performance. I haven't taken a look at that since the first POC. This hasn't been optimized at all, so I'm sure there is a lot of low-hanging fruit.

I just broke identifying reserved tokens out to a regular expression for a ~30% speedup (based on tests/parser-cases/tutorial.thrift). 9 more optimizations like that would be a 35x speedup.

In addition to optimization, we could cache the parsed files on disk, similar to how the python module loader transparently stores pyc's.

@kurtbrose
Copy link
Author

Added docstring parsing to services and functions. Any number of comments right before the { of a service body, or immediately after a function definition will get joined together with newlines and set to the __doc__ of the generated class.

service Foo /* the docstring */ { 
   void foo() /* arg doc */
}

Open for feedback / other ideas.

@kurtbrose
Copy link
Author

I've played around a bit, and I see the following options to improve performance. (Disregarding py / pyc file caching -- since that is possible but orthogonal to the parser itself.)

  1. Move "heavy lifting" out of parlsey.
  • Move parsing various sub-parts to "built-in" rules. Along the lines of using regex to identify keywords, there are a couple mechanisms to break out "leaf" rules to other libraries / code. This approach has the advantage of being very iterative. The optimization can be done one rule at a time.

  • Move the first level tree-node extraction into parsimonious (https://github.com/erikrose/parsimonious) which is highly optimized. Another possibility is to move the entire thing to parsimonious, although this results in a much more verbose two-phase parser.

  1. Optimize parsley. This has the advantage of optimization being orthogonal to grammar, so the grammar doesn't become longer or harder to modify.

I'm confident the parser will at least match the performance of the existing parser with a bit of work.

@kurtbrose
Copy link
Author

To give a bit more context to this PR, I'm currently trying to migrate a medium sized mid-tier from an old version of facebook thrift to thriftpy (~320 thrift files, ~a dozen services and data stores/logs with occasional thrift formatting, maintained by ~30 engineers). The reason for the migration is that the Apache Thrift compiler is a huge pain-point. The compiler takes ~30 minutes to build, and requires the latest C++ boost libraries which makes setting up the build environment painful. In addition to solving this, thriftpy also has the nice bonus of not requiring a separate build step. This opens the door to functionality similar to Avro (https://avro.apache.org/docs/1.2.0/#compare) where data schemas can be stored next to the data and passed around between processes.

The migration is necessarily going to be gradual, which means it is important for the process to be able to recover from parse errors. (try/except around an exception importing a thrift module should allow the process to continue.) In addition, if trying to parse invalid thrift IDL corrupts the whole system, treating thrift specs as data that can be stored in a DB is not viable.

@kurtbrose
Copy link
Author

kurtbrose commented Dec 29, 2016

Getting into the guts of the parser (131 of our legacy thrift files parsing so far), there are a lot of "quirks".

One is use of the reserved keywords 'END', 'end', and 'alias' as struct fields / enum items. The other is referencing enum items as module-level constants. (i.e. "foo.BAR" where "foo" is an included module with enum Baz { BAR = 1 })

It might make sense to maintain this as a separate parser module rather than merging to core given the quirks of our code.

… error messages improved with file paths in some places
@kurtbrose
Copy link
Author

Closing for now, we are going with Thrift 0.9.3; may come back around to this in the future.

Thank you very much everybody for your time and attention, this was a great learning experience on parsing for me.

@kurtbrose kurtbrose closed this Jan 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants