Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience: allow specifying the corpus description json file #172

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 25 additions & 13 deletions compiler_opt/rl/corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@
# command line, where all the flags reference existing, local files.
FullyQualifiedCmdLine = Tuple[str, ...]

DEFAULT_CORPUS_DESCRIPTION_FILENAME = 'corpus_description.json'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_CORPUS...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean adding underscore('') before DEFAULT...

github believes that ''DEFAULT... means italic the word following the '_'



def _apply_cmdline_filters(
orig_options: Tuple[str, ...],
Expand Down Expand Up @@ -185,7 +187,7 @@ def __call__(self,
class Corpus:
"""Represents a corpus.

A corpus is created from a corpus_description.json file, produced by
A corpus is created from a corpus description json file, produced by
extract_ir.py (for example).

To use the corpus:
Expand Down Expand Up @@ -225,7 +227,7 @@ class ReplaceContext:

def __init__(self,
*,
data_path: str,
location: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location is a bit confusing, seems that data_path is a more informative name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to change to anything, but did you read the commit message (because I'm making the opposite argument there).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i missed that part.

data_path sounds to me that can represents both file path or dir path. My major concern about the naming of location is: 1) it does not describe it is the location of what so can be a bit confusing, data_location may be more descriptive in that sense; 2) the flag name is data_path now, having different names increases the burden a little bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about corpus_description, since we'd subsequently go with deprecating dir-based value anyway? (probably subsequently worth updating the flag, too)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with corpus_description is that it is a bit too specific so if in the future we have more/different corpus formatting we have to do some migration work again. Meanwhile, data_path is a general and reasonable name imo.

I'm fine with either as long as the name is the same as the flag name, so up to you :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus_path (and ack on the flag). Less general than "data", reasonably ambiguous-ish to mean either dir or file, and says nothing about description or anything like that. wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus_path sounds great!

module_filter: Optional[Callable[[str], bool]] = None,
additional_flags: Tuple[str, ...] = (),
delete_flags: Tuple[str, ...] = (),
Expand All @@ -238,7 +240,10 @@ def __init__(self,
output) and validated.

Args:
data_path: corpus directory.
location: either the path to the corpus description json file in a corpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better that we only allow the input being the corpus_description path? The only concern is that it adds some migration cost...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be better, but to avoid migration pain, we could stage it: this patch first; then warn about deprecation; then deprecate

we can skip the 2nd step because we don't have a release model and so when folks would notice the change is out of our control; but can work with e.g. Fuchsia to update their scripts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the migration plan sgtm!

directory, or just the corpus directory, case in which
`DEFAULT_CORPUS_DESCRIPTION_FILENAME` will be assumed to be available
in that directory and used as corpus description.
additional_flags: list of flags to append to the command line
delete_flags: list of flags to remove (both `-flag=<value` and
`-flag <value>` are supported).
Expand All @@ -251,18 +256,24 @@ def __init__(self,
module_filter: a regular expression used to filter 'in' modules with names
matching it. None to include everything.
"""
self._base_dir = data_path
if tf.io.gfile.isdir(location):
self._base_dir = location
corpus_description = os.path.join(location,
DEFAULT_CORPUS_DESCRIPTION_FILENAME)
else:
self._base_dir = os.path.dirname(location)
corpus_description = location

self._sampler = sampler
# TODO: (b/233935329) Per-corpus *fdo profile paths can be read into
# {additional|delete}_flags here
with tf.io.gfile.GFile(
os.path.join(data_path, 'corpus_description.json'), 'r') as f:
with tf.io.gfile.GFile(corpus_description, 'r') as f:
corpus_description: Dict[str, Any] = json.load(f)

module_paths = corpus_description['modules']
if len(module_paths) == 0:
raise ValueError(
f'{data_path}\'s corpus_description contains no modules.')
f'{corpus_description} corpus description contains no modules.')

has_thinlto: bool = corpus_description['has_thinlto']

Expand All @@ -273,7 +284,7 @@ def __init__(self,
if corpus_description[
'global_command_override'] == constant.UNSPECIFIED_OVERRIDE:
raise ValueError(
'global_command_override in corpus_description.json not filled.')
f'global_command_override in {corpus_description} not filled.')
cmd_override = tuple(corpus_description['global_command_override'])
if len(additional_flags) > 0:
logging.warning(
Expand Down Expand Up @@ -314,7 +325,8 @@ def get_cmdline(name: str):
if cmd_override_was_specified:
ret = cmd_override
else:
with tf.io.gfile.GFile(os.path.join(data_path, name + '.cmd')) as f:
with tf.io.gfile.GFile(os.path.join(self._base_dir,
name + '.cmd')) as f:
ret = tuple(f.read().replace(r'{', r'{{').replace(r'}',
r'}}').split('\0'))
# The options read from a .cmd file must be run with -cc1
Expand All @@ -331,8 +343,8 @@ def get_cmdline(name: str):
contents = tp.map(
lambda name: ModuleSpec(
name=name,
size=tf.io.gfile.GFile(os.path.join(data_path, name + '.bc')).
size(),
size=tf.io.gfile.GFile(
os.path.join(self._base_dir, name + '.bc')).size(),
command_line=get_cmdline(name),
has_thinlto=has_thinlto), module_paths)
self._module_specs = tuple(
Expand Down Expand Up @@ -405,6 +417,6 @@ def create_corpus_for_testing(location: str,
if cmdline_is_override:
corpus_description['global_command_override'] = cmdline
with tf.io.gfile.GFile(
os.path.join(location, 'corpus_description.json'), 'w') as f:
os.path.join(location, DEFAULT_CORPUS_DESCRIPTION_FILENAME), 'w') as f:
f.write(json.dumps(corpus_description))
return Corpus(data_path=location, **kwargs)
return Corpus(location=location, **kwargs)
11 changes: 11 additions & 0 deletions compiler_opt/rl/corpus_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,17 @@ def test_constructor(self):
has_thinlto=False),))
self.assertEqual(len(cps), 1)

def test_specific_path(self):
basedir = self.create_tempdir()
cps = corpus.create_corpus_for_testing(
location=basedir, elements=[corpus.ModuleSpec(name='1', size=1)])
new_corpus_desc = os.path.join(basedir, 'hi.json')
tf.io.gfile.rename(
os.path.join(basedir, corpus.DEFAULT_CORPUS_DESCRIPTION_FILENAME),
new_corpus_desc)
cps2 = corpus.Corpus(location=new_corpus_desc)
self.assertTupleEqual(cps.module_specs, cps2.module_specs)

def test_invalid_args(self):
with self.assertRaises(
ValueError, msg='-cc1 flag not present in .cmd file'):
Expand Down
8 changes: 5 additions & 3 deletions compiler_opt/rl/train_locally.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,10 @@

flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
'Root directory for writing logs/summaries/checkpoints.')
flags.DEFINE_string('data_path', None,
'Path to directory containing the corpus.')
flags.DEFINE_string(
'data_path', None,
'Path to directory containing the corpus, or specific corpus description '
'json file.')
flags.DEFINE_integer(
'num_workers', None,
'Number of parallel data collection workers. `None` for max available')
Expand Down Expand Up @@ -103,7 +105,7 @@ def train_eval(worker_manager_class=LocalWorkerPoolManager,

logging.info('Loading module specs from corpus at %s.', FLAGS.data_path)
cps = corpus.Corpus(
data_path=FLAGS.data_path,
location=FLAGS.data_path,
additional_flags=problem_config.flags_to_add(),
delete_flags=problem_config.flags_to_delete(),
replace_flags=problem_config.flags_to_replace())
Expand Down
8 changes: 5 additions & 3 deletions compiler_opt/tools/generate_default_trace.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,10 @@
# see https://bugs.python.org/issue33315 - we do need these types, but must
# currently use them as string annotations

_DATA_PATH = flags.DEFINE_string('data_path', None,
'Path to folder containing IR files.')
_DATA_PATH = flags.DEFINE_string(
'data_path', None,
'Path to directory containing IR files, or path to description json file '
'under such a directory.')
_POLICY_PATH = flags.DEFINE_string(
'policy_path', '', 'Path to the policy to generate trace with.')
_OUTPUT_PATH = flags.DEFINE_string(
Expand Down Expand Up @@ -145,7 +147,7 @@ def main(_):
_MODULE_FILTER.value) if _MODULE_FILTER.value else None

cps = corpus.Corpus(
data_path=_DATA_PATH.value,
location=_DATA_PATH.value,
module_filter=lambda name: True
if not module_filter else module_filter.match(name),
additional_flags=config.flags_to_add(),
Expand Down