Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Script to Check Consistency Between Data Types in Directories and Metadata #390

Merged
merged 8 commits into from
Oct 24, 2024
55 changes: 55 additions & 0 deletions src/scribe_data/check/check_data_type_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
from scribe_data.cli.cli_utils import (
LANGUAGE_DATA_EXTRACTION_DIR,
data_type_metadata,
)


def check_data_type_metadata(output_file):
"""
Check that subdirectories named for data types in language directories
are also reflected in the data_type_metadata.json file, accounting for meta-languages.
"""
# Extract valid data types from data_type_metadata
valid_data_types = set(data_type_metadata.keys())

def check_language_subdirs(lang_dir, meta_lang=None):
discrepancies = []

for language in lang_dir.iterdir():
if language.is_dir():
meta_language = meta_lang or language.name.lower()
data_types_in_dir = []

KesharwaniArpita marked this conversation as resolved.
Show resolved Hide resolved
for data_type in language.iterdir():
if data_type.is_dir():
data_types_in_dir.append(data_type.name.lower())

# Compare with valid data types
missing_data_types = set(data_types_in_dir) - valid_data_types
extra_data_types = valid_data_types - set(data_types_in_dir)

if missing_data_types:
discrepancies.append(f"Missing in metadata for '{meta_language}': {missing_data_types}")
if extra_data_types:
discrepancies.append(f"Extra in directory for '{meta_language}': {extra_data_types}")
Comment on lines +33 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct terms would be "Extra in metadata" or "Missing in directory"?
this is the result for English:
Extra in directory for 'english': {'conjunctions', 'pronouns', 'articles', 'postpositions', 'personal_pronouns', 'autosuggestions', 'prepositions'}
but the English directory doesn't have them.

Copy link
Contributor Author

@KesharwaniArpita KesharwaniArpita Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @catreedle but extra data here denotes any data type that is there in the language folder but not in metadata file


# Recursively check sub-languages (if applicable)
sub_lang_dir = language / 'sub-languages'
if sub_lang_dir.exists():
discrepancies.extend(check_language_subdirs(sub_lang_dir, meta_language))

Comment on lines +37 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thoughts here is how #402 will affect this PR...🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...sub_lang_dir = language / 'sub-languages' is written to even support the new flow coming from #402, yeah? @catreedle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was done keeping the generalisation in mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get it. does this mean the directory structure will change?
Norwegian
--sub-languages
----Nynorks
----Bokmal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get it. does this mean the directory structure will change? Norwegian --sub-languages ----Nynorks ----Bokmal

No, @catreedle, it will remain as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean the directory structure will change?

No, the directory structure will remain same, but the scripts will be able to check the data types under sub languages also

return discrepancies


# Start checking from the base language directory
discrepancies = check_language_subdirs(LANGUAGE_DATA_EXTRACTION_DIR)

# Store discrepancies in the output file
with open(output_file, 'w', encoding='utf-8') as f:
if discrepancies:
for discrepancy in discrepancies:
f.write(discrepancy + '\n')
else:
f.write("All data type metadata is up-to-date!\n")

print(f"Discrepancies stored in: {output_file}")
Loading