Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pre-commit hooks (public datasets ONLY) #745

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
01ce5d1
edit
ruisi-su May 24, 2022
40d05cd
unchange
ruisi-su May 24, 2022
1c22538
stats
ruisi-su May 25, 2022
f3442ac
Merge branch 'master' of github.com:ruisi-su/biomedical
ruisi-su May 25, 2022
3ed9e8c
Merge branch 'bigscience-workshop:master' into master
ruisi-su May 27, 2022
f75d0eb
added init for ptm
ruisi-su May 27, 2022
ebb7d43
added proc meta script
ruisi-su May 28, 2022
fd15a96
Merge branch 'master' of github.com:bigscience-workshop/biomedical
ruisi-su May 28, 2022
e1fd031
Merge branch 'bigscience-workshop:master' into master
ruisi-su May 28, 2022
36b6477
Merge branch 'bigscience-workshop:master' into master
ruisi-su May 29, 2022
28e18bf
add single
ruisi-su May 29, 2022
14872a8
Merge branch 'master' of github.com:bigscience-workshop/biomedical
ruisi-su Jun 1, 2022
4615a37
Merge branch 'bigscience-workshop:master' into master
ruisi-su Jun 1, 2022
25433ed
add vis code
ruisi-su Jun 3, 2022
db4fbf2
Merge branch 'bigscience-workshop:master' into master
ruisi-su Jun 4, 2022
b1c10c7
added vis changes
ruisi-su Jun 6, 2022
60471aa
Merge branch 'bigscience-workshop:master' into master
ruisi-su Jun 6, 2022
62a3247
remove proc file
ruisi-su Jun 6, 2022
cc6718e
Merge branch 'master' of github.com:ruisi-su/biomedical
ruisi-su Jun 6, 2022
9d104b8
add vis code
ruisi-su Jun 8, 2022
ed66113
Merge remote-tracking branch 'upstream/master'
ruisi-su Jun 15, 2022
124d413
add paper script
ruisi-su Jun 15, 2022
e98e9e0
edit scripts
ruisi-su Jun 16, 2022
ab63147
edit scripts
ruisi-su Jul 2, 2022
695c88b
add readme
ruisi-su Jul 2, 2022
a02459b
remove wip code
ruisi-su Jul 2, 2022
645ab90
add ngram back in
ruisi-su Jul 2, 2022
b1845fe
black and isort vis code
ruisi-su Jul 2, 2022
391b7bf
move
ruisi-su Jul 7, 2022
73cb2be
added pdfs
ruisi-su Jul 9, 2022
21a389b
added pdfs that are not local and not broken
ruisi-su Jul 11, 2022
832b16c
added agg pdf
ruisi-su Jul 22, 2022
1b37449
resolve readme
ruisi-su Oct 17, 2022
53d3e54
added precommit hooks file
ruisi-su Oct 19, 2022
5d40604
fix
ruisi-su Oct 21, 2022
54dd142
fix
ruisi-su Oct 21, 2022
40521b5
small change to one file to run precommit hooks
ruisi-su Oct 21, 2022
6c66e42
small change to one file to run precommit hooks
ruisi-su Oct 21, 2022
10d5041
remove irrelevant file
ruisi-su Oct 21, 2022
bdbd9cc
remove irrelevant file
ruisi-su Oct 21, 2022
5c8a7af
put pdfs back
ruisi-su Oct 21, 2022
09e29a6
Merge branch 'bigscience-workshop:main' into master
ruisi-su Nov 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
# See https://github.com/crmne/cookiecutter-modern-datascience
fail_fast: true
exclude: '^$'
files: ^bigbio/biodatasets/
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-case-conflict
- id: debug-statements
- id: detect-private-key
- id: check-merge-conflict
- id: check-added-large-files
# - repo: https://github.com/myint/autoflake
# rev: v1.7.6
# hooks:
# - id: autoflake
# args:
# - --in-place
# - --remove-duplicate-keys
# - --remove-unused-variables
# - --remove-all-unused-imports
# - --expand-star-imports
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
args:
- --max-line-length
- '119'
- repo: https://github.com/PyCQA/isort
rev: 5.10.1
hooks:
- id: isort
args:
- --profile
- black
- repo: https://github.com/ambv/black
rev: 22.10.0
hooks:
- id: black
args:
- --line-length
- '119'
- --target-version
- py38
- repo: local
hooks:
- id: test-bigbio
name: running bigbio unit tests
entry: python -m tests.test_bigbio
language: system
files: ^bigbio/biodatasets/
pass_filenames: true
# always_run: true
25 changes: 8 additions & 17 deletions bigbio/biodatasets/an_em/an_em.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,11 @@
_DISPLAYNAME = "AnEM"

_DESCRIPTION = """\
AnEM corpus is a domain- and species-independent resource manually annotated for anatomical
entity mentions using a fine-grained classification system. The corpus consists of 500 documents
(over 90,000 words) selected randomly from citation abstracts and full-text papers with
the aim of making the corpus representative of the entire available biomedical scientific
literature. The corpus annotation covers mentions of both healthy and pathological anatomical
AnEM corpus is a domain- and species-independent resource manually annotated for anatomical \
entity mentions using a fine-grained classification system. The corpus consists of 500 documents \
(over 90,000 words) selected randomly from citation abstracts and full-text papers with \
the aim of making the corpus representative of the entire available biomedical scientific \
literature. The corpus annotation covers mentions of both healthy and pathological anatomical \
entities and contains over 3,000 annotated mentions.
"""

Expand Down Expand Up @@ -167,10 +167,7 @@ def _split_generators(self, dl_manager) -> List[datasets.SplitGenerator]:
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": all_data,
"split_path": data_dir
/ "AnEM-1.0.4"
/ "development"
/ "train-files.list",
"split_path": data_dir / "AnEM-1.0.4" / "development" / "train-files.list",
"split": "train",
},
),
Expand All @@ -186,10 +183,7 @@ def _split_generators(self, dl_manager) -> List[datasets.SplitGenerator]:
name=datasets.Split.VALIDATION,
gen_kwargs={
"filepath": all_data,
"split_path": data_dir
/ "AnEM-1.0.4"
/ "development"
/ "test-files.list",
"split_path": data_dir / "AnEM-1.0.4" / "development" / "test-files.list",
"split": "dev",
},
),
Expand Down Expand Up @@ -251,10 +245,7 @@ def _brat_to_source(self, filepath, brat_example):
"equivalences": [
{
"entity_id": brat_entity["id"],
"ref_ids": [
f"{brat_example['document_id']}_{ids}"
for ids in brat_entity["ref_ids"]
],
"ref_ids": [f"{brat_example['document_id']}_{ids}" for ids in brat_entity["ref_ids"]],
}
for brat_entity in brat_example["equivalences"]
],
Expand Down
3 changes: 1 addition & 2 deletions streamlit_demo/vis_data_card.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ def gen_latex(dataset_name, helper, splits, schemas, fig_path):
r"Token frequency distribution by split (top) and frequency of different kind of instances (bottom).}"
+ "\n"
)
latex_bod += r"\end{figure}" + "\n" + r"\textbf{Dataset Description} "
latex_bod += r"\end{figure}" + "\n" + r"\textbf{Dataset Description:} "
latex_bod += (
fr"{descriptions}"
+ "\n"
Expand Down Expand Up @@ -403,4 +403,3 @@ def draw_figure(data_name, data_config_name, schema_type):
latex_name = f"{data_name}_{config_name}.tex"
write_latex(latex_bod, latex_name)
print(latex_bod)