Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research/subset #164

Draft
wants to merge 89 commits into
base: development
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
de83804
Ignore more
benjamc Aug 12, 2024
fdd6f90
Update
benjamc Aug 12, 2024
542d63b
Add notebook
benjamc Aug 12, 2024
6a1fb0d
Update
benjamc Aug 12, 2024
70af9d2
Merge branch 'development' into research/subset
benjamc Aug 14, 2024
b7c718b
feat: validate ranking
benjamc Aug 14, 2024
0837387
Update
benjamc Aug 16, 2024
5490eec
add notebook
benjamc Aug 16, 2024
11122e8
Update
benjamc Aug 16, 2024
8a89183
Do not plot
benjamc Aug 17, 2024
5768330
Update
benjamc Aug 17, 2024
52d717e
Update notebook
benjamc Aug 18, 2024
c064dbf
Format
benjamc Aug 18, 2024
0a9d207
Compare subselections
benjamc Aug 18, 2024
959ce8e
Also save run cfgs
benjamc Aug 18, 2024
5cc4d07
Update notebook
benjamc Aug 18, 2024
794c9f7
Merge branch 'development' into research/subset
benjamc Nov 8, 2024
37c80ba
Merge branch 'development' into research/subset
benjamc Nov 8, 2024
4bb9826
Merge branch 'development' into research/subset
benjamc Nov 8, 2024
7a12285
Merge branch 'development' into research/subset
benjamc Nov 8, 2024
cbb1e94
fix timestamp
benjamc Nov 11, 2024
f3030ff
goodbye bash scripts
benjamc Nov 11, 2024
801d271
make prettier
benjamc Nov 11, 2024
25f350f
goosbye old file
benjamc Nov 11, 2024
5c2b20f
goodbye
benjamc Nov 11, 2024
2a732ff
build(pyproject): update ruff setting
benjamc Nov 11, 2024
50691a2
Ignore more
benjamc Nov 11, 2024
79c30ab
update program
benjamc Nov 11, 2024
6436945
feat(subselect): finally as python
benjamc Nov 11, 2024
8636ebd
tiny fixes
benjamc Nov 11, 2024
b4837ee
refactor(subselection)
benjamc Nov 12, 2024
ef97f0e
fix(problem): generate pymoo config
benjamc Nov 12, 2024
41f9439
local parallel
benjamc Nov 19, 2024
1b04986
refactor(create_subset_config): del dir, better error msg, new cmd
benjamc Nov 28, 2024
5f7a0c8
feat: new subsets for BB + MO
benjamc Nov 28, 2024
9a6c458
run more pymoo problems
benjamc Nov 28, 2024
741856c
Rename problem
benjamc Nov 28, 2024
4ade4ec
refactor(generate_problems): MO
benjamc Nov 28, 2024
f811513
refactor(benchmark_footprint): notebook
benjamc Nov 28, 2024
a60c585
refactor(inspect_problems): update notebook
benjamc Nov 28, 2024
f0f8d52
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Nov 28, 2024
dfb1410
build(yahpo_install): update
benjamc Nov 28, 2024
19b49e5
update yahpo install
benjamc Nov 28, 2024
3b5a852
update notebooks
benjamc Nov 28, 2024
9af890a
revert subselection progress
benjamc Nov 28, 2024
a17722b
please pandas
benjamc Nov 29, 2024
5331430
please pandas
benjamc Nov 29, 2024
9634a6e
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Nov 29, 2024
d58aaaa
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Nov 29, 2024
88fa41e
fix(hebo): ordinal hyperparameter + precommit
benjamc Nov 29, 2024
2862f4c
refactor(yahpo): update ConfigSpace API
benjamc Dec 1, 2024
acca085
ignore more
benjamc Dec 3, 2024
f5abcb5
add scenario and subset_id to config
benjamc Dec 3, 2024
efae5c8
add scenario and subset id to subselection configs
benjamc Dec 3, 2024
f3542e8
remove duplicates in config
benjamc Dec 3, 2024
cd43443
refactor(gather_data): more config keys
benjamc Dec 20, 2024
14de991
feat(gather_data): calculate log performance
benjamc Dec 20, 2024
30e11e9
update notebook
benjamc Dec 20, 2024
13b9541
Merge branch 'development' into research/subset
benjamc Dec 20, 2024
2c9f371
feat(report)
benjamc Jan 6, 2025
8c40aaa
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Jan 6, 2025
8b948f0
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Jan 6, 2025
071f1e4
fix(autorank): api
benjamc Jan 7, 2025
1167ff2
ignore more
benjamc Jan 7, 2025
5045ebb
feat(color_palette): more and nicer colors
benjamc Jan 7, 2025
b31dbdd
feat(generate_report): silent plotting
benjamc Jan 7, 2025
5d1e6ca
fix(ax): conversion of ordinal HPs
benjamc Jan 10, 2025
f0ced8e
style(utils): pre-commit
benjamc Jan 10, 2025
74c6dc1
style(...): pre-commit
benjamc Jan 10, 2025
c2da546
style
benjamc Jan 10, 2025
1c16a55
fix(ax): cat HP
benjamc Jan 10, 2025
da8f4c7
Merge branch 'development' into research/subset
benjamc Jan 10, 2025
754f333
Merge branch 'development' into research/subset
benjamc Jan 10, 2025
607ccdd
refactor(generate_report)
benjamc Jan 12, 2025
c2ecf65
fix(ax): allow_inactive_with_values=True
benjamc Jan 12, 2025
03b638a
fix(yahpo): ConfigSpace deprecation warning
benjamc Jan 12, 2025
62e8fc8
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Jan 12, 2025
abab6d2
docs(generate_report): more info
benjamc Jan 12, 2025
86c8423
fix(run_autorank): if nothing is lost
benjamc Jan 12, 2025
8866952
fix(utils): filter by final performance: if there is a small max val …
benjamc Jan 12, 2025
6e2f633
Merge branch 'development' into research/subset
benjamc Jan 12, 2025
f1691bc
feat(gather_data): collect from several folders
benjamc Jan 12, 2025
baafb7d
Merge branch 'research/subset' of github.com:automl/CARP-S into resea…
benjamc Jan 12, 2025
9124a7a
refactor(generate_report): goodbye plots, fix ranks and norm
benjamc Jan 14, 2025
e458d15
style(file_logger)
benjamc Jan 14, 2025
170531d
style(pareto_front)
benjamc Jan 15, 2025
0145ff8
style(overriderfinde)
benjamc Jan 15, 2025
e052d2d
style(loggingutils)
benjamc Jan 15, 2025
9901400
style(index_configs)
benjamc Jan 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ containers
data
outputs
runs
results
runsold
*.json
*.code-workspace
Expand Down Expand Up @@ -167,4 +168,7 @@ runs_subset
runs*
*.parquet
run-data*
*.txt
*.txt
slurm-*
tmp*
*.aux
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ such as surrogate models, in order to run the benchmark:

- For YAHPO, you can download the required surrogate benchmarks and meta-data with
```bash
bash container_recipes/benchmarks/YAHPO/prepare_yahpo.sh
bash container_recipes/benchmarks/YAHPO/install_yahpo.sh
```

## Minimal Example
Expand Down
16 changes: 11 additions & 5 deletions carps/analysis/concat_rundata.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
from __future__ import annotations

import pandas as pd
from carps.analysis.gather_data import load_set, convert_mixed_types_to_str

from carps.analysis.gather_data import convert_mixed_types_to_str, load_set


def concat_rundata():
paths = {
Expand All @@ -24,13 +27,16 @@ def concat_rundata():

args = []
for item in paths.values():
for k,v in item.items():
from pathlib import Path
args.append((v,k))
for k, v in item.items():
args.append((v, k))
res = [load_set(paths=a[0], set_id=a[1]) for a in args]
df = pd.concat([r[0] for r in res]).reset_index(drop=True)
df = convert_mixed_types_to_str(df)
df.to_parquet("rundata.parquet")

df_cfg = pd.concat([d for _, d in res]).reset_index(drop=True)
df_cfg.to_parquet("rundata_cfg.parquet")


if __name__ == "__main__":
concat_rundata()
concat_rundata()
89 changes: 72 additions & 17 deletions carps/analysis/gather_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,17 @@ def load_log(rundir: str | Path, log_fn: str = "trial_logs.jsonl") -> pd.DataFra
df["cfg_fn"] = config_fn
df["cfg_str"] = [(config_fn, cfg_str)] * len(df)

config_keys = ["benchmark", "problem", "seed", "optimizer_id", "task"]
config_keys = [
"benchmark_id",
"problemd_id",
"scenario",
"subset_id",
"benchmark",
"problem",
"seed",
"optimizer_id",
"task",
]
config_keys_forbidden = ["_target_", "_partial_"]
df = annotate_with_cfg(df=df, cfg=cfg, config_keys=config_keys, config_keys_forbidden=config_keys_forbidden)
# df = maybe_add_bandit_log(df, rundir, n_initial_design=cfg.task.n_initial_design)
Expand Down Expand Up @@ -346,8 +356,11 @@ def process_logs(logs: pd.DataFrame, keep_task_columns: list[str] | None = None)

# Add time
logger.debug("Calculate the elapsed time...")
logs = logs.groupby(by=["problem_id", "optimizer_id", "seed"]).apply(calc_time).reset_index(drop=True)

logs = (
logs.groupby(by=["problem_id", "optimizer_id", "seed"])
.apply(calc_time, include_groups=False)
.reset_index(drop=False)
)
logs = convert_mixed_types_to_str(logs, logger)
logger.debug("Done 😪🙂")
return logs
Expand All @@ -367,9 +380,23 @@ def normalize_logs(logs: pd.DataFrame) -> pd.DataFrame:
logs["trial_value__cost_inc"] = logs["trial_value__cost"].transform("cummin")
logs["trial_value__cost_norm"] = logs.groupby("problem_id")["trial_value__cost"].transform(normalize)
logger.info("Calc normalized incumbent cost...")

# logs["trial_value__cost_log"] = logs["trial_value__cost"].apply(lambda x: np.log(x + 1e-10))
logs["trial_value__cost_log"] = logs.groupby(by=["problem_id"])["trial_value__cost"].transform(
lambda x: np.log(x - x.min() + 1e-10)
)
logs["trial_value__cost_inc_log"] = logs.groupby(by=["problem_id", "optimizer_id", "seed"])[
"trial_value__cost_log"
].transform("cummin")
logs["trial_value__cost_log_norm"] = logs.groupby("problem_id")["trial_value__cost_log"].transform(normalize)
logs["trial_value__cost_inc_log_norm"] = logs.groupby(by=["problem_id", "optimizer_id", "seed"])[
"trial_value__cost_log_norm"
].transform("cummin")

logs["trial_value__cost_inc_norm"] = logs.groupby(by=["problem_id", "optimizer_id", "seed"])[
"trial_value__cost_norm"
].transform("cummin")
logs["trial_value__cost_inc_norm_log"] = logs["trial_value__cost_inc_norm"].apply(lambda x: np.log(x + 1e-10))
if "time" not in logs:
logs["time"] = 0
logger.info("Normalize time...")
Expand Down Expand Up @@ -427,7 +454,10 @@ def get_interpolated_performance_df(
"trial_value__cost",
"trial_value__cost_norm",
"trial_value__cost_inc",
"trial_value__cost_inc_log",
"trial_value__cost_inc_log_norm",
"trial_value__cost_inc_norm",
"trial_value__cost_inc_norm_log",
]
logger.info("Create dataframe for neat plotting by aligning x-axis / interpolating budget.")

Expand Down Expand Up @@ -473,21 +503,46 @@ def load_logs(rundir: str):

# NOTE(eddiebergman): Use `n_processes=None` as default, which uses `os.cpu_count()` in `Pool`
def filelogs_to_df(
rundir: str, log_fn: str = "trial_logs.jsonl", n_processes: int | None = None
rundir: str | list[str], log_fn: str = "trial_logs.jsonl", n_processes: int | None = None
) -> tuple[pd.DataFrame, pd.DataFrame]:
logger.info(f"Get rundirs from {rundir}...")
rundirs = get_run_dirs(rundir)
logger.info(f"Found {len(rundirs)} runs. Load data...")
partial_load_log = partial(load_log, log_fn=log_fn)
results = map_multiprocessing(partial_load_log, rundirs, n_processes=n_processes)
df = pd.concat(results).reset_index(drop=True)
logger.info("Done. Do some preprocessing...")
df_cfg = pd.DataFrame([{"cfg_fn": k, "cfg_str": v} for k, v in df["cfg_str"].unique()])
df_cfg.loc[:, "experiment_id"] = np.arange(0, len(df_cfg))
df["experiment_id"] = df["cfg_fn"].apply(lambda x: np.where(df_cfg["cfg_fn"].to_numpy() == x)[0][0])
df_cfg.loc[:, "cfg_str"] = df_cfg["cfg_str"].apply(lambda x: x.replace("\n", "\\n"))
del df["cfg_str"]
del df["cfg_fn"]
"""Load logs from file and preprocess.

Will collect all results from all runs contained in `rundir`.

Parameters
----------
rundir : str | list[str]
Directory containing logs.
log_fn : str, optional
Filename of the log file, by default "trial_logs.jsonl"
n_processes : int | None, optional
Number of processes to use for multiprocessing, by default None

Returns.
-------
tuple[pd.DataFrame, pd.DataFrame]
Logs and config data frames.
"""
if isinstance(rundir, str):
rundir = [rundir]
rundirs = rundir
df_list = []
for rundir in rundirs:
logger.info(f"Get rundirs from {rundir}...")
rundirs = get_run_dirs(rundir)
logger.info(f"Found {len(rundirs)} runs. Load data...")
partial_load_log = partial(load_log, log_fn=log_fn)
results = map_multiprocessing(partial_load_log, rundirs, n_processes=n_processes)
df = pd.concat(results).reset_index(drop=True)
logger.info("Done. Do some preprocessing...")
df_cfg = pd.DataFrame([{"cfg_fn": k, "cfg_str": v} for k, v in df["cfg_str"].unique()])
df_cfg.loc[:, "experiment_id"] = np.arange(0, len(df_cfg))
df["experiment_id"] = df["cfg_fn"].apply(lambda x: np.where(df_cfg["cfg_fn"].to_numpy() == x)[0][0])
df_cfg.loc[:, "cfg_str"] = df_cfg["cfg_str"].apply(lambda x: x.replace("\n", "\\n"))
del df["cfg_str"]
del df["cfg_fn"]
df_list.append(df)
df = pd.concat(df_list).reset_index(drop=True)
logger.info("Done. Saving to file...")
# df = df.map(lambda x: x if not isinstance(x, list) else str(x))
df.to_csv(Path(rundir) / "logs.csv", index=False)
Expand Down
Loading