The goal of ncaa-march-madness-2020 is to store the notebooks for this Kaggle Competition, see GitBook including
- Baseline
- XGBOOST 超参数调整
- Target encoding
- ID embedding
- GBDT + LR
- GBDT + LR k-fold
- 变量重要性
- Linear vs. Tree linear?
- Auto-encoder 查询异常值
- Python 包说明
We publish our package with some internal functions, install with
pip install ncaa-march-madness-2020
All notebooks work in the analysis
directory, and save all data files
in input
, output
and data
directories.
fs::dir_tree("analysis", recurse = TRUE, regexp = "ipynb")
#> analysis
#> +-- baseline.ipynb
#> +-- evaluate-features.ipynb
#> +-- gbdt_lr.ipynb
#> +-- gbdt_lr_CV.ipynb
#> +-- id2vec.ipynb
#> +-- linear-base-learner.ipynb
#> +-- march-madness-2020-ncaam-simple-lightgbm-on-kfold.ipynb
#> +-- Obtain_Answer.ipynb
#> +-- outliers.ipynb
#> +-- params_tuning.ipynb
#> +-- paris-madness.ipynb
#> +-- pkg_test.ipynb
#> \-- target-encoding.ipynb
fs::dir_tree(recurse = TRUE, regexp = "input|output|data")
#> .
#> +-- data
#> | +-- feature_importances.csv
#> | +-- id2vec.npy
#> | +-- NCAA2020_Kenpom.csv
#> | +-- outlier_df.csv
#> | +-- submission_True.csv
#> | +-- team_strength_embedding.csv
#> | +-- Tourney_Reuslt.csv
#> | \-- Tourney_Reuslt_inputs.csv
#> +-- input
#> | +-- google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
#> | | +-- MDataFiles_Stage1
#> | | | +-- Cities.csv
#> | | | +-- Conferences.csv
#> | | | +-- MConferenceTourneyGames.csv
#> | | | +-- MGameCities.csv
#> | | | +-- MMasseyOrdinals.csv
#> | | | +-- MNCAATourneyCompactResults.csv
#> | | | +-- MNCAATourneyDetailedResults.csv
#> | | | +-- MNCAATourneySeedRoundSlots.csv
#> | | | +-- MNCAATourneySeeds.csv
#> | | | +-- MNCAATourneySlots.csv
#> | | | +-- MRegularSeasonCompactResults.csv
#> | | | +-- MRegularSeasonDetailedResults.csv
#> | | | +-- MSeasons.csv
#> | | | +-- MSecondaryTourneyCompactResults.csv
#> | | | +-- MSecondaryTourneyTeams.csv
#> | | | +-- MTeamCoaches.csv
#> | | | +-- MTeamConferences.csv
#> | | | +-- MTeams.csv
#> | | | \-- MTeamSpellings.csv
#> | | +-- MEvents2015.csv
#> | | +-- MEvents2016.csv
#> | | +-- MEvents2017.csv
#> | | +-- MEvents2018.csv
#> | | +-- MEvents2019.csv
#> | | +-- MPlayers.csv
#> | | \-- MSampleSubmissionStage1_2020.csv
#> | \-- google-cloud-ncaa-march-madness-2020-division-1-mens-tournament.zip
#> +-- large_data
#> \-- output
#> \-- paris-submission.csv
From https://github.com/Kaggle/kaggle-api
kaggle competitions download -c google-cloud-ncaa-march-madness-2020-division-1-mens-tournament -p input
mkdir input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
unzip input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament.zip -d input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
- Do the feature engineering on goal and spots with distance(Nandakumar 2020)
- We ignore the multicollinearity detection in the feature, we choose XGBoost, thus it handles this problem itself, see more https://datascience.stackexchange.com/a/39806/60879.
Please note that the ncaa-march-madness-2020
project is released with
a Contributor Code of
Conduct.
By
contributing to this project, you agree to abide by its terms.
Nandakumar, Namita. 2020. “R + Tidyverse in Sports.” RStudio Conference 2020. 2020. https://resources.rstudio.com/rstudio-conf-2020/r-tidyverse-in-sports-namita-nandakumar.