Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback #1

Open
wants to merge 35 commits into
base: feedback
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
f9121d3
Setting up GitHub Classroom Feedback
github-classroom[bot] Jan 2, 2024
1436ddf
Create feature_request.md
GangBean Jan 12, 2024
d8f1ee9
Create bug_report.md
GangBean Jan 12, 2024
2dd3684
Create PULL_REQUEST_TEMPLATE.md
GangBean Jan 12, 2024
adc24db
feat: baseline code download & args.yaml로 config 관리하도록 변경 #3
twndus Jan 15, 2024
a3e5c50
Merge branch 'main' of https://github.com/boostcampaitech6/level2-dkt…
twndus Jan 15, 2024
2401b5f
feat: 6 confusion matrix wandb logging added #6
twndus Jan 15, 2024
ea77c09
feat: baseline code download & args.yaml로 config 관리하도록 변경 #3
GangBean Jan 15, 2024
ab81da7
Merge branch 'main' into feat/6-cf
twndus Jan 16, 2024
fc0fa3b
Merge pull request #7 from boostcampaitech6/feat/6-cf
Dong-droid Jan 16, 2024
71489b8
Feat/8-gitignore
twndus Jan 16, 2024
1fd2fe6
Feat/10 ptfile (#11)
Dong-droid Jan 17, 2024
f646106
feat: best_model, submission 모두 k+1 개가 나오도록 구현 & retrain 구현 #2
twndus Jan 17, 2024
69407f1
feat: wandb finish 를 추가하여 wandb run k+1 개가 나오도록 구현 #2
twndus Jan 17, 2024
83b8784
refactor: 불필요한 print문 삭제 #2
twndus Jan 17, 2024
32c896c
feat12/ensemble 추가
Dong-droid Jan 17, 2024
1dac2ca
[FEAT] LightGBM 베이스라인 코드에 추가하기 (#16)
sangwoonoel Jan 17, 2024
258f0b1
[FEAT] Last Query 모델 베이스라인에 추가 - ModelBase상속 x (#18)
sangwoonoel Jan 17, 2024
3e2e64c
[FEAT] Last Query 모델 베이스라인에 추가 - ModelBase 상속 o (#20)
sangwoonoel Jan 17, 2024
0b8b116
[FEAT] ATTNLSTM 구현 (#22)
sangwoonoel Jan 17, 2024
294a870
Merge branch 'main' into feat/2-cv
Dong-droid Jan 17, 2024
bda44a4
Feat/2-cv 구현
GangBean Jan 17, 2024
6117394
[FEAT] Last Query 모델 베이스라인에 추가 - ModelBase 상속 부분 이용 (#29)
sangwoonoel Jan 18, 2024
9f7eb82
[FEAT] LightGCN 의 학습 결과 저장을 위한 코드 변경 + LR Scheduler 을 이용한 sweep (#43)
uhhyunjoo Jan 24, 2024
48e25b4
31 feat saint (#33)
Dong-droid Apr 11, 2024
828647a
46 lightgcn (#47)
uhhyunjoo Apr 11, 2024
822d278
feat: last query LGBM 추가 #27 (#35)
twndus Apr 11, 2024
14bb077
Create doc
GangBean Apr 11, 2024
758460d
Create tmp
GangBean Apr 11, 2024
eab5782
Add files via upload
GangBean Apr 11, 2024
4c34019
Update doc
GangBean Apr 11, 2024
7b86dcf
Rename doc to doc.txt
GangBean Apr 11, 2024
d724b76
Delete doc.txt
GangBean Apr 11, 2024
02289bc
Delete docs/tmp
GangBean Apr 11, 2024
661472c
Update README.md
GangBean Apr 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
name: Bug report
about: 버그 발생 시 사용하는 템플릿
title: "[BUG] "
labels: ''
assignees: ''

---

## Description

## How to reproduce

1.
2.
3.

## Solution
14 changes: 14 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
name: Feature request
about: 새로운 기능을 추가할 때 사용하는 템플릿
title: "[FEAT] "
labels: ''
assignees: ''

---

## Background
-

## To do
- [ ]
12 changes: 12 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Overview
-

## Change Log
-

## To Reviewer
-

## Issue Tags
- Closed | Fixed: #
- See also: #
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Deep Knowledge Tracing Baseline Code

Boostcamp A.I. Tech 5기 DKT 트랙 베이스라인 코드입니다.
현재 DKT 대회는 두 종류의 베이스라인이 제공됩니다.
+ `dkt/` 이 폴더 내에는 **Sequential Model**로 풀어나가는 베이스라인이 담겨져있습니다.
+ `lightgcn/` 이 폴더 내에는 Graph 기법으로 풀어나가는 베이스라인이 담겨져있습니다.

두 베이스라인의 파일 구조는 많이 비슷합니다. 그러나 사용하는 라이브러리의 차이가 있기 때문에 **`conda` 환경을 분리해서 사용하는 것을 추천**드립니다.
Empty file added __init__.py
Empty file.
30 changes: 30 additions & 0 deletions dkt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Baseline1: Deep Knowledge Tracing

## Setup
```bash
cd /opt/ml/input/code/dkt
conda init
(base) . ~/.bashrc
(base) conda create -n dkt python=3.10 -y
(base) conda activate dkt
(dkt) pip install -r requirements.txt
(dkt) python train.py
(dkt) python inference.py
```

## Files
`code/dkt`
* `train.py`: 학습코드입니다.
* `inference.py`: 추론 후 `submissions.csv` 파일을 만들어주는 소스코드입니다.
* `requirements.txt`: 모델 학습에 필요한 라이브러리들이 정리되어 있습니다.

`code/dkt/dkt`
* `args.py`: `argparse`를 통해 학습에 활용되는 여러 argument들을 받아줍니다.
* `criterion.py`: Loss를 포함합니다.
* `datloader.py`: dataloader를 불러옵니다.
* `metric.py`: metric 계산하는 함수를 포함합니다.
* `model.py`: 여러 모델 소스 코드를 포함합니다. `LSTM`, `LSTMATTN`, `BERT`를 가지고 있습니다.
* `optimizer.py`: optimizer를 instantiate할 수 있는 소스코드를 포함합니다.
* `scheduler.py`: scheduler 소스코드를 포함합니다.
* `trainer.py`: 훈련에 사용되는 함수들을 포함합니다.
* `utils.py`: 학습에 필요한 부수적인 함수들을 포함합니다.
Empty file added dkt/dkt-env/README.md
Empty file.
Empty file added dkt/dkt-env/dkt_env/__init__.py
Empty file.
808 changes: 808 additions & 0 deletions dkt/dkt-env/poetry.lock

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions dkt/dkt-env/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[tool.poetry]
name = "dkt-env"
version = "0.1.0"
description = ""
authors = ["Your Name <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
transformers = "^4.36.2"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Empty file added dkt/dkt-env/tests/__init__.py
Empty file.
Binary file added dkt/dkt/__pycache__/args.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/criterion.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/dataloader.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/metric.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/model.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/optimizer.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/scheduler.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/trainer.cpython-310.pyc
Binary file not shown.
Binary file added dkt/dkt/__pycache__/utils.cpython-310.pyc
Binary file not shown.
33 changes: 33 additions & 0 deletions dkt/dkt/args.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# dkt args
seed: 42 # int, seed
device: cpu # str, cpu or gpu
data_dir: /opt/ml/input/data/ # str, data directory

asset_dir: asset/ # str, data directory
file_name: train_data.csv # str, train file name
model_dir: models/ # str, model directory
model_name: best_model.pt # str, model file name
output_dir: outputs/ # str, output directory
test_file_name: test_data.csv # str, test file name

max_seq_len: 20 # int, max sequence length
num_workers: 1 # int, number of workers

# model
hidden_dim: 64 # int, hidden dimension size
n_layers: 2 # int, number of layers
n_heads: 2 # int, number of heads
drop_out: .2 # float, drop out rate

# training
n_epochs: 20 # int, number -f epochs
batch_size: 64 # int, batch size
lr: .0001 # float, learning rate
clip_grad: 10 # int, clip grad
patience: 5 # int, for early stopping

log_steps: 50 # int, print log per n steps

model: lstm # str, model type
optimizer: adam # str, optimizer type
scheduler: plateau # str, scheduler type
6 changes: 6 additions & 0 deletions dkt/dkt/criterion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import torch


def get_criterion(pred: torch.Tensor, target: torch.Tensor):
loss = torch.nn.BCEWithLogitsLoss(reduction="none")
return loss(pred, target)
196 changes: 196 additions & 0 deletions dkt/dkt/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
import os
import random
import time
from datetime import datetime
from typing import Tuple

import numpy as np
import pandas as pd
import torch
from sklearn.preprocessing import LabelEncoder


class Preprocess:
def __init__(self, args):
self.args = args
self.train_data = None
self.test_data = None

def get_train_data(self):
return self.train_data

def get_test_data(self):
return self.test_data

def split_data(self,
data: np.ndarray,
ratio: float = 0.7,
shuffle: bool = True,
seed: int = 0) -> Tuple[np.ndarray]:
"""
split data into two parts with a given ratio.
"""
if shuffle:
random.seed(seed) # fix to default seed 0
random.shuffle(data)

size = int(len(data) * ratio)
data_1 = data[:size]
data_2 = data[size:]
return data_1, data_2

def __save_labels(self, encoder: LabelEncoder, name: str) -> None:
le_path = os.path.join(self.args.asset_dir, name + "_classes.npy")
np.save(le_path, encoder.classes_)

def __preprocessing(self, df: pd.DataFrame, is_train: bool = True) -> pd.DataFrame:
cate_cols = ["assessmentItemID", "testId", "KnowledgeTag"]

if not os.path.exists(self.args.asset_dir):
os.makedirs(self.args.asset_dir)

for col in cate_cols:
le = LabelEncoder()
if is_train:
# For UNKNOWN class
a = df[col].unique().tolist() + ["unknown"]
le.fit(a)
self.__save_labels(le, col)
else:
label_path = os.path.join(self.args.asset_dir, col + "_classes.npy")
le.classes_ = np.load(label_path)

df[col] = df[col].apply(
lambda x: x if str(x) in le.classes_ else "unknown"
)

# 모든 컬럼이 범주형이라고 가정
df[col] = df[col].astype(str)
test = le.transform(df[col])
df[col] = test

def convert_time(s: str):
timestamp = time.mktime(
datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple()
)
return int(timestamp)

df["Timestamp"] = df["Timestamp"].apply(convert_time)
return df

def __feature_engineering(self, df: pd.DataFrame) -> pd.DataFrame:
# TODO: Fill in if needed
return df

def load_data_from_file(self, file_name: str, is_train: bool = True) -> np.ndarray:
csv_file_path = os.path.join(self.args.data_dir, file_name)
df = pd.read_csv(csv_file_path) # , nrows=100000)
df = self.__feature_engineering(df)
df = self.__preprocessing(df, is_train)

# 추후 feature를 embedding할 시에 embedding_layer의 input 크기를 결정할때 사용

self.args.n_questions = len(
np.load(os.path.join(self.args.asset_dir, "assessmentItemID_classes.npy"))
)
self.args.n_tests = len(
np.load(os.path.join(self.args.asset_dir, "testId_classes.npy"))
)
self.args.n_tags = len(
np.load(os.path.join(self.args.asset_dir, "KnowledgeTag_classes.npy"))
)

df = df.sort_values(by=["userID", "Timestamp"], axis=0)
columns = ["userID", "assessmentItemID", "testId", "answerCode", "KnowledgeTag"]
group = (
df[columns]
.groupby("userID")
.apply(
lambda r: (
r["testId"].values,
r["assessmentItemID"].values,
r["KnowledgeTag"].values,
r["answerCode"].values,
)
)
)
return group.values

def load_train_data(self, file_name: str) -> None:
self.train_data = self.load_data_from_file(file_name)

def load_test_data(self, file_name: str) -> None:
self.test_data = self.load_data_from_file(file_name, is_train=False)


class DKTDataset(torch.utils.data.Dataset):
def __init__(self, data: np.ndarray, args):
self.data = data
self.max_seq_len = args.max_seq_len

def __getitem__(self, index: int) -> dict:
row = self.data[index]

# Load from data
test, question, tag, correct = row[0], row[1], row[2], row[3]
data = {
"test": torch.tensor(test + 1, dtype=torch.int),
"question": torch.tensor(question + 1, dtype=torch.int),
"tag": torch.tensor(tag + 1, dtype=torch.int),
"correct": torch.tensor(correct, dtype=torch.int),
}

# Generate mask: max seq len을 고려하여서 이보다 길면 자르고 아닐 경우 그대로 냅둔다
seq_len = len(row[0])
if seq_len > self.max_seq_len:
for k, seq in data.items():
data[k] = seq[-self.max_seq_len:]
mask = torch.ones(self.max_seq_len, dtype=torch.int16)
else:
for k, seq in data.items():
# Pre-padding non-valid sequences
tmp = torch.zeros(self.max_seq_len)
tmp[self.max_seq_len-seq_len:] = data[k]
data[k] = tmp
mask = torch.zeros(self.max_seq_len, dtype=torch.int16)
mask[-seq_len:] = 1
data["mask"] = mask

# Generate interaction
interaction = data["correct"] + 1 # 패딩을 위해 correct값에 1을 더해준다.
interaction = interaction.roll(shifts=1)
interaction_mask = data["mask"].roll(shifts=1)
interaction_mask[0] = 0
interaction = (interaction * interaction_mask).to(torch.int64)
data["interaction"] = interaction
data = {k: v.int() for k, v in data.items()}
return data

def __len__(self) -> int:
return len(self.data)


def get_loaders(args, train: np.ndarray, valid: np.ndarray) -> Tuple[torch.utils.data.DataLoader]:
pin_memory = False
train_loader, valid_loader = None, None

if train is not None:
trainset = DKTDataset(train, args)
train_loader = torch.utils.data.DataLoader(
trainset,
num_workers=args.num_workers,
shuffle=True,
batch_size=args.batch_size,
pin_memory=pin_memory,
)
if valid is not None:
valset = DKTDataset(valid, args)
valid_loader = torch.utils.data.DataLoader(
valset,
num_workers=args.num_workers,
shuffle=False,
batch_size=args.batch_size,
pin_memory=pin_memory,
)

return train_loader, valid_loader
10 changes: 10 additions & 0 deletions dkt/dkt/metric.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from typing import Tuple

import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score


def get_metric(targets: np.ndarray, preds: np.ndarray) -> Tuple[float]:
auc = roc_auc_score(y_true=targets, y_score=preds)
acc = accuracy_score(y_true=targets, y_pred=np.where(preds >= 0.5, 1, 0))
return auc, acc
Loading