Skip to content

hhlim333/ALTA2022Readability

Repository files navigation

Robustness of Hybrid Models in Cross-domain Readability Assessment

Abstract

Recent studies in automatic readability assessment have shown that hybrid models --- models that leverage both linguistically motivated features and neural models --- can outperform neural models. However, most evaluations on hybrid models have been based on in-domain data in English. This paper provides further evidence on the contribution of linguistic features by reporting the first direct comparison between hybrid, neural and linguistic models on cross-domain data. In experiments on a Chinese dataset, the hybrid model outperforms the neural model on both in-domain and cross-domain data. Importantly, the hybrid model exhibits much smaller performance degradation in the cross-domain setting, suggesting that the linguistic features are more robust and can better capture salient indicators of text difficulty.

Tools

scikit-learn==0.24.1
torch==1.11.0
transformers==4.5.0

Pretrained Model

MacBERT: https://huggingface.co/hfl/chinese-macbert-large
BERT: https://huggingface.co/bert-base-chinese
BERT-wwm: https://huggingface.co/hfl/chinese-bert-wwm
RoBERTa: https://huggingface.co/hfl/chinese-roberta-wwm-ext

We use the learning rate of 2e-5 for all pretrained Model

Demo on Google Colab

Mainland In-domain Version : https://colab.research.google.com/drive/1iyC_RXy1Y_U-M7dpwsZO5j_-by09OWUQ?usp=sharing
Mainland Cross-domain Version : https://colab.research.google.com/drive/1yCV8bL9z6EsTIjrX100iVhBSbMyiI3MG?usp=sharing
HongKong In-domain 8ways Version : https://colab.research.google.com/drive/1e06POLJITSuETPq9djFGBGYdgabpPWXd?usp=sharing
HongKong In-domain 12ways Version : https://colab.research.google.com/drive/1D4ixJsfPVDdKvFPueS7avvidMyHYGVjg?usp=sharing

How to Run

Most of the code is based on https://github.com/yjang43/pushingonreadability_transformers

  1. Go to pushingonreadability_transformers-master folder

  2. Create 5-Fold of a dataset for training.

python kfold.py --corpus_path mainland.csv --corpus_name mainland
  • Stratified folds of data will save under file name "data/mainland.{k}.{type}.csv". k means k-th of the K-Fold and type is either train, valid, or test.
  1. Fine-tune on dataset with pretrained model.
python train.py --corpus_name mainland --model chinese-macbert-large --learning_rate 2e-5
  1. Collect output probability with a trained model.
python inference.py --checkpoint_path checkpoint/mainland.chinese-macbert-large.0.14 --data_path data/mainland.0.test.csv
  1. Collect features and combine with output probability.

  2. Go to pushingonreadability_traditional_ML-master folder.

  3. Create result folder and put the combination of output probability and features file into the folder. For example: mainland.0.train.combined.csv,mainland.0.test.combined.csv

  4. Fed into Classifiers

python nonneural-classification.py -r
  • -r means random forest classifier
  • -s means SVM
  • -g means XGB

References

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features
https://aclanthology.org/2021.emnlp-main.834v2.pdf

Tools:
https://github.com/brucewlee/pushingonreadability_traditional_ML
https://github.com/yjang43/pushingonreadability_transformers

Most of our code are modifed from the above tools

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages