Skip to content

Latest commit

 

History

History
74 lines (56 loc) · 3.34 KB

README.md

File metadata and controls

74 lines (56 loc) · 3.34 KB

Robustness of Hybrid Models in Cross-domain Readability Assessment

Abstract

Recent studies in automatic readability assessment have shown that hybrid models --- models that leverage both linguistically motivated features and neural models --- can outperform neural models. However, most evaluations on hybrid models have been based on in-domain data in English. This paper provides further evidence on the contribution of linguistic features by reporting the first direct comparison between hybrid, neural and linguistic models on cross-domain data. In experiments on a Chinese dataset, the hybrid model outperforms the neural model on both in-domain and cross-domain data. Importantly, the hybrid model exhibits much smaller performance degradation in the cross-domain setting, suggesting that the linguistic features are more robust and can better capture salient indicators of text difficulty.

Tools

scikit-learn==0.24.1
torch==1.11.0
transformers==4.5.0

Pretrained Model

MacBERT: https://huggingface.co/hfl/chinese-macbert-large
BERT: https://huggingface.co/bert-base-chinese
BERT-wwm: https://huggingface.co/hfl/chinese-bert-wwm
RoBERTa: https://huggingface.co/hfl/chinese-roberta-wwm-ext

We use the learning rate of 2e-5 for all pretrained Model

Demo on Google Colab

Mainland In-domain Version : https://colab.research.google.com/drive/1iyC_RXy1Y_U-M7dpwsZO5j_-by09OWUQ?usp=sharing
Mainland Cross-domain Version : https://colab.research.google.com/drive/1yCV8bL9z6EsTIjrX100iVhBSbMyiI3MG?usp=sharing
HongKong In-domain 8ways Version : https://colab.research.google.com/drive/1e06POLJITSuETPq9djFGBGYdgabpPWXd?usp=sharing
HongKong In-domain 12ways Version : https://colab.research.google.com/drive/1D4ixJsfPVDdKvFPueS7avvidMyHYGVjg?usp=sharing

How to Run

Most of the code is based on https://github.com/yjang43/pushingonreadability_transformers

  1. Go to pushingonreadability_transformers-master folder

  2. Create 5-Fold of a dataset for training.

python kfold.py --corpus_path mainland.csv --corpus_name mainland
  • Stratified folds of data will save under file name "data/mainland.{k}.{type}.csv". k means k-th of the K-Fold and type is either train, valid, or test.
  1. Fine-tune on dataset with pretrained model.
python train.py --corpus_name mainland --model chinese-macbert-large --learning_rate 2e-5
  1. Collect output probability with a trained model.
python inference.py --checkpoint_path checkpoint/mainland.chinese-macbert-large.0.14 --data_path data/mainland.0.test.csv
  1. Collect features and combine with output probability.

  2. Go to pushingonreadability_traditional_ML-master folder.

  3. Create result folder and put the combination of output probability and features file into the folder. For example: mainland.0.train.combined.csv,mainland.0.test.combined.csv

  4. Fed into Classifiers

python nonneural-classification.py -r
  • -r means random forest classifier
  • -s means SVM
  • -g means XGB

References

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features
https://aclanthology.org/2021.emnlp-main.834v2.pdf

Tools:
https://github.com/brucewlee/pushingonreadability_traditional_ML
https://github.com/yjang43/pushingonreadability_transformers

Most of our code are modifed from the above tools