Recent studies in automatic readability assessment have shown that hybrid models --- models that leverage both linguistically motivated features and neural models --- can outperform neural models. However, most evaluations on hybrid models have been based on in-domain data in English. This paper provides further evidence on the contribution of linguistic features by reporting the first direct comparison between hybrid, neural and linguistic models on cross-domain data. In experiments on a Chinese dataset, the hybrid model outperforms the neural model on both in-domain and cross-domain data. Importantly, the hybrid model exhibits much smaller performance degradation in the cross-domain setting, suggesting that the linguistic features are more robust and can better capture salient indicators of text difficulty.
scikit-learn==0.24.1
torch==1.11.0
transformers==4.5.0
MacBERT: https://huggingface.co/hfl/chinese-macbert-large
BERT: https://huggingface.co/bert-base-chinese
BERT-wwm: https://huggingface.co/hfl/chinese-bert-wwm
RoBERTa: https://huggingface.co/hfl/chinese-roberta-wwm-ext
We use the learning rate of 2e-5 for all pretrained Model
Mainland In-domain Version : https://colab.research.google.com/drive/1iyC_RXy1Y_U-M7dpwsZO5j_-by09OWUQ?usp=sharing
Mainland Cross-domain Version : https://colab.research.google.com/drive/1yCV8bL9z6EsTIjrX100iVhBSbMyiI3MG?usp=sharing
HongKong In-domain 8ways Version : https://colab.research.google.com/drive/1e06POLJITSuETPq9djFGBGYdgabpPWXd?usp=sharing
HongKong In-domain 12ways Version : https://colab.research.google.com/drive/1D4ixJsfPVDdKvFPueS7avvidMyHYGVjg?usp=sharing
Most of the code is based on https://github.com/yjang43/pushingonreadability_transformers
-
Go to pushingonreadability_transformers-master folder
-
Create 5-Fold of a dataset for training.
python kfold.py --corpus_path mainland.csv --corpus_name mainland
- Stratified folds of data will save under file name "data/mainland.{k}.{type}.csv". k means k-th of the K-Fold and type is either train, valid, or test.
- Fine-tune on dataset with pretrained model.
python train.py --corpus_name mainland --model chinese-macbert-large --learning_rate 2e-5
- Collect output probability with a trained model.
python inference.py --checkpoint_path checkpoint/mainland.chinese-macbert-large.0.14 --data_path data/mainland.0.test.csv
-
Collect features and combine with output probability.
-
Go to pushingonreadability_traditional_ML-master folder.
-
Create result folder and put the combination of output probability and features file into the folder. For example: mainland.0.train.combined.csv,mainland.0.test.combined.csv
-
Fed into Classifiers
python nonneural-classification.py -r
- -r means random forest classifier
- -s means SVM
- -g means XGB
Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features
https://aclanthology.org/2021.emnlp-main.834v2.pdf
Tools:
https://github.com/brucewlee/pushingonreadability_traditional_ML
https://github.com/yjang43/pushingonreadability_transformers
Most of our code are modifed from the above tools