Robustness of Hybrid Models in Cross-domain Readability Assessment

Abstract

Recent studies in automatic readability assessment have shown that hybrid models --- models that leverage both linguistically motivated features and neural models --- can outperform neural models. However, most evaluations on hybrid models have been based on in-domain data in English. This paper provides further evidence on the contribution of linguistic features by reporting the first direct comparison between hybrid, neural and linguistic models on cross-domain data. In experiments on a Chinese dataset, the hybrid model outperforms the neural model on both in-domain and cross-domain data. Importantly, the hybrid model exhibits much smaller performance degradation in the cross-domain setting, suggesting that the linguistic features are more robust and can better capture salient indicators of text difficulty.

Tools

scikit-learn==0.24.1
torch==1.11.0
transformers==4.5.0

Pretrained Model

MacBERT: https://huggingface.co/hfl/chinese-macbert-large
BERT: https://huggingface.co/bert-base-chinese
BERT-wwm: https://huggingface.co/hfl/chinese-bert-wwm
RoBERTa: https://huggingface.co/hfl/chinese-roberta-wwm-ext

We use the learning rate of 2e-5 for all pretrained Model

Demo on Google Colab

Mainland In-domain Version : https://colab.research.google.com/drive/1iyC_RXy1Y_U-M7dpwsZO5j_-by09OWUQ?usp=sharing
Mainland Cross-domain Version : https://colab.research.google.com/drive/1yCV8bL9z6EsTIjrX100iVhBSbMyiI3MG?usp=sharing
HongKong In-domain 8ways Version : https://colab.research.google.com/drive/1e06POLJITSuETPq9djFGBGYdgabpPWXd?usp=sharing
HongKong In-domain 12ways Version : https://colab.research.google.com/drive/1D4ixJsfPVDdKvFPueS7avvidMyHYGVjg?usp=sharing

How to Run

Most of the code is based on https://github.com/yjang43/pushingonreadability_transformers

Go to pushingonreadability_transformers-master folder
Create 5-Fold of a dataset for training.

python kfold.py --corpus_path mainland.csv --corpus_name mainland

Stratified folds of data will save under file name "data/mainland.{k}.{type}.csv". k means k-th of the K-Fold and type is either train, valid, or test.

Fine-tune on dataset with pretrained model.

python train.py --corpus_name mainland --model chinese-macbert-large --learning_rate 2e-5

Collect output probability with a trained model.

python inference.py --checkpoint_path checkpoint/mainland.chinese-macbert-large.0.14 --data_path data/mainland.0.test.csv

Collect features and combine with output probability.
Go to pushingonreadability_traditional_ML-master folder.
Create result folder and put the combination of output probability and features file into the folder. For example: mainland.0.train.combined.csv,mainland.0.test.combined.csv
Fed into Classifiers

python nonneural-classification.py -r

-r means random forest classifier
-s means SVM
-g means XGB

References

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features
https://aclanthology.org/2021.emnlp-main.834v2.pdf

Tools:
https://github.com/brucewlee/pushingonreadability_traditional_ML
https://github.com/yjang43/pushingonreadability_transformers

Most of our code are modifed from the above tools

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
pushingonreadability_traditional_ML-master		pushingonreadability_traditional_ML-master
pushingonreadability_transformers-master		pushingonreadability_transformers-master
result		result
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robustness of Hybrid Models in Cross-domain Readability Assessment

Abstract

Tools

Pretrained Model

Demo on Google Colab

How to Run

References

About

Releases

Packages

Languages

hhlim333/ALTA2022Readability

Folders and files

Latest commit

History

Repository files navigation

Robustness of Hybrid Models in Cross-domain Readability Assessment

Abstract

Tools

Pretrained Model

Demo on Google Colab

How to Run

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages