I took five german authors from the Kaggle German Literature dataset. Using a Multinomial Naive Bayes classifier with TF-IDF vectorization, I built a pipeline that takes in German text and produces a prediction.
- Korpus: https://www.kaggle.com/jihyeseo/german-literature-from-digbiborg
- Stopwords: https://github.com/stopwords-iso/stopwords-de
- Other resources:
- NER and POS Tagging for MNB: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
- Translation API: https://tech.yandex.com/translate/
- add translation API to pipeline
- improve recall for Kafka texts