-
Notifications
You must be signed in to change notification settings - Fork 4
Webpage Classifier
The crawler makes use of a machine learning model (binary linear SVC) to classify documents into two classes: "data related" or "unrelated". It was trained with SciKit-Learn (python) and integrated into the crawler pipeline via storm's multilang protocol.
-
detect whether a web page is related to "datasets" (links to, contains, API for datasets) based on its textual content
-
higher focus on a low false positive rate than on a low false negative rate:
- We primarily want to filter unrelated content, not find all possible sites containing datasets (This better complements the used hybrid content analysis approach).
Training data (~44000 labeled web pages) was collected manually by crawling ~2200 labeled websites. The 27k URLs (27k labeled "data", 17k labeled "unrelated") were crawled using the crawler component with customized settings:
-
labeled as "data": 1570 seed URLs from ruth's table of data sources, from http://dataportals.org, from other sources (such as http://aqicn.org/links/).
- each URL crawled with a depth of 1 outlink and limited to the same host
-
labeled as "unrelated": 600 URLs from google searches of the "unrelated" test case (https://github.com/52North/ecmwf-dataset-crawl/blob/develop/controller/src/integration_tests/testcases.ts#L44-L63) + Alexa Top 500 pages.
- unlimited crawling, stopped after 17k results.
Automatic language detection was run on all results, and only English pages were kept for training. Originally I had more fine grained classes ("realtime", "historic", "dataportal", "agency" instead of just "data"), but the training data was not distinct enough to separate these labels.
Due to the restriction to the same web host during crawling, we can be fairly certain that the labels of the crawled pages labeled as "data" are accurate. They might not be referring directly to a dataset, but are always related to an organization providing datasets.
For the classifier the next steps were:
-
extraction of text from trainingdata (using JSoup in the crawler)
-
split text into words, filtering stopwords and place names (see here)
-
determine weight of words per document (TFIDF)
-
filter top N (5-50) most important words across all documents (chi squared)
-
train classifier to determine which words are most significantly separating documents of both classes
- a LinearSVC classifier was used, as it did yield the best accuracy (by a small margin) during evaluation of various classifiers
Hyper parameters (such as length number of selected features) were selected using a random search with 20 iterations.
The training script used can be found on the crawler-test-manual
branch:
While the classifier has quite high accuracy when validating with a split test set of the training data of 94%, when applied to real world examples, the results are lacking (roughly 70% accuracy in my tests). This is due to the fact, that the training and test data originates from the same crawls and is thus not really uncorrelated.
Each step can be considered a prototype with headroom for improvement, but in general the taken approach has the following high level issues:
-
document language detection is not very accurate. -> blurs the language model by including foreign vocabulary from falsely detected languages. (minor impact)
- mitigation: improved language detector considering the HTML tags + domain as well (e.g. cld2)
-
training data of the "unrelated" group is /very/ unrelated: the interesting cases of pages having similar keywords, but unrelated context is not covered.
-
training data of the "data" group might not be referring directly to a data set, but is always related to an organization providing data sets.
-
training data consists of small set of distinct websites, with lots of sub-pages with similar content: leads to overfitting of the classifier to the modeler. (major impact)
- mitigation: reduce number of pages per site, increase number of distinct sites in training data
-
simplified selection of the relevant features/words: there tend to be unrelated words included which happen to be relevant just for the given training data (through overfit) (major impact)
- partly mitigated via better training data. use better feature-selection algorithm, requires more ML-knowhow
-
classification based on occurrence of single words, not combinations of words (N-grams)
For our hybrid content analyses use case, improving two metrics of the classifier would be most useful:
- reduce mis-classifications (avoid e.g. arxiv.org papers labeled as "data").
- improve accuracy on roughly relevant results (avoid labeling the landing page of usgs.gov as "data").
The focus should probably be on the first issue for now. To fix the first issue, we should fix point 4. To fix the second issue, we should increase accuracy in the labels (fix point 2), and draw a finer distinction between the classes by introducing more edge-case trainingdata (fix point 3). Points 1, 5 & 6 would improve general performance, but are not as critical.