Use the GitHub-API to update the list of tools and extensions related to TaskWarrior. It will be displayed on the web site: http://taskwarrior.org/tools/
This is linked to the project of future Tool page: http://brunovernay.github.io/taskwarrior-site-test/
The idea is to use the GitHub-API to search project related to TaskWarrior and update the list of tools displayed on TaskWarrior site from this list.
The project started in Java, but I created a Python branch, as it is more idiomatic to the TaskWarrior community. It should be compatible with Python v2 & v3 (http://pythonclock.org/).
I use https://github.com/PyGithub/PyGithub , there are many Python projects addressing GitHub, even a book Mining the Social Web .
cp Config.py.example Config.py
and editConfig.py
with your GitHub token- old tool list is in
data-tools-old.json
python3 Main.py > log-$(date -Iminutes).txt
(takes about 5 min)- New data is in
data-tools.json
python3 Main.py > log-$(date -Iminutes).txt
python3 Main.py > log-$(date -Iminutes).txt
- It works
- We still have to set the category manually
- There is no API yet to get the license (GitHub is working on it)
- You have to enter your GitHub token given the number of required requests. (https://github.com/settings/tokens)
- It only covers GitHub projects currently (BitBucket maybe one day ...)
- We might apply a diff after the update, to keep manual changes
Note:
- the text description is pure text, no HTML.
- There are duplicated names, I use the url_src as a unique identifier. But some project changed URL, for example xtw changed its login name, so the url is different. I output a warning and create a duplicate
The mapping:
- category: manual
- name name
- description description
- url homepage
- url_src html_url
- license ???
- language language (will get only the primary language, have to request languages_url to know more)
- author owner/login (+ collaborators, contributors, teams ...) We have to make multiple request to get the real name instead of the Login.
- theme best guess from description
- verified today
- last_update updated_at (pushed_at would be more conservative, but would miss commits in non-master branches)
I get all the "Readme" in order to perform some Machine Learning. The first idea would be to classify by category. The Python library seems to be SciKit. There is a more active NLTK library, but since I only need simple text feature extraction and no complex Natural Language processing, I will stick to SciKit. Some ref: