- Find out 10000 most used words in some language
- Create a basic app to display 4 pictures for any of words
- Fill the app with default google search images as a starting step
- Improve the word->image matching with various heuristic and empirical methods
- Write down all the cool things we learned along the way
We will use German as target language. 10000 most frequent words will be extracted by analyzing text of german wikipedia. You can find it here, or more specifically here
Please Click on the Watch button to get notifications from this project If you have any questions or problems use Issues
Download wiki bz2 file in text-analysis directory in this repo but please do NOT add it to the project, it is 6.4 Gigabytes!
cd text-analysis
./xml.bz2-to-text.gz.sh dewiki-20170920-pages-articles.xml.bz2
- this will extract raw text from wikipedia with filenamedewiki-20170920-pages-articles.text.gz
- you can view the content of this file with:
gzcat dewiki-20170920-pages-articles.text.gz | head
- Note that above processing takes around 20m on mac book pro.
./text.gz-to-words.gz.sh dewiki-20170920-pages-articles.text.gz
- convert text into a stream of words, each word on a new line. Also removes some undesired words. This reduces the time for later processing scripts.
- ./words.gz-to-freq.tsv dewiki-20170920-pages-articles.words.gz
- this will create dewiki-20170920-pages-articles.freq.tsv - a list of 10k most frequent words in german.
- Note that previous list was not correct because of wrong handling of uppercase ÄÜÖ
- improved word counter to preserve Title case of German nouns.
- We can see that 5787 words are Nouns!
cat dewiki-20170920-pages-articles.freq.tsv | cut -f2 | sed -nE '/^[A-ZÜÄÖ].*/p' | wc -l
- did partial automated word translation using standard unix dict
- 4401 words was not translated, mostly because they were in acusative or dative case, need better dict
- dewiki-20170920-pages-articles.prev.txt contains 10 most frequent previous words for our 10k words.
- dewiki-20170920-pages-articles.next.txt contains 10 most frequent next words for our 10k words.