Replies: 6 comments 15 replies
-
Hey Ian, this is really awesome! If I have time tomorrow I'll try importing it into libreoffice calc to do some analysis. Just by seeing the text I can see some hypothesis validated:
|
Beta Was this translation helpful? Give feedback.
-
Bigrams. All char pairs, I deleted those with zero. Percentage is percentage of total count. Trigrams tomorrow. Treat as preliminary, since corpus appears to have dropped diacriticed vowels. |
Beta Was this translation helpful? Give feedback.
-
Have replaced char frequency with version 2 in first post above. There was a trailing space at the end of paragraphs, which increased the 'space' count, and created the 'space enter' bigram instead of the 'period enter' bigram. Revised bigram file will be uploaded when done ... it takes a while. |
Beta Was this translation helpful? Give feedback.
-
@iandoug & @Lobo-Feroz -- Please alert me when you've got letter frequency and letter bigram frequency lists ready for me to run in place of the crypto lists I've been relying on. I need them in a spreadsheet or in the following format: letters24 = ['E','A','O','S','N','I','R','L','D','C','T','U','P', |
Beta Was this translation helpful? Give feedback.
-
@iandoug -- For my engram-es layout optimization code to make sense of single-letter and letter bigram frequencies, I need just the frequencies of case insensitive letters or letter bigrams -- no spaces or non-letter characters. So I summed the cleaned-up Leipzig corpus counts for each letter across counts for that letter with whatever diacritical marks: e 294897235 It will take me forever to do this for your bigrams, however. |
Beta Was this translation helpful? Give feedback.
-
Signs frequencies. Based on @iandoug 's spanish-character-frequency-v1.ods, I have grouped the punctuation signs per key with their shifted versions. I was getting errors when sorting by "Average", so I made a 2nd sheet in the ods for a VLOOKUP table. Probably my calc-fu is not up to par, I'm sure it could have been done in the first sheet. The ods: spanish-character-frequency-v1 - signs - 20200820.ods The result:
@binarybottle, I hope these can help us sort the layout of the punctuation signs. |
Beta Was this translation helpful? Give feedback.
-
Let me start a new thread, others are getting long.
Character frequency from analysis of cleaned and paragraphed Leipzig corpus attached. I think the licence allows to post the compiled corpus, will upload to Zenodo once I have done the bigrams and trigrams, probably only tomorrow.
Version 2:
leipzig-spanish-char-freq.txt
Cheers, Ian
Beta Was this translation helpful? Give feedback.
All reactions