Everyone has their own personal reading history, and as we grow up, we are able to comprehend more and more complicated texts. But how do you estimate the level of difficulty of a given text, and how do you teach a computer to do that? In this project, you will find it out: write a program that determines how difficult the text is and for which age it is most suitable. Learn to call programs from the command-line and get more experience with using regexes and Math class.
Result: https://imgur.com/zadJPI0
Ever wonder how to measure if the text is hard or easy to read? For a human this task is pretty simple: you just read the text and feel if you're struggling or not. But how to teach a computer do that? In this project, you will write such a program.
Firstly, let's create a simple program. If the text contains more than 100 symbols (including spacebars and punctuation) then the text is considered hard to read. Else, the text is considered easy to read. If the text contains exactly 100 symbols then it is still easy to read.
The input contains a single line with text. Output "HARD" if the text is hard to read and "EASY" if the text is easy to read.
Examples The symbol > represents the user input. Notice that it's not the part of the input.
Example 1
This text is simple to read! EASY
Example 2
This text is hard to read. It contains a lot of sentences as well as a lot of words in each sentence. HARD
But the real text can be pretty long and still can be easy to read isn't it? There need to be done another approach. How about calculating a number of words in each sentence? Clearly, if each sentence of the text contains a lot of words then this text is hard to read.
Let's suppose that if the text contains on average more than 10 words per sentence, then this text is hard to read. Otherwise, this text is easy to read. Don't worry, we will consider more scientific ways in the next stages.
The input contains a single line with text. Output "HARD" if the text is hard to read and "EASY" if the text is easy to read.
For example, the first example contains two sentences with 6 and 10 words (numbers also counts as words) so the average is 8 and this is less than 10. In the second example, there is also 2 sentences but with 6 and 16 words so the average is 11 and this is greater than 10. If the average is equal to 10 then the text is still considered easy to read.
Don't forget that the sentences can end with a dot as well as with an exclamation mark and a question mark. But the last sentence can be with or without a punctuation character at the end.
Examples The symbol > represents the user input. Notice that it's not the part of the input.
Example 1
This text is simple to read! It has on average less than 10 words per sentence. EASY
Example 2
This text is hard to read. It contains a lot of sentenses as well as a lot of words in each sentence HARD
In this stage, you will program the Automated readability index. It was introduced in 1968 and a lot of research works rely on this. The index is calculated by the following ARI formula (https://en.wikipedia.org/wiki/Automated_readability_index).
You can look at different ages corresponding to the different scores by the table in this article.
Also, your program should read a file instead of typing a text manually. You should pass the filename through the command line arguments.
The program should output the score itself and an approximate age needed to comprehend the text.
Use the appropriate rounding function to calculate the score as integer.
You should also print how many characters, words, and sentences the text has.
The number of characters is any visible symbol (so, in the real text it's everything except space, newline "\n" and tab "\t").
Notice, that the text can contain multiple lines, not just a single line like in the previous stages. You should analyze all the lines.
Examples The symbol > represents the user input. Notice that it's not the part of the input.
Example 1
java Main in.txt The text is: Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content and its presentation. Researchers have used various factors to measure readability. Readability is more than simply legibility, which is a measure of how easily a reader can distinguish individual letters or characters from each other. Higher readability eases reading effort and speed for any reader, but it is especially important for those who do not have high reading comprehension. In readers with poor reading comprehension, raising the readability level of a text from mediocre to good can make the difference between success and failure
Words: 108 Sentences: 6 Characters: 580 The score is: 12.86 This text should be understood by 18-24 year olds.
Example 2
java Main in.txt The text is: This is the page of the Simple English Wikipedia. A place where people work together to write encyclopedias in different languages. That includes children and adults who are learning English. There are 142,262 articles on the Simple English Wikipedia. All of the pages are free to use. They have all been published under both the Creative Commons License 3 and the GNU Free Documentation License. You can help here! You may change these pages and make new pages. Read the help pages and other good pages to learn how to write pages here. You may ask questions at Simple talk.
Words: 100 Sentences: 10 Characters: 476 The score is: 5.98 This text should be understood by 11-12 year olds.
In this stage, you should implement various other scientific approaches to calculate a readability score.
Take a look at different ages and corresponding scores in the table in this article (https://en.wikipedia.org/wiki/Automated_readability_index). This table is suitable for all the algorithms described in this stage. To calculate the age use the upper bound of the range. For example, if the range is 12-13 year olds then it's upper bound is 13 year olds.
The first algorithm is Flesch–Kincaid readability tests. First, you need to create a method that calculates a number of syllables in a word. The formula is given below. You can find more information here (https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests). You can use the second formula to calculate the index; it allows you to easily calculate the age of a person using the same table from the Automated Readability Index.
The second one is SMOG index. It stands for Simple Measure of Gobbledygook. To calculate it, you need to count the number of polysyllables which is the number of words with more than 2 syllables. The formula is shown below. You can find out more here (https://en.wikipedia.org/wiki/SMOG). The Wikipedia page says that at least 30 sentences are required for this index to work properly. Don't pay attention to this, just keep it in mind when you use this index in real life. As in the previous example, the grade level is calculated here, so to get the age of a person you need to use the table from the first link.
The next one is Coleman–Liau index. The formula is given below. For more information read this (https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index). L is the average number of characters per 100 words and S is the average number of sentences per 100 words. Like all other indices, the output is a person's grade level. Like all other indices, the result is a minimum grade level required to understand this text.
So, in this stage, you should program all three approaches. Don't forget about the Automated readability index! Also, there should be an option to choose all methods at the same time.
To count the number of syllables you should use letters a, e, i, o, u, y as vowels. You can see here (https://simple.wikipedia.org/wiki/Vowel) examples and intricacies with determining vowels in a word with 100% accuracy. So, let's use the following 4 rules:
- Count the number of vowels in the word.
- Do not count double-vowels (for example, "rain" has 2 vowels but only 1 syllable).
- If the last letter in the word is 'e' do not count it as a vowel (for example, "side" has 1 syllable).
- If at the end it turns out that the word contains 0 vowels, then consider this word as a 1-syllable one.
Example The symbol > represents the user input. Notice that it's not the part of the input.
java Main in.txt The text is: This is the front page of the Simple English Wikipedia. Wikipedias are places where people work together to write encyclopedias in different languages. We use Simple English words and grammar here. The Simple English Wikipedia is for everyone! That includes children and adults who are learning English. There are 142,262 articles on the Simple English Wikipedia. All of the pages are free to use. They have all been published under both the Creative Commons License and the GNU Free Documentation License. You can help here! You may change these pages and make new pages. Read the help pages and other good pages to learn how to write pages here. If you need help, you may ask questions at Simple talk. Use Basic English vocabulary and shorter sentences. This allows people to understand normally complex terms or phrases.
Words: 137 Sentences: 14 Characters: 687 Syllables: 210 Polysyllables: 17 Enter the score you want to calculate (ARI, FK, SMOG, CL, all): all
Automated Readability Index: 7.08 (about 13 year olds). Flesch–Kincaid readability tests: 6.31 (about 12 year olds). Simple Measure of Gobbledygook: 9.42 (about 15 year olds). Coleman–Liau index: 10.66 (about 17 year olds).
This text should be understood in average by 14.25 year olds.