Слава Україні! Support our Ukrainian friends..

A long, long time ago (to be exact – in February 2018) I prepared a Shiny app to explore the books of the most popular Project Gutenberg authors. The original app is still available (and bugged) via ShinyApps, and was inspired by this article. I first decided to port the unique words plot from Python to R, then utilised some readability scripts I’d written earlier, and all of a sudden it was all nice and Shiny. (ba-dum-tss) Also, it helped that I needed to create some kind of a project for my uni labs.

I recently digged back into the project due to my incessant interest in the concepts of readability tests and other measurable text features. I thought the app itself could use some improvements, and at least one of them could potentially be done without touching the existing codebase. Then there’s the issue with how to calculate the readability score – and stay sane.

### Dealing with the books galore

If you consider the “Books on this level” plot, there’s way too much data if you go for the B1 level:

This is due to the fact that there are way more books on the A2 - B2 levels than on any else or, putting it in the statistical lingo, the flesch.value vector data is skewed (left-skewed if you look at the flesch.value, and right-skewed if you consider the level). This becomes appearent when you group all the books by the level and count your chickens:

``````import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
``````
``````book_list = pd.read_csv('book_list.csv', sep=',', header=0)
``````
``````sns.set(style="ticks")
``````
``````.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
``````

``````book_list.groupby('level').size()
``````
``````level
A1     14
A2    155
B1    187
B2     90
C1     22
C2     10
dtype: int64
``````

Another thing that becomes apparent when you plot the flesch.value (click for more info on that) against the average.goodreads.rating (pulled from the Goodreads API) is that it’s also left-skewed (right-skewed on the plot below because of the nature of the Y axis).

``````sns.jointplot(x='flesch.value',y='average.goodreads.rating',data=book_list,kind='hex')
``````
``````<seaborn.axisgrid.JointGrid at 0x7fd0b6787860>
``````

What I get from this is that, essentially, the rating may be a useless metric (“every product and business on earth is assigned a numerical rating from 1 to 5 on some site, but everything ends up at 3.7 +/- 0.1, so it means nothing, but we just keep doing it because our first goal is quantification as such, accuracy and precision be damned” to quote Gary Bernhardt). But I promised to myself not to dwell on the past and so I’m not changing the logic of the app, and I shall just pick fewer books for each grade level to make it look nicer on the actual plot. Also, these are the cream of the crop volumes, so maybe they’re just really good on average? Or maybe people are afraid to rate them too low?

But how do I pick only some of the books in such a way that I get a sample of each readability level (and Goodreads rating)? Given the plot above, there are some really hot regions (think 3.7 stars, 78 on the Flesch-Kincaid scale) but the rest could potentially stay as it is. As this is more for the demonstration purposes and not some actual sample-sized statistical research, the simplest way would be to first bucket the books in an even more fine manner (90-88 flesch.value, 88-86 flesch.value, and so on) and then from each such bucket, if there is more than one book, choose on the basis of their Goodreads rating (how varied it is).

``````bins_flesch = np.linspace(0,100,num=101)
cuts_flesch = pd.cut(book_list['flesch.value'], bins_flesch)
bins_rating = np.linspace(0,5,num=11)
cuts_rating = pd.cut(book_list['average.goodreads.rating'], bins_rating)
groups = book_list.groupby(by=[cuts_flesch, cuts_rating])
``````
``````groups.size().head()
``````
``````flesch.value  average.goodreads.rating
(26.0, 27.0]  (3.0, 3.5]                  1
(43.0, 44.0]  (3.5, 4.0]                  2
(44.0, 45.0]  (3.5, 4.0]                  1
(46.0, 47.0]  (3.5, 4.0]                  1
(47.0, 48.0]  (2.5, 3.0]                  1
dtype: int64
``````

Let’s assume that for each of these buckets, the goal would be to have a single element in it, and this element would be as close to the average of that particular bucket as possible. Note that this is going to be a pretty small data frame (I’m starting with <500 records) so performance isn’t an issue here. Normally you’d want to pre-allocate indices, and stuff like that.

``````res_df = pd.DataFrame(columns=list(book_list.columns.values))
``````
``````# https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return array[idx]
``````
``````els = []
for key, item in groups:
g = groups.get_group(key)
g_mean_choice = find_nearest(g['average.goodreads.rating'], g_mean)
g_choice = g[g['average.goodreads.rating'] == g_mean_choice]
f = [res_df, g_choice]
res_df = pd.concat(f)
``````
``````res_df.head()
``````
``````.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
``````

``````cuts_flesch_2 = pd.cut(res_df['flesch.value'], bins_flesch)
cuts_rating_2 = pd.cut(res_df['average.goodreads.rating'], bins_rating)
groups = res_df.groupby(by=[cuts_flesch_2, cuts_rating_2])
``````
``````flesch.value  average.goodreads.rating
(26.0, 27.0]  (3.0, 3.5]                  1
(43.0, 44.0]  (3.5, 4.0]                  1
(44.0, 45.0]  (3.5, 4.0]                  1
(46.0, 47.0]  (3.5, 4.0]                  1
(47.0, 48.0]  (2.5, 3.0]                  1
dtype: int64
``````

Now there should be just 1 book per bucket. The final result is 142 books (`res_df.shape[0]`). Plugging this into the existing app gives us the following result:

Another thing I wanted to touch upon is the readability score calculation – which is pretty hard. To calculate the Flesch-Kincaid readability score you need to correctly parse the text into sentences, words and syllables. The raw implementation of the formula would be as follows:

``````def calculate_flesch_score(no_of_sentences, no_of_words, no_of_syllables):
return 206.835 - 1.015 * (no_of_words / no_of_sentences) - 84.6 * (no_of_syllables / no_of_words)
``````

Even leaving aside the issue of what is a word vs what is a sentence (take the Inuit languages with their extreme agglutination), determining the syllable count of a text is not that easy. You can use CMUDict, which is conveniently available via the nltk package, but there are a lot of words missing from there. Then there’s Pyphen but it’s pretty error-prone. I also came across the Readability package which also has its specific requirements pertaining to the pre-processing of the text.

Let’s see how all these perform on the benchmark of “Heart of Darkness”, one of the rare cases of the books I hate so much, any time I see a copy of it somewhere I want to throw it out of the closest window:

``````import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import cmudict
import pyphen
import re
``````
``````f = open('heart_of_darkness.txt')
``````
``````raw.find('The Nellie')
``````
``````646
``````
``````raw.find('End of the Project Gutenberg EBook of Heart of Darkness, by Joseph Conrad')
``````
``````211098
``````
``````raw = raw[646:211098]
``````
``````lower_parsed = re.sub('[^0-9a-zA-Z]+', ' ', raw).lower()
``````
``````words = word_tokenize(lower_parsed)
``````
``````len(words)
``````
``````39064
``````
``````unique_words = set(words)
``````
``````len(unique_words)
``````
``````5727
``````
``````sentences = sent_tokenize(raw)
``````
``````len(sentences)
``````
``````2417
``````

This gives me 39064 words (5727 of them unique) and 2417 sentences. Let’s see how many of these words can be found in the CMUdict:

``````cmudict = cmudict.dict()
``````
``````no_dict = []
for word in unique_words:
if word not in cmudict.keys():
no_dict.append(word)
``````
``````len(no_dict)
``````
``````359
``````

This means that 359 unique words are not accounted for in the CMUdict. Let’s see how often they appear in the whole book:

``````c = 0
for unique_word in no_dict:
for word in words:
if unique_word == word:
c += 1
print(c)
``````
``````507
``````

That’s a total of 507 words out of 39064, or ~1.3% of all the words in the book. Since this is a quick and dirty solution let’s say I can live with this fact, and move on. The plan is as follows:

1. create a lookup dictionary based off the stress patterns in the CMUdict,
2. calculate the number of syllables in the whole book based off the lookup dictionary.

The join-fu and error handling is due to the CMUdict sometimes having two different pronunciation of the same word (obviously, but not useful to me at that moment).

``````lookup_dict = {}
reg = re.compile(r'[^\d]+')

for word in unique_words:
if word not in cmudict.keys():
lookup_dict[word] = 1
else:
try:
lookup_dict[word] = len(reg.sub('', ''.join(cmudict[word][0])))
except TypeError:
lookup_dict[word] = len(reg.sub('', ''.join(cmudict[word][0][0])))
``````
``````n_syl = 0
for word in words:
n_syl += lookup_dict[word]
print(n_syl)
``````
``````52487
``````

52487 sounds a bit low, but let’s see how this translates into the score:

``````print(calculate_flesch_score(len(sentences), len(words), n_syl))
``````
``````76.76050250923326
``````

Another implementation, now fetching the syllable count with Pyphen:

``````import pyphen
dic = pyphen.Pyphen(lang='en_UK') # In reality, Conrad was a Pole.
``````

Pyphen returns a hyphenated string; I’m going to count these hyphens and add 1 to get the predicted number of syllables.

``````lookup_dict = {}
reg = re.compile(r'[^-]+')

for word in unique_words:
lookup_dict[word] = len(reg.sub('', word))+1
``````
``````n_syl = 0
for word in words:
n_syl += lookup_dict[word]
print(n_syl)
``````
``````39064
``````

This gives me 39064 syllables, which is even more dubious.

``````print(calculate_flesch_score(len(sentences), len(words), n_syl))
``````
``````105.8303827058337
``````

This makes no sense whatsover. A score of 105 would be for sentences in the style of “I see you.” (essentially, a lot of single-syllable words in really short sentences), and this book is far from that. Let’s see the final implementation via the Readability package:

``````import readability
``````
``````sentences = [s.replace('\n', ' ') for s in sentences]
text = '\n'.join(sentences)
``````
``````results = readability.getmeasures(text, lang='en')
``````
``````print(results['readability grades']['FleschReadingEase'])
``````
``````82.8232249133162
``````

The score of almost 83 again seems a bit too high but it’s at least close to the one calculated with the CMU dictionary. Finally, let’s see how the treetagger-based implementation performed last year:

``````book_list[book_list['title']=='Heart of Darkness']
``````
``````.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
``````

The treetagger version has a similar score of ~78. Maybe I just really hate this book…