Zipfian Patterns in Books
Scope & Motivation
The Zipf Pattern is one of the most intriguing mathematical relations to me, but when applied to linguistics, most research target very large corpora of text rather than individual books. Research found that the strength of the Zipfian pattern differs in different genres of corpora of text, going from highest to lowest as: scientific publication -> non-fiction books -> fiction/poetry/religious text. That makes perfect sense considering that George Zipf originally predicted the pattern in human speech that is successful in conveying meaning.
I'm attempting to investigate whether the variance in zipf pattern strength between individual books is:
- big enough to allow us to predict the genre of the book
- correlated to how successful it was to convey meaning (being readable)
Current Direction
Currently working on literature review and replicating the results of some research.
Collected a large amount of books in digital form, currently working on writing algorithms to strip the digital files of any extra text that was not written by the author. This is tedious but important because that additional text can deviate the zipf strength calculation significantly since the books are not large enough as your typical text corpus used in research like this.