So, where do all the numbers used in Plotting Plots come from? It’s an important question to ask. Knowing how the data are created will give you a more confident understanding of how they can (and cannot) be used.
Allow me to share.
I start with digital versions of the books, and a little computer program I wrote. I break up the books into parts (usually chapters, but sometimes stanzas, scenes, or sections) and the program tallies the author’s use of every single word in the text. I then save the tallies as a spreadsheet, which I use to power the visualization tools you see.
There are a few things to know about the data themselves. First, text data is unstructured by its nature and therefore very tricky to work with sometimes. I manually inspect and prepare every digital version of a text before running it through my program. I also manually inspect the data produced, as sometimes all sorts of odd things can happen. For example, sometimes words get merged with punctuation or things like apostrophes get converted into gibberish. My faith in computers alone, without human guidance, is weak.
Unless otherwise noted, I also remove integers and punctuation. A frequent exception to this is apostrophes, which can be used fairly regularly to create contractions like “ain’t.”
The process of quantifying text is never perfect. But it also doesn’t need to be. All that is needed is a sufficient effort to make most of the textual data quantifiably visible and accurate so readers can use the data to enjoy exploring texts in a new way. And that’s what I aim for.
That having been said, if you ever notice anything that seems incorrect or glitchy, please tell me via the messaging feature on the website or on social. Most things can be easily adjusted and improved. I appreciate any and all feedback.