Machine Learning (ML) is one of our strongest statistical tools. that can be harnessed for crucial and life-saving tasks like predicting cancer. ML algorithms take large and complex data sets, identify patterns, train on a set of cases where the outcome (e.g., whether a skin spot is cancerous or not) is known, and then use what it learned on the training set to predict new and unknown cases. It's a powerhouse of a tool. Like most statistical tools, we could have some fun with it as well..
Just as ML could identify skin patterns, so it could work with words. ML could look at a set of annotated texts (e.g., texts for which we know and "teach" the algorithm who was the author) and then use that knowledge to predict whether a new text was written by a specific author. For example, Julia Silge, one of the creators of the Tidyverse, used ML to predict whether lines from books were from Jane Austen's Pride and Prejudice or from H. G. Wells's War of the Worlds. See her analysis, that served as a basis for this blog post here.
There are several factors that play an advantage in Silge's analysis: First, the texts she is using, while not huge compare to other ML projects, are nevertheless rather long, as she looked at full books. Second, the authors she was examining, and the specific pieces she chose, are pretty different from each other - one is a romantic novel about emotional development, the other a science fiction tale about the destruction of earth. One would expect these two to use very different vocabularies.
As a first step, I downloaded the lyrics to 15 Beatles albums (all albums + the compilation of mostly unpublished in albums singles, "1"), and 23 studio albums by the Rolling Stones, using R's package geniusR. I save the lyrics split into individual lines in a Tidy format.
After preparing the data for machine learning (including removing stopwords and creating training and test data sets, among other things), I used a Generalized Linear Model via penalized maximum likelihood model (glmnet package in R).
As this little analysis is just for fun, I will not go into the technical details and to the code itself, but just say that the model did pretty well in identifying Rolling Stones songs (actually - lines from songs), and not so well with Beatles songs. Out of the 2083 song lines from the Rolling Stones, 1762 were predicted to be Stones songs (84.5%). But for the Beatles, only 960 out of 1641 lines were correctly attributed to the Fab Four (58.5%).
Here are the words with the strongest coefficients for each band. Top blue words make it very likely that the song came from the Stones. Red lines strongly predict Beatles:
Here are a few examples for Beatles lines that were misclassified as Rolling Stones songs: "Cause I couldn't really stand it", "Well, the Ukraine girls really knock me out". Here are a few Stones songs misclassified as Beatles songs: "Why do you hide, why do you hide your love?", "She was born in an arctic zone", "Yeah, and he said one word to me, and that was death" (why, on earth would that be a Bealtes song?).
A few comments on the analysis and on why it should be taken too seriously:
1) the corpus is very small compared to what you would want for a machine learning task
2) the comparison is between two English bands that worked in about the same scene at around the same time - Of course, the Stones approached somewhat different and more rebellious audience, but still, they shared more cultural similarities than differences
3) The Beatles, and to a lesser degree the Stones split the art of writing between their members. The model here assumes that Paul McCartney, John Lennon, George Harrison, and Ringo Starr all use the same language... This is unlikely. The same applies to the Stones.
But again, the point here was to just have some fun... and for that purpose - the model seems to do well enough...