Text Mining of Rubaiyat of Omar Khayyam using R

January 10, 2023

Introduction

Khayyam was an Iranian astronomer, mathematician,philosopher and a poet , which is commonly known for his quatrains (short poems). He was almost unknown to whole world, till 19th century, but thanks to EdwardFitzGerald, his poems were translated to English and published entitled Rubaiyat of Omar Khayyam.

Although, his achievements in astronomy and mathematics such as solar calender and cubic equation are admirable, nonetheless, world realized who he really is, because of the hidden amazing philosophy in his poems. Discovery the hidden truth in his poems made me motivated to do some analysis and document it for the others; that is exactly where the text mining methods come in. In text mining, the focus is on textual data and extraction of wisdom from it. To have some knowledge about text mining using R, you can refer to Text Mining with R book which is written by Julia Silge and David Robinson.

Now that we have our tools, let’s get ready and start our journey and float in the Khayyam’s mind and find out what was going on in his mind and enjoy it.

Important: Khayyam wrote these poems in Persian, but we will examine the English translated version of his poems.

Call required packages

At first, the packages we need, should be installed and called.

install.packages(tidyverse)
install.packages(tidytext)
install.packages(gutenbergr)
install.packages(topicmodels)
install.packages(tidyr)

library(tidyverse)
library(tidytext)
library(gutenbergr)
library(topicmodels)

Download the data

So many books are accessible in the gutenbergr package. Every book has its own gutenberg_id and can be found via it. For more details you can refer to Help page of package. The gutenberg_id for Khayyam is 246. Let’s take a look at it.

khayyam <- gutenberg_download(246)

## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

Preprocessing of data

We want just the poems in the book. So sections such as introduction, preface, appendix and references must be deleted from the text. The whole text of the book is accessible with view(khayyam) in RStudio.

Exclusion of non-related data

Before anything, it is better to do some data cleaning. As seen below, we remove and exclude rows 1:450 and 1878:2131 from our data.

khayyam <- khayyam %>% slice(-c(1:450, 1878:2131))

Stop words

In English (and obviously any other language), there are some words which are frequently used in every specific texts, but text analysis of them are not so useful. We called them Stop Words. R can detect and consider these words.

tidy_khayyam <- khayyam %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

## Joining, by = "word"

Below, we can look at most common words in Khayyam’s poems. But, we should notice that this is without considering stop words.

tidy_khayyam

## # A tibble: 2,381 × 2
##    gutenberg_id word   
##           <int> <chr>  
##  1          246 awake  
##  2          246 morning
##  3          246 bowl   
##  4          246 night  
##  5          246 flung  
##  6          246 stone  
##  7          246 stars  
##  8          246 flight 
##  9          246 lo     
## 10          246 hunter 
## # … with 2,371 more rows

# The most common words in Khayyam's poems
tidy_khayyam %>%
    count(word, sort = TRUE)

## # A tibble: 1,097 × 2
##    word      n
##    <chr> <int>
##  1 wine     21
##  2 ah       19
##  3 cup      19
##  4 rose     19
##  5 earth    15
##  6 dust     14
##  7 life     14
##  8 lip      14
##  9 thou     14
## 10 day      12
## # … with 1,087 more rows

Additionally, we can add some words and bind them to these predefined stop words. For example, we customize stop words by adding words ah and sans.

custom_stop_words <- bind_rows(tibble(word = c("ah","sans"),  
            lexicon = c("custom")), stop_words)

custom_stop_words

## # A tibble: 1,151 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 ah          custom 
##  2 sans        custom 
##  3 a           SMART  
##  4 a's         SMART  
##  5 able        SMART  
##  6 about       SMART  
##  7 above       SMART  
##  8 according   SMART  
##  9 accordingly SMART  
## 10 across      SMART  
## # … with 1,141 more rows

Now, purely, we can take a look at the most frequently used words in Khayyam’s poems.

tidy_khayyam <- khayyam %>%
    unnest_tokens(word, text) %>%
    anti_join(custom_stop_words)

## Joining, by = "word"

# The most common words in Khayyam's poems
tidy_khayyam %>%
    count(word, sort = TRUE)

## # A tibble: 1,095 × 2
##    word      n
##    <chr> <int>
##  1 wine     21
##  2 cup      19
##  3 rose     19
##  4 earth    15
##  5 dust     14
##  6 life     14
##  7 lip      14
##  8 thou     14
##  9 day      12
## 10 door     12
## # … with 1,085 more rows

Wine, cup, rose are 3 most common words. In the following, as well, there are words such as earth, dust, life, lip and etc. Well, what are these about? What do the wine and the cup refer to? What about the rose and the lip? and also others.

Maybe, it is obvious. The wine and the cup are referring to binge and having fun. On the other hand, rose is the symbol of lover in Persian poems and of course it has its own complicated means. Then, lip and rose could refer to pleasure and carelessness and of course love. It’s all about seizing the present moment.

Below, words repeated more than 10 times, have been shown.

library(ggplot2)
tidy_khayyam %>%
    count(word, sort = TRUE) %>%
    filter(n > 10) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) +
    geom_col() +
    xlab(NULL) +
    coord_flip()

There are some guidelines which can separate the words based on their sentiments. For example positive words or negative words. One of these guidelines is bing.

bing_word_counts <- tidy_khayyam %>%
    inner_join(get_sentiments("bing")) %>%
    count(word, sentiment, sort = TRUE)

## Joining, by = "word"

bing_word_counts

## # A tibble: 173 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 dust     negative     14
##  2 darkness negative      6
##  3 paradise positive      6
##  4 sweet    positive      6
##  5 vain     negative      6
##  6 angel    positive      5
##  7 hell     negative      5
##  8 love     positive      5
##  9 waste    negative      5
## 10 dead     negative      4
## # … with 163 more rows

Here, these words have been visually shown.

# This can be shown visually
bing_word_counts %>%
    group_by(sentiment) %>%
    top_n(10) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment",
         x = NULL) +
    coord_flip()

## Selecting by n

We can check wordcloud plot for Khayyam’s poems. Size of words have been determined based on their frequency.

library(wordcloud)

## Loading required package: RColorBrewer

tidy_khayyam %>%
  anti_join(custom_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

## Warning in wordcloud(word, n, max.words = 100): round could not be fit on page.
## It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100): earth could not be fit on page.
## It will not be plotted.

Topic modeling

Every document could be separated to different topics that each of them may have their own labels. Topic modeling is a method which can be used for this purpose. A popular topic modeling is Latent Dirichlet allocation (LDA). Based on LDA each document is a mixture of topics and each topic is a mixture of words. In Khayyam’s poems, we hypothesize that it contains two different topics; however we can consider more topics too. But, if you are dealing with a document that include several chapters, you can set number of chapters as number of topics.

First of all, we must get the data ready to fit LDA model.

q = tidy_khayyam %>%
     count(gutenberg_id,word, sort = TRUE)
b = q %>%
     cast_dtm(gutenberg_id, word, n)

Now, it’s time to fit LDA with k = 2 (two topics).

ap_lda <- LDA(b, k = 2, control = list(seed = 1234))
ap_lda

## A LDA_VEM topic model with 2 topics.

After fitting the model, we can estimate the per-topic-per-word probabilities, called β (“beta”), from the model. Needless to say that these values indicate the probabilities of that term being generated from that topic.

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics

## # A tibble: 2,190 × 3
##    topic term     beta
##    <int> <chr>   <dbl>
##  1     1 wine  0.00694
##  2     2 wine  0.0110 
##  3     1 cup   0.00978
##  4     2 cup   0.00629
##  5     1 rose  0.0108 
##  6     2 rose  0.00522
##  7     1 earth 0.00669
##  8     2 earth 0.00604
##  9     1 dust  0.00342
## 10     2 dust  0.00858
## # … with 2,180 more rows

The term “wine” has a 0.007 probability of being generated from topic 1, but a 0.01 probability of being generated from topic 2.

Then we take a look at the 10 terms that are most common within each topic.

ap_top_terms <- ap_topics %>%
     group_by(topic) %>%
     slice_max(beta, n = 10) %>% 
     ungroup() %>%
     arrange(topic, -beta)

ap_top_terms %>%
     mutate(term = reorder_within(term, beta, topic)) %>%
     ggplot(aes(beta, term, fill = factor(topic))) +
     geom_col(show.legend = FALSE) +
     facet_wrap(~ topic, scales = "free") +
     scale_y_reordered()

What does topic 1 say? the terms “rose”, “lip”, “wine”, “thee”. It seems that Khayyam is talking about relationship with lover, kiss and drinking a wine with a partner. What about topic 2? There are terms such as “wine”, “hand”, “drink” and “grape”. If you have read Khayyam’s poems, you know that he mostly wrote and composed about holding a cup, drinking the wine and enjoying the life. On the other hand, it was a boom age of pottery in ancient times (almost 1000 years ago) in Iran. Khayyam believed that the bodies of died and buried people turn into soil and dust over time and roses grow from that dust. He believed that the pottery we use are made from the soil and dust of dead people and one day, the pottery will be made from our soil and dust for next generations(e.g. hundred year later). Therefore, We should treat pottery with respect because it is made from other people’s soil. Time is very very short and we should appreciate and enjoy the short moments of life.

beta_wide <- ap_topics %>%
     mutate(topic = paste0("topic", topic)) %>%
     pivot_wider(names_from = topic, values_from = beta) %>% 
     filter(topic1 > .001 | topic2 > .001) %>%
     mutate(log_ratio = log2(topic2 / topic1))
beta_wide

## # A tibble: 544 × 4
##    term   topic1  topic2 log_ratio
##    <chr>   <dbl>   <dbl>     <dbl>
##  1 wine  0.00694 0.0110      0.663
##  2 cup   0.00978 0.00629    -0.636
##  3 rose  0.0108  0.00522    -1.05 
##  4 earth 0.00669 0.00604    -0.146
##  5 dust  0.00342 0.00858     1.33 
##  6 life  0.00408 0.00790     0.955
##  7 lip   0.00841 0.00338    -1.32 
##  8 thou  0.00531 0.00661     0.317
##  9 day   0.00789 0.00218    -1.86 
## 10 door  0.00581 0.00435    -0.417
## # … with 534 more rows

In this output, we see that the term “wine” is more likely to belong to topic 2. On the other hand, terms “cup” and “rose” are more likely to belong to topic 1. Based on the results, we can labels to each of topics, but we don’t do that and leave it to the taste of the dear readers.

Conclusion

We introduced the basics concepts of text mining. Clearly, we didn’t delve into details. In a nutshell, we analyzed the Khayyam’s poems. We saw that Khayyam believes in joy. I highly recommend you to read his poems and personally discover his mindset. If we want to summarize his thought and world view in one sentence, that sentence will be this:

Life is short. Always choose happiness.

Posted on:: January 10, 2023

Length:: 10 minute read, 1928 words

Categories:: Text Mining Tidytext

Tags:: Tidytext text mining Rstudio Khayyam poem Rubaiyat

See Also:: Scientometrics Analysis of RStudio applications in PubMed Database