Skip to main content

Book and dictionary selection

I’ve chosen to analyze “The Tartar Steppe” by Dino Buzzati, which is my favorite novel. The main theme of the novel is the flee of time, and it is a recommended read for everyone who is struggling with a routine. The original text is in Italian, but I was able to easily find online the English version of the book in a .txt. (LINK)

For sentiment analysis, I selected a dictionary coming from the ANEW study. (LINK)

Pre-processing

For both the book and the dictionary, I manually removed a few lines of text from the head of the documents, since they were not related to the actual content.
The first step to analyze the book is to load the text in a lowercase form and remove all of the unwanted characters. To do that, I selecte a list of characters that I want to keep and I remove all of the others.

import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.lang.en.stop_words import STOP_WORDS
import numpy as np
import matplotlib.pyplot as plt
#Load text and create a list of non-word characters and a text of only lowercase words
text = open('tartarsteppe.txt', 'r').read()
text = str.lower(text)
letters = "abcdefghijklmnopqrstuvwxyz "
non_word=text
for i in non_word:
    if i in letters:
        non_word = non_word.replace(i, "")
non_word=list(set(non_word))
for i in text:
    if i in non_word:
        text = text.replace(i, " ")

I proceed to tokenize and lemmatize the text using the spaCy library; after that, I remove all the stopwords and spaces in excess.

#Tokenize and lemmanize
my_doc = nlp(text)
token_list = []
for token in my_doc:
    token_list.append(token.lemma_)

#Remove stopwords and spaces
testo = []
for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        if lexeme!="-PRON-":
            if len(word.strip())>1:
                testo.append(word.strip())

Then, I load the ANEW dictionary and converted it into a Python dictionary, where words are the dictionary’s keys. In this way, I can easily access the sentiment value of a certain word.

#Importing and cleaning dictionary
dizionario=[]
with open('ANEW.txt', 'r') as f:
    for i in f:
        dizionario.append(i)
diz_data={}
for i in dizionario[1:]:
    line_sp=i.replace('\n','').strip().split('\t')
    diz_data[line_sp[0]]=list(map(lambda x: 0 if x=="." else float(x), line_sp[2:]))

Algorithm

Now I define the functions that are used to perform sentiment analysis.

The first function is “sent_val”, which takes as arguments a list of words and our sentiment dictionary. The function maps each word of the list to a sentiment score and compute the mean of that score.

The second function is “book_analysis”, which is a function that analyzes a book using the “sent_val” function. It takes as input a pre-processed book, the size of the word window to slide through the text, and the size of overlapping words between different windows. The function proceed to extract pieces of text from the book according to the given parameters and compute the sentiment score for each section.

The last function is “give_perc”, and it simply tells where each word window locates in the book. This is used to compare windows of different lenghts and/or different books.

#Function that map a list of words to sentiment and compute the mean of sentiment
def sent_val(sentences,dizionario):
    mapping = list(filter(None,(list(map(dizionario.get, sentences)))))
    mapping= np.transpose(mapping)
    sentim = mapping[0].mean()
    return sentim

#Function that use the sent_val function to analyze a book, given the window and overlap parameters
def book_analysis(testo,T,overlap):
    N_win=round(len(testo)/(T-overlap)-0.5)
    numerone=[]
    for window in range(N_win):
        t_w=testo[(T-overlap)*window:(T-overlap)*window+T]
        if len(t_w)>0:
            numerone.append(sent_val(t_w,diz_data))
    return numerone

#Function that gives a list of % book completion, given the window and overlap parameters
#Used when plotting the sentiment
def give_perc(testo,T,overlap):
    N_win=round(len(testo)/(T-overlap)-0.5)
    listona=[]
    for i in range(N_win):
        listona.append(i*100/N_win)
    return listona

Results

In Figure 1 I plotted the emotional arc of the story, using as parameters T = 1000 and overlap = 100. The mean score of the emotional arc is also plotted. As it is possible to observe, the story starts and concludes on a more or less positive note. However, the central part of the story is very pessimistic, which is reflected with a sharp drop in the emotional score. This kind of arc resembles the so-called “Man in a hole” emotional arc, which is one of the shapes that can be found in Reagan et al. (2016). The “Man in a hole” is characterized by a fall in sentiment and a subsequent rise, and is the opposite of the “Icarus” emotional arc, where we have first a rise and then a fall.

#Parameters
T=1000
overlap=100
#Functions
sent_score=book_analysis(testo,T,overlap)
x=give_perc(testo,T,overlap)
#Compute mean score
mean_score=[]
for i in range(len(sent_score)):
    mean_score.append(sum(sent_score)/len(sent_score))
#plot
%matplotlib notebook
plt.plot(x,sent_score,linewidth=3,label='T= %s' % T,color='tab:green')
plt.plot(x,mean_score,"--",linewidth=2,label='Mean score',color='tab:green')
plt.xlabel('Book %')
plt.ylabel('Sentiment score')
plt.title('The Tartar Steppe, emotional arc')
plt.show()
plt.savefig("plot1.pdf")

In Figure 2, I performed a sensitivity analysis on the “word window” parameter (T), while leaving the overlap window fixed (= 100). For all of the different lengths, we roughly observe the same pattern. However, for smaller windows, there are larger emotional swings, which tend to be averaged out on wider windows.

#Sensitivity Analysis
#Parameters
T_sens=[500,750,1000,1250,1500]
overlap=100
#plot
%matplotlib notebook
plt.xlabel('Book %')
plt.ylabel('Sentiment score')
plt.title('The Tartar Steppe emotional arc, T Sensitivity Analysis')
for T in T_sens:
    sent_score=book_analysis(testo,T,overlap)
    x=give_perc(testo,T,overlap)
    if T==1000:
        plt.plot(x,sent_score,linewidth=3,label='T= %s' % T)
    else:
        plt.plot(x,sent_score,'-',linewidth=1.5,label='T= %s' % T)
plt.legend()
plt.show()
plt.savefig("plot2.pdf")
Leonardo Rizzo

PhD candidate in Network and Data Science. I am a data scientist with a background in economics and finance, and my research focuses on the economic and managerial applications of network science and agent-based simulation models.

One Comment

Leave a Reply