Speculative fiction mining

11 minute read

Data of very different kinds has already been dissected by bored data scientists: movies, politics, demography, criminal, even metal band lyrics. At the same time, fiction, especially speculative fiction, is mostly overlooked, which seems surprising to those believing into the stereotype of usual passion of STEM students/professionals for speculative fiction.

Here I try to fill this gap by taking some data and drawing some plots.

Data source

First of all, we need 2 things: data to analyse and ideas of what data we need. Ideally, the latter should be the first, but in this case the order is more spiral than straight: some ideas may lead to new data sources, some data sources can inspire/provoke new ideas. The basic requirements to data were information about genre, year, author.

I started from structured data sources, namely Wikidata, because I heard that, after the shutdown of Freebase, it has been merged into Wikidata and Freebase contained a lot of data about books. However, Wikidata turned out to contain only 28463 novels, 3329 science fiction and 1722 fantasy items, check it by this query to wikidata.

DBpedia showed even worse results: 1654 Science_fiction and 1341 Fantasy_literature, see corresponding query.

Therefore I returned to unpleasant choice of sites to crawl and scrape data from. Goodreads, probably the most full database about books, like IMDB for movies, prohibits crawling in their robots.txt and TOS. fantasticfiction.com looked promising, but its coverage is also not so good. After all, I decided to go with Fantlab - from my point of view, it has the fullest coverage of speculative fiction, especially, but not only Russian authors. As it appears, there is even an API, but I preferred to simply crawl and scrape the data.

Crawling and scraping

Since I had no experience in scraping, I’ve chosen the most popular framework - scrapy. Non-obvious things I learned from using it: 1. In order to serialize encoded Russian text, scrapy should be at least 1.20 and FEED_EXPORT_ENCODING = 'utf-8' should be added to settings (thanks to this recent SO answer) 2. XPath selectors can be copied from browser (in Chrome: Right click - Inspect - Copy - Copy XPath). I didn’t find this advice anywhere - maybe, because it is actually obvious. 3. You can use ItemLoaders to nicely preprocess/postprocess extracted info and you can grab items from multiple pages/requrests, but not both at the same time - at least, I didn’t find the way to do it nicely.

(Since I wanted to get info about genres and this info is shown only when enough users voted for the same genre, I crawled only those works having at least one rating.)

Data preparation

We start from reading of serialized csv into pandas dataframe; we filter away works written before 1900 year. and we convert genres into immutable set of strings.

import pandas as pd
import numpy as np

def genres_split(s):
    return frozenset(s.split(','))

def year_converter(s):
    try:
        years = int(s)
        if 1900 < years < 2017:
            return years
    except ValueError:
        return np.nan
    
filename='/home/nik/workspace/fiction-miner/fantlab_annot.csv'
df = pd.read_csv(filename, converters={"genres": genres_split, "year": year_converter}, index_col='id')
df.head(3)
book_type annotation year reviews_count rating_count title author_name genres rating
id
/work163462 shortstory Япония периода второй мировой войны. В маленьк... 1957.0 NaN 36.0 Солдат из сна Кобо Абэ (安部公房 / Kōbō Abe) (Реализм) 7.36
/work160391 novel NaN 1948.0 NaN 4.0 終りし道の標に Owarishi michi no shirube ni Кобо Абэ (安部公房 / Kōbō Abe) () 7.25
/work164992 shortstory NaN NaN NaN 4.0 家 / Ie Кобо Абэ (安部公房 / Kōbō Abe) () 8.25

We have genres as immutable set of strings – it violates 1st normal form, so we perform usual one-hot encoding and, in the same time, we keep only part of genres and translate them from Russian to English.

main_genres_dict = {
    'Фантастика': 'Science fiction',
    'Фэнтези': 'Fantasy',
    'Магический реализм': 'Magic realism',
    'Мистика': 'Mystic fiction',
    'Хоррор/Ужасы': 'Horror fiction',
    'Сказка/Притча': 'Fairytale / parable',
    'Детектив': 'Detective fiction',
    'Боевик': 'Action fiction',
    'Триллер': 'Thriller',
    'Любовный роман': 'Romance novel',
    'Историческая проза': 'Historical fiction',
    'Сюрреализм': 'Surrealism',
    'Постмодернизм': 'Postmodernism',
    'Реализм': 'Realism',
    'Психоделика': 'Psyhodelic',
    }
scifi_subgenres = {
    '«Твёрдая» НФ': 'Hard SF',
    'Гуманитарная («мягкая») НФ': 'Soft SF',
    'Космоопера': 'Spaceopera',
    'Киберпанк': 'Cyberpunk',
    'Планетарная фантастика': 'Planet SF',
    'Таймпанк': 'Timepunk',
    'Хроноопера': 'Chronoopera',
    'Постапокалиптика': 'Postap',
    'Роман-катастрофа': 'Disaster novel',
    'Утопия': 'Utopia',
    'Антиутопия': 'Antiutopia',
    }
fantasy_subgenres = {
    'Эпическое фэнтези': 'Epic Fantasy',
    'Героическое фэнтези': 'Heroic Fantasy',
    'Городское фэнтези': 'Urban Fantasy',
    'Dark Fantasy': 'Dark Fantasy',
    'Технофэнтези': 'Technofantasy',
    'Science Fantasy': 'Science Fantasy',
    '«Дотолкиновское» фэнтези': 'Pretolkien Fantasy',
    '«Классическое» фэнтези': 'Classic Fantasy',
    'Артуриана': 'Arhturiana',
    'Анималистическое': 'Animalistic',
    'Мифологическое': 'Mythological Fantasy',
    }
# genres_dict = {**main_genres_dict, **scifi_subgenres, fantasy_subgenres}
# genres_set = frozenset.union(*df.genres)
def encode_genres(df, genres_dict):
    genres_set = genres_dict.keys()
    for genre in genres_set:
        genre_lab = genres_dict[genre]
        #taken from http://datascience.stackexchange.com/a/11799
        df[genre_lab] = df.apply(lambda x: int(genre in x['genres']), axis='columns')

        df[genre_lab+'_rat'] = df.apply(lambda x: x['rating'] if x[genre_lab]==1 else np.nan, axis=1)
        df[genre_lab+'_rat_count'] = df.apply(lambda x: x['rating_count'] if x[genre_lab]==1 else np.nan, axis=1)
        print(genre_lab + ": \t\t" + str(len(df[df[genre_lab] == 1])))
print('  Main genres')
encode_genres(df, main_genres_dict)
print('  SciFi subgenres')
encode_genres(df, scifi_subgenres)
print('  Fantasy subgenres')
encode_genres(df, fantasy_subgenres)
df.head(3)
  Main genres
Mystic fiction: 		1844
Fairytale / parable: 		1044
Fantasy: 		4635
Psyhodelic: 		263
Romance novel: 		276
Realism: 		4473
Action fiction: 		689
Thriller: 		283
Science fiction: 		12765
Postmodernism: 		209
Detective fiction: 		1273
Historical fiction: 		539
Surrealism: 		415
Magic realism: 		761
Horror fiction: 		1413
  SciFi subgenres
Hard SF: 		2815
Cyberpunk: 		269
Planet SF: 		651
Utopia: 		109
Spaceopera: 		1043
Soft SF: 		8957
Chronoopera: 		1021
Antiutopia: 		498
Timepunk: 		108
Postap: 		756
Disaster novel: 		218
  Fantasy subgenres
Science Fantasy: 		369
Pretolkien Fantasy: 		70
Epic Fantasy: 		443
Technofantasy: 		247
Dark Fantasy: 		359
Arhturiana: 		39
Heroic Fantasy: 		3364
Mythological Fantasy: 		213
Animalistic: 		57
Classic Fantasy: 		129
Urban Fantasy: 		649
book_type annotation year reviews_count rating_count title author_name genres rating Arhturiana ... Postap_rat_count Cyberpunk Cyberpunk_rat Cyberpunk_rat_count Classic Fantasy Classic Fantasy_rat Classic Fantasy_rat_count Horror fiction Horror fiction_rat Horror fiction_rat_count
id
/work163462 shortstory Япония периода второй мировой войны. В маленьк... 1957.0 NaN 36.0 Солдат из сна Кобо Абэ (安部公房 / Kōbō Abe) (Реализм) 7.36 0 ... NaN 0 NaN NaN 0 NaN NaN 0 NaN NaN
/work160391 novel NaN 1948.0 NaN 4.0 終りし道の標に Owarishi michi no shirube ni Кобо Абэ (安部公房 / Kōbō Abe) () 7.25 0 ... NaN 0 NaN NaN 0 NaN NaN 0 NaN NaN
/work164992 shortstory NaN NaN NaN 4.0 家 / Ie Кобо Абэ (安部公房 / Kōbō Abe) () 8.25 0 ... NaN 0 NaN NaN 0 NaN NaN 0 NaN NaN

3 rows × 120 columns

In addition to the explicit info like year, genre, author, rating, etc., I determine 2 more types of data based on authors’ names: 1. Authors’ nationality: Russian or Foreign 2. Authors’ gender: Male or Female

Authors’ nationality

In order to estimate the author’s nationality, I simply check if the name contains at least one English character: I base my solution on the observation that names for foreign authors contains their original names in parentheses. Of course, there may be errors of both type, for example pseudonyms of Russian authors or missed original name of foreign one, but I belive that in general the error rate would be high enough.

import re
def containsEn(s):
    return bool(re.search('[a-zA-Z]', s))
print(containsEn('Майкл (Michael)'))
print(containsEn('Петр (псевдоним Ивана)'))
True
False
df_en = df[df.apply(lambda x: containsEn(x['author_name']), axis=1)]
df_rus = df[df.apply(lambda x: not containsEn(x['author_name']), axis=1)]
print(df.shape)
print(df_en.shape)
(75418, 120)
(36021, 120)

Authors’ gender

In order to estimate the author’s gender, I split full name into name parts and find intersection with prepared sets of male and female names. These lists were constructed from gazeteers of Gate 8 (both English and Russian lists) and some random site for oneiromancy, name choice and so on.

def read_in_set(filename, prefix='/home/nik/workspace/fiction-miner/names/'):
    with open(prefix + filename) as f:
        content = f.readlines()
    res = set([x.strip() for x in content if len(x.strip()) > 1])
    return res
# taken from GATE 8 gazeteers and http://www.sonnik-online.net/imena/_mujskie.html
names_female_ru = read_in_set('first_names_female.lst')
names_male_ru = read_in_set('first_names_male.lst')
names_female_en = read_in_set('person_female_ext.lst')
names_male_en = read_in_set('person_male_ext.lst')
names_female = set.union(names_female_ru, names_female_en)
names_male = set.union(names_male_ru, names_male_en)
# since my list contains forms of male names like Alexandra, lets filter them out
names_male = names_male.difference(names_female)

import string
def check_gender(s, names_set):
    translator = str.maketrans('', '', string.punctuation)
    words_in_s = set(s.translate(translator).split())
    return bool(words_in_s.intersection(names_set))

def isMale(s):
    return check_gender(s, names_male)

def isFemale(s):
    return check_gender(s, names_female)

print(isMale('Вера Петрова'))
print(isMale('Владимиp Ильин')) #incorrect layout
print(isMale('Владимир Ильин'))
print(isMale('Александра Олайва (Alexandra Oliva)')) #because of Oliva
print(isFemale('Алексей В. Андреев'))
print(isFemale('Вера Петрова'))
print(isMale('Алан Глинн (Alan Glynn)'))
print(isMale('Дэвид С. Гарнетт (David S. Garnett)'))
print(isFemale('John Doe'))
False
False
True
True
False
True
True
True
False
df_female = df[df.apply(lambda x: isFemale(x['author_name']), axis=1)]
print('females: ' + str(df_female.shape))
df_male = df[df.apply(lambda x: isMale(x['author_name']), axis=1)]
print('males: ' + str(df_male.shape))
females: 12367
males: 57734
df_unknown = df[df.apply(lambda x: not isMale(x['author_name']) and not isFemale(x['author_name']), axis=1)]
print('undefined: ' + str(df_unknown.shape))
print(np.unique(df_unknown['author_name'])[:5])
df_both = df[df.apply(lambda x: isMale(x['author_name']) and isFemale(x['author_name']), axis=1)]
print('both: ' + str(df_both.shape))
print(np.unique(df_both['author_name'])[:10])
undefined: (6884, 120)
['Zотов' 'А. Ли Мартинес (A. Lee Martinez)'
 'А. Н. Л. Манби (A. N. L. Munby)' 'Аврам Дэвидсон (Avram Davidson)'
 'Агоп Мелконян (Агоп Мъгърдич Мелконян)']
both: (1567, 120)
['Алан Глинн (Alan Glynn)' 'Александр и Людмила Белаш'
 'Александра Олайва (Alexandra Oliva)'
 'Александра Харви (Alyxandra Harvey)' 'Брижит Обер (Brigitte Aubert)'
 'Вероника Рот (Veronica Roth)' 'Габриэлла Пирс (Gabriella Pierce)'
 'Гай Гэвриел Кей (Guy Gavriel Kay)'
 'Гейл Карсон Ливайн (Gail Carson Levine)' 'Грэм Джойс (Graham Joyce)']

As we can see, most ‘hermaphrodite’ authors are those having surname equal to the name of the opposite gender, e.g. Alyxandra Harvey.

Genres development

Lets analyse the genres development over years. Area charts seem to be the most appropriate here, because they connect volume with square or areas.

Note that we use ‘wiggle’ modification, that is why it centers arount zero instead of building peaked mountains, read docs of matplotlib for details.

import matplotlib.pyplot as plt
%matplotlib notebook
# %matplotlib inline
from bokeh.palettes import colorblind, d3, brewer

def draw_plot(df2plot, genres_to_show, title, roll_window=2):
    fig, ax = plt.subplots(figsize=(11, 6))
    stackplot(df2plot, genres_to_show, ax, title, roll_window)

def stackplot(df2plot, genres_to_show, ax, title, roll_window):
    ax.set_title(title)
    palette = d3['Category10'][len(genres_to_show)]
    x = np.unique(df2plot['year'].dropna().values)
    y_all = []
    for i, genre in enumerate(genres_to_show):
        y = df2plot.groupby('year')[genre].sum()
        y_avg = pd.Series.rolling(y, center=False,window=roll_window).mean()
        y_all.append(y_avg)
    polys = ax.stackplot(x, y_all, labels=genres_to_show, linewidth=0, colors=palette, baseline='wiggle')
    ax.set_xticks(np.arange(1900,2020,10))
    ax.grid()
    ax.set_axis_bgcolor((0.8,0.8,0.8))
    ax.legend(loc='lower left', prop={'size':13})
    return ax
top_genres = [  #i.e those having at least 1000 items
    'Science fiction',
    'Fantasy',
    'Mystic fiction',
    'Horror fiction',
    'Fairytale / parable', 
    'Detective fiction', 
    'Realism', 
]
draw_plot(df, top_genres, 'Genres development over years', 2)

Note also the averaging (roll_window param), which is designed to smooth natural sparseness of the data; without it (i.e. with roll_window=1) random peaks hide trends, which is especially actual for further plots containg data subsets.

sf_subgenres = [
    'Hard SF',
    'Soft SF',
    'Spaceopera',
    'Cyberpunk',
    'Planet SF',
    'Timepunk',
    'Chronoopera',
    'Postap',
#     'Disaster novel',
    'Utopia',
    'Antiutopia',
]
draw_plot(df, sf_subgenres, 'Science fiction subgenres development over years')

fantasy_subgenres = [
    'Epic Fantasy',
    'Heroic Fantasy',
    'Urban Fantasy',
    'Dark Fantasy',
    'Technofantasy',
    'Science Fantasy',
#     'Pretolkien Fantasy',
    'Classic Fantasy',
#     'Arhturiana',
#     'Animalistic',
    'Mythological Fantasy',
]
draw_plot(df, fantasy_subgenres, 'Fantasy subgenres development over years')

#draw_plot(df_rus, top_genres, 'Genres development over years (Russian authors)', 2)

Compare Russian and Foreign authors

def compare2plots_row(df2plot1, df2plot2, genres_to_show, title1, title2, roll_window=2):
    fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(11, 10))
    stackplot(df2plot1, genres_to_show, axes[0], title1, roll_window)
    stackplot(df2plot2, genres_to_show, axes[1], title2, roll_window)

compare2plots_row(df_rus, df_en, top_genres, 'Genres (Russian authors)', 'Genres (Foreign authors)')

We can see the lag for Foreign authors: at the first plot volume becomes to shrink in 2015, while at the second one - already in 2010. If we further analyze subgenres, we find that most volume of Russian authors plot concentrates in last 15 years, while Foreign authors plot has most mass in the middle of XX century. It can be explained by the hypothesis that nearer works (both in time and nationality) cause more attention of the readers and thus have more representation on these plots.

compare2plots_row(df_rus, df_en, sf_subgenres, 'SciFi subgenres (Russian authors)', 'SciFi subgenres (Foreign authors)') 
compare2plots_row(df_rus, df_en, fantasy_subgenres, 'Fantasy subgenres (Russian authors)', 'Fantasy subgenres (Foreign authors)') 

It is interesting that Tehcnofantasy and Science Fantasy are much more popular in case of Russian authors, unlike Epic Fantasy and Dark Fantasy.

Compare Female and Male authors

compare2plots_row(df_female, df_male, top_genres, 'Genres (Female authors)', 'Genres (Male authors)')

compare2plots_row(df_female, df_male, sf_subgenres, 'SciFi subgenres (Female authors)', 'SciFi subgenres (Male authors)') 
compare2plots_row(df_female, df_male, fantasy_subgenres, 'Fantasy subgenres (Female authors)', 'Fantasy subgenres (Male authors)')