Entry NLP4: Frequencies and Comparison

26 minute read

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math and analysis!

import pandas as pd
import os
from IPython.display import display

import string
import re
import itertools
import nltk
nltk.download('stopwords')
# Grab and store the data
def read_script(file_path):
    corpus = ''
    with open(file_path, 'r', encoding='latin-1') as l:
        for line in l:
            if (re.match('[^\d+]', line)
               ) and (re.match('^(?!\s*$).+', line)
                      ) and not (re.match('(.*www.*)|(.*http:*)', line)
                                ) and not (re.match('Sync and correct*', line)):
                line = re.sub('</?i>|</?font.*>', '', line)
                corpus = corpus + ' ' + line
    return corpus

def load_files_to_dict(file_path, return_dict):    
    for thing in os.scandir(file_path):
        if thing.is_dir():
            new_path = os.path.join(file_path, thing.name)
            new_dict = return_dict[thing.name] = {}
            load_files_to_dict(new_path, new_dict)
        elif thing.is_file:
            return_dict[thing.name] = read_script(f'{file_path}/{thing.name}')
    return return_dict
def convert_dict_df(script_dict):
    return pd.DataFrame.from_dict(script_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})

# Clean the text and create ngrams
def punct_tokens(df, text_col):
    newline_list = '\t\r\n'
    remove_newline = str.maketrans(' ', ' ', newline_list)
    punct_list = string.punctuation + '-‘_”'
    nopunct = str.maketrans('', '', punct_list)
    df['no_punct_tokens'] = df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
    return df

def create_ngrams(df):
    stop = nltk.corpus.stopwords.words('english')
    df['unigrams'] = df['no_punct_tokens'].apply(lambda x: [item for item in x if item not in stop])
    df['bigrams'] = df['unigrams'].apply(lambda x:(list(nltk.bigrams(x))))
    df['trigrams'] = df['unigrams'].apply(lambda x:(list(nltk.trigrams(x))))
    return df

def create_ngram_df(script_dict, text_col):
    df = convert_dict_df(script_dict)
    df = punct_tokens(df, text_col)
    df = create_ngrams(df)
    return df

Frequencies

Counting words is a common sample problem and can probably be considered the ‘hello world’ of NLP. When putting it into a dictionary data structure, the concept isn’t difficult:

  • For each word (or in our case, n-gram) in the corpus
  • Insert the word if it’s not there (the dictionary key)
  • Add 1 to the count (the dictionary value)
frequency_dictionary = {}
for ngram in ngram_list:
    if ngram not in frequency_dictionary:
        frequency_dictionary[ngram] = 0
    frequency_dictionary[ngram] +=1

The question is, how to apply this general concept to my specific use case.

The n-grams have already been created, so I don’t have to worry about longer n-grams (the bigrams, and I threw in trigrams because why not?) spilling from one scrip to another. Which means I can concatenate all the n-grams of a specific category together (i.e. I don’t want to combine unigrams with bigrams, just all the unigrams with each other).

auth_file_path = os.path.join(os.getcwd(), 'data', '1960s')
raw_auth_dict = load_files_to_dict(auth_file_path, {})

auth_ngram_df = create_ngram_df(raw_auth_dict, 'corpus')
auth_ngram_df.head()
script_name corpus no_punct_tokens unigrams bigrams trigrams
0 The Twilight Zone - 3x17 - One More Pallbearer... You're traveling\n through another dimension-... [youre, traveling, through, another, dimension... [youre, traveling, another, dimension, dimensi... [(youre, traveling), (traveling, another), (an... [(youre, traveling, another), (traveling, anot...
1 The Twilight Zone - 3x05 - A Game of Pool.srt You're traveling\n through another dimension-... [youre, traveling, through, another, dimension... [youre, traveling, another, dimension, dimensi... [(youre, traveling), (traveling, another), (an... [(youre, traveling, another), (traveling, anot...
2 The Twilight Zone - 2x03 - Nervous Man in a Fo... You're traveling\n through another dimension-... [youre, traveling, through, another, dimension... [youre, traveling, another, dimension, dimensi... [(youre, traveling), (traveling, another), (an... [(youre, traveling, another), (traveling, anot...
3 The Twilight Zone - 4x05 - Mute.srt You unlock this door\n with the key of imagin... [you, unlock, this, door, with, the, key, of, ... [unlock, door, key, imagination, beyond, anoth... [(unlock, door), (door, key), (key, imaginatio... [(unlock, door, key), (door, key, imagination)...
4 The Twilight Zone - 3x04 - The Passersby.srt You're traveling\n through another dimension-... [youre, traveling, through, another, dimension... [youre, traveling, another, dimension, dimensi... [(youre, traveling), (traveling, another), (an... [(youre, traveling, another), (traveling, anot...

I already know I want to use the n-grams as my unique identifier, which means I’ll need to create a separate dataframe for each set of frequencies - mixing unigrams with bigrams wouldn’t let me do the analysis I want. This both simplifies and complicates the process, since I won’t be able to just add on to the same dataframe anymore.

The frequency_ct and dict_to_df functions that I created in the previous solution to the homework still work. The only new aspect is that I need to put all the n-gram lists from the different scripts together. My initial thought was to use list.expand, but that would require looping through every row of the dataframe, which isn’t the fastest or memory optimized solution.

Fortunately, there is an easy alternative: it’s easily accomplished by using the sum method on the column as specified in this StackOverflow answer.

auth_ngram_df['unigrams'].sum()[:10]
['youre',
 'traveling',
 'another',
 'dimension',
 'dimension',
 'sight',
 'sound',
 'mind',
 'journey',
 'wondrous']

Now that all of the ngrams are in a single list, it’s a simple matter of creating a function to process them.

def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict
test_freq = frequency_ct(auth_ngram_df['unigrams'].sum())
test_freq
{'youre': 1410,
 'traveling': 71,
 'another': 358,
 'dimension': 353,
 'sight': 131,
 'sound': 205,
 'mind': 422,
 'journey': 76,
 ...}

Of course now that I have my counts, I want to sort the n-grams from most frequent to least frequent. My favorite method to do this? DataFrames.

Unlike the previous convert_dict_df function, this one will need to be more flexible. It needs to be able to handle both the authentic 1960s corpus, all four of the modern corpora, and which ever n-grams I happen to be running. The addition of a couple of variables to handle column naming and a sort_values method takes care of it.

The corpus_name variable in particular is important later in the analysis. I’ll need to compare the authentic corpus which was written in the 1960s about the 1960s to each of the corpora written in the 21st century about the 1960s. With the flow I’ve established, I’ll need to merge dataframes to complete the analysis. This is most easily accomplished when the non-join-on columns have different names.

Example: If I join two dataframes with column names = ['unigram', 'frequency'] I’ll end up with a single dataframe with the column names = ['unigram', 'x-frequency', 'y-frequency']. I find these x and y prefixes less than informative and prefer to name my columns explicitly.

def dict_to_df(freq_dict, gram_name, corpus_name):
    if (type(gram_name)==str) and (type(corpus_name)==str):
        pass
    else:
        print('gram_name and corpus_name variables must be strings')
    freq_colname = corpus_name+'_frequency'
    df = pd.DataFrame.from_dict(freq_dict, orient='index'
                               ).reset_index().rename(columns={'index':gram_name, 0:freq_colname}
                                                     ).sort_values(freq_colname, ascending=False)
    return df

But why stop my function at just the frequency? I also need normalized frequencies. Normalized frequencies level the playing field of straight counts when comparing corpora. With simple counts, a larger corpus will have n-grams with larger counts simply because there are more words overall than a smaller corpus. It doesn’t necessarily reflect any relevant comparison.

Also, the homework problem requires getting ratios of the normalized frequencies later in the analysis.

def normalized_freq(freq_df, corpus_name):
    freq_col_name = corpus_name + '_frequency'
    norm_col_name = corpus_name + '_norm_freq'
    total_ct = freq_df[freq_col_name].sum()
    freq_df[norm_col_name] = freq_df[freq_col_name]/total_ct
    return freq_df

def create_frequencies(ngram_list, gram_name, corpus_name):
    freq_dict = frequency_ct(ngram_list)
    freq_df = dict_to_df(freq_dict, gram_name, corpus_name)
    freq_df = normalized_freq(freq_df, corpus_name)
    return freq_df
auth_freq_df = create_frequencies(auth_ngram_df['unigrams'].sum(), 'unigram', 'authentic')
auth_freq_df.head()
unigram authentic_frequency authentic_norm_freq
206 well 2272 0.012132
25 dont 2199 0.011742
175 im 1988 0.010616
26 know 1777 0.009489
19 mr 1604 0.008565
test_file_path = os.path.join(os.getcwd(), 'data', '21st-century')
raw_test_dict = load_files_to_dict(test_file_path, {})

test_ngram_dict = {}
for script_group in list(raw_test_dict.keys()):
    test_ngram_dict[script_group] = create_ngram_df(raw_test_dict[script_group], 'corpus')

test_freq_dict = {}
for script_group in list(test_ngram_dict.keys()):
    test_freq_dict[script_group] = create_frequencies(test_ngram_dict[script_group]['unigrams'].sum(), 'unigram', script_group)

test_freq_dict['Pan_Am'].head()
unigram Pan_Am_frequency Pan_Am_norm_freq
67 im 489 0.015189
114 oh 407 0.012642
11 dont 379 0.011772
50 well 373 0.011586
119 know 323 0.010033

Compare corpora

The last piece of this homework challenge is to compare the authentic corpus (wrtten regarding the 1960s and penned in the 1960s) to the four test corpora (written regarding the 1960s but not penned until the 21st century).

To compare anything to anything, first I need to combine the different dataframes holding my test corpora with the authentic corpus. I decided to do this by merging the values for the authentic data into each of the dataframes holding the values for the test data.

compare_dict = {}
for script_group in list(test_freq_dict.keys()):
    compare_dict[script_group] = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)
compare_dict['Pan_Am'].head()
unigram Pan_Am_frequency Pan_Am_norm_freq authentic_frequency authentic_norm_freq
0 im 489.0 0.015189 1988.0 0.010616
1 oh 407.0 0.012642 1580.0 0.008437
2 dont 379.0 0.011772 2199.0 0.011742
3 well 373.0 0.011586 2272.0 0.012132
4 know 323.0 0.010033 1777.0 0.009489

The equation I implemented in the previous solution to this homework was:

df['norm_freq_ratio'] = df.loc[(df['imitation_norm_freq'] != 0
                               ) & (df['authentic_norm_freq'] != 0), 'imitation_norm_freq'
                              ]/df.loc[(df['imitation_norm_freq'] != 0
                                       ) & (df['authentic_norm_freq'] != 0), 'authentic_norm_freq']

In order to implement this in the various dataframes, I’ll need a way to identify the appropriate columns, regardless of which dataframe I’m working with. This can be done by looking for ‘norm_freq’ in the column names - which will pull out the normalized frequency for both the authentic and test data.

[compare_dict['Pan_Am'].columns[compare_dict['Pan_Am'].columns.str.contains('norm_freq')]]
[Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')]

Referencing the dataframe by the dictionary and script group name is getting rather tedious, so I can just set the dictionary/script name as the dataframe I’m working with. This has a much cleaner appearance and, more importantly, is easier to read. Regardless of how good (or not) code is, it’s much more common to have to read code in order to improve, maintain, update, or repair it than write it. My philosophy is to make code as easy to read as possible, so that my future self can decipher what I was thinking when I wrote it the first time around.

test = compare_dict['Pan_Am']
test_cols = test.columns[test.columns.str.contains('norm_freq')]
test_cols
Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')

Now I can update my code to the more readable version. Since I use the test dataframe as the left object and the authentic dataframe as the right object in the join, I can count on the fact that the test:authentic columns will always be in the same order.

As an added bonus, I only have to write to the dictionary once instead of the initial write, then the update with the new columns.

compare_dict = {}
for script_group in list(test_freq_dict.keys()):
    df = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)
    freq_cols = df.columns[df.columns.str.contains('norm_freq')]
    df['norm_freq_ratio'] = df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[0]] / df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[1]]
    compare_dict[script_group] = df
compare_dict['Pan_Am'].head()
unigram Pan_Am_frequency Pan_Am_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
0 im 489.0 0.015189 1988.0 0.010616 1.430801
1 oh 407.0 0.012642 1580.0 0.008437 1.498387
2 dont 379.0 0.011772 2199.0 0.011742 1.002538
3 well 373.0 0.011586 2272.0 0.012132 0.954965
4 know 323.0 0.010033 1777.0 0.009489 1.057309

High Ratios

High ratios for the normalized frequency show unigrams that were used commonly in the 21st-century scripts, but were extremely rare (but present) in 1960s scripts.

for script_group in compare_dict.keys():
    print(script_group)
    display(compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50))
    print('\n')
Pan_Am
unigram Pan_Am_frequency Pan_Am_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
51 dean 87.0 0.002702 1.0 0.000005 506.064637
18 pan 160.0 0.004970 3.0 0.000016 310.231195
162 amanda 32.0 0.000994 1.0 0.000005 186.138717
89 stewardess 54.0 0.001677 2.0 0.000011 157.054543
197 teddy 27.0 0.000839 1.0 0.000005 157.054543
281 stewardesses 19.0 0.000590 1.0 0.000005 110.519863
364 ryan 15.0 0.000466 1.0 0.000005 87.252524
456 cia 13.0 0.000404 1.0 0.000005 75.618854
452 ich 13.0 0.000404 1.0 0.000005 75.618854
491 monte 12.0 0.000373 1.0 0.000005 69.802019
483 omar 12.0 0.000373 1.0 0.000005 69.802019
84 ii 59.0 0.001833 5.0 0.000027 68.638652
548 carlo 10.0 0.000311 1.0 0.000005 58.168349
569 monsieur 10.0 0.000311 1.0 0.000005 58.168349
37 maggie 108.0 0.003355 11.0 0.000059 57.110743
603 le 9.0 0.000280 1.0 0.000005 52.351514
608 hier 9.0 0.000280 1.0 0.000005 52.351514
625 soviets 9.0 0.000280 1.0 0.000005 52.351514
634 lauras 9.0 0.000280 1.0 0.000005 52.351514
641 courier 9.0 0.000280 1.0 0.000005 52.351514
713 maggies 8.0 0.000248 1.0 0.000005 46.534679
349 rio 16.0 0.000497 2.0 0.000011 46.534679
734 moscow 8.0 0.000248 1.0 0.000005 46.534679
660 zu 8.0 0.000248 1.0 0.000005 46.534679
81 ted 59.0 0.001833 8.0 0.000043 42.899157
259 greg 21.0 0.000652 3.0 0.000016 40.717844
764 cockpit 7.0 0.000217 1.0 0.000005 40.717844
396 magazine 14.0 0.000435 2.0 0.000011 40.717844
770 casino 7.0 0.000217 1.0 0.000005 40.717844
818 graham 7.0 0.000217 1.0 0.000005 40.717844
449 previously 13.0 0.000404 2.0 0.000011 37.809427
925 tasty 6.0 0.000186 1.0 0.000005 34.901009
855 diplomatic 6.0 0.000186 1.0 0.000005 34.901009
883 palace 6.0 0.000186 1.0 0.000005 34.901009
877 cleared 6.0 0.000186 1.0 0.000005 34.901009
177 rome 30.0 0.000932 5.0 0.000027 34.901009
1004 choosing 5.0 0.000155 1.0 0.000005 29.084175
1041 pudding 5.0 0.000155 1.0 0.000005 29.084175
1015 guessing 5.0 0.000155 1.0 0.000005 29.084175
1112 khrushchev 5.0 0.000155 1.0 0.000005 29.084175
587 runway 10.0 0.000311 2.0 0.000011 29.084175
987 safely 5.0 0.000155 1.0 0.000005 29.084175
988 economy 5.0 0.000155 1.0 0.000005 29.084175
993 ugh 5.0 0.000155 1.0 0.000005 29.084175
113 london 44.0 0.001367 9.0 0.000048 28.437860
291 anderson 19.0 0.000590 4.0 0.000021 27.629966
105 mm 46.0 0.001429 11.0 0.000059 24.324946
468 cargo 12.0 0.000373 3.0 0.000016 23.267340
1371 fairly 4.0 0.000124 1.0 0.000005 23.267340
1336 32 4.0 0.000124 1.0 0.000005 23.267340
Mad_Men
unigram Mad_Men_frequency Mad_Men_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
138 sterling 170.0 0.001166 2.0 0.000011 109.151410
172 sally 143.0 0.000981 2.0 0.000011 91.815598
54 draper 365.0 0.002503 6.0 0.000032 78.118166
238 jesus 108.0 0.000741 2.0 0.000011 69.343249
553 francis 42.0 0.000288 1.0 0.000005 53.933638
317 clients 74.0 0.000507 2.0 0.000011 47.512967
187 joan 134.0 0.000919 4.0 0.000021 43.018497
195 betty 128.0 0.000878 4.0 0.000021 41.092295
435 jimmy 55.0 0.000377 2.0 0.000011 35.313691
457 ken 52.0 0.000357 2.0 0.000011 33.387490
843 crap 26.0 0.000178 1.0 0.000005 33.387490
931 presentation 23.0 0.000158 1.0 0.000005 29.535087
905 freddy 23.0 0.000158 1.0 0.000005 29.535087
354 bobby 67.0 0.000459 3.0 0.000016 28.678998
358 creative 66.0 0.000453 3.0 0.000016 28.250953
942 holloway 22.0 0.000151 1.0 0.000005 28.250953
937 clara 22.0 0.000151 1.0 0.000005 28.250953
1033 fatherinlaw 20.0 0.000137 1.0 0.000005 25.682685
994 spectacular 20.0 0.000137 1.0 0.000005 25.682685
1035 belle 20.0 0.000137 1.0 0.000005 25.682685
1069 joey 19.0 0.000130 1.0 0.000005 24.398550
1051 whitman 19.0 0.000130 1.0 0.000005 24.398550
1050 connie 19.0 0.000130 1.0 0.000005 24.398550
1142 delicious 18.0 0.000123 1.0 0.000005 23.114416
1120 jewish 18.0 0.000123 1.0 0.000005 23.114416
451 dick 53.0 0.000363 3.0 0.000016 22.686371
653 airlines 35.0 0.000240 2.0 0.000011 22.472349
691 partners 33.0 0.000226 2.0 0.000011 21.188215
1196 dallas 16.0 0.000110 1.0 0.000005 20.546148
1217 strategy 16.0 0.000110 1.0 0.000005 20.546148
1216 award 16.0 0.000110 1.0 0.000005 20.546148
1270 reception 15.0 0.000103 1.0 0.000005 19.262013
1331 danny 15.0 0.000103 1.0 0.000005 19.262013
1285 episode 15.0 0.000103 1.0 0.000005 19.262013
1303 casting 15.0 0.000103 1.0 0.000005 19.262013
755 previously 29.0 0.000199 2.0 0.000011 18.619946
1370 suitcase 14.0 0.000096 1.0 0.000005 17.977879
1371 cancel 14.0 0.000096 1.0 0.000005 17.977879
1358 grey 14.0 0.000096 1.0 0.000005 17.977879
1432 bowl 13.0 0.000089 1.0 0.000005 16.693745
364 duck 64.0 0.000439 5.0 0.000027 16.436918
279 lane 89.0 0.000610 7.0 0.000037 16.326850
635 hare 37.0 0.000254 3.0 0.000016 15.837656
630 beans 37.0 0.000254 3.0 0.000016 15.837656
636 greg 37.0 0.000254 3.0 0.000016 15.837656
482 campaign 49.0 0.000336 4.0 0.000021 15.730644
1536 joyce 12.0 0.000082 1.0 0.000005 15.409611
1572 chemical 12.0 0.000082 1.0 0.000005 15.409611
1584 salad 12.0 0.000082 1.0 0.000005 15.409611
1600 mens 12.0 0.000082 1.0 0.000005 15.409611
X-Men_First_Class
unigram X-Men_First_Class_frequency X-Men_First_Class_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
79 cia 10.0 0.002281 1.0 0.000005 427.076397
171 commands 5.0 0.001140 1.0 0.000005 213.538198
103 cuba 9.0 0.002052 2.0 0.000011 192.184379
210 sebastian 4.0 0.000912 1.0 0.000005 170.830559
211 shaws 4.0 0.000912 1.0 0.000005 170.830559
364 x 3.0 0.000684 1.0 0.000005 128.122919
326 presentation 3.0 0.000684 1.0 0.000005 128.122919
284 moscow 3.0 0.000684 1.0 0.000005 128.122919
264 threat 3.0 0.000684 1.0 0.000005 128.122919
370 homo 3.0 0.000684 1.0 0.000005 128.122919
500 jekyll 2.0 0.000456 1.0 0.000005 85.415279
396 groovy 2.0 0.000456 1.0 0.000005 85.415279
540 atom 2.0 0.000456 1.0 0.000005 85.415279
417 arrangement 2.0 0.000456 1.0 0.000005 85.415279
511 cola 2.0 0.000456 1.0 0.000005 85.415279
434 facility 2.0 0.000456 1.0 0.000005 85.415279
577 delicious 2.0 0.000456 1.0 0.000005 85.415279
415 florida 2.0 0.000456 1.0 0.000005 85.415279
624 formal 2.0 0.000456 1.0 0.000005 85.415279
623 dusseldorf 2.0 0.000456 1.0 0.000005 85.415279
216 argentina 4.0 0.000912 2.0 0.000011 85.415279
595 absorb 2.0 0.000456 1.0 0.000005 85.415279
118 russians 7.0 0.001596 4.0 0.000021 74.738369
115 turkey 8.0 0.001824 5.0 0.000027 68.332223
116 russia 8.0 0.001824 5.0 0.000027 68.332223
72 missiles 11.0 0.002509 7.0 0.000037 67.112005
18 hank 23.0 0.005245 15.0 0.000080 65.485048
335 jesus 3.0 0.000684 2.0 0.000011 64.061460
380 usa 3.0 0.000684 2.0 0.000011 64.061460
296 reconsider 3.0 0.000684 2.0 0.000011 64.061460
81 wow 10.0 0.002281 7.0 0.000037 61.010914
230 destination 4.0 0.000912 3.0 0.000016 56.943520
164 beast 5.0 0.001140 4.0 0.000021 53.384550
136 soviet 6.0 0.001368 5.0 0.000027 51.249168
368 rockets 3.0 0.000684 3.0 0.000016 42.707640
13 charles 26.0 0.005929 26.0 0.000139 42.707640
1117 spectacular 1.0 0.000228 1.0 0.000005 42.707640
1106 oneway 1.0 0.000228 1.0 0.000005 42.707640
1088 expectations 1.0 0.000228 1.0 0.000005 42.707640
525 backup 2.0 0.000456 2.0 0.000011 42.707640
529 nazis 2.0 0.000456 2.0 0.000011 42.707640
532 gates 2.0 0.000456 2.0 0.000011 42.707640
1124 currently 1.0 0.000228 1.0 0.000005 42.707640
575 senior 2.0 0.000456 2.0 0.000011 42.707640
1065 hoohoo 1.0 0.000228 1.0 0.000005 42.707640
588 freaks 2.0 0.000456 2.0 0.000011 42.707640
573 scratch 2.0 0.000456 2.0 0.000011 42.707640
407 serum 2.0 0.000456 2.0 0.000011 42.707640
491 mutated 2.0 0.000456 2.0 0.000011 42.707640
1140 colleges 1.0 0.000228 1.0 0.000005 42.707640
The_Kennedys
unigram The_Kennedys_frequency The_Kennedys_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
16 bobby 112.0 0.006192 3.0 0.000016 386.549750
86 khrushchev 30.0 0.001659 1.0 0.000005 310.620335
103 sighs 25.0 0.001382 1.0 0.000005 258.850279
165 rosemary 18.0 0.000995 1.0 0.000005 186.372201
12 kennedy 128.0 0.007077 9.0 0.000048 147.257048
101 cuba 25.0 0.001382 2.0 0.000011 129.425140
37 ii 60.0 0.003317 5.0 0.000027 124.248134
298 election 11.0 0.000608 1.0 0.000005 113.894123
163 ethel 18.0 0.000995 2.0 0.000011 93.186101
399 elected 8.0 0.000442 1.0 0.000005 82.832089
379 dallas 8.0 0.000442 1.0 0.000005 82.832089
394 mississippi 8.0 0.000442 1.0 0.000005 82.832089
418 cabinet 8.0 0.000442 1.0 0.000005 82.832089
110 senator 23.0 0.001272 3.0 0.000016 79.380752
460 organized 7.0 0.000387 1.0 0.000005 72.478078
214 christ 14.0 0.000774 2.0 0.000011 72.478078
127 meredith 21.0 0.001161 3.0 0.000016 72.478078
492 cia 7.0 0.000387 1.0 0.000005 72.478078
243 lyndon 13.0 0.000719 2.0 0.000011 67.301073
583 bastard 6.0 0.000332 1.0 0.000005 62.124067
559 bases 6.0 0.000332 1.0 0.000005 62.124067
525 francis 6.0 0.000332 1.0 0.000005 62.124067
556 option 6.0 0.000332 1.0 0.000005 62.124067
590 jolly 6.0 0.000332 1.0 0.000005 62.124067
616 dean 6.0 0.000332 1.0 0.000005 62.124067
607 mcnamara 6.0 0.000332 1.0 0.000005 62.124067
109 campaign 23.0 0.001272 4.0 0.000021 59.535564
294 campus 11.0 0.000608 2.0 0.000011 56.947061
725 sites 5.0 0.000276 1.0 0.000005 51.770056
689 bundy 5.0 0.000276 1.0 0.000005 51.770056
872 regime 4.0 0.000221 1.0 0.000005 41.416045
762 threat 4.0 0.000221 1.0 0.000005 41.416045
794 perception 4.0 0.000221 1.0 0.000005 41.416045
802 largely 4.0 0.000221 1.0 0.000005 41.416045
894 disaster 4.0 0.000221 1.0 0.000005 41.416045
775 grunts 4.0 0.000221 1.0 0.000005 41.416045
773 handing 4.0 0.000221 1.0 0.000005 41.416045
763 humiliating 4.0 0.000221 1.0 0.000005 41.416045
747 itsits 4.0 0.000221 1.0 0.000005 41.416045
836 defeat 4.0 0.000221 1.0 0.000005 41.416045
903 diplomatic 4.0 0.000221 1.0 0.000005 41.416045
740 subs 4.0 0.000221 1.0 0.000005 41.416045
857 cancel 4.0 0.000221 1.0 0.000005 41.416045
885 grasp 4.0 0.000221 1.0 0.000005 41.416045
798 lodge 4.0 0.000221 1.0 0.000005 41.416045
796 operational 4.0 0.000221 1.0 0.000005 41.416045
467 administration 7.0 0.000387 2.0 0.000011 36.239039
439 roosevelt 7.0 0.000387 2.0 0.000011 36.239039
508 interview 7.0 0.000387 2.0 0.000011 36.239039
445 jimmy 7.0 0.000387 2.0 0.000011 36.239039
high_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
    high_score_results = high_score_results.append(
        {'script':script_group,
         'score':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
display(high_score_results.sort_values('score'))
print('Best performing corpus (lowest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmin(), 0]))
print('Worst performing corpus (highest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmax(), 0]))
script score
1 Mad_Men 1456.975643
0 Pan_Am 3336.811190
3 The_Kennedys 3980.829683
2 X-Men_First_Class 4282.152672
Best performing corpus (lowest score) Mad_Men
Worst performing corpus (highest score) X-Men_First_Class

Low Ratios

Low ratios for the normalized frequency show unigrams that were used commonly in 1960, but were rare in the 21st-century scripts.

for script_group in compare_dict.keys():
    print(script_group)
    display(compare_dict[script_group].sort_values('norm_freq_ratio').head(50))
    print('\n')
Pan_Am
unigram Pan_Am_frequency Pan_Am_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
5337 honey 1.0 0.000031 152.0 0.000812 0.038269
3797 imagination 1.0 0.000031 138.0 0.000737 0.042151
3761 ship 1.0 0.000031 137.0 0.000732 0.042459
3260 human 1.0 0.000031 101.0 0.000539 0.057592
3247 major 1.0 0.000031 76.0 0.000406 0.076537
4218 machine 1.0 0.000031 75.0 0.000400 0.077558
3352 radio 1.0 0.000031 74.0 0.000395 0.078606
3118 jerry 1.0 0.000031 69.0 0.000368 0.084302
4146 shadow 1.0 0.000031 68.0 0.000363 0.085542
3089 martin 1.0 0.000031 66.0 0.000352 0.088134
3510 jackie 1.0 0.000031 65.0 0.000347 0.089490
5794 television 1.0 0.000031 60.0 0.000320 0.096947
5287 gun 1.0 0.000031 59.0 0.000315 0.098590
4002 floor 1.0 0.000031 55.0 0.000294 0.105761
3930 aunt 1.0 0.000031 54.0 0.000288 0.107719
1227 sound 4.0 0.000124 205.0 0.001095 0.113499
2584 whose 2.0 0.000062 100.0 0.000534 0.116337
5824 devil 1.0 0.000031 49.0 0.000262 0.118711
1297 earth 4.0 0.000124 194.0 0.001036 0.119935
2324 general 2.0 0.000062 94.0 0.000502 0.123762
2662 alan 2.0 0.000062 94.0 0.000502 0.123762
4767 mans 1.0 0.000031 46.0 0.000246 0.126453
3009 rip 1.0 0.000031 46.0 0.000246 0.126453
2823 ought 2.0 0.000062 88.0 0.000470 0.132201
4626 evil 1.0 0.000031 44.0 0.000235 0.132201
5683 scene 1.0 0.000031 44.0 0.000235 0.132201
2469 kids 2.0 0.000062 85.0 0.000454 0.136867
5001 rid 1.0 0.000031 42.0 0.000224 0.138496
2146 kid 2.0 0.000062 83.0 0.000443 0.140165
2976 account 1.0 0.000031 41.0 0.000219 0.141874
3576 peter 1.0 0.000031 41.0 0.000219 0.141874
2942 agnes 1.0 0.000031 39.0 0.000208 0.149150
4801 heaven 1.0 0.000031 39.0 0.000208 0.149150
4363 destroy 1.0 0.000031 37.0 0.000198 0.157212
2922 steel 1.0 0.000031 37.0 0.000198 0.157212
2321 dog 2.0 0.000062 72.0 0.000384 0.161579
4102 eh 1.0 0.000031 36.0 0.000192 0.161579
1911 ideas 2.0 0.000062 71.0 0.000379 0.163855
5318 team 1.0 0.000031 35.0 0.000187 0.166195
3242 broken 1.0 0.000031 33.0 0.000176 0.176268
3395 100 1.0 0.000031 33.0 0.000176 0.176268
4813 fellas 1.0 0.000031 32.0 0.000171 0.181776
4513 harmon 1.0 0.000031 32.0 0.000171 0.181776
1692 alive 3.0 0.000093 93.0 0.000497 0.187640
1658 indeed 3.0 0.000093 93.0 0.000497 0.187640
1027 town 5.0 0.000155 155.0 0.000828 0.187640
5646 explanation 1.0 0.000031 31.0 0.000166 0.187640
3003 fellow 1.0 0.000031 31.0 0.000166 0.187640
317 old 17.0 0.000528 526.0 0.002809 0.187997
2147 crossed 2.0 0.000062 61.0 0.000326 0.190716
Mad_Men
unigram Mad_Men_frequency Mad_Men_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
3985 twilight 3.0 0.000021 499.0 0.002665 0.007720
3764 zone 4.0 0.000027 506.0 0.002702 0.010151
12149 doc 1.0 0.000007 57.0 0.000304 0.022529
3302 captain 4.0 0.000027 208.0 0.001111 0.024695
10348 commander 1.0 0.000007 52.0 0.000278 0.024695
9954 emma 1.0 0.000007 48.0 0.000256 0.026753
10925 ace 1.0 0.000007 47.0 0.000251 0.027322
10578 schmidt 1.0 0.000007 46.0 0.000246 0.027916
7970 base 1.0 0.000007 45.0 0.000240 0.028536
4642 sight 3.0 0.000021 131.0 0.000700 0.029408
12626 precisely 1.0 0.000007 42.0 0.000224 0.030575
7376 destroy 1.0 0.000007 37.0 0.000198 0.034706
11629 sergeant 1.0 0.000007 37.0 0.000198 0.034706
5882 traveling 2.0 0.000014 71.0 0.000379 0.036173
7451 access 1.0 0.000007 34.0 0.000182 0.037769
10221 julius 1.0 0.000007 33.0 0.000176 0.038913
9446 jess 1.0 0.000007 33.0 0.000176 0.038913
5213 substance 2.0 0.000014 63.0 0.000336 0.040766
4142 colonel 3.0 0.000021 92.0 0.000491 0.041874
10549 radar 1.0 0.000007 29.0 0.000155 0.044280
3890 magic 3.0 0.000021 85.0 0.000454 0.045322
3835 mister 3.0 0.000021 85.0 0.000454 0.045322
9702 alex 1.0 0.000007 28.0 0.000150 0.045862
9698 grant 1.0 0.000007 28.0 0.000150 0.045862
7923 driscoll 1.0 0.000007 27.0 0.000144 0.047561
6934 illusion 1.0 0.000007 27.0 0.000144 0.047561
11946 witch 1.0 0.000007 27.0 0.000144 0.047561
8747 reckon 1.0 0.000007 26.0 0.000139 0.049390
2545 amen 6.0 0.000041 152.0 0.000812 0.050690
6240 stations 2.0 0.000014 49.0 0.000262 0.052414
4150 jerry 3.0 0.000021 69.0 0.000368 0.055832
11773 horn 1.0 0.000007 23.0 0.000123 0.055832
12777 christie 1.0 0.000007 23.0 0.000123 0.055832
8236 33 1.0 0.000007 23.0 0.000123 0.055832
10139 toward 1.0 0.000007 23.0 0.000123 0.055832
12232 shortly 1.0 0.000007 23.0 0.000123 0.055832
12410 engines 1.0 0.000007 22.0 0.000117 0.058370
9830 repeat 1.0 0.000007 22.0 0.000117 0.058370
8212 barney 1.0 0.000007 22.0 0.000117 0.058370
7255 main 1.0 0.000007 21.0 0.000112 0.061149
9793 item 1.0 0.000007 21.0 0.000112 0.061149
11043 function 1.0 0.000007 21.0 0.000112 0.061149
7471 properly 1.0 0.000007 21.0 0.000112 0.061149
10641 tonights 1.0 0.000007 21.0 0.000112 0.061149
10496 jenny 1.0 0.000007 19.0 0.000101 0.067586
10444 wings 1.0 0.000007 19.0 0.000101 0.067586
9436 ross 1.0 0.000007 19.0 0.000101 0.067586
10725 degree 1.0 0.000007 19.0 0.000101 0.067586
9437 reverend 1.0 0.000007 19.0 0.000101 0.067586
6728 doll 2.0 0.000014 37.0 0.000198 0.069413
X-Men_First_Class
unigram X-Men_First_Class_frequency X-Men_First_Class_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
212 mr 4.0 0.000912 1604.0 0.008565 0.106503
1516 boy 1.0 0.000228 311.0 0.001661 0.137324
1532 away 1.0 0.000228 305.0 0.001629 0.140025
874 hear 1.0 0.000228 297.0 0.001586 0.143797
782 long 1.0 0.000228 293.0 0.001565 0.145760
405 old 2.0 0.000456 526.0 0.002809 0.162386
1132 minute 1.0 0.000228 217.0 0.001159 0.196809
986 captain 1.0 0.000228 208.0 0.001111 0.205325
1374 room 1.0 0.000228 201.0 0.001073 0.212476
479 night 2.0 0.000456 394.0 0.002104 0.216790
1401 mrs 1.0 0.000228 195.0 0.001041 0.219014
1215 doctor 1.0 0.000228 182.0 0.000972 0.234657
1109 sit 1.0 0.000228 182.0 0.000972 0.234657
677 dead 1.0 0.000228 181.0 0.000967 0.235954
1500 land 1.0 0.000228 180.0 0.000961 0.237265
94 oh 9.0 0.002052 1580.0 0.008437 0.243271
673 quite 1.0 0.000228 171.0 0.000913 0.249752
1178 girl 1.0 0.000228 171.0 0.000913 0.249752
1001 trying 1.0 0.000228 169.0 0.000902 0.252708
376 mean 3.0 0.000684 502.0 0.002681 0.255225
635 dear 1.0 0.000228 166.0 0.000886 0.257275
1084 heard 1.0 0.000228 165.0 0.000881 0.258834
859 ago 1.0 0.000228 162.0 0.000865 0.263627
756 fine 1.0 0.000228 161.0 0.000860 0.265265
1429 tonight 1.0 0.000228 160.0 0.000854 0.266923
1204 fact 1.0 0.000228 155.0 0.000828 0.275533
1361 town 1.0 0.000228 155.0 0.000828 0.275533
1326 hours 1.0 0.000228 153.0 0.000817 0.279135
1363 honey 1.0 0.000228 152.0 0.000812 0.280971
42 well 15.0 0.003421 2272.0 0.012132 0.281961
406 around 2.0 0.000456 300.0 0.001602 0.284718
1469 car 1.0 0.000228 146.0 0.000780 0.292518
527 understand 2.0 0.000456 285.0 0.001522 0.299703
261 uh 4.0 0.000912 569.0 0.003038 0.300229
478 last 2.0 0.000456 282.0 0.001506 0.302891
334 youve 3.0 0.000684 422.0 0.002253 0.303609
1056 father 1.0 0.000228 139.0 0.000742 0.307249
1197 getting 1.0 0.000228 137.0 0.000732 0.311735
797 four 1.0 0.000228 136.0 0.000726 0.314027
418 told 2.0 0.000456 272.0 0.001452 0.314027
1314 house 1.0 0.000228 134.0 0.000716 0.318714
766 late 1.0 0.000228 134.0 0.000716 0.318714
360 next 3.0 0.000684 390.0 0.002083 0.328520
558 huh 2.0 0.000456 256.0 0.001367 0.333653
1240 game 1.0 0.000228 123.0 0.000657 0.347217
791 five 1.0 0.000228 122.0 0.000651 0.350063
381 morning 2.0 0.000456 243.0 0.001298 0.351503
1293 check 1.0 0.000228 118.0 0.000630 0.361929
742 says 1.0 0.000228 117.0 0.000625 0.365023
473 everything 2.0 0.000456 234.0 0.001250 0.365023
The_Kennedys
unigram The_Kennedys_frequency The_Kennedys_norm_freq authentic_frequency authentic_norm_freq norm_freq_ratio
1774 zone 2.0 0.000111 506.0 0.002702 0.040925
1874 earth 1.0 0.000055 194.0 0.001036 0.053371
2646 game 1.0 0.000055 123.0 0.000657 0.084179
3250 guess 1.0 0.000055 111.0 0.000593 0.093279
3314 kill 1.0 0.000055 106.0 0.000566 0.097679
1328 sound 2.0 0.000111 205.0 0.001095 0.101015
2802 hot 1.0 0.000055 101.0 0.000539 0.102515
2239 key 1.0 0.000055 97.0 0.000518 0.106742
2959 space 1.0 0.000055 90.0 0.000481 0.115045
2352 ought 1.0 0.000055 88.0 0.000470 0.117659
1618 story 2.0 0.000111 174.0 0.000929 0.119012
2584 hate 1.0 0.000055 81.0 0.000433 0.127827
2791 black 1.0 0.000055 77.0 0.000411 0.134468
1581 ten 2.0 0.000111 153.0 0.000817 0.135347
1760 amen 2.0 0.000111 152.0 0.000812 0.136237
2156 cold 1.0 0.000055 76.0 0.000406 0.136237
3020 darling 1.0 0.000055 76.0 0.000406 0.136237
2852 stuff 1.0 0.000055 71.0 0.000379 0.145831
2310 pool 1.0 0.000055 70.0 0.000374 0.147914
3095 shoot 1.0 0.000055 70.0 0.000374 0.147914
1260 ship 2.0 0.000111 137.0 0.000732 0.151153
3414 book 1.0 0.000055 67.0 0.000358 0.154537
3394 martin 1.0 0.000055 66.0 0.000352 0.156879
1471 death 2.0 0.000111 131.0 0.000700 0.158077
2172 odd 1.0 0.000055 63.0 0.000336 0.164349
928 play 3.0 0.000166 187.0 0.000999 0.166107
2229 wonder 1.0 0.000055 61.0 0.000326 0.169738
2543 boundaries 1.0 0.000055 60.0 0.000320 0.172567
1078 land 3.0 0.000166 180.0 0.000961 0.172567
2923 hit 1.0 0.000055 59.0 0.000315 0.175492
3638 message 1.0 0.000055 58.0 0.000310 0.178517
3135 fair 1.0 0.000055 58.0 0.000310 0.178517
2976 charlie 1.0 0.000055 58.0 0.000310 0.178517
2269 across 1.0 0.000055 57.0 0.000304 0.181649
2478 pick 1.0 0.000055 56.0 0.000299 0.184893
1869 creator 1.0 0.000055 56.0 0.000299 0.184893
1970 floor 1.0 0.000055 55.0 0.000294 0.188255
3388 heres 1.0 0.000055 55.0 0.000294 0.188255
1683 case 2.0 0.000111 107.0 0.000571 0.193533
2696 nights 1.0 0.000055 53.0 0.000283 0.195359
317 old 10.0 0.000553 526.0 0.002809 0.196844
2938 company 1.0 0.000055 52.0 0.000278 0.199116
3031 somewhere 1.0 0.000055 52.0 0.000278 0.199116
2871 apartment 1.0 0.000055 52.0 0.000278 0.199116
3685 bomb 1.0 0.000055 52.0 0.000278 0.199116
2213 station 1.0 0.000055 51.0 0.000272 0.203020
2656 person 1.0 0.000055 51.0 0.000272 0.203020
1706 street 2.0 0.000111 101.0 0.000539 0.205030
1875 named 1.0 0.000055 50.0 0.000267 0.207080
3509 dark 1.0 0.000055 50.0 0.000267 0.207080
low_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
    low_score_results = low_score_results.append(
        {'script':script_group,
         'score':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
display(low_score_results.sort_values('score', ascending=False))
print('Best performing corpus (highest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmax(), 0]))
print('Worst performing corpus (lowest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmin(), 0]))
script score
2 X-Men_First_Class 13.255571
3 The_Kennedys 7.791826
0 Pan_Am 6.533376
1 Mad_Men 2.309133
Best performing corpus (highest score) X-Men_First_Class
Worst performing corpus (lowest score) Mad_Men

Ranking

The scores returned both as top and bottom normalized frequency ratios are bad things:

  • The 50 highest ratios are words that were used frequently in the 21st century scripts, but were rare in the 1960s
  • the 50 lowest ratios are words that were used frequently in the 1960s, but showed up rarely in the 21st century scripts

In the high ratios set, the higher the ratio, the further the script is from the authentic corpus. In the low ratios set, the higher the ratio, the closer the script is to the authentic corpus. So to get my ranking, I’m going to subtract the low ratio from the high ratio. The script corpora will then be sorted from lowest (best) to highest (worst) score.

results = pd.DataFrame(columns = ['script', 'high_ratio', 'low_ratio'])
for script_group in compare_dict.keys():
    results = results.append(
        {'script':script_group,
         'high_ratio':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum(),
         'low_ratio':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
    results['combined_score'] = results['high_ratio'] - results['low_ratio']
    results = results.sort_values('combined_score')
    results['rank'] = range(1, 1+len(results))
display(results)
script high_ratio low_ratio combined_score rank
0 Mad_Men 1456.975643 2.309133 1454.666510 1
1 Pan_Am 3336.811190 6.533376 3330.277814 2
3 The_Kennedys 3980.829683 7.791826 3973.037857 3
2 X-Men_First_Class 4282.152672 13.255571 4268.897101 4

The analysis for the unigrams is now complete. To see the clean code (including improvements to functions) and the results for unigrams, bigrams, and trigrams, see the accompanying notebook.

Caveats

There are several problems with this exercise and the solution.

Corpus data processing

The biggest initial problem for me was the fact that punctuation wasn’t removed, the n-grams were case sensitive, and stopwords weren’t removed. The first two mean that words aren’t counted appropriately, especially when they’re prone to different capitalizations and uses with punctuation. For example, in the initial solution I noticed ‘daddy’ written several ways. Here are several ways ‘daddy’ could be included in a script

  • Daddy.
  • daddy.
  • Daddy
  • daddy
  • Daddy!
  • daddy!

This is six iterations for a single word which should all be counded together.

The last point, stopwords weren’t removed, means that there’s a lot of meaningless noise; Words like ‘the’, ‘a’, ‘an’, ‘of’, ‘for’, etc remain in the analysis.

Pronouns

Related to proper counting and stopwords are proper nouns. In a script or novel, the names of the characters of the story will show up a disporportionate amount of the time. With a large enough corpus this becomes moot because names common to the era will naturally show up more than modern names. However, these corpora aren’t large enough for this averaging of character names. The same is true for place names. The location the script is set has a higher likelihood of being mentioned.

Ratio impact

As can be seen in the final results dataframe, the high ratios have a much larger impact on my ranking than the lower numbers. This means that including words that were rare in the 1960s has a much bigger impact on the ranking than excluding words that were common.

Repetition

The authentic 1960s corpus includes many, many The Twilight Zone episodes. Most, if not all, of The Twilight Zone episodes start with the same introduction. This means that words like ‘traveling’, ‘another’, ‘dimension’, ‘sight’, ‘sound’, ‘mind’, and ‘journey’ are disproportionately represented. An improvement to the analysis would be to account for and remove this repetition so that it’s only represented once in the frequencies.