Entry NLP4: Frequencies and Comparison
In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math and analysis!
import pandas as pd
import os
from IPython.display import display
import string
import re
import itertools
import nltk
nltk.download('stopwords')
# Grab and store the data
def read_script(file_path):
corpus = ''
with open(file_path, 'r', encoding='latin-1') as l:
for line in l:
if (re.match('[^\d+]', line)
) and (re.match('^(?!\s*$).+', line)
) and not (re.match('(.*www.*)|(.*http:*)', line)
) and not (re.match('Sync and correct*', line)):
line = re.sub('</?i>|</?font.*>', '', line)
corpus = corpus + ' ' + line
return corpus
def load_files_to_dict(file_path, return_dict):
for thing in os.scandir(file_path):
if thing.is_dir():
new_path = os.path.join(file_path, thing.name)
new_dict = return_dict[thing.name] = {}
load_files_to_dict(new_path, new_dict)
elif thing.is_file:
return_dict[thing.name] = read_script(f'{file_path}/{thing.name}')
return return_dict
def convert_dict_df(script_dict):
return pd.DataFrame.from_dict(script_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})
# Clean the text and create ngrams
def punct_tokens(df, text_col):
newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation + '-‘_”'
nopunct = str.maketrans('', '', punct_list)
df['no_punct_tokens'] = df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
return df
def create_ngrams(df):
stop = nltk.corpus.stopwords.words('english')
df['unigrams'] = df['no_punct_tokens'].apply(lambda x: [item for item in x if item not in stop])
df['bigrams'] = df['unigrams'].apply(lambda x:(list(nltk.bigrams(x))))
df['trigrams'] = df['unigrams'].apply(lambda x:(list(nltk.trigrams(x))))
return df
def create_ngram_df(script_dict, text_col):
df = convert_dict_df(script_dict)
df = punct_tokens(df, text_col)
df = create_ngrams(df)
return df
Frequencies
Counting words is a common sample problem and can probably be considered the ‘hello world’ of NLP. When putting it into a dictionary data structure, the concept isn’t difficult:
- For each word (or in our case, n-gram) in the corpus
- Insert the word if it’s not there (the dictionary key)
- Add 1 to the count (the dictionary value)
frequency_dictionary = {}
for ngram in ngram_list:
if ngram not in frequency_dictionary:
frequency_dictionary[ngram] = 0
frequency_dictionary[ngram] +=1
The question is, how to apply this general concept to my specific use case.
The n-grams have already been created, so I don’t have to worry about longer n-grams (the bigrams, and I threw in trigrams because why not?) spilling from one scrip to another. Which means I can concatenate all the n-grams of a specific category together (i.e. I don’t want to combine unigrams with bigrams, just all the unigrams with each other).
auth_file_path = os.path.join(os.getcwd(), 'data', '1960s')
raw_auth_dict = load_files_to_dict(auth_file_path, {})
auth_ngram_df = create_ngram_df(raw_auth_dict, 'corpus')
auth_ngram_df.head()
script_name | corpus | no_punct_tokens | unigrams | bigrams | trigrams | |
---|---|---|---|---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
I already know I want to use the n-grams as my unique identifier, which means I’ll need to create a separate dataframe for each set of frequencies - mixing unigrams with bigrams wouldn’t let me do the analysis I want. This both simplifies and complicates the process, since I won’t be able to just add on to the same dataframe anymore.
The frequency_ct
and dict_to_df
functions that I created in the previous solution to the homework still work. The only new aspect is that I need to put all the n-gram lists from the different scripts together. My initial thought was to use list.expand
, but that would require looping through every row of the dataframe, which isn’t the fastest or memory optimized solution.
Fortunately, there is an easy alternative: it’s easily accomplished by using the sum
method on the column as specified in this StackOverflow answer.
auth_ngram_df['unigrams'].sum()[:10]
['youre',
'traveling',
'another',
'dimension',
'dimension',
'sight',
'sound',
'mind',
'journey',
'wondrous']
Now that all of the ngrams are in a single list, it’s a simple matter of creating a function to process them.
def frequency_ct(ngram_list):
freq_dict = {}
for ngram in ngram_list:
if ngram not in freq_dict:
freq_dict[ngram] = 0
freq_dict[ngram] +=1
return freq_dict
test_freq = frequency_ct(auth_ngram_df['unigrams'].sum())
test_freq
{'youre': 1410,
'traveling': 71,
'another': 358,
'dimension': 353,
'sight': 131,
'sound': 205,
'mind': 422,
'journey': 76,
...}
Of course now that I have my counts, I want to sort the n-grams from most frequent to least frequent. My favorite method to do this? DataFrames.
Unlike the previous convert_dict_df
function, this one will need to be more flexible. It needs to be able to handle both the authentic 1960s corpus, all four of the modern corpora, and which ever n-grams I happen to be running. The addition of a couple of variables to handle column naming and a sort_values
method takes care of it.
The corpus_name
variable in particular is important later in the analysis. I’ll need to compare the authentic corpus which was written in the 1960s about the 1960s to each of the corpora written in the 21st century about the 1960s. With the flow I’ve established, I’ll need to merge dataframes to complete the analysis. This is most easily accomplished when the non-join-on columns have different names.
Example: If I join two dataframes with column names = ['unigram', 'frequency']
I’ll end up with a single dataframe with the column names = ['unigram', 'x-frequency', 'y-frequency']
. I find these x
and y
prefixes less than informative and prefer to name my columns explicitly.
def dict_to_df(freq_dict, gram_name, corpus_name):
if (type(gram_name)==str) and (type(corpus_name)==str):
pass
else:
print('gram_name and corpus_name variables must be strings')
freq_colname = corpus_name+'_frequency'
df = pd.DataFrame.from_dict(freq_dict, orient='index'
).reset_index().rename(columns={'index':gram_name, 0:freq_colname}
).sort_values(freq_colname, ascending=False)
return df
But why stop my function at just the frequency? I also need normalized frequencies. Normalized frequencies level the playing field of straight counts when comparing corpora. With simple counts, a larger corpus will have n-grams with larger counts simply because there are more words overall than a smaller corpus. It doesn’t necessarily reflect any relevant comparison.
Also, the homework problem requires getting ratios of the normalized frequencies later in the analysis.
def normalized_freq(freq_df, corpus_name):
freq_col_name = corpus_name + '_frequency'
norm_col_name = corpus_name + '_norm_freq'
total_ct = freq_df[freq_col_name].sum()
freq_df[norm_col_name] = freq_df[freq_col_name]/total_ct
return freq_df
def create_frequencies(ngram_list, gram_name, corpus_name):
freq_dict = frequency_ct(ngram_list)
freq_df = dict_to_df(freq_dict, gram_name, corpus_name)
freq_df = normalized_freq(freq_df, corpus_name)
return freq_df
auth_freq_df = create_frequencies(auth_ngram_df['unigrams'].sum(), 'unigram', 'authentic')
auth_freq_df.head()
unigram | authentic_frequency | authentic_norm_freq | |
---|---|---|---|
206 | well | 2272 | 0.012132 |
25 | dont | 2199 | 0.011742 |
175 | im | 1988 | 0.010616 |
26 | know | 1777 | 0.009489 |
19 | mr | 1604 | 0.008565 |
test_file_path = os.path.join(os.getcwd(), 'data', '21st-century')
raw_test_dict = load_files_to_dict(test_file_path, {})
test_ngram_dict = {}
for script_group in list(raw_test_dict.keys()):
test_ngram_dict[script_group] = create_ngram_df(raw_test_dict[script_group], 'corpus')
test_freq_dict = {}
for script_group in list(test_ngram_dict.keys()):
test_freq_dict[script_group] = create_frequencies(test_ngram_dict[script_group]['unigrams'].sum(), 'unigram', script_group)
test_freq_dict['Pan_Am'].head()
unigram | Pan_Am_frequency | Pan_Am_norm_freq | |
---|---|---|---|
67 | im | 489 | 0.015189 |
114 | oh | 407 | 0.012642 |
11 | dont | 379 | 0.011772 |
50 | well | 373 | 0.011586 |
119 | know | 323 | 0.010033 |
Compare corpora
The last piece of this homework challenge is to compare the authentic corpus (wrtten regarding the 1960s and penned in the 1960s) to the four test corpora (written regarding the 1960s but not penned until the 21st century).
To compare anything to anything, first I need to combine the different dataframes holding my test corpora with the authentic corpus. I decided to do this by merging the values for the authentic data into each of the dataframes holding the values for the test data.
compare_dict = {}
for script_group in list(test_freq_dict.keys()):
compare_dict[script_group] = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)
compare_dict['Pan_Am'].head()
unigram | Pan_Am_frequency | Pan_Am_norm_freq | authentic_frequency | authentic_norm_freq | |
---|---|---|---|---|---|
0 | im | 489.0 | 0.015189 | 1988.0 | 0.010616 |
1 | oh | 407.0 | 0.012642 | 1580.0 | 0.008437 |
2 | dont | 379.0 | 0.011772 | 2199.0 | 0.011742 |
3 | well | 373.0 | 0.011586 | 2272.0 | 0.012132 |
4 | know | 323.0 | 0.010033 | 1777.0 | 0.009489 |
The equation I implemented in the previous solution to this homework was:
df['norm_freq_ratio'] = df.loc[(df['imitation_norm_freq'] != 0
) & (df['authentic_norm_freq'] != 0), 'imitation_norm_freq'
]/df.loc[(df['imitation_norm_freq'] != 0
) & (df['authentic_norm_freq'] != 0), 'authentic_norm_freq']
In order to implement this in the various dataframes, I’ll need a way to identify the appropriate columns, regardless of which dataframe I’m working with. This can be done by looking for ‘norm_freq’ in the column names - which will pull out the normalized frequency for both the authentic and test data.
[compare_dict['Pan_Am'].columns[compare_dict['Pan_Am'].columns.str.contains('norm_freq')]]
[Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')]
Referencing the dataframe by the dictionary and script group name is getting rather tedious, so I can just set the dictionary/script name as the dataframe I’m working with. This has a much cleaner appearance and, more importantly, is easier to read. Regardless of how good (or not) code is, it’s much more common to have to read code in order to improve, maintain, update, or repair it than write it. My philosophy is to make code as easy to read as possible, so that my future self can decipher what I was thinking when I wrote it the first time around.
test = compare_dict['Pan_Am']
test_cols = test.columns[test.columns.str.contains('norm_freq')]
test_cols
Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')
Now I can update my code to the more readable version. Since I use the test dataframe as the left object and the authentic dataframe as the right object in the join, I can count on the fact that the test:authentic columns will always be in the same order.
As an added bonus, I only have to write to the dictionary once instead of the initial write, then the update with the new columns.
compare_dict = {}
for script_group in list(test_freq_dict.keys()):
df = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)
freq_cols = df.columns[df.columns.str.contains('norm_freq')]
df['norm_freq_ratio'] = df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[0]] / df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[1]]
compare_dict[script_group] = df
compare_dict['Pan_Am'].head()
unigram | Pan_Am_frequency | Pan_Am_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
0 | im | 489.0 | 0.015189 | 1988.0 | 0.010616 | 1.430801 |
1 | oh | 407.0 | 0.012642 | 1580.0 | 0.008437 | 1.498387 |
2 | dont | 379.0 | 0.011772 | 2199.0 | 0.011742 | 1.002538 |
3 | well | 373.0 | 0.011586 | 2272.0 | 0.012132 | 0.954965 |
4 | know | 323.0 | 0.010033 | 1777.0 | 0.009489 | 1.057309 |
High Ratios
High ratios for the normalized frequency show unigrams that were used commonly in the 21st-century scripts, but were extremely rare (but present) in 1960s scripts.
for script_group in compare_dict.keys():
print(script_group)
display(compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50))
print('\n')
Pan_Am
unigram | Pan_Am_frequency | Pan_Am_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
51 | dean | 87.0 | 0.002702 | 1.0 | 0.000005 | 506.064637 |
18 | pan | 160.0 | 0.004970 | 3.0 | 0.000016 | 310.231195 |
162 | amanda | 32.0 | 0.000994 | 1.0 | 0.000005 | 186.138717 |
89 | stewardess | 54.0 | 0.001677 | 2.0 | 0.000011 | 157.054543 |
197 | teddy | 27.0 | 0.000839 | 1.0 | 0.000005 | 157.054543 |
281 | stewardesses | 19.0 | 0.000590 | 1.0 | 0.000005 | 110.519863 |
364 | ryan | 15.0 | 0.000466 | 1.0 | 0.000005 | 87.252524 |
456 | cia | 13.0 | 0.000404 | 1.0 | 0.000005 | 75.618854 |
452 | ich | 13.0 | 0.000404 | 1.0 | 0.000005 | 75.618854 |
491 | monte | 12.0 | 0.000373 | 1.0 | 0.000005 | 69.802019 |
483 | omar | 12.0 | 0.000373 | 1.0 | 0.000005 | 69.802019 |
84 | ii | 59.0 | 0.001833 | 5.0 | 0.000027 | 68.638652 |
548 | carlo | 10.0 | 0.000311 | 1.0 | 0.000005 | 58.168349 |
569 | monsieur | 10.0 | 0.000311 | 1.0 | 0.000005 | 58.168349 |
37 | maggie | 108.0 | 0.003355 | 11.0 | 0.000059 | 57.110743 |
603 | le | 9.0 | 0.000280 | 1.0 | 0.000005 | 52.351514 |
608 | hier | 9.0 | 0.000280 | 1.0 | 0.000005 | 52.351514 |
625 | soviets | 9.0 | 0.000280 | 1.0 | 0.000005 | 52.351514 |
634 | lauras | 9.0 | 0.000280 | 1.0 | 0.000005 | 52.351514 |
641 | courier | 9.0 | 0.000280 | 1.0 | 0.000005 | 52.351514 |
713 | maggies | 8.0 | 0.000248 | 1.0 | 0.000005 | 46.534679 |
349 | rio | 16.0 | 0.000497 | 2.0 | 0.000011 | 46.534679 |
734 | moscow | 8.0 | 0.000248 | 1.0 | 0.000005 | 46.534679 |
660 | zu | 8.0 | 0.000248 | 1.0 | 0.000005 | 46.534679 |
81 | ted | 59.0 | 0.001833 | 8.0 | 0.000043 | 42.899157 |
259 | greg | 21.0 | 0.000652 | 3.0 | 0.000016 | 40.717844 |
764 | cockpit | 7.0 | 0.000217 | 1.0 | 0.000005 | 40.717844 |
396 | magazine | 14.0 | 0.000435 | 2.0 | 0.000011 | 40.717844 |
770 | casino | 7.0 | 0.000217 | 1.0 | 0.000005 | 40.717844 |
818 | graham | 7.0 | 0.000217 | 1.0 | 0.000005 | 40.717844 |
449 | previously | 13.0 | 0.000404 | 2.0 | 0.000011 | 37.809427 |
925 | tasty | 6.0 | 0.000186 | 1.0 | 0.000005 | 34.901009 |
855 | diplomatic | 6.0 | 0.000186 | 1.0 | 0.000005 | 34.901009 |
883 | palace | 6.0 | 0.000186 | 1.0 | 0.000005 | 34.901009 |
877 | cleared | 6.0 | 0.000186 | 1.0 | 0.000005 | 34.901009 |
177 | rome | 30.0 | 0.000932 | 5.0 | 0.000027 | 34.901009 |
1004 | choosing | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
1041 | pudding | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
1015 | guessing | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
1112 | khrushchev | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
587 | runway | 10.0 | 0.000311 | 2.0 | 0.000011 | 29.084175 |
987 | safely | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
988 | economy | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
993 | ugh | 5.0 | 0.000155 | 1.0 | 0.000005 | 29.084175 |
113 | london | 44.0 | 0.001367 | 9.0 | 0.000048 | 28.437860 |
291 | anderson | 19.0 | 0.000590 | 4.0 | 0.000021 | 27.629966 |
105 | mm | 46.0 | 0.001429 | 11.0 | 0.000059 | 24.324946 |
468 | cargo | 12.0 | 0.000373 | 3.0 | 0.000016 | 23.267340 |
1371 | fairly | 4.0 | 0.000124 | 1.0 | 0.000005 | 23.267340 |
1336 | 32 | 4.0 | 0.000124 | 1.0 | 0.000005 | 23.267340 |
Mad_Men
unigram | Mad_Men_frequency | Mad_Men_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
138 | sterling | 170.0 | 0.001166 | 2.0 | 0.000011 | 109.151410 |
172 | sally | 143.0 | 0.000981 | 2.0 | 0.000011 | 91.815598 |
54 | draper | 365.0 | 0.002503 | 6.0 | 0.000032 | 78.118166 |
238 | jesus | 108.0 | 0.000741 | 2.0 | 0.000011 | 69.343249 |
553 | francis | 42.0 | 0.000288 | 1.0 | 0.000005 | 53.933638 |
317 | clients | 74.0 | 0.000507 | 2.0 | 0.000011 | 47.512967 |
187 | joan | 134.0 | 0.000919 | 4.0 | 0.000021 | 43.018497 |
195 | betty | 128.0 | 0.000878 | 4.0 | 0.000021 | 41.092295 |
435 | jimmy | 55.0 | 0.000377 | 2.0 | 0.000011 | 35.313691 |
457 | ken | 52.0 | 0.000357 | 2.0 | 0.000011 | 33.387490 |
843 | crap | 26.0 | 0.000178 | 1.0 | 0.000005 | 33.387490 |
931 | presentation | 23.0 | 0.000158 | 1.0 | 0.000005 | 29.535087 |
905 | freddy | 23.0 | 0.000158 | 1.0 | 0.000005 | 29.535087 |
354 | bobby | 67.0 | 0.000459 | 3.0 | 0.000016 | 28.678998 |
358 | creative | 66.0 | 0.000453 | 3.0 | 0.000016 | 28.250953 |
942 | holloway | 22.0 | 0.000151 | 1.0 | 0.000005 | 28.250953 |
937 | clara | 22.0 | 0.000151 | 1.0 | 0.000005 | 28.250953 |
1033 | fatherinlaw | 20.0 | 0.000137 | 1.0 | 0.000005 | 25.682685 |
994 | spectacular | 20.0 | 0.000137 | 1.0 | 0.000005 | 25.682685 |
1035 | belle | 20.0 | 0.000137 | 1.0 | 0.000005 | 25.682685 |
1069 | joey | 19.0 | 0.000130 | 1.0 | 0.000005 | 24.398550 |
1051 | whitman | 19.0 | 0.000130 | 1.0 | 0.000005 | 24.398550 |
1050 | connie | 19.0 | 0.000130 | 1.0 | 0.000005 | 24.398550 |
1142 | delicious | 18.0 | 0.000123 | 1.0 | 0.000005 | 23.114416 |
1120 | jewish | 18.0 | 0.000123 | 1.0 | 0.000005 | 23.114416 |
451 | dick | 53.0 | 0.000363 | 3.0 | 0.000016 | 22.686371 |
653 | airlines | 35.0 | 0.000240 | 2.0 | 0.000011 | 22.472349 |
691 | partners | 33.0 | 0.000226 | 2.0 | 0.000011 | 21.188215 |
1196 | dallas | 16.0 | 0.000110 | 1.0 | 0.000005 | 20.546148 |
1217 | strategy | 16.0 | 0.000110 | 1.0 | 0.000005 | 20.546148 |
1216 | award | 16.0 | 0.000110 | 1.0 | 0.000005 | 20.546148 |
1270 | reception | 15.0 | 0.000103 | 1.0 | 0.000005 | 19.262013 |
1331 | danny | 15.0 | 0.000103 | 1.0 | 0.000005 | 19.262013 |
1285 | episode | 15.0 | 0.000103 | 1.0 | 0.000005 | 19.262013 |
1303 | casting | 15.0 | 0.000103 | 1.0 | 0.000005 | 19.262013 |
755 | previously | 29.0 | 0.000199 | 2.0 | 0.000011 | 18.619946 |
1370 | suitcase | 14.0 | 0.000096 | 1.0 | 0.000005 | 17.977879 |
1371 | cancel | 14.0 | 0.000096 | 1.0 | 0.000005 | 17.977879 |
1358 | grey | 14.0 | 0.000096 | 1.0 | 0.000005 | 17.977879 |
1432 | bowl | 13.0 | 0.000089 | 1.0 | 0.000005 | 16.693745 |
364 | duck | 64.0 | 0.000439 | 5.0 | 0.000027 | 16.436918 |
279 | lane | 89.0 | 0.000610 | 7.0 | 0.000037 | 16.326850 |
635 | hare | 37.0 | 0.000254 | 3.0 | 0.000016 | 15.837656 |
630 | beans | 37.0 | 0.000254 | 3.0 | 0.000016 | 15.837656 |
636 | greg | 37.0 | 0.000254 | 3.0 | 0.000016 | 15.837656 |
482 | campaign | 49.0 | 0.000336 | 4.0 | 0.000021 | 15.730644 |
1536 | joyce | 12.0 | 0.000082 | 1.0 | 0.000005 | 15.409611 |
1572 | chemical | 12.0 | 0.000082 | 1.0 | 0.000005 | 15.409611 |
1584 | salad | 12.0 | 0.000082 | 1.0 | 0.000005 | 15.409611 |
1600 | mens | 12.0 | 0.000082 | 1.0 | 0.000005 | 15.409611 |
X-Men_First_Class
unigram | X-Men_First_Class_frequency | X-Men_First_Class_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
79 | cia | 10.0 | 0.002281 | 1.0 | 0.000005 | 427.076397 |
171 | commands | 5.0 | 0.001140 | 1.0 | 0.000005 | 213.538198 |
103 | cuba | 9.0 | 0.002052 | 2.0 | 0.000011 | 192.184379 |
210 | sebastian | 4.0 | 0.000912 | 1.0 | 0.000005 | 170.830559 |
211 | shaws | 4.0 | 0.000912 | 1.0 | 0.000005 | 170.830559 |
364 | x | 3.0 | 0.000684 | 1.0 | 0.000005 | 128.122919 |
326 | presentation | 3.0 | 0.000684 | 1.0 | 0.000005 | 128.122919 |
284 | moscow | 3.0 | 0.000684 | 1.0 | 0.000005 | 128.122919 |
264 | threat | 3.0 | 0.000684 | 1.0 | 0.000005 | 128.122919 |
370 | homo | 3.0 | 0.000684 | 1.0 | 0.000005 | 128.122919 |
500 | jekyll | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
396 | groovy | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
540 | atom | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
417 | arrangement | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
511 | cola | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
434 | facility | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
577 | delicious | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
415 | florida | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
624 | formal | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
623 | dusseldorf | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
216 | argentina | 4.0 | 0.000912 | 2.0 | 0.000011 | 85.415279 |
595 | absorb | 2.0 | 0.000456 | 1.0 | 0.000005 | 85.415279 |
118 | russians | 7.0 | 0.001596 | 4.0 | 0.000021 | 74.738369 |
115 | turkey | 8.0 | 0.001824 | 5.0 | 0.000027 | 68.332223 |
116 | russia | 8.0 | 0.001824 | 5.0 | 0.000027 | 68.332223 |
72 | missiles | 11.0 | 0.002509 | 7.0 | 0.000037 | 67.112005 |
18 | hank | 23.0 | 0.005245 | 15.0 | 0.000080 | 65.485048 |
335 | jesus | 3.0 | 0.000684 | 2.0 | 0.000011 | 64.061460 |
380 | usa | 3.0 | 0.000684 | 2.0 | 0.000011 | 64.061460 |
296 | reconsider | 3.0 | 0.000684 | 2.0 | 0.000011 | 64.061460 |
81 | wow | 10.0 | 0.002281 | 7.0 | 0.000037 | 61.010914 |
230 | destination | 4.0 | 0.000912 | 3.0 | 0.000016 | 56.943520 |
164 | beast | 5.0 | 0.001140 | 4.0 | 0.000021 | 53.384550 |
136 | soviet | 6.0 | 0.001368 | 5.0 | 0.000027 | 51.249168 |
368 | rockets | 3.0 | 0.000684 | 3.0 | 0.000016 | 42.707640 |
13 | charles | 26.0 | 0.005929 | 26.0 | 0.000139 | 42.707640 |
1117 | spectacular | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
1106 | oneway | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
1088 | expectations | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
525 | backup | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
529 | nazis | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
532 | gates | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
1124 | currently | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
575 | senior | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
1065 | hoohoo | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
588 | freaks | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
573 | scratch | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
407 | serum | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
491 | mutated | 2.0 | 0.000456 | 2.0 | 0.000011 | 42.707640 |
1140 | colleges | 1.0 | 0.000228 | 1.0 | 0.000005 | 42.707640 |
The_Kennedys
unigram | The_Kennedys_frequency | The_Kennedys_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
16 | bobby | 112.0 | 0.006192 | 3.0 | 0.000016 | 386.549750 |
86 | khrushchev | 30.0 | 0.001659 | 1.0 | 0.000005 | 310.620335 |
103 | sighs | 25.0 | 0.001382 | 1.0 | 0.000005 | 258.850279 |
165 | rosemary | 18.0 | 0.000995 | 1.0 | 0.000005 | 186.372201 |
12 | kennedy | 128.0 | 0.007077 | 9.0 | 0.000048 | 147.257048 |
101 | cuba | 25.0 | 0.001382 | 2.0 | 0.000011 | 129.425140 |
37 | ii | 60.0 | 0.003317 | 5.0 | 0.000027 | 124.248134 |
298 | election | 11.0 | 0.000608 | 1.0 | 0.000005 | 113.894123 |
163 | ethel | 18.0 | 0.000995 | 2.0 | 0.000011 | 93.186101 |
399 | elected | 8.0 | 0.000442 | 1.0 | 0.000005 | 82.832089 |
379 | dallas | 8.0 | 0.000442 | 1.0 | 0.000005 | 82.832089 |
394 | mississippi | 8.0 | 0.000442 | 1.0 | 0.000005 | 82.832089 |
418 | cabinet | 8.0 | 0.000442 | 1.0 | 0.000005 | 82.832089 |
110 | senator | 23.0 | 0.001272 | 3.0 | 0.000016 | 79.380752 |
460 | organized | 7.0 | 0.000387 | 1.0 | 0.000005 | 72.478078 |
214 | christ | 14.0 | 0.000774 | 2.0 | 0.000011 | 72.478078 |
127 | meredith | 21.0 | 0.001161 | 3.0 | 0.000016 | 72.478078 |
492 | cia | 7.0 | 0.000387 | 1.0 | 0.000005 | 72.478078 |
243 | lyndon | 13.0 | 0.000719 | 2.0 | 0.000011 | 67.301073 |
583 | bastard | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
559 | bases | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
525 | francis | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
556 | option | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
590 | jolly | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
616 | dean | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
607 | mcnamara | 6.0 | 0.000332 | 1.0 | 0.000005 | 62.124067 |
109 | campaign | 23.0 | 0.001272 | 4.0 | 0.000021 | 59.535564 |
294 | campus | 11.0 | 0.000608 | 2.0 | 0.000011 | 56.947061 |
725 | sites | 5.0 | 0.000276 | 1.0 | 0.000005 | 51.770056 |
689 | bundy | 5.0 | 0.000276 | 1.0 | 0.000005 | 51.770056 |
872 | regime | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
762 | threat | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
794 | perception | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
802 | largely | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
894 | disaster | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
775 | grunts | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
773 | handing | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
763 | humiliating | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
747 | itsits | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
836 | defeat | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
903 | diplomatic | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
740 | subs | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
857 | cancel | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
885 | grasp | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
798 | lodge | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
796 | operational | 4.0 | 0.000221 | 1.0 | 0.000005 | 41.416045 |
467 | administration | 7.0 | 0.000387 | 2.0 | 0.000011 | 36.239039 |
439 | roosevelt | 7.0 | 0.000387 | 2.0 | 0.000011 | 36.239039 |
508 | interview | 7.0 | 0.000387 | 2.0 | 0.000011 | 36.239039 |
445 | jimmy | 7.0 | 0.000387 | 2.0 | 0.000011 | 36.239039 |
high_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
high_score_results = high_score_results.append(
{'script':script_group,
'score':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum()
}, ignore_index=True)
display(high_score_results.sort_values('score'))
print('Best performing corpus (lowest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmin(), 0]))
print('Worst performing corpus (highest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmax(), 0]))
script | score | |
---|---|---|
1 | Mad_Men | 1456.975643 |
0 | Pan_Am | 3336.811190 |
3 | The_Kennedys | 3980.829683 |
2 | X-Men_First_Class | 4282.152672 |
Best performing corpus (lowest score) Mad_Men
Worst performing corpus (highest score) X-Men_First_Class
Low Ratios
Low ratios for the normalized frequency show unigrams that were used commonly in 1960, but were rare in the 21st-century scripts.
for script_group in compare_dict.keys():
print(script_group)
display(compare_dict[script_group].sort_values('norm_freq_ratio').head(50))
print('\n')
Pan_Am
unigram | Pan_Am_frequency | Pan_Am_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
5337 | honey | 1.0 | 0.000031 | 152.0 | 0.000812 | 0.038269 |
3797 | imagination | 1.0 | 0.000031 | 138.0 | 0.000737 | 0.042151 |
3761 | ship | 1.0 | 0.000031 | 137.0 | 0.000732 | 0.042459 |
3260 | human | 1.0 | 0.000031 | 101.0 | 0.000539 | 0.057592 |
3247 | major | 1.0 | 0.000031 | 76.0 | 0.000406 | 0.076537 |
4218 | machine | 1.0 | 0.000031 | 75.0 | 0.000400 | 0.077558 |
3352 | radio | 1.0 | 0.000031 | 74.0 | 0.000395 | 0.078606 |
3118 | jerry | 1.0 | 0.000031 | 69.0 | 0.000368 | 0.084302 |
4146 | shadow | 1.0 | 0.000031 | 68.0 | 0.000363 | 0.085542 |
3089 | martin | 1.0 | 0.000031 | 66.0 | 0.000352 | 0.088134 |
3510 | jackie | 1.0 | 0.000031 | 65.0 | 0.000347 | 0.089490 |
5794 | television | 1.0 | 0.000031 | 60.0 | 0.000320 | 0.096947 |
5287 | gun | 1.0 | 0.000031 | 59.0 | 0.000315 | 0.098590 |
4002 | floor | 1.0 | 0.000031 | 55.0 | 0.000294 | 0.105761 |
3930 | aunt | 1.0 | 0.000031 | 54.0 | 0.000288 | 0.107719 |
1227 | sound | 4.0 | 0.000124 | 205.0 | 0.001095 | 0.113499 |
2584 | whose | 2.0 | 0.000062 | 100.0 | 0.000534 | 0.116337 |
5824 | devil | 1.0 | 0.000031 | 49.0 | 0.000262 | 0.118711 |
1297 | earth | 4.0 | 0.000124 | 194.0 | 0.001036 | 0.119935 |
2324 | general | 2.0 | 0.000062 | 94.0 | 0.000502 | 0.123762 |
2662 | alan | 2.0 | 0.000062 | 94.0 | 0.000502 | 0.123762 |
4767 | mans | 1.0 | 0.000031 | 46.0 | 0.000246 | 0.126453 |
3009 | rip | 1.0 | 0.000031 | 46.0 | 0.000246 | 0.126453 |
2823 | ought | 2.0 | 0.000062 | 88.0 | 0.000470 | 0.132201 |
4626 | evil | 1.0 | 0.000031 | 44.0 | 0.000235 | 0.132201 |
5683 | scene | 1.0 | 0.000031 | 44.0 | 0.000235 | 0.132201 |
2469 | kids | 2.0 | 0.000062 | 85.0 | 0.000454 | 0.136867 |
5001 | rid | 1.0 | 0.000031 | 42.0 | 0.000224 | 0.138496 |
2146 | kid | 2.0 | 0.000062 | 83.0 | 0.000443 | 0.140165 |
2976 | account | 1.0 | 0.000031 | 41.0 | 0.000219 | 0.141874 |
3576 | peter | 1.0 | 0.000031 | 41.0 | 0.000219 | 0.141874 |
2942 | agnes | 1.0 | 0.000031 | 39.0 | 0.000208 | 0.149150 |
4801 | heaven | 1.0 | 0.000031 | 39.0 | 0.000208 | 0.149150 |
4363 | destroy | 1.0 | 0.000031 | 37.0 | 0.000198 | 0.157212 |
2922 | steel | 1.0 | 0.000031 | 37.0 | 0.000198 | 0.157212 |
2321 | dog | 2.0 | 0.000062 | 72.0 | 0.000384 | 0.161579 |
4102 | eh | 1.0 | 0.000031 | 36.0 | 0.000192 | 0.161579 |
1911 | ideas | 2.0 | 0.000062 | 71.0 | 0.000379 | 0.163855 |
5318 | team | 1.0 | 0.000031 | 35.0 | 0.000187 | 0.166195 |
3242 | broken | 1.0 | 0.000031 | 33.0 | 0.000176 | 0.176268 |
3395 | 100 | 1.0 | 0.000031 | 33.0 | 0.000176 | 0.176268 |
4813 | fellas | 1.0 | 0.000031 | 32.0 | 0.000171 | 0.181776 |
4513 | harmon | 1.0 | 0.000031 | 32.0 | 0.000171 | 0.181776 |
1692 | alive | 3.0 | 0.000093 | 93.0 | 0.000497 | 0.187640 |
1658 | indeed | 3.0 | 0.000093 | 93.0 | 0.000497 | 0.187640 |
1027 | town | 5.0 | 0.000155 | 155.0 | 0.000828 | 0.187640 |
5646 | explanation | 1.0 | 0.000031 | 31.0 | 0.000166 | 0.187640 |
3003 | fellow | 1.0 | 0.000031 | 31.0 | 0.000166 | 0.187640 |
317 | old | 17.0 | 0.000528 | 526.0 | 0.002809 | 0.187997 |
2147 | crossed | 2.0 | 0.000062 | 61.0 | 0.000326 | 0.190716 |
Mad_Men
unigram | Mad_Men_frequency | Mad_Men_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
3985 | twilight | 3.0 | 0.000021 | 499.0 | 0.002665 | 0.007720 |
3764 | zone | 4.0 | 0.000027 | 506.0 | 0.002702 | 0.010151 |
12149 | doc | 1.0 | 0.000007 | 57.0 | 0.000304 | 0.022529 |
3302 | captain | 4.0 | 0.000027 | 208.0 | 0.001111 | 0.024695 |
10348 | commander | 1.0 | 0.000007 | 52.0 | 0.000278 | 0.024695 |
9954 | emma | 1.0 | 0.000007 | 48.0 | 0.000256 | 0.026753 |
10925 | ace | 1.0 | 0.000007 | 47.0 | 0.000251 | 0.027322 |
10578 | schmidt | 1.0 | 0.000007 | 46.0 | 0.000246 | 0.027916 |
7970 | base | 1.0 | 0.000007 | 45.0 | 0.000240 | 0.028536 |
4642 | sight | 3.0 | 0.000021 | 131.0 | 0.000700 | 0.029408 |
12626 | precisely | 1.0 | 0.000007 | 42.0 | 0.000224 | 0.030575 |
7376 | destroy | 1.0 | 0.000007 | 37.0 | 0.000198 | 0.034706 |
11629 | sergeant | 1.0 | 0.000007 | 37.0 | 0.000198 | 0.034706 |
5882 | traveling | 2.0 | 0.000014 | 71.0 | 0.000379 | 0.036173 |
7451 | access | 1.0 | 0.000007 | 34.0 | 0.000182 | 0.037769 |
10221 | julius | 1.0 | 0.000007 | 33.0 | 0.000176 | 0.038913 |
9446 | jess | 1.0 | 0.000007 | 33.0 | 0.000176 | 0.038913 |
5213 | substance | 2.0 | 0.000014 | 63.0 | 0.000336 | 0.040766 |
4142 | colonel | 3.0 | 0.000021 | 92.0 | 0.000491 | 0.041874 |
10549 | radar | 1.0 | 0.000007 | 29.0 | 0.000155 | 0.044280 |
3890 | magic | 3.0 | 0.000021 | 85.0 | 0.000454 | 0.045322 |
3835 | mister | 3.0 | 0.000021 | 85.0 | 0.000454 | 0.045322 |
9702 | alex | 1.0 | 0.000007 | 28.0 | 0.000150 | 0.045862 |
9698 | grant | 1.0 | 0.000007 | 28.0 | 0.000150 | 0.045862 |
7923 | driscoll | 1.0 | 0.000007 | 27.0 | 0.000144 | 0.047561 |
6934 | illusion | 1.0 | 0.000007 | 27.0 | 0.000144 | 0.047561 |
11946 | witch | 1.0 | 0.000007 | 27.0 | 0.000144 | 0.047561 |
8747 | reckon | 1.0 | 0.000007 | 26.0 | 0.000139 | 0.049390 |
2545 | amen | 6.0 | 0.000041 | 152.0 | 0.000812 | 0.050690 |
6240 | stations | 2.0 | 0.000014 | 49.0 | 0.000262 | 0.052414 |
4150 | jerry | 3.0 | 0.000021 | 69.0 | 0.000368 | 0.055832 |
11773 | horn | 1.0 | 0.000007 | 23.0 | 0.000123 | 0.055832 |
12777 | christie | 1.0 | 0.000007 | 23.0 | 0.000123 | 0.055832 |
8236 | 33 | 1.0 | 0.000007 | 23.0 | 0.000123 | 0.055832 |
10139 | toward | 1.0 | 0.000007 | 23.0 | 0.000123 | 0.055832 |
12232 | shortly | 1.0 | 0.000007 | 23.0 | 0.000123 | 0.055832 |
12410 | engines | 1.0 | 0.000007 | 22.0 | 0.000117 | 0.058370 |
9830 | repeat | 1.0 | 0.000007 | 22.0 | 0.000117 | 0.058370 |
8212 | barney | 1.0 | 0.000007 | 22.0 | 0.000117 | 0.058370 |
7255 | main | 1.0 | 0.000007 | 21.0 | 0.000112 | 0.061149 |
9793 | item | 1.0 | 0.000007 | 21.0 | 0.000112 | 0.061149 |
11043 | function | 1.0 | 0.000007 | 21.0 | 0.000112 | 0.061149 |
7471 | properly | 1.0 | 0.000007 | 21.0 | 0.000112 | 0.061149 |
10641 | tonights | 1.0 | 0.000007 | 21.0 | 0.000112 | 0.061149 |
10496 | jenny | 1.0 | 0.000007 | 19.0 | 0.000101 | 0.067586 |
10444 | wings | 1.0 | 0.000007 | 19.0 | 0.000101 | 0.067586 |
9436 | ross | 1.0 | 0.000007 | 19.0 | 0.000101 | 0.067586 |
10725 | degree | 1.0 | 0.000007 | 19.0 | 0.000101 | 0.067586 |
9437 | reverend | 1.0 | 0.000007 | 19.0 | 0.000101 | 0.067586 |
6728 | doll | 2.0 | 0.000014 | 37.0 | 0.000198 | 0.069413 |
X-Men_First_Class
unigram | X-Men_First_Class_frequency | X-Men_First_Class_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
212 | mr | 4.0 | 0.000912 | 1604.0 | 0.008565 | 0.106503 |
1516 | boy | 1.0 | 0.000228 | 311.0 | 0.001661 | 0.137324 |
1532 | away | 1.0 | 0.000228 | 305.0 | 0.001629 | 0.140025 |
874 | hear | 1.0 | 0.000228 | 297.0 | 0.001586 | 0.143797 |
782 | long | 1.0 | 0.000228 | 293.0 | 0.001565 | 0.145760 |
405 | old | 2.0 | 0.000456 | 526.0 | 0.002809 | 0.162386 |
1132 | minute | 1.0 | 0.000228 | 217.0 | 0.001159 | 0.196809 |
986 | captain | 1.0 | 0.000228 | 208.0 | 0.001111 | 0.205325 |
1374 | room | 1.0 | 0.000228 | 201.0 | 0.001073 | 0.212476 |
479 | night | 2.0 | 0.000456 | 394.0 | 0.002104 | 0.216790 |
1401 | mrs | 1.0 | 0.000228 | 195.0 | 0.001041 | 0.219014 |
1215 | doctor | 1.0 | 0.000228 | 182.0 | 0.000972 | 0.234657 |
1109 | sit | 1.0 | 0.000228 | 182.0 | 0.000972 | 0.234657 |
677 | dead | 1.0 | 0.000228 | 181.0 | 0.000967 | 0.235954 |
1500 | land | 1.0 | 0.000228 | 180.0 | 0.000961 | 0.237265 |
94 | oh | 9.0 | 0.002052 | 1580.0 | 0.008437 | 0.243271 |
673 | quite | 1.0 | 0.000228 | 171.0 | 0.000913 | 0.249752 |
1178 | girl | 1.0 | 0.000228 | 171.0 | 0.000913 | 0.249752 |
1001 | trying | 1.0 | 0.000228 | 169.0 | 0.000902 | 0.252708 |
376 | mean | 3.0 | 0.000684 | 502.0 | 0.002681 | 0.255225 |
635 | dear | 1.0 | 0.000228 | 166.0 | 0.000886 | 0.257275 |
1084 | heard | 1.0 | 0.000228 | 165.0 | 0.000881 | 0.258834 |
859 | ago | 1.0 | 0.000228 | 162.0 | 0.000865 | 0.263627 |
756 | fine | 1.0 | 0.000228 | 161.0 | 0.000860 | 0.265265 |
1429 | tonight | 1.0 | 0.000228 | 160.0 | 0.000854 | 0.266923 |
1204 | fact | 1.0 | 0.000228 | 155.0 | 0.000828 | 0.275533 |
1361 | town | 1.0 | 0.000228 | 155.0 | 0.000828 | 0.275533 |
1326 | hours | 1.0 | 0.000228 | 153.0 | 0.000817 | 0.279135 |
1363 | honey | 1.0 | 0.000228 | 152.0 | 0.000812 | 0.280971 |
42 | well | 15.0 | 0.003421 | 2272.0 | 0.012132 | 0.281961 |
406 | around | 2.0 | 0.000456 | 300.0 | 0.001602 | 0.284718 |
1469 | car | 1.0 | 0.000228 | 146.0 | 0.000780 | 0.292518 |
527 | understand | 2.0 | 0.000456 | 285.0 | 0.001522 | 0.299703 |
261 | uh | 4.0 | 0.000912 | 569.0 | 0.003038 | 0.300229 |
478 | last | 2.0 | 0.000456 | 282.0 | 0.001506 | 0.302891 |
334 | youve | 3.0 | 0.000684 | 422.0 | 0.002253 | 0.303609 |
1056 | father | 1.0 | 0.000228 | 139.0 | 0.000742 | 0.307249 |
1197 | getting | 1.0 | 0.000228 | 137.0 | 0.000732 | 0.311735 |
797 | four | 1.0 | 0.000228 | 136.0 | 0.000726 | 0.314027 |
418 | told | 2.0 | 0.000456 | 272.0 | 0.001452 | 0.314027 |
1314 | house | 1.0 | 0.000228 | 134.0 | 0.000716 | 0.318714 |
766 | late | 1.0 | 0.000228 | 134.0 | 0.000716 | 0.318714 |
360 | next | 3.0 | 0.000684 | 390.0 | 0.002083 | 0.328520 |
558 | huh | 2.0 | 0.000456 | 256.0 | 0.001367 | 0.333653 |
1240 | game | 1.0 | 0.000228 | 123.0 | 0.000657 | 0.347217 |
791 | five | 1.0 | 0.000228 | 122.0 | 0.000651 | 0.350063 |
381 | morning | 2.0 | 0.000456 | 243.0 | 0.001298 | 0.351503 |
1293 | check | 1.0 | 0.000228 | 118.0 | 0.000630 | 0.361929 |
742 | says | 1.0 | 0.000228 | 117.0 | 0.000625 | 0.365023 |
473 | everything | 2.0 | 0.000456 | 234.0 | 0.001250 | 0.365023 |
The_Kennedys
unigram | The_Kennedys_frequency | The_Kennedys_norm_freq | authentic_frequency | authentic_norm_freq | norm_freq_ratio | |
---|---|---|---|---|---|---|
1774 | zone | 2.0 | 0.000111 | 506.0 | 0.002702 | 0.040925 |
1874 | earth | 1.0 | 0.000055 | 194.0 | 0.001036 | 0.053371 |
2646 | game | 1.0 | 0.000055 | 123.0 | 0.000657 | 0.084179 |
3250 | guess | 1.0 | 0.000055 | 111.0 | 0.000593 | 0.093279 |
3314 | kill | 1.0 | 0.000055 | 106.0 | 0.000566 | 0.097679 |
1328 | sound | 2.0 | 0.000111 | 205.0 | 0.001095 | 0.101015 |
2802 | hot | 1.0 | 0.000055 | 101.0 | 0.000539 | 0.102515 |
2239 | key | 1.0 | 0.000055 | 97.0 | 0.000518 | 0.106742 |
2959 | space | 1.0 | 0.000055 | 90.0 | 0.000481 | 0.115045 |
2352 | ought | 1.0 | 0.000055 | 88.0 | 0.000470 | 0.117659 |
1618 | story | 2.0 | 0.000111 | 174.0 | 0.000929 | 0.119012 |
2584 | hate | 1.0 | 0.000055 | 81.0 | 0.000433 | 0.127827 |
2791 | black | 1.0 | 0.000055 | 77.0 | 0.000411 | 0.134468 |
1581 | ten | 2.0 | 0.000111 | 153.0 | 0.000817 | 0.135347 |
1760 | amen | 2.0 | 0.000111 | 152.0 | 0.000812 | 0.136237 |
2156 | cold | 1.0 | 0.000055 | 76.0 | 0.000406 | 0.136237 |
3020 | darling | 1.0 | 0.000055 | 76.0 | 0.000406 | 0.136237 |
2852 | stuff | 1.0 | 0.000055 | 71.0 | 0.000379 | 0.145831 |
2310 | pool | 1.0 | 0.000055 | 70.0 | 0.000374 | 0.147914 |
3095 | shoot | 1.0 | 0.000055 | 70.0 | 0.000374 | 0.147914 |
1260 | ship | 2.0 | 0.000111 | 137.0 | 0.000732 | 0.151153 |
3414 | book | 1.0 | 0.000055 | 67.0 | 0.000358 | 0.154537 |
3394 | martin | 1.0 | 0.000055 | 66.0 | 0.000352 | 0.156879 |
1471 | death | 2.0 | 0.000111 | 131.0 | 0.000700 | 0.158077 |
2172 | odd | 1.0 | 0.000055 | 63.0 | 0.000336 | 0.164349 |
928 | play | 3.0 | 0.000166 | 187.0 | 0.000999 | 0.166107 |
2229 | wonder | 1.0 | 0.000055 | 61.0 | 0.000326 | 0.169738 |
2543 | boundaries | 1.0 | 0.000055 | 60.0 | 0.000320 | 0.172567 |
1078 | land | 3.0 | 0.000166 | 180.0 | 0.000961 | 0.172567 |
2923 | hit | 1.0 | 0.000055 | 59.0 | 0.000315 | 0.175492 |
3638 | message | 1.0 | 0.000055 | 58.0 | 0.000310 | 0.178517 |
3135 | fair | 1.0 | 0.000055 | 58.0 | 0.000310 | 0.178517 |
2976 | charlie | 1.0 | 0.000055 | 58.0 | 0.000310 | 0.178517 |
2269 | across | 1.0 | 0.000055 | 57.0 | 0.000304 | 0.181649 |
2478 | pick | 1.0 | 0.000055 | 56.0 | 0.000299 | 0.184893 |
1869 | creator | 1.0 | 0.000055 | 56.0 | 0.000299 | 0.184893 |
1970 | floor | 1.0 | 0.000055 | 55.0 | 0.000294 | 0.188255 |
3388 | heres | 1.0 | 0.000055 | 55.0 | 0.000294 | 0.188255 |
1683 | case | 2.0 | 0.000111 | 107.0 | 0.000571 | 0.193533 |
2696 | nights | 1.0 | 0.000055 | 53.0 | 0.000283 | 0.195359 |
317 | old | 10.0 | 0.000553 | 526.0 | 0.002809 | 0.196844 |
2938 | company | 1.0 | 0.000055 | 52.0 | 0.000278 | 0.199116 |
3031 | somewhere | 1.0 | 0.000055 | 52.0 | 0.000278 | 0.199116 |
2871 | apartment | 1.0 | 0.000055 | 52.0 | 0.000278 | 0.199116 |
3685 | bomb | 1.0 | 0.000055 | 52.0 | 0.000278 | 0.199116 |
2213 | station | 1.0 | 0.000055 | 51.0 | 0.000272 | 0.203020 |
2656 | person | 1.0 | 0.000055 | 51.0 | 0.000272 | 0.203020 |
1706 | street | 2.0 | 0.000111 | 101.0 | 0.000539 | 0.205030 |
1875 | named | 1.0 | 0.000055 | 50.0 | 0.000267 | 0.207080 |
3509 | dark | 1.0 | 0.000055 | 50.0 | 0.000267 | 0.207080 |
low_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
low_score_results = low_score_results.append(
{'script':script_group,
'score':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
}, ignore_index=True)
display(low_score_results.sort_values('score', ascending=False))
print('Best performing corpus (highest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmax(), 0]))
print('Worst performing corpus (lowest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmin(), 0]))
script | score | |
---|---|---|
2 | X-Men_First_Class | 13.255571 |
3 | The_Kennedys | 7.791826 |
0 | Pan_Am | 6.533376 |
1 | Mad_Men | 2.309133 |
Best performing corpus (highest score) X-Men_First_Class
Worst performing corpus (lowest score) Mad_Men
Ranking
The scores returned both as top and bottom normalized frequency ratios are bad things:
- The 50 highest ratios are words that were used frequently in the 21st century scripts, but were rare in the 1960s
- the 50 lowest ratios are words that were used frequently in the 1960s, but showed up rarely in the 21st century scripts
In the high ratios set, the higher the ratio, the further the script is from the authentic corpus. In the low ratios set, the higher the ratio, the closer the script is to the authentic corpus. So to get my ranking, I’m going to subtract the low ratio from the high ratio. The script corpora will then be sorted from lowest (best) to highest (worst) score.
results = pd.DataFrame(columns = ['script', 'high_ratio', 'low_ratio'])
for script_group in compare_dict.keys():
results = results.append(
{'script':script_group,
'high_ratio':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum(),
'low_ratio':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
}, ignore_index=True)
results['combined_score'] = results['high_ratio'] - results['low_ratio']
results = results.sort_values('combined_score')
results['rank'] = range(1, 1+len(results))
display(results)
script | high_ratio | low_ratio | combined_score | rank | |
---|---|---|---|---|---|
0 | Mad_Men | 1456.975643 | 2.309133 | 1454.666510 | 1 |
1 | Pan_Am | 3336.811190 | 6.533376 | 3330.277814 | 2 |
3 | The_Kennedys | 3980.829683 | 7.791826 | 3973.037857 | 3 |
2 | X-Men_First_Class | 4282.152672 | 13.255571 | 4268.897101 | 4 |
The analysis for the unigrams is now complete. To see the clean code (including improvements to functions) and the results for unigrams, bigrams, and trigrams, see the accompanying notebook.
Caveats
There are several problems with this exercise and the solution.
Corpus data processing
The biggest initial problem for me was the fact that punctuation wasn’t removed, the n-grams were case sensitive, and stopwords weren’t removed. The first two mean that words aren’t counted appropriately, especially when they’re prone to different capitalizations and uses with punctuation. For example, in the initial solution I noticed ‘daddy’ written several ways. Here are several ways ‘daddy’ could be included in a script
- Daddy.
- daddy.
- Daddy
- daddy
- Daddy!
- daddy!
This is six iterations for a single word which should all be counded together.
The last point, stopwords weren’t removed, means that there’s a lot of meaningless noise; Words like ‘the’, ‘a’, ‘an’, ‘of’, ‘for’, etc remain in the analysis.
Pronouns
Related to proper counting and stopwords are proper nouns. In a script or novel, the names of the characters of the story will show up a disporportionate amount of the time. With a large enough corpus this becomes moot because names common to the era will naturally show up more than modern names. However, these corpora aren’t large enough for this averaging of character names. The same is true for place names. The location the script is set has a higher likelihood of being mentioned.
Ratio impact
As can be seen in the final results dataframe, the high ratios have a much larger impact on my ranking than the lower numbers. This means that including words that were rare in the 1960s has a much bigger impact on the ranking than excluding words that were common.
Repetition
The authentic 1960s corpus includes many, many The Twilight Zone episodes. Most, if not all, of The Twilight Zone episodes start with the same introduction. This means that words like ‘traveling’, ‘another’, ‘dimension’, ‘sight’, ‘sound’, ‘mind’, and ‘journey’ are disproportionately represented. An improvement to the analysis would be to account for and remove this repetition so that it’s only represented once in the frequencies.