Entry NLP3: Clean Data and Split into N-grams
In the first entry of this series, I figured out how to process the raw files. In the second entry, I figured out how to load all files in a directory (even if it has subdirectories) and store the data.
Now I’m ready to make the analysis case insensitive, remove punctuation and stopwords, and split what’s left into n-grams.
Side note: To be fair, I worked on a pretty extensive NLP problem a few years ago. I’ll be reusing code and logic from that project.
import pandas as pd
import os
from IPython.display import display
import string
import re
import itertools
import nltk
nltk.download('stopwords')
def read_script(file_path):
corpus = ''
with open(file_path, 'r', encoding='latin-1') as l:
for line in l:
if (re.match('[^\d+]', line)
) and (re.match('^(?!\s*$).+', line)
) and not (re.match('(.*www.*)|(.*http:*)', line)
) and not (re.match('Sync and correct*', line)):
line = re.sub('</?i>|</?font.*>', '', line)
corpus = corpus + ' ' + line
return corpus
def load_files_to_dict(file_path, return_dict):
for thing in os.scandir(file_path):
if thing.is_dir():
new_path = os.path.join(file_path, thing.name)
new_dict = return_dict[thing.name] = {}
load_files_to_dict(new_path, new_dict)
elif thing.is_file:
return_dict[thing.name] = read_script(f'{file_path}/{thing.name}')
return return_dict
file_path = os.path.join(os.getcwd(), 'data', '1960s')
unilayer_dict = load_files_to_dict(file_path, {})
Remove Punctuation
The list of things to remove includes \n
, which denotes a newline. I found that including \r
(a carriage return) and \t
(a tab) is also helpful. These characters can all be hard to spot as they are generally invisible and can randomly attach themselves to otherwise normal words.
newline_list = '\t\r\n'
Next I’ll spell out the special characters I want to remove from the text. Fortunately, there’s a list of punctuation included in the string
library.
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
This list is pretty comprehensive. Between this and the newline_list
I created above all the remaining characters from the “Remove” list are now addressed. For quick reference, I still had the following items to remove:
- ’#’
- ’-‘
- ’(‘
- ’)’
- ’”’
- ‘\n’
In my previous project, I discovered the translate
method. It replaces specified characters with those described in a dictionary or mapping table. The method maketrans
creates the mapping table. This set of methods is very handy method for proessing strings.
Now I can specify all my variables:
newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation
nopunct = str.maketrans('', '', punct_list)
To process the data, I can then just apply str.translate
to the column holding the text.
df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
This particular strategy hinges on the text being a value in a dataframe column. However, the output from the last notebook is a dictionary.
list(unilayer_dict.keys())[:5]
['The Twilight Zone - 3x17 - One More Pallbearer.srt',
'The Twilight Zone - 3x05 - A Game of Pool.srt',
'The Twilight Zone - 2x03 - Nervous Man in a Four Dollar Room.srt',
'The Twilight Zone - 4x05 - Mute.srt',
'The Twilight Zone - 3x04 - The Passersby.srt']
unilayer_dict['The Twilight Zone - 4x05 - Mute.srt'][:500]
' You unlock this door\n with the key of imagination.\n Beyond it is another dimension-\n a dimension of sound,\n a dimension of sight,\n a dimension of mind.\n You\'re moving into a land\n of both shadow and substance,\n of things and ideas.\n You\'ve just crossed over\n into the twilight zone.\n So...\n "the undersigned,\n "having accepted\n the following propositions:\n "A, that prior\n to the inception of language,\n "man communicated\n by telepathic means;\n "and b, that this ability\n not only still exists\n "but'
Convert dictionary to dataframe
A dictionary is easily converted into a dataframe with pd.DataFrame.from_dict
. The gotsha for this particular use case is the parameter orient
, it has to be set to index
in order to use key:value as rows instead of columns. Conversely, it uses the key as the index.
pd.DataFrame.from_dict(unilayer_dict, orient='index').head()
0 | |
---|---|
The Twilight Zone - 3x17 - One More Pallbearer.srt | You're traveling\n through another dimension-... |
The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... |
The Twilight Zone - 2x03 - Nervous Man in a Four Dollar Room.srt | You're traveling\n through another dimension-... |
The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... |
The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... |
The indexing quirk is easily fixed with reset_index
to make all my variables accessible as columns. However, the I have terrible columns names. Then I give the column names intuitive names.
pd.DataFrame.from_dict(unilayer_dict, orient='index').reset_index().head()
index | 0 | |
---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... |
Ultimately this will all be in a single function or series of functions and the column name won’t matter. However, I find it much easier to write and read the code when there are descriptive names - this goes for column names and function names. So I’m going to change the column names to be more easily understood.
test = pd.DataFrame.from_dict(unilayer_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})
test.head()
script_name | corpus | |
---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... |
While it is a single line of code, it is a little unwieldy and I’ll need to apply it to all the dictionaries, so I’ll write a quick function to do it for me.
def convert_dict_df(script_dict):
return pd.DataFrame.from_dict(script_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})
Remove Punctuation
Now that the values are conveniently located in a dataframe, I just have to apply the logic defined earlier. To make it easier, I’ll put the logic into a function, then apply that function to the example dataframe.
def punct_tokens(df, text_col):
newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation
nopunct = str.maketrans('', '', punct_list)
df['no_punct_tokens'] = df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
return df
punct_test = punct_tokens(test, 'corpus')
punct_test.head()
script_name | corpus | no_punct_tokens | |
---|---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... |
Remove stopwords
Now that the punctuation is out of the way, I can start thinking about the breaking the text into different sized n-grams. What has worked for me in the past is to split the string that’s had punctuation removed into unigrams (called one-grams in the homework), the create different sizes of n-gram from there.
However, to get words with actual meaning, I first need to remove stopwords.
The nltk
library has a handy list of stopwords. Note: Using the nltk
library is beyond the scope of this series of entries. Historically, my use of the nltk
libray has mostly been limited to the stopword list and n-gram creation. I have used the FreqDist
and ConditionalFreqDist
functions, but found them a bit tempermental and ended up coding frequency counts myself for this exercise (see the next post).
nltk.corpus.stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll",
"you'd", 'your', 'yours', 'yourself', 'yourselves',
'he', 'him', 'his', 'himself', 'she', "she's",
'her', 'hers', 'herself', 'it', "it's", 'its',
'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom',
'this', 'that', "that'll", 'these', 'those',
'am', 'is', 'are', 'was', 'were', 'be', 'been',
'being', 'have', 'has', 'had', 'having', 'do',
'does', 'did', 'doing', 'a', 'an', 'the', 'and',
'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about',
'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from',
'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how',
'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor',
'not', 'only', 'own', 'same', 'so', 'than', 'too',
'very', 's', 't', 'can', 'will', 'just', 'don',
"don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't",
'couldn', "couldn't", 'didn', "didn't", 'doesn',
"doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
'haven', "haven't", 'isn', "isn't", 'ma', 'mightn',
"mightn't", 'mustn', "mustn't", 'needn', "needn't",
'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
"wasn't", 'weren', "weren't", 'won', "won't",
'wouldn', "wouldn't"]
The best way I found to remove stopwords was to use list comprehension in a lambda function.
All the code for this section was re-used, so I’ll lump the results all together.
def create_ngrams(df):
stop = nltk.corpus.stopwords.words('english')
df['unigrams'] = df['no_punct_tokens'].apply(lambda x: [item for item in x if item not in stop])
df['bigrams'] = df['unigrams'].apply(lambda x:(list(nltk.bigrams(x))))
df['trigrams'] = df['unigrams'].apply(lambda x:(list(nltk.trigrams(x))))
return df
create_ngrams(punct_test).head()
script_name | corpus | no_punct_tokens | unigrams | bigrams | trigrams | |
---|---|---|---|---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
I appreciate this data structure because if there is anything that doesn’t make sense later in the analysis, I can search for it and track it back to the source, i.e. as long as I can find it in the designated n-gram column, I can see what the corpus looked like in the original form (the full concatenated string), after removal of punctuation, after removal of the stopwords, and converted to n-grams as well as being able to track it back to the script it came from because I have the script name.
The code to create this dataframe is a good chunk of code that’s all related, so I’ll combine it into a single function for easy of use.
def create_ngram_df(script_dict, text_col):
df = convert_dict_df(script_dict)
df = punct_tokens(df, text_col)
df = create_ngrams(df)
return df
authentic_ngram_df = create_ngram_df(unilayer_dict, 'corpus')
authentic_ngram_df
script_name | corpus | no_punct_tokens | unigrams | bigrams | trigrams | |
---|---|---|---|---|---|---|
0 | The Twilight Zone - 3x17 - One More Pallbearer... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
1 | The Twilight Zone - 3x05 - A Game of Pool.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
2 | The Twilight Zone - 2x03 - Nervous Man in a Fo... | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
3 | The Twilight Zone - 4x05 - Mute.srt | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
4 | The Twilight Zone - 3x04 - The Passersby.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
... | ... | ... | ... | ... | ... | ... |
116 | The Twilight Zone - s05e36 - The Bewitchin' Po... | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
117 | The Twilight Zone - 3x03 - The Shelter.srt | You're traveling\n through another dimension-... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
118 | The Twilight Zone - s05e21 - Spur of the Momen... | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
119 | The Twilight Zone - 2x29 - The Obsolete Man.srt | You're traveling\n through another dimension\... | [youre, traveling, through, another, dimension... | [youre, traveling, another, dimension, dimensi... | [(youre, traveling), (traveling, another), (an... | [(youre, traveling, another), (traveling, anot... |
120 | The Twilight Zone - s05e13 - Ring-A-Ding Girl.srt | You unlock this door\n with the key of imagin... | [you, unlock, this, door, with, the, key, of, ... | [unlock, door, key, imagination, beyond, anoth... | [(unlock, door), (door, key), (key, imaginatio... | [(unlock, door, key), (door, key, imagination)... |
121 rows × 6 columns
To handle the multiple corpora of the 21st century scripts, I retained the dictionary-holding-another-data-structure set up. The name of each grouping (‘Pan-Am’, ‘Mad_Med’, ‘The_Kennedys’, ‘X-Men_First_Class’) is a key and the dataframe is the value. Using this, I can continue to reap the benefits of my functions, while keeping the groups, and their individual analyses, separate.
test_ngram_dict = {}
for script_group in list(bilayer_dict.keys()):
test_ngram_dict[script_group] = create_ngram_df(bilayer_dict[script_group], 'corpus')
test_ngram_dict['Pan_Am']
script_name | corpus | no_punct_tokens | unigrams | bigrams | trigrams | |
---|---|---|---|---|---|---|
0 | Pan.Am.S01E09.srt | Previously on "Pan Am"...\n Look, I get to se... | [previously, on, pan, am, look, i, get, to, se... | [previously, pan, look, get, see, world, sam, ... | [(previously, pan), (pan, look), (look, get), ... | [(previously, pan, look), (pan, look, get), (l... |
1 | Pan.Am.S01E08.srt | 1\n Previously on "Pan Am"...\n Let's keep... | [1, previously, on, pan, am, lets, keep, it... | [1, previously, pan, lets, keep, new, york,... | [(1, previously), (previously, pan), (pan, ... | [(1, previously, pan), (previously, pan, le... |
2 | Pan.Am.S01E05.srt | Previously on "Pan Am"...\n What do you think... | [previously, on, pan, am, what, do, you, think... | [previously, pan, think, youre, ran, away, wed... | [(previously, pan), (pan, think), (think, your... | [(previously, pan, think), (pan, think, youre)... |
3 | Pan.Am.S01E11.srt | Previously on "Pan Am".\n MI6 will want answe... | [previously, on, pan, am, mi6, will, want, ans... | [previously, pan, mi6, want, answers, take, li... | [(previously, pan), (pan, mi6), (mi6, want), (... | [(previously, pan, mi6), (pan, mi6, want), (mi... |
4 | Pan.Am.S01E10.srt | Previously on "Pan Am"...\n I bet you've got ... | [previously, on, pan, am, i, bet, youve, got, ... | [previously, pan, bet, youve, got, surprises, ... | [(previously, pan), (pan, bet), (bet, youve), ... | [(previously, pan, bet), (pan, bet, youve), (b... |
5 | Pan.Am.S01E04.srt | Previously on "Pan Am"...\n - You're gonna me... | [previously, on, pan, am, youre, gonna, meet, ... | [previously, pan, youre, gonna, meet, kennedy,... | [(previously, pan), (pan, youre), (youre, gonn... | [(previously, pan, youre), (pan, youre, gonna)... |
6 | Pan.Am.S01E12.srt | Previously on "Pan Am".\n We'd like to move y... | [previously, on, pan, am, wed, like, to, move,... | [previously, pan, wed, like, move, courier, ag... | [(previously, pan), (pan, wed), (wed, like), (... | [(previously, pan, wed), (pan, wed, like), (we... |
7 | Pan.Am.S01E06.srt | Previously on "Pan Am".\n Why don't you came ... | [previously, on, pan, am, why, dont, you, came... | [previously, pan, dont, came, fog, captain, af... | [(previously, pan), (pan, dont), (dont, came),... | [(previously, pan, dont), (pan, dont, came), (... |
8 | Pan.Am.S01E07.srt | Previously on "Pan Am"...\n You smell like wh... | [previously, on, pan, am, you, smell, like, wh... | [previously, pan, smell, like, whiskey, cigare... | [(previously, pan), (pan, smell), (smell, like... | [(previously, pan, smell), (pan, smell, like),... |
9 | Pan.Am.S01E13.srt | Previously on "Pan Am"...\n Let's keep it in ... | [previously, on, pan, am, lets, keep, it, in, ... | [previously, pan, lets, keep, new, york, ginny... | [(previously, pan), (pan, lets), (lets, keep),... | [(previously, pan, lets), (pan, lets, keep), (... |
10 | Pan.Am.S01E03.srt | Previously on "Pan Am".\n You're always disap... | [previously, on, pan, am, youre, always, disap... | [previously, pan, youre, always, disappearing,... | [(previously, pan), (pan, youre), (youre, alwa... | [(previously, pan, youre), (pan, youre, always... |
11 | Pan.Am.S01E02.srt | Previously on "Pan Am"...\n Do you not wanna ... | [previously, on, pan, am, do, you, not, wanna,... | [previously, pan, wanna, get, married, need, d... | [(previously, pan), (pan, wanna), (wanna, get)... | [(previously, pan, wanna), (pan, wanna, get), ... |
12 | Pan.Am.S01E14.srt | Previously on "Pan Am"...\n There's a dealer ... | [previously, on, pan, am, theres, a, dealer, i... | [previously, pan, theres, dealer, london, whos... | [(previously, pan), (pan, theres), (theres, de... | [(previously, pan, theres), (pan, theres, deal... |
13 | Pan.Am.S01E01.srt | 1\n There you are.\n Enjoy your flight.\n ... | [1, there, you, are, enjoy, your, flight, j... | [1, enjoy, flight, jet, clipper, service, u... | [(1, enjoy), (enjoy, flight), (flight, jet)... | [(1, enjoy, flight), (enjoy, flight, jet), ... |
Putting it all together, the functions look like this:
def convert_dict_df(script_dict):
return pd.DataFrame.from_dict(script_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})
def punct_tokens(df, text_col):
newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation
nopunct = str.maketrans('', '', punct_list)
df['no_punct_tokens'] = df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
return df
def create_ngrams(df):
stop = nltk.corpus.stopwords.words('english')
df['unigrams'] = df['no_punct_tokens'].apply(lambda x: [item for item in x if item not in stop])
df['bigrams'] = df['unigrams'].apply(lambda x:(list(nltk.bigrams(x))))
df['trigrams'] = df['unigrams'].apply(lambda x:(list(nltk.trigrams(x))))
return df
def create_ngram_df(script_dict, text_col):
df = convert_dict_df(script_dict)
df = punct_tokens(df, text_col)
df = create_ngrams(df)
return df