Package Documentation¶
Text Preprocessing¶
Text pre-processing module:
-
preprocessing.text.
convert_html_entities
(text_string)[source]¶ Converts HTML5 character references within text_string to their corresponding unicode characters and returns converted string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
convert_ligatures
(text_string)[source]¶ Coverts Latin character references within text_string to their corresponding unicode characters and returns converted string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a string or NoneType not be passed as an argument
-
preprocessing.text.
correct_spelling
(text_string)[source]¶ Splits string and converts words not found within a pre-built dictionary to their most likely actual word based on a relative probability dictionary. Returns edited string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a string or NoneType not be passed as an argument
-
preprocessing.text.
create_sentence_list
(text_string)[source]¶ Splits text_string into a list of sentences based on NLTK’s english.pickle tokenizer, and returns said list as type list of str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
keyword_tokenize
(text_string)[source]¶ Extracts keywords from text_string using NLTK’s list of English stopwords, ignoring words of a length smaller than 3, and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
lemmatize
(text_string)[source]¶ Returns base from of text_string using NLTK’s WordNetLemmatizer as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
lowercase
(text_string)[source]¶ Converts text_string into lowercase and returns the converted string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
preprocess_text
(text_string, function_list)[source]¶ Given each function within function_list, applies the order of functions put forward onto text_string, returning the processed string as type str.
Keyword argument:
- function_list: list of functions available in preprocessing.text
- text_string: string instance
Exceptions raised:
- FunctionError: occurs should an invalid function be passed within the list of functions
- InputError: occurs should text_string be non-string, or function_list be non-list
-
preprocessing.text.
remove_esc_chars
(text_string)[source]¶ Removes any escape character within text_string and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
remove_number_words
(text_string)[source]¶ Removes any integer represented as a word within text_string and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
remove_numbers
(text_string)[source]¶ Removes any digit value discovered within text_string and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
remove_time_words
(text_string)[source]¶ Removes any word associated to time (day, week, month, etc.) within text_string and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed
-
preprocessing.text.
remove_unbound_punct
(text_string)[source]¶ Removes all punctuation unattached from a non-whitespace or attached to another punctuation character unexpectedly (e.g. ”.;’;”) within text_string and returns the new string as type str.
Keyword argument:
- text_string: string instance
Exceptions raised:
- InputError: occurs should a non-string argument be passed