Package Documentation

Text Preprocessing

Text pre-processing module:

preprocessing.text.convert_html_entities(text_string)[source]

Converts HTML5 character references within text_string to their corresponding unicode characters and returns converted string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.convert_ligatures(text_string)[source]

Coverts Latin character references within text_string to their corresponding unicode characters and returns converted string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a string or NoneType not be passed as an argument
preprocessing.text.correct_spelling(text_string)[source]

Splits string and converts words not found within a pre-built dictionary to their most likely actual word based on a relative probability dictionary. Returns edited string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a string or NoneType not be passed as an argument
preprocessing.text.create_sentence_list(text_string)[source]

Splits text_string into a list of sentences based on NLTK’s english.pickle tokenizer, and returns said list as type list of str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.keyword_tokenize(text_string)[source]

Extracts keywords from text_string using NLTK’s list of English stopwords, ignoring words of a length smaller than 3, and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.lemmatize(text_string)[source]

Returns base from of text_string using NLTK’s WordNetLemmatizer as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.lowercase(text_string)[source]

Converts text_string into lowercase and returns the converted string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.preprocess_text(text_string, function_list)[source]

Given each function within function_list, applies the order of functions put forward onto text_string, returning the processed string as type str.

Keyword argument:

  • function_list: list of functions available in preprocessing.text
  • text_string: string instance

Exceptions raised:

  • FunctionError: occurs should an invalid function be passed within the list of functions
  • InputError: occurs should text_string be non-string, or function_list be non-list
preprocessing.text.remove_esc_chars(text_string)[source]

Removes any escape character within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_number_words(text_string)[source]

Removes any integer represented as a word within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_numbers(text_string)[source]

Removes any digit value discovered within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_time_words(text_string)[source]

Removes any word associated to time (day, week, month, etc.) within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_unbound_punct(text_string)[source]

Removes all punctuation unattached from a non-whitespace or attached to another punctuation character unexpectedly (e.g. ”.;’;”) within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_urls(text_string)[source]

Removes all URLs within text_string and returns the new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a non-string argument be passed
preprocessing.text.remove_whitespace(text_string)[source]

Removes all whitespace found within text_string and returns new string as type str.

Keyword argument:

  • text_string: string instance

Exceptions raised:

  • InputError: occurs should a string or NoneType not be passed as an argument