languages

Module for organizing language dependencies for text cleaning.

The following languages have been selected because their stopwords can be removed via https://github.com/stopwords-iso/stopwords-iso/tree/master/python.

Contents:

lem_abbr_dict, stem_abbr_dict, sw_abbr_dict

wikirec.languages.lem_abbr_dict()[source]

Calls a dictionary of languages and their abbreviations for lemmatization.

Returns:
lem_abbr_dictdict

A dictionary with languages as keys and their abbreviations as items.

Notes

These languages can be lemmatized via https://spacy.io/usage/models.

They are also those that can have their words ordered by parts of speech.

wikirec.languages.stem_abbr_dict()[source]

Calls a dictionary of languages and their abbreviations for stemming.

Returns:
stem_abbr_dictdict

A dictionary with languages as keys and their abbreviations as items.

Notes

These languages don’t have good lemmatizers, and will thus be stemmed via https://www.nltk.org/api/nltk.stem.html.

wikirec.languages.sw_abbr_dict()[source]

Calls a dictionary of languages and their abbreviations for stop word removal.

Returns:
sw_abbr_dictdict

A dictionary with languages as keys and their abbreviations as items.

Notes

These languages can only have their stopwords removed via https://github.com/stopwords-iso/stopwords-iso).