data_utils¶

The data_utils module provides needed functions for data loading and parsing

Functions

wikirec.data_utils.input_conversion_dict()
wikirec.data_utils.download_wiki()
wikirec.data_utils._process_article()
wikirec.data_utils._iterate_and_parse_file()
wikirec.data_utils.parse_to_ndjson()
wikirec.data_utils._combine_tokens_to_str()
wikirec.data_utils._clean_text_strings()
wikirec.data_utils._lower_remove_unwanted()
wikirec.data_utils._lemmatize()
wikirec.data_utils._subset_and_combine_tokens()
wikirec.data_utils.clean()

Classes

wikirec.data_utils.WikiXmlHandler

wikirec.data_utils.input_conversion_dict()[source]¶: A dictionary of argument conversions for commonly recommended articles.

wikirec.data_utils.download_wiki(language='en', target_dir='wiki_dump', file_limit=-1, dump_id=False)[source]¶

Downloads the most recent stable dump of the English Wikipedia if it is not already in the specified pwd directory.

Parameters:

languagestr (default=en)

The language of Wikipedia to download.

target_dirstr (default=wiki_dump)

The directory in the pwd into which files should be downloaded.

file_limitint (default=-1, all files)

The limit for the number of files to download.

dump_idstr (default=False)

The id of an explicit Wikipedia dump that the user wants to download.

Note: a value of False will select the third from the last (latest stable dump).

Returns:

file_infolist of lists: Information on the downloaded Wikipedia dump files.

wikirec.data_utils._process_article(title, text, templates='Infobox book')[source]¶

Process a wikipedia article looking for given infobox templates.

Parameters:

titlestr: The title of the article.
textstr: The text to be processed.
templatesstr (default=Infobox book): The target templates for the corpus.

Returns:

title, text, wikilinks: string, string, list: The data from the article.

wikirec.data_utils.parse_to_ndjson(topics='books', language='en', output_path='topic_articles', input_dir='wikipedia_dump', partitions_dir='partitions', limit=None, delete_parsed_files=False, multicore=True, verbose=True)[source]¶

Finds all Wikipedia entries for the given topics and convert them to json files.

Parameters:

topicsstr (default=books)

The topics that articles should be subset by.

Note: this corresponds to the type of infobox from Wikipedia articles.

languagestr (default=en)

The language of Wikipedia that articles are being parsed for.

output_pathstr (default=topic_articles)

The name of the final output ndjson file.

input_dirstr (default=wikipedia_dump)

The path to the directory where the data is stored.

partitions_dirstr (default=partitions)

The path to the directory where the output should be stored.

limitint (default=None)

An optional limit of the number of topic articles per dump file to find.

delete_parsed_filesbool (default=False)

Whether to delete the separate parsed files after combining them.

multicorebool (default=True)

Whether to use multicore processesing.

verbosebool (default=True)

Whether to show a tqdm progress bar for the processes.

Returns:

Wikipedia dump files parsed for the given template types and converted to json files.

wikirec.data_utils._combine_tokens_to_str(tokens)[source]¶

Combines the texts into one string.

Parameters:

tokensstr or list: The texts to be combined.

Returns:

texts_strstr: A string of the full text with unwanted words removed.

wikirec.data_utils._lower_remove_unwanted(args)[source]¶

Lower cases tokens and removes numbers and possibly names.

Parameters:

argslist of tuples: The following arguments zipped.
textlist: The text to clean.
remove_namesbool: Whether to remove names.
words_to_ignorestr or list: Strings that should be removed from the text body.
stop_wordsstr or list: Stopwords for the given language.

Returns:

text_lowerlist: The text with lowercased tokens and without unwanted tokens.

wikirec.data_utils._lemmatize(tokens, nlp=None, verbose=True)[source]¶

Lemmatizes tokens.

Parameters:

tokenslist or list of lists: Tokens to be lemmatized.
nlpspacy.load object: A spacy language model.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

lemmatized_tokenslist or list of lists: Tokens that have been lemmatized for nlp analysis.

wikirec.data_utils._subset_and_combine_tokens(args)[source]¶

Subsets a text by a maximum length and combines it to a string.

Parameters:

argslist of tuples: The following arguments zipped.
textlist: The list of tokens to be subsetted for and combined.
max_token_indexint (default=-1): The maximum allowable length of a tokenized text.

Returns:

sub_comb_texttuple: An index and its combined text.

wikirec.data_utils.clean(texts, language='en', min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, remove_names=False, sample_size=1, verbose=True)[source]¶

Cleans text body to prepare it for analysis.

Parameters:

textsstr or list: The texts to be cleaned and tokenized.
languagestr (default=en): The language of Wikipedia to download.
min_token_freqint (default=2): The minimum allowable frequency of a word inside the corpus.
min_token_lenint (default=3): The smallest allowable length of a word.
min_tokensint (default=0): The minimum allowable length of a tokenized text.
max_token_indexint (default=-1): The maximum allowable length of a tokenized text.
min_ngram_countint (default=5): The minimum occurrences for an n-gram to be included.
remove_stopwordsbool (default=True): Whether to remove stopwords.
ignore_wordsstr or list: Strings that should be removed from the text body.
remove_namesbool (default=False): Whether to remove common names.
sample_sizefloat (default=1): The amount of data to be randomly sampled.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

text_corpus, selected_idxslist, list: The texts formatted for text analysis as well as the indexes for selected entries.

class wikirec.data_utils.WikiXmlHandler[source]¶

Parse through XML data using SAX.

characters(content)[source]¶: Characters between opening and closing tags.

startElement(name, attrs)[source]¶: Opening tag of element.

endElement(name)[source]¶: Closing tag of element.