data_utils

The data_utils module provides needed functions for data loading and parsing

Functions

Classes

wikirec.data_utils.input_conversion_dict()[source]

A dictionary of argument conversions for commonly recommended articles.

wikirec.data_utils.download_wiki(language='en', target_dir='wiki_dump', file_limit=-1, dump_id=False)[source]

Downloads the most recent stable dump of the English Wikipedia if it is not already in the specified pwd directory.

Parameters:
languagestr (default=en)

The language of Wikipedia to download.

target_dirstr (default=wiki_dump)

The directory in the pwd into which files should be downloaded.

file_limitint (default=-1, all files)

The limit for the number of files to download.

dump_idstr (default=False)

The id of an explicit Wikipedia dump that the user wants to download.

Note: a value of False will select the third from the last (latest stable dump).

Returns:
file_infolist of lists

Information on the downloaded Wikipedia dump files.

wikirec.data_utils._process_article(title, text, templates='Infobox book')[source]

Process a wikipedia article looking for given infobox templates.

Parameters:
titlestr

The title of the article.

textstr

The text to be processed.

templatesstr (default=Infobox book)

The target templates for the corpus.

Returns:
title, text, wikilinks: string, string, list

The data from the article.

wikirec.data_utils.parse_to_ndjson(topics='books', language='en', output_path='topic_articles', input_dir='wikipedia_dump', partitions_dir='partitions', limit=None, delete_parsed_files=False, multicore=True, verbose=True)[source]

Finds all Wikipedia entries for the given topics and convert them to json files.

Parameters:
topicsstr (default=books)

The topics that articles should be subset by.

Note: this corresponds to the type of infobox from Wikipedia articles.

languagestr (default=en)

The language of Wikipedia that articles are being parsed for.

output_pathstr (default=topic_articles)

The name of the final output ndjson file.

input_dirstr (default=wikipedia_dump)

The path to the directory where the data is stored.

partitions_dirstr (default=partitions)

The path to the directory where the output should be stored.

limitint (default=None)

An optional limit of the number of topic articles per dump file to find.

delete_parsed_filesbool (default=False)

Whether to delete the separate parsed files after combining them.

multicorebool (default=True)

Whether to use multicore processesing.

verbosebool (default=True)

Whether to show a tqdm progress bar for the processes.

Returns:
Wikipedia dump files parsed for the given template types and converted to json files.
wikirec.data_utils._combine_tokens_to_str(tokens)[source]

Combines the texts into one string.

Parameters:
tokensstr or list

The texts to be combined.

Returns:
texts_strstr

A string of the full text with unwanted words removed.

wikirec.data_utils._lower_remove_unwanted(args)[source]

Lower cases tokens and removes numbers and possibly names.

Parameters:
argslist of tuples

The following arguments zipped.

textlist

The text to clean.

remove_namesbool

Whether to remove names.

words_to_ignorestr or list

Strings that should be removed from the text body.

stop_wordsstr or list

Stopwords for the given language.

Returns:
text_lowerlist

The text with lowercased tokens and without unwanted tokens.

wikirec.data_utils._lemmatize(tokens, nlp=None, verbose=True)[source]

Lemmatizes tokens.

Parameters:
tokenslist or list of lists

Tokens to be lemmatized.

nlpspacy.load object

A spacy language model.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
lemmatized_tokenslist or list of lists

Tokens that have been lemmatized for nlp analysis.

wikirec.data_utils._subset_and_combine_tokens(args)[source]

Subsets a text by a maximum length and combines it to a string.

Parameters:
argslist of tuples

The following arguments zipped.

textlist

The list of tokens to be subsetted for and combined.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

Returns:
sub_comb_texttuple

An index and its combined text.

wikirec.data_utils.clean(texts, language='en', min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, remove_names=False, sample_size=1, verbose=True)[source]

Cleans text body to prepare it for analysis.

Parameters:
textsstr or list

The texts to be cleaned and tokenized.

languagestr (default=en)

The language of Wikipedia to download.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

remove_namesbool (default=False)

Whether to remove common names.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
text_corpus, selected_idxslist, list

The texts formatted for text analysis as well as the indexes for selected entries.

class wikirec.data_utils.WikiXmlHandler[source]

Parse through XML data using SAX.

characters(content)[source]

Characters between opening and closing tags.

startElement(name, attrs)[source]

Opening tag of element.

endElement(name)[source]

Closing tag of element.