
The data_utils module provides needed functions for data loading and parsing




A dictionary of argument conversions for commonly recommended articles.

wikirec.data_utils.download_wiki(language='en', target_dir='wiki_dump', file_limit=-1, dump_id=False)[source]

Downloads the most recent stable dump of the English Wikipedia if it is not already in the specified pwd directory.

languagestr (default=en)

The language of Wikipedia to download.

target_dirstr (default=wiki_dump)

The directory in the pwd into which files should be downloaded.

file_limitint (default=-1, all files)

The limit for the number of files to download.

dump_idstr (default=False)

The id of an explicit Wikipedia dump that the user wants to download.

Note: a value of False will select the third from the last (latest stable dump).

file_infolist of lists

Information on the downloaded Wikipedia dump files.

wikirec.data_utils._process_article(title, text, templates='Infobox book')[source]

Process a wikipedia article looking for given infobox templates.


The title of the article.


The text to be processed.

templatesstr (default=Infobox book)

The target templates for the corpus.

title, text, wikilinks: string, string, list

The data from the article.

wikirec.data_utils.parse_to_ndjson(topics='books', language='en', output_path='topic_articles', input_dir='wikipedia_dump', partitions_dir='partitions', limit=None, delete_parsed_files=False, multicore=True, verbose=True)[source]

Finds all Wikipedia entries for the given topics and convert them to json files.

topicsstr (default=books)

The topics that articles should be subset by.

Note: this corresponds to the type of infobox from Wikipedia articles.

languagestr (default=en)

The language of Wikipedia that articles are being parsed for.

output_pathstr (default=topic_articles)

The name of the final output ndjson file.

input_dirstr (default=wikipedia_dump)

The path to the directory where the data is stored.

partitions_dirstr (default=partitions)

The path to the directory where the output should be stored.

limitint (default=None)

An optional limit of the number of topic articles per dump file to find.

delete_parsed_filesbool (default=False)

Whether to delete the separate parsed files after combining them.

multicorebool (default=True)

Whether to use multicore processesing.

verbosebool (default=True)

Whether to show a tqdm progress bar for the processes.

Wikipedia dump files parsed for the given template types and converted to json files.

Combines the texts into one string.

tokensstr or list

The texts to be combined.


A string of the full text with unwanted words removed.


Lower cases tokens and removes numbers and possibly names.

argslist of tuples

The following arguments zipped.


The text to clean.


Whether to remove names.

words_to_ignorestr or list

Strings that should be removed from the text body.

stop_wordsstr or list

Stopwords for the given language.


The text with lowercased tokens and without unwanted tokens.

wikirec.data_utils._lemmatize(tokens, nlp=None, verbose=True)[source]

Lemmatizes tokens.

tokenslist or list of lists

Tokens to be lemmatized.

nlpspacy.load object

A spacy language model.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

lemmatized_tokenslist or list of lists

Tokens that have been lemmatized for nlp analysis.


Subsets a text by a maximum length and combines it to a string.

argslist of tuples

The following arguments zipped.


The list of tokens to be subsetted for and combined.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.


An index and its combined text.

wikirec.data_utils.clean(texts, language='en', min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, remove_names=False, sample_size=1, verbose=True)[source]

Cleans text body to prepare it for analysis.

textsstr or list

The texts to be cleaned and tokenized.

languagestr (default=en)

The language of Wikipedia to download.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

remove_namesbool (default=False)

Whether to remove common names.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

text_corpus, selected_idxslist, list

The texts formatted for text analysis as well as the indexes for selected entries.

class wikirec.data_utils.WikiXmlHandler[source]

Parse through XML data using SAX.


Characters between opening and closing tags.

startElement(name, attrs)[source]

Opening tag of element.


Closing tag of element.