data_utils¶
The data_utils
module provides needed functions for data loading and parsing
Functions
wikirec.data_utils._iterate_and_parse_file()
wikirec.data_utils._clean_text_strings()
Classes
- wikirec.data_utils.input_conversion_dict()[source]¶
A dictionary of argument conversions for commonly recommended articles.
- wikirec.data_utils.download_wiki(language='en', target_dir='wiki_dump', file_limit=-1, dump_id=False)[source]¶
Downloads the most recent stable dump of the English Wikipedia if it is not already in the specified pwd directory.
- Parameters:
- languagestr (default=en)
The language of Wikipedia to download.
- target_dirstr (default=wiki_dump)
The directory in the pwd into which files should be downloaded.
- file_limitint (default=-1, all files)
The limit for the number of files to download.
- dump_idstr (default=False)
The id of an explicit Wikipedia dump that the user wants to download.
Note: a value of False will select the third from the last (latest stable dump).
- Returns:
- file_infolist of lists
Information on the downloaded Wikipedia dump files.
- wikirec.data_utils._process_article(title, text, templates='Infobox book')[source]¶
Process a wikipedia article looking for given infobox templates.
- Parameters:
- titlestr
The title of the article.
- textstr
The text to be processed.
- templatesstr (default=Infobox book)
The target templates for the corpus.
- Returns:
- title, text, wikilinks: string, string, list
The data from the article.
- wikirec.data_utils.parse_to_ndjson(topics='books', language='en', output_path='topic_articles', input_dir='wikipedia_dump', partitions_dir='partitions', limit=None, delete_parsed_files=False, multicore=True, verbose=True)[source]¶
Finds all Wikipedia entries for the given topics and convert them to json files.
- Parameters:
- topicsstr (default=books)
The topics that articles should be subset by.
Note: this corresponds to the type of infobox from Wikipedia articles.
- languagestr (default=en)
The language of Wikipedia that articles are being parsed for.
- output_pathstr (default=topic_articles)
The name of the final output ndjson file.
- input_dirstr (default=wikipedia_dump)
The path to the directory where the data is stored.
- partitions_dirstr (default=partitions)
The path to the directory where the output should be stored.
- limitint (default=None)
An optional limit of the number of topic articles per dump file to find.
- delete_parsed_filesbool (default=False)
Whether to delete the separate parsed files after combining them.
- multicorebool (default=True)
Whether to use multicore processesing.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the processes.
- Returns:
- Wikipedia dump files parsed for the given template types and converted to json files.
- wikirec.data_utils._combine_tokens_to_str(tokens)[source]¶
Combines the texts into one string.
- Parameters:
- tokensstr or list
The texts to be combined.
- Returns:
- texts_strstr
A string of the full text with unwanted words removed.
- wikirec.data_utils._lower_remove_unwanted(args)[source]¶
Lower cases tokens and removes numbers and possibly names.
- Parameters:
- argslist of tuples
The following arguments zipped.
- textlist
The text to clean.
- remove_namesbool
Whether to remove names.
- words_to_ignorestr or list
Strings that should be removed from the text body.
- stop_wordsstr or list
Stopwords for the given language.
- Returns:
- text_lowerlist
The text with lowercased tokens and without unwanted tokens.
- wikirec.data_utils._lemmatize(tokens, nlp=None, verbose=True)[source]¶
Lemmatizes tokens.
- Parameters:
- tokenslist or list of lists
Tokens to be lemmatized.
- nlpspacy.load object
A spacy language model.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- lemmatized_tokenslist or list of lists
Tokens that have been lemmatized for nlp analysis.
- wikirec.data_utils._subset_and_combine_tokens(args)[source]¶
Subsets a text by a maximum length and combines it to a string.
- Parameters:
- argslist of tuples
The following arguments zipped.
- textlist
The list of tokens to be subsetted for and combined.
- max_token_indexint (default=-1)
The maximum allowable length of a tokenized text.
- Returns:
- sub_comb_texttuple
An index and its combined text.
- wikirec.data_utils.clean(texts, language='en', min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, remove_names=False, sample_size=1, verbose=True)[source]¶
Cleans text body to prepare it for analysis.
- Parameters:
- textsstr or list
The texts to be cleaned and tokenized.
- languagestr (default=en)
The language of Wikipedia to download.
- min_token_freqint (default=2)
The minimum allowable frequency of a word inside the corpus.
- min_token_lenint (default=3)
The smallest allowable length of a word.
- min_tokensint (default=0)
The minimum allowable length of a tokenized text.
- max_token_indexint (default=-1)
The maximum allowable length of a tokenized text.
- min_ngram_countint (default=5)
The minimum occurrences for an n-gram to be included.
- remove_stopwordsbool (default=True)
Whether to remove stopwords.
- ignore_wordsstr or list
Strings that should be removed from the text body.
- remove_namesbool (default=False)
Whether to remove common names.
- sample_sizefloat (default=1)
The amount of data to be randomly sampled.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- text_corpus, selected_idxslist, list
The texts formatted for text analysis as well as the indexes for selected entries.