model¶

The model module provides needed functions for modeling text corpuses and delivering recommendations

Functions

wikirec.model.gen_embeddings()
wikirec.model.gen_sim_matrix()
wikirec.model.recommend()

wikirec.model.gen_embeddings(method='bert', corpus=None, bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', path_to_json=None, path_to_embedding_model='wikilink_embedding_model', embedding_size=75, epochs=20, verbose=True, **kwargs)[source]¶

Generates embeddings given a modeling method and text corpus.

Parameters

methodstr (default=bert)

The modelling method.

Options:
BERT: Bidirectional Encoder Representations from Transformers

Words embeddings are derived via Google Neural Networks.

Embeddings are then used to derive similarities.

Doc2vec : Document to Vector

An entire document is converted to a vector.

Based on word2vec, but maintains the document context.

LDA: Latent Dirichlet Allocation

Text data is classified into a given number of categories.

These categories are then used to classify individual entries given the percent they fall into categories.

TFIDF: Term Frequency Inverse Document Frequency

Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.

These importances are then vectorized and used to relate documents.

WikilinkNN: Wikilinks Neural Network

Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.

corpuslist of lists (default=None): The text corpus over which analysis should be done.
bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens): The BERT model to use.
path_to_jsonstr (default=None): The path to the parsed json file.
path_to_embedding_modelstr (default=wikilink_embedding_model): The name of the embedding model to load or create.
embedding_sizeint (default=75): The length of the embedding vectors between the articles and the links.
epochsint (default=20): The number of modeling iterations through the training dataset.
verbosebool (default=True): Whether to show a tqdm progress bar for the model creation.
**kwargskeyword arguments: Arguments correspoding to sentence_transformers.SentenceTransformer.encode, gensim.models.doc2vec.Doc2Vec, gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer.

Returns

embeddingsnp.ndarray: Embeddings to be used to create article-article similarity matrices.

wikirec.model.gen_sim_matrix(method='bert', metric='cosine', embeddings=None)[source]¶

Derives a similarity matrix from document embeddings.

Parameters

methodstr (default=bert)

The modelling method.

Options:
BERT: Bidirectional Encoder Representations from Transformers

Words embeddings are derived via Google Neural Networks.

Embeddings are then used to derive similarities.

Doc2vec : Document to Vector

An entire document is converted to a vector.

Based on word2vec, but maintains the document context.

LDA: Latent Dirichlet Allocation

Text data is classified into a given number of categories.

These categories are then used to classify individual entries given the percent they fall into categories.

TFIDF: Term Frequency Inverse Document Frequency

Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.

These importances are then vectorized and used to relate documents.

WikilinkNN: Wikilinks Neural Network

Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.

metricstr (default=cosine)

The metric to be used when comparing vectorized corpus entries.

Note: options include cosine and euclidean.

Returns

sim_matrixgensim.interfaces.TransformedCorpus or numpy.ndarray: The similarity sim_matrix for the corpus from the given model.

wikirec.model.recommend(inputs=None, ratings=None, titles=None, sim_matrix=None, metric='cosine', n=10)[source]¶

Recommends similar items given an input or list of inputs of interest.

Parameters

inputsstr or list (default=None)

The name of an item or items of interest.

ratingslist (default=None)

A list of ratings that correspond to each input.

Note: len(ratings) must equal len(inputs).

titleslists (default=None)

The titles of the articles.

sim_matrixgensim.interfaces.TransformedCorpus or np.ndarray (default=None)

The similarity sim_matrix for the corpus from the given model.

nint (default=10)

The number of items to recommend.

metricstr (default=cosine)

The metric to be used when comparing vectorized corpus entries.

Note: options include cosine and euclidean.

Returns

recommendationslist of lists: Those items that are most similar to the inputs and their similarity scores