model

The model module provides needed functions for modeling text corpuses and delivering recommendations

Functions

wikirec.model.gen_embeddings(method='bert', corpus=None, bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', path_to_json=None, path_to_embedding_model='wikilink_embedding_model', embedding_size=75, epochs=20, verbose=True, **kwargs)[source]

Generates embeddings given a modeling method and text corpus.

Parameters:
methodstr (default=bert)

The modelling method.

Options:

BERT: Bidirectional Encoder Representations from Transformers

  • Words embeddings are derived via Google Neural Networks.

  • Embeddings are then used to derive similarities.

Doc2vec : Document to Vector

  • An entire document is converted to a vector.

  • Based on word2vec, but maintains the document context.

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

TFIDF: Term Frequency Inverse Document Frequency

  • Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.

  • These importances are then vectorized and used to relate documents.

WikilinkNN: Wikilinks Neural Network

  • Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.

corpuslist of lists (default=None)

The text corpus over which analysis should be done.

bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)

The BERT model to use.

path_to_jsonstr (default=None)

The path to the parsed json file.

path_to_embedding_modelstr (default=wikilink_embedding_model)

The name of the embedding model to load or create.

embedding_sizeint (default=75)

The length of the embedding vectors between the articles and the links.

epochsint (default=20)

The number of modeling iterations through the training dataset.

verbosebool (default=True)

Whether to show a tqdm progress bar for the model creation.

**kwargskeyword arguments

Arguments correspoding to sentence_transformers.SentenceTransformer.encode, gensim.models.doc2vec.Doc2Vec, gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer.

Returns:
embeddingsnp.ndarray

Embeddings to be used to create article-article similarity matrices.

wikirec.model.gen_sim_matrix(method='bert', metric='cosine', embeddings=None)[source]

Derives a similarity matrix from document embeddings.

Parameters:
methodstr (default=bert)

The modelling method.

Options:

BERT: Bidirectional Encoder Representations from Transformers

  • Words embeddings are derived via Google Neural Networks.

  • Embeddings are then used to derive similarities.

Doc2vec : Document to Vector

  • An entire document is converted to a vector.

  • Based on word2vec, but maintains the document context.

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

TFIDF: Term Frequency Inverse Document Frequency

  • Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.

  • These importances are then vectorized and used to relate documents.

WikilinkNN: Wikilinks Neural Network

  • Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.

metricstr (default=cosine)

The metric to be used when comparing vectorized corpus entries.

Note: options include cosine and euclidean.

Returns:
sim_matrixgensim.interfaces.TransformedCorpus or numpy.ndarray

The similarity sim_matrix for the corpus from the given model.

wikirec.model.recommend(inputs=None, ratings=None, titles=None, sim_matrix=None, metric='cosine', n=10)[source]

Recommends similar items given an input or list of inputs of interest.

Parameters:
inputsstr or list (default=None)

The name of an item or items of interest.

ratingslist (default=None)

A list of ratings that correspond to each input.

Note: len(ratings) must equal len(inputs).

titleslists (default=None)

The titles of the articles.

sim_matrixgensim.interfaces.TransformedCorpus or np.ndarray (default=None)

The similarity sim_matrix for the corpus from the given model.

nint (default=10)

The number of items to recommend.

metricstr (default=cosine)

The metric to be used when comparing vectorized corpus entries.

Note: options include cosine and euclidean.

Returns:
recommendationslist of lists

Those items that are most similar to the inputs and their similarity scores