model¶
The model
module provides needed functions for modeling text corpuses and delivering recommendations
Functions
- wikirec.model.gen_embeddings(method='bert', corpus=None, bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', path_to_json=None, path_to_embedding_model='wikilink_embedding_model', embedding_size=75, epochs=20, verbose=True, **kwargs)[source]¶
Generates embeddings given a modeling method and text corpus.
- Parameters
- methodstr (default=bert)
The modelling method.
- Options:
BERT: Bidirectional Encoder Representations from Transformers
Words embeddings are derived via Google Neural Networks.
Embeddings are then used to derive similarities.
Doc2vec : Document to Vector
An entire document is converted to a vector.
Based on word2vec, but maintains the document context.
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
TFIDF: Term Frequency Inverse Document Frequency
Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.
These importances are then vectorized and used to relate documents.
WikilinkNN: Wikilinks Neural Network
Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.
- corpuslist of lists (default=None)
The text corpus over which analysis should be done.
- bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)
The BERT model to use.
- path_to_jsonstr (default=None)
The path to the parsed json file.
- path_to_embedding_modelstr (default=wikilink_embedding_model)
The name of the embedding model to load or create.
- embedding_sizeint (default=75)
The length of the embedding vectors between the articles and the links.
- epochsint (default=20)
The number of modeling iterations through the training dataset.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the model creation.
- **kwargskeyword arguments
Arguments correspoding to sentence_transformers.SentenceTransformer.encode, gensim.models.doc2vec.Doc2Vec, gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer.
- Returns
- embeddingsnp.ndarray
Embeddings to be used to create article-article similarity matrices.
- wikirec.model.gen_sim_matrix(method='bert', metric='cosine', embeddings=None)[source]¶
Derives a similarity matrix from document embeddings.
- Parameters
- methodstr (default=bert)
The modelling method.
- Options:
BERT: Bidirectional Encoder Representations from Transformers
Words embeddings are derived via Google Neural Networks.
Embeddings are then used to derive similarities.
Doc2vec : Document to Vector
An entire document is converted to a vector.
Based on word2vec, but maintains the document context.
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
TFIDF: Term Frequency Inverse Document Frequency
Word importance increases proportionally to the number of times a word appears in the document while being offset by the number of documents in the corpus that contain the word.
These importances are then vectorized and used to relate documents.
WikilinkNN: Wikilinks Neural Network
Generate embeddings using a neural network trained on the connections between articles and their internal wikilinks.
- metricstr (default=cosine)
The metric to be used when comparing vectorized corpus entries.
Note: options include cosine and euclidean.
- Returns
- sim_matrixgensim.interfaces.TransformedCorpus or numpy.ndarray
The similarity sim_matrix for the corpus from the given model.
- wikirec.model.recommend(inputs=None, ratings=None, titles=None, sim_matrix=None, metric='cosine', n=10)[source]¶
Recommends similar items given an input or list of inputs of interest.
- Parameters
- inputsstr or list (default=None)
The name of an item or items of interest.
- ratingslist (default=None)
A list of ratings that correspond to each input.
Note: len(ratings) must equal len(inputs).
- titleslists (default=None)
The titles of the articles.
- sim_matrixgensim.interfaces.TransformedCorpus or np.ndarray (default=None)
The similarity sim_matrix for the corpus from the given model.
- nint (default=10)
The number of items to recommend.
- metricstr (default=cosine)
The metric to be used when comparing vectorized corpus entries.
Note: options include cosine and euclidean.
- Returns
- recommendationslist of lists
Those items that are most similar to the inputs and their similarity scores