To perform tf-idf weighting in scikit-learn, you can use the TfidfVectorizer
class from the sklearn.feature_extraction.text
module.
The TfidfVectorizer
class converts a collection of raw documents into a matrix of tf-idf features. It works by tokenizing the documents, computing the term frequency (tf) and inverse document frequency (idf) of each term, and then multiplying the tf and idf values to get the tf-idf weight of each term.
Here's an example of how to use the TfidfVectorizer
class in scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer # Define the documents documents = [ "This is the first document.", "This document is the second document.", "And this is the third one.", "Is this the first document?", ] # Create the TfidfVectorizer vectorizer = TfidfVectorizer() # Fit the vectorizer to the documents and transform them into tf-idf vectors vectors = vectorizer.fit_transform(documents) # Print the tf-idf vectors print(vectors.toarray())Soucre:www.lautturi.com
This code will output a matrix of tf-idf vectors, with each row representing a document and each column representing a term. The values in the matrix are the tf-idf weights of the terms in the documents.
You can also customize the behavior of the TfidfVectorizer
by setting various parameters, such as the tokenization function, the n-gram range, and the minimum and maximum document frequency of the terms.