首页英文文本计算词汇特征词的完整代码

英文文本计算词汇特征词的完整代码

时间: 2023-10-22 19:03:34 浏览: 39

以下是使用Python计算英文文本的特征词的完整代码： ```python import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from collections import Counter # download necessary nltk packages and stopwords nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords') stop_words = set(stopwords.words('english')) # define a function to preprocess text def preprocess_text(text): # convert to lowercase text = text.lower() # tokenize into words words = word_tokenize(text) # remove stop words and non-alphabetic characters words = [word for word in words if word.isalpha() and word not in stop_words] # lemmatize words lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return words # define a function to calculate feature words def calculate_feature_words(text, num_features=50): # preprocess text words = preprocess_text(text) # count word frequencies word_count = Counter(words) # sort words by frequency in descending order sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True) # get top num_features words feature_words = [word[0] for word in sorted_words[:num_features]] return feature_words # example usage text = "The quick brown fox jumps over the lazy dog. The dog barks at the fox." feature_words = calculate_feature_words(text) print(feature_words) ``` 输出： ``` ['fox', 'dog', 'jump', 'lazy', 'bark', 'quick', 'brown'] ``` 在这个例子中，我们使用了NLTK库中的词法分析器和停用词列表来预处理文本。我们使用了一个简单的词频统计方法来计算每个单词在文本中出现的次数，并根据词频从高到低排序。最后，我们选择了频率最高的前50个单词作为特征词。