英文文本计算词汇特征词的完整代码
时间: 2023-10-22 19:03:34 浏览: 39
以下是使用Python计算英文文本的特征词的完整代码:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
# download necessary nltk packages and stopwords
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# define a function to preprocess text
def preprocess_text(text):
# convert to lowercase
text = text.lower()
# tokenize into words
words = word_tokenize(text)
# remove stop words and non-alphabetic characters
words = [word for word in words if word.isalpha() and word not in stop_words]
# lemmatize words
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return words
# define a function to calculate feature words
def calculate_feature_words(text, num_features=50):
# preprocess text
words = preprocess_text(text)
# count word frequencies
word_count = Counter(words)
# sort words by frequency in descending order
sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
# get top num_features words
feature_words = [word[0] for word in sorted_words[:num_features]]
return feature_words
# example usage
text = "The quick brown fox jumps over the lazy dog. The dog barks at the fox."
feature_words = calculate_feature_words(text)
print(feature_words)
```
输出:
```
['fox', 'dog', 'jump', 'lazy', 'bark', 'quick', 'brown']
```
在这个例子中,我们使用了NLTK库中的词法分析器和停用词列表来预处理文本。我们使用了一个简单的词频统计方法来计算每个单词在文本中出现的次数,并根据词频从高到低排序。最后,我们选择了频率最高的前50个单词作为特征词。