python中用nltk库进行 英文文章预处理
时间: 2023-05-28 20:04:42 浏览: 118
1.导入nltk库和必要的模块
```
import nltk
nltk.download('punkt') # 如果第一次使用nltk库需要下载punkt模块
nltk.download('stopwords') # 下载停止词模块
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
```
2.读入英文文章
```
with open('article.txt', 'r', encoding='utf-8') as f:
article = f.read()
```
3.将文章转换为小写字母
```
article = article.lower()
```
4.分句
```
sentences = sent_tokenize(article)
```
5.分词
```
words = []
for sentence in sentences:
words.extend(word_tokenize(sentence))
```
6.去除停止词
```
stop_words = set(stopwords.words('english')) # 获取停止词表
filtered_words = [word for word in words if word not in stop_words]
```
7.词干化
```
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
```
8.词形还原
```
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
```
9.词频统计
```
from collections import Counter
word_count = Counter(lemmatized_words)
print(word_count.most_common(10)) # 输出出现频率最高的10个词语
```
阅读全文