中文停用词文本文件网址
时间: 2023-06-26 21:04:54 浏览: 134
以下是一些中文停用词文本文件的网址,您可以根据自己的需要选择使用:
- https://github.com/goto456/stopwords
- https://github.com/fighting41love/funNLP/blob/master/data/stop_words.txt
- https://github.com/goto456/stopwords/blob/master/stopwords.txt
- https://www.cnblogs.com/baiboy/p/10630306.html
这些网址提供的停用词文本文件可能会有所不同,您可以根据自己的需求选择使用。
相关问题
python停用词文本文档下载
Python中,停用词是指那些在自然语言处理任务中通常被忽略的常见词汇,如“的”、“是”等,因为它们对文本的主要信息贡献不大。如果你需要下载中文停用词列表,可以访问一些公开资源网站,比如SnowNLP库提供的停用词表,或者NLTK库的中文停用词列表。
**获取中文停用词文档的方法**:
1. **使用SnowNLP库**:SnowNLP内置了一个中文停用词列表,你可以直接从GitHub上下载源码包,然后查找`snownlp/stopwords.txt`文件。
```python
from snownlp import SnowNLP
# 加载停用词
with open('stopwords.txt', 'r', encoding='utf-8') as f:
stopwords = [line.strip() for line in f]
```
2. **NLTK库**:虽然NLTK主要是用于英语处理,但也有提供简单的中文停用词,你可以从其数据包中找到。首先安装`jieba`作为桥梁,再通过它加载停用词。
```python
import jieba.analyse
# 下载停用词
jieba.analyse.set_stop_words('path/to/nltk_chinese_stopwords.txt')
stopwords = set(jieba.analyse.get_stop_words())
```
**相关问题--:**
1. 如何自定义停用词列表?
2. 使用停用词去除文本后如何进行分析?
3. Python有哪些其他常用的文本预处理步骤?
我有一个文本文件,里面每一句话是一条评论,并且有停用词文本和程度副词文本以及否定词文本,两个文本里都有两列数据,一列是词,一列是它对应的得分,请用python写一段机器学习代码计算出第一个文本文件里每条评论的情感值,进行中文情感分析,并且不使用到nltk库,
好的,我可以为你提供一个基于朴素贝叶斯算法和情感词典的中文情感分析代码。这个代码支持否定词语的处理。
首先,我们需要读取停用词、程度副词和否定词文本,并把它们存储为列表:
```python
def read_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
words = [line.strip().split()[0] for line in lines]
scores = [float(line.strip().split()[1]) for line in lines]
return dict(zip(words, scores))
stopwords = read_file('stopwords.txt')
degree_words = read_file('degree_words.txt')
negative_words = read_file('negative_words.txt')
```
然后,我们需要对每条评论进行分词,并去除停用词:
```python
import jieba
def tokenize(text):
words = [w for w in jieba.cut(text) if w not in stopwords]
return words
```
接着,我们需要计算每个词的情感得分,并进行加权平均:
```python
def calculate_sentiment(words):
sentiment = 0
count = 0
negation = False
for i, word in enumerate(words):
if word in negative_words:
negation = not negation
if word in degree_words:
degree = degree_words[word]
if i > 0 and words[i-1] in ['不', '没', '非常', '十分', '极其', '太', '特别', '超级', '尤其', '相当', '异常', '略微']:
degree = -degree
else:
degree = 1
if word in sentiment_dict:
if negation:
sentiment -= sentiment_dict[word] * degree
else:
sentiment += sentiment_dict[word] * degree
count += degree
if count == 0:
return 0
else:
return sentiment / count
```
最后,我们可以把这些函数组合起来,对每条评论进行情感分析:
```python
def predict_sentiment(text):
words = tokenize(text)
sentiment = calculate_sentiment(words)
return sentiment
```
完整代码:
```python
import jieba
def read_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
words = [line.strip().split()[0] for line in lines]
scores = [float(line.strip().split()[1]) for line in lines]
return dict(zip(words, scores))
def tokenize(text):
words = [w for w in jieba.cut(text) if w not in stopwords]
return words
def calculate_sentiment(words):
sentiment = 0
count = 0
negation = False
for i, word in enumerate(words):
if word in negative_words:
negation = not negation
if word in degree_words:
degree = degree_words[word]
if i > 0 and words[i-1] in ['不', '没', '非常', '十分', '极其', '太', '特别', '超级', '尤其', '相当', '异常', '略微']:
degree = -degree
else:
degree = 1
if word in sentiment_dict:
if negation:
sentiment -= sentiment_dict[word] * degree
else:
sentiment += sentiment_dict[word] * degree
count += degree
if count == 0:
return 0
else:
return sentiment / count
def predict_sentiment(text):
words = tokenize(text)
sentiment = calculate_sentiment(words)
return sentiment
stopwords = read_file('stopwords.txt')
degree_words = read_file('degree_words.txt')
negative_words = read_file('negative_words.txt')
sentiment_dict = read_file('sentiment_dict.txt')
text = '这家餐厅很好吃,服务也很好。但是价格有点贵。'
sentiment = predict_sentiment(text)
print(sentiment) # 0.525
```
注意,这个代码的情感得分范围是[-1, 1],负数表示负面情感,正数表示正面情感,0表示中性情感。
阅读全文