使用nltk与textrank分词
时间: 2023-11-04 07:06:06 浏览: 207
nltk实现对英文短文本的名词抽取
首先,需要安装nltk和textrank库。
在Python中,安装nltk库可以使用以下命令:
```python
!pip install nltk
```
安装textrank库可以使用以下命令:
```python
!pip install sumy
```
接下来,我们先使用nltk库对文本进行分词:
```python
import nltk
from nltk.tokenize import word_tokenize
# 文本
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."
# 分词
tokens = word_tokenize(text)
print(tokens)
```
输出结果为:
```
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']
```
接下来,我们使用textrank库对文本进行分词:
```python
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
# 文本
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."
# 分词
tokenizer = Tokenizer("english")
parser = PlaintextParser.from_string(text, tokenizer)
tokens = [str(sentence).strip() for sentence in parser.document.sentences]
print(tokens)
```
输出结果为:
```
['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.']
```
可以看到,使用textrank库对文本进行分词后,整个文本被当作一个句子处理。如果需要对文本进行句子级别的分词,可以使用nltk库中的sent_tokenize函数。
阅读全文