PYTHON编写结构化程序:从网络或硬盘获取语料库文本,使用词干提取器索引有关“happy”的上下文。使用正则表达式为文本进行句子分割、分词操作,并查找到其中以“a,b,c”开头,以“ing”结尾的单词,然后忽略掉词内部的元音即保留所有词首或词尾的元音序列使其仍可以轻松阅读。将结果写入本地文件,以.txt格式命名(注意字符类型的转换)。
时间: 2024-02-16 15:01:29 浏览: 135
好的,我会按照你的要求回答你的问题。首先,这是一个Python编写结构化程序的任务,需要使用Python语言来完成。
以下是代码实现的步骤:
1. 导入必要的模块,如nltk、re等。
```python
import nltk
from nltk.stem import PorterStemmer
import re
```
2. 获取语料库文本,可以从网络或硬盘中获取。
```python
# 从网络获取文本
from urllib import request
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
# 从硬盘获取文本
with open('corpus.txt', 'r') as f:
raw = f.read()
```
3. 对文本进行词干提取,使用PorterStemmer算法。
```python
ps = PorterStemmer()
tokens = nltk.word_tokenize(raw)
happy_words = [ps.stem(w) for w in tokens if 'happy' in ps.stem(w)]
```
4. 对文本进行句子分割和分词操作,使用正则表达式。
```python
sentences = nltk.sent_tokenize(raw)
words = [nltk.word_tokenize(sentence) for sentence in sentences]
```
5. 查找以“a,b,c”开头,以“ing”结尾的单词,并忽略掉词内部的元音。
```python
vowels = ['a', 'e', 'i', 'o', 'u']
pattern = r'^(a|b|c)[a-z]*[aeiou]*[a-z]*ing$'
result = []
for sentence in words:
for word in sentence:
if re.match(pattern, word):
word = re.sub('[aeiou]', '', word)
result.append(word)
```
6. 将结果写入本地文件,并以.txt格式命名。
```python
with open('result.txt', 'w') as f:
f.write('\n'.join(result))
```
完整代码如下:
```python
import nltk
from nltk.stem import PorterStemmer
import re
# 从网络获取文本
from urllib import request
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
# 从硬盘获取文本
# with open('corpus.txt', 'r') as f:
# raw = f.read()
# 词干提取
ps = PorterStemmer()
tokens = nltk.word_tokenize(raw)
happy_words = [ps.stem(w) for w in tokens if 'happy' in ps.stem(w)]
# 句子分割和分词
sentences = nltk.sent_tokenize(raw)
words = [nltk.word_tokenize(sentence) for sentence in sentences]
# 查找以“a,b,c”开头,以“ing”结尾的单词,并忽略掉词内部的元音
vowels = ['a', 'e', 'i', 'o', 'u']
pattern = r'^(a|b|c)[a-z]*[aeiou]*[a-z]*ing$'
result = []
for sentence in words:
for word in sentence:
if re.match(pattern, word):
word = re.sub('[aeiou]', '', word)
result.append(word)
# 将结果写入本地文件
with open('result.txt', 'w') as f:
f.write('\n'.join(result))
```
注意:在运行代码之前需要安装nltk模块,使用以下命令安装即可:
```python
pip install nltk
```
阅读全文