def parse_corpus(infile, outfile): '''parse the corpus of the infile into the outfile''' space = ' ' i = 0 def tokenize(text): return [lemma(token) for token in text.split()] with open(outfile, 'w', encoding='utf-8') as fout: # wiki = WikiCorpus(infile, lemmatize=False, dictionary={}) # gensim中的维基百科处理类WikiCorpus wiki = WikiCorpus(infile, tokenizer_func=tokenize, dictionary={}) # gensim中的维基百科处理类WikiCorpus for text in wiki.get_texts(): fout.write(space.join(text) + '\n') i += 1 if i % 10000 == 0: logger.info('Saved ' + str(i) + ' articles') 报错D:\软件\python\lib\site-packages\gensim\utils.py:1333: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected %s; aliasing chunkize to chunkize_serial" % entity) Traceback (most recent call last): File "D:\pythonFiles\图灵\Python_project\self_learn\大语言模型\WikiExtractor.py", line 52, in <module> parse_corpus(infile, outfile) File "D:\pythonFiles\图灵\Python_project\self_learn\大语言模型\WikiExtractor.py", line 29, in parse_corpus for text in wiki.get_texts(): File "D:\软件\python\lib\site-packages\gensim\corpora\wikicorpus.py", line 693, in get_texts for tokens, title, pageid in pool.imap(_process_article, group): File "D:\软件\python\lib\multiprocessing\pool.py", line 870, in next raise value File "D:\软件\python\lib\multiprocessing\pool.py", line 537, in _handle_tasks put(task) File "D:\软件\python\lib\multiprocessing\connection.py", line 211, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "D:\软件\python\lib\multiprocessing\reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'parse_corpus.<locals>.tokenize' 怎么优化
时间: 2023-09-11 12:09:40 浏览: 164
尝试将`tokenize`函数定义在`parse_corpus`函数外,并将其作为参数传递到`WikiCorpus`中,如下所示:
```python
def tokenize(text):
return [lemma(token) for token in text.split()]
def parse_corpus(infile, outfile):
'''parse the corpus of the infile into the outfile'''
space = ' '
i = 0
wiki = WikiCorpus(infile, tokenizer_func=tokenize, dictionary={})
with open(outfile, 'w', encoding='utf-8') as fout:
for text in wiki.get_texts():
fout.write(space.join(text) + '\n')
i += 1
if i % 10000 == 0:
logger.info('Saved ' + str(i) + ' articles')
```
另外,你也可以尝试使用`pathos.multiprocessing`替代Python自带的`multiprocessing`模块,因为`pathos.multiprocessing`可以pickle化本地函数。
阅读全文
相关推荐


















