金融文本数据中的公司名称数据提取Python
时间: 2023-10-07 09:08:36 浏览: 79
基于Python实现的中文公司名称分词工具,支持公司名称中的地名,品牌名(主词),行业词,公司名后缀提取
5星 · 资源好评率100%
提取金融文本数据中的公司名称可以使用命名实体识别(Named Entity Recognition, NER)模型,常见的NER模型有CRF、BiLSTM+CRF、BERT等。以下是一个使用BiLSTM+CRF模型提取公司名称的示例Python代码:
```python
import pandas as pd
import numpy as np
import re
import jieba
import os
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, Bidirectional, LSTM, TimeDistributed, Dense
from keras.models import Model
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
# 读取数据
data = pd.read_csv('finance_text.csv')
# 删除空值
data.dropna(inplace=True)
# 只保留中文字符
data['text'] = data['text'].apply(lambda x: re.sub(r'[^\u4e00-\u9fa5]', '', x))
# 分词
data['text'] = data['text'].apply(lambda x: jieba.lcut(x))
# 建立词表
word2id = {}
for sentence in data['text']:
for word in sentence:
if word not in word2id:
word2id[word] = len(word2id)
# 将词转换为id
data['x'] = data['text'].apply(lambda sentence: [word2id[word] for word in sentence])
# 标注公司名称的位置
data['y'] = data['text'].apply(lambda sentence: [1 if re.match('公司|集团|银行|保险', word) else 0 for word in sentence])
# 填充序列长度
max_len = max(data['x'].apply(len))
data['x'] = data['x'].apply(lambda sentence: pad_sequences([sentence], maxlen=max_len, padding='post')[0])
data['y'] = data['y'].apply(lambda sentence: pad_sequences([sentence], maxlen=max_len, padding='post')[0])
# 划分训练集和验证集
train_size = int(len(data) * 0.8)
train_data = data[:train_size]
valid_data = data[train_size:]
# 定义模型
input = Input(shape=(max_len,))
embedding = Embedding(input_dim=len(word2id), output_dim=128)(input)
biLSTM = Bidirectional(LSTM(units=64, return_sequences=True))(embedding)
output = TimeDistributed(Dense(units=2, activation='softmax'))(biLSTM)
model = Model(inputs=input, outputs=output)
# 编译模型
model.compile(optimizer='adam', loss=crf_loss, metrics=[crf_viterbi_accuracy])
# 训练模型
model.fit(train_data['x'], np.expand_dims(train_data['y'], axis=-1), validation_data=(valid_data['x'], np.expand_dims(valid_data['y'], axis=-1)), batch_size=32, epochs=10)
# 预测
test_data = ['这家公司的股票表现不错', '保险公司的业绩增长很快']
test_x = [pad_sequences([sentence], maxlen=max_len, padding='post')[0] for sentence in test_data]
pred_y = model.predict(test_x)
# 将预测结果转换为公司名称
for i in range(len(test_data)):
company = ''
for j in range(len(test_data[i])):
if pred_y[i][j][1] > 0.5:
company += test_data[i][j]
print(company)
```
该代码使用BiLSTM+CRF模型对金融文本数据进行公司名称提取,首先对文本进行分词,然后标注公司名称的位置,接着建立词表并将词转换为id,最后使用BiLSTM+CRF模型进行训练和预测。
阅读全文