深度学习下如何计算文本相似度,请写出代码
时间: 2024-05-15 22:20:11 浏览: 65
Sure, here's an example code for calculating text similarity using deep learning:
```
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Concatenate, Bidirectional
from tensorflow.keras.models import Model
# example texts
text1 = "The quick brown fox jumps over the lazy dog"
text2 = "A quick brown dog jumps over the lazy fox"
# define tokenizer
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts([text1, text2])
# convert texts to sequences
seq1 = tokenizer.texts_to_sequences([text1])[0]
seq2 = tokenizer.texts_to_sequences([text2])[0]
# pad sequences to fixed length
max_len = 10
seq1 = pad_sequences([seq1], maxlen=max_len, padding='post')[0]
seq2 = pad_sequences([seq2], maxlen=max_len, padding='post')[0]
# define model architecture
input1 = Input(shape=(max_len,))
input2 = Input(shape=(max_len,))
embedding_layer = Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=100)
lstm_layer = Bidirectional(LSTM(128))
dense_layer = Dense(64, activation='relu')
output_layer = Dense(1, activation='sigmoid')
embedded1 = embedding_layer(input1)
embedded2 = embedding_layer(input2)
encoded1 = lstm_layer(embedded1)
encoded2 = lstm_layer(embedded2)
merged = Concatenate()([encoded1, encoded2])
dense1 = dense_layer(merged)
output = output_layer(dense1)
model = Model(inputs=[input1, input2], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# train model (not necessary for text similarity calculation)
labels = [1]
model.fit(x=[seq1, seq2], y=labels, epochs=10)
# calculate similarity score
similarity_score = model.predict([seq1, seq2])[0][0]
print("Similarity score: %.2f" % similarity_score)
```
Note that this is just an example code, and there are many ways to calculate text similarity using deep learning. The model architecture used here is a simple one, consisting of an embedding layer, a bidirectional LSTM layer, a dense layer, and an output layer. The model is trained on a binary classification task (whether two texts are similar or not), but the trained model can also be used to calculate similarity scores directly, as shown in the last few lines of the code.