长序列结构化状态空间S4：大语言模型的新解决方案

需积分: 0 196 浏览量更新于2024-06-13 收藏 3.23MB PDF 举报

长文本序列的结构化状态空间（Structured State Space, SSM）是一种新兴的模型框架，旨在解决序列数据处理中的核心问题，尤其是应对跨越多种模态和任务的长期依赖性。传统的模型如循环神经网络（RNN）、卷积神经网络（CNN）和Transformer虽然各有针对长距离依赖的特殊变体，但在处理超过10000步的极长序列时，仍然面临计算和内存效率的挑战。在近期的研究中，一种创新的方法提出将序列建模为状态空间模型的形式：x'(t) = Ax(t) + Bu(t)，y(t) = Cx(t) + Du(t)，其中x(t)表示状态向量，u(t)是输入，A、B、C和D是相应的矩阵。这种方法理论上能够有效地捕捉和表达复杂的长期依赖关系。然而，其计算复杂度和内存需求非常高，这使得它在作为通用序列建模工具时显得不切实际。为了克服这一限制，我们提出了Structured State Space Sequence Model (S4)，它建立在一个新颖的状态参数化之上。S4的关键创新在于对状态矩阵A进行更高效的设计，通过引入结构化的状态空间，不仅降低了计算负担，还能够减少所需的存储资源。这种结构化方法允许模型在保持对长程依赖的有效捕捉的同时，保持在实际应用中的可行性。 S4的优势在于： 1. **模型效率**：通过优化状态更新规则和矩阵分解技术，S4能够在保持模型表达力的同时，显著减少计算步骤和所需的内存空间，使得处理超长序列成为可能。 2. **泛化能力**：由于采用了统一的框架，S4能够适应各种序列数据，无论是文本、音频还是视频，都能通过适当的输入映射矩阵C和D进行处理。 3. **可扩展性**：S4的结构设计使得模型容易进行并行化和分布式计算，进一步提升了处理大规模数据的能力。 4. **易用性**：S4的设计考虑到了实际应用的便利性，使得模型训练和部署更为简洁，这对于研究人员和工程师来说是一个重要的优点。 5. **理论支持**：尽管是基于现有理论的拓展，S4的研究也深入探讨了如何通过状态空间模型的数学特性来理解和改进长序列的学习性能。长文本序列的结构化状态空间模型（S4）是现代序列建模领域的一个重要突破，它提供了一种有效且可扩展的方式来处理具有挑战性的长距离依赖问题，有望在未来的大规模语言模型和毕业设计项目中发挥关键作用。

Algorithm 1 S4 Convolution Kernel (Sketch)

Input: S4 parameters Λ, P , Q, B, C ∈

and step size ∆

Output: SSM convolution kernel K = K

(A, B, C) for A = Λ − P Q

∗

(equation (5))

C ←



I − A



∗

C  Truncate SSM generating function (SSMGF) to length L



(ω) k

(ω)

(ω) k

(ω)



←

C Q

∗



∆

1−ω

1+ω

− Λ



−1

[B P ]  Black-box Cauchy kernel

K(ω) ←

1+ω



(ω) − k

(ω)(1 + k

(ω))

−1

(ω)



 Woodbury Identity

K = {

K(ω) : ω = exp(2πi

)}  Evaluate SSMGF at all roots of unity ω ∈ Ω

5: K ← iFFT(

K)  Inverse Fourier Transform

•

Instead of computing

directly, we compute its spectrum by evaluating its

truncated generating

function

L−1

j=0

at the roots of unity ζ. K can then be found by applying an inverse FFT.

•

This generating function is closely related to the matrix resolvent, and now involves a matrix inverse

instead of power. The low-rank term can now be corrected by applying the

Woodbury identity

which

reduces (A + P Q

∗

)

−1

in terms of A

−1

, truly reducing to the diagonal case.

•

Finally, we show that the diagonal matrix case is equivalent to the computation of a

Cauchy kernel

−ζ

, a well-studied problem with stable near-linear algorithms [30, 31].

Our techniques apply to any matrix that can be decomposed as Normal Plus Low-Rank (NPLR).

Theorem 1. All HiPPO matrices from [16] have a NPLR representation

A = V ΛV

∗

− P Q

= V (Λ − (V

∗

P ) (V

∗

) V

∗

(6)

for unitary

V ∈

N×N

, diagonal

, and low-rank factorization

P , Q ∈

N×r

. These matrices HiPPO- LegS,

LegT, LagT all satisfy r = 1 or r = 2. In particular, equation (2) is NPLR with r = 1.

3.3 S4 Algorithms and Computational Complexity

By equation

(6)

, note that NPLR matrices can be conjugated into diagonal plus low-rank (DPLR) form (now

over

instead of

). Theorems 2 and 3 describe the complexities of SSMs where

is in DPLR form. S4 is

optimal or near-optimal for both recurrent and convolutional representations.

Theorem 2

(S4 Recurrence)

Given any step size ∆, computing one step of the recurrence

(3)

can be done

in O(N) operations where N is the state size.

Theorem 2 follows from the fact that the inverse of a DPLR matrix is also DPLR (e.g. also by the Woodbury

identity). This implies that the discretized matrix

is the product of two DPLR matrices and thus has

O(N) matrix-vector multiplication. Appendix C.2 computes A in closed DPLR form.

Theorem 3

(S4 Convolution)

Given any step size ∆, computing the SSM convolution ﬁlter

can be

reduced to 4 Cauchy multiplies, requiring only

O(N + L) operations and O(N + L) space.

Appendix C, Deﬁnition 3 formally deﬁnes Cauchy matrices, which are related to rational interpolation

problems. Computing with Cauchy matrices is an extremely well-studied problem in numerical analysis,

with both fast arithmetic and numerical algorithms based on the famous Fast Multipole Method (FMM)

[

]. The computational complexities of these algorithms under various settings are described in

Appendix C, Proposition 5.

We reiterate that Theorem 3 is our core technical contribution, and its algorithm is the very motivation of

the NPLR S4 parameterization. This algorithm is formally sketched in Algorithm 1.

Table 1: Complexity of various sequence models in terms of sequence length (

), batch size (

), and hidden dimension

(

); tildes denote log factors. Metrics are parameter count, training computation, training space requirement, training

parallelizability, and inference computation (for 1 sample and time-step). For simplicity, the state size

of S4 is tied

. Bold denotes model is theoretically best for that metric. Convolutions are eﬃcient for training while recurrence

is eﬃcient for inference, while SSMs combine the strengths of both.

Convolution

Recurrence Attention S4

Parameters LH H

Training

LH(B + H ) BLH

B(L

H + LH

) BH (

H +

L) + B

Space BLH BLH B(L

+ HL) BLH

Parallel Yes No Yes Yes

Inference LH

H + H

L H

3.4 Architecture Details of the Deep S4 Layer

Concretely, an S4 layer is parameterized as follows. First initialize a SSM with

set to the HiPPO matrix

(2)

. By Lemma 3.1 and Theorem 1, this SSM is unitarily equivalent to some (

Λ − P Q

∗

, B, C

) for some

diagonal Λ and vectors P , Q, B, C ∈

N×1

. These comprise S4’s 5N trainable parameters.

The overall deep neural network (DNN) architecture of S4 is similar to prior work. As deﬁned above, S4

deﬁnes a map from

→

, i.e. a 1-D sequence map. Typically, DNNs operate on feature maps of size

instead of 1. S4 handles multiple features by simply deﬁning

independent copies of itself, and then mixing

the

features with a position-wise linear layer for a total of

(

) parameters per layer. Nonlinear

activation functions are also inserted between these layers. Overall, S4 deﬁnes a sequence-to-sequence map of

shape (batch size, sequence length, hidden dimension), exactly the same as related sequence models such as

Transformers, RNNs, and CNNs.

Note that the core S4 module is a linear transformation, but the addition of non-linear transformations

through the depth of the network makes the overall deep SSM non-linear. This is analogous to a vanilla CNN,

since convolutional layers are also linear. The broadcasting across

hidden features described in this section

is also analogous to depthwise-separable convolutions. Thus, the overall deep S4 model is closely related to a

depthwise-separable CNN but with global convolution kernels.

Finally, we note that follow-up work found that this version of S4 can sometimes suﬀer from numerical

instabilities when the

matrix has eigenvalues on the right half-plane [

]. It introduced a slight change to

the NPLR parameterization for S4 from Λ − P Q

∗

to Λ − P P

∗

that corrects this potential problem.

Table 1 compares the complexities of the most common deep sequence modeling mechanisms.

4 Experiments

Section 4.1 benchmarks S4 against the LSSL and eﬃcient Transformer models. Section 4.2 validates S4 on

LRDs: the LRA benchmark and raw speech classiﬁcation. Section 4.3 investigates whether S4 can be used as

a general sequence model to perform eﬀectively and eﬃciently in a wide variety of settings including image

classiﬁcation, image and text generation, and time series forecasting.

4.1 S4 Eﬃciency Benchmarks

We benchmark that S4 can be trained quickly and eﬃciently, both compared to the LSSL, as well as eﬃcient

Transformer variants designed for long-range sequence modeling. As outlined in Section 3, S4 is theoretically

much more eﬃcient than the LSSL, and Table 2 conﬁrms that the S4 is orders of magnitude more speed- and

memory-eﬃcient for practical layer sizes. In fact, S4’s speed and memory use is competitive with the most

Refers to global (in the sequence length) and depthwise-separable convolutions, similar to the convolution version of S4.

剩余31页未读，继续阅读

weixin_39994582

粉丝: 0

长序列结构化状态空间S4：大语言模型的新解决方案

Mamba模型：优化选择性状态空间解决长序列处理难题

SSM Java项目：StudentInfo 数据管理与可视化分析

文本匹配模型综述：从双塔到交互模型

结构化状态空间模型SSM

LLM+Mamba具有选择性状态空间的线性时间序列建模

SSM整合包.zip

java成本预测决策支持系统（基于SSM框架）

TimeSeries时间序列函数_timeseries_matlab_matlab小程序_

SSM框架构建的学生成绩管理系统功能解析

SSM美颜网站项目：人工智能与计算机视觉实战

最新资源