无模型方法：高维生存数据分析特征筛选

需积分: 16 2 浏览量更新于2024-07-14 收藏 656KB PDF 举报

在当今科学数据迅速增长的背景下，许多领域如医学、生物统计学和机器学习等对特征筛选的需求日益增加。本文主要关注高维生存数据的无模型特征筛选方法，这是在面对大量复杂变量时降低数据维度、提高分析效率的关键策略。研究者们在《中国科学》(SCIENCECHINA Mathematics)2018年9月刊上发表了一篇论文，标题为“高维生存数据的无模型特征筛选”，由杨元元、刘显辉和郝梅玲三位作者共同完成。该论文探讨了如何在存在 censoring（数据缺失或不完全观察）的情况下，设计一种统一且稳健的无模型特征选择方法。censoring 是生存数据分析中的常见问题，它可能源于实验设计、样本流失或其他原因，使得某些观测值只有生存时间的部分信息可用。传统模型依赖的方法在这种情况下可能不够有效，因此研究者提出了一种不依赖于特定假设模型的筛选策略，旨在发现那些与生存结局显著相关的特征。无模型特征筛选的核心思想是利用统计学的原理，如屏风法则（screening rules），通过简单的统计量来评估每个特征与生存时间之间的关联强度。这种方法的优势在于其灵活性，无需预先设定复杂的统计模型，如 Cox 因子回归或生存曲线分析，从而减少了模型选择和参数估计带来的潜在偏差。研究者可能使用了基于统计显著性、相关性或递归特征消除等技术来实现这一目标。论文中，作者可能介绍了具体的筛选步骤，包括如何处理 censored 数据，如何构建统计显著性阈值，以及如何确保筛选过程在高维数据中的稳定性和一致性。此外，他们可能还展示了通过实际案例或模拟研究来验证该方法的有效性和性能对比，证明其在减少维度的同时，仍能保持较高的预测精度和可解释性。值得注意的是，该工作对于数据科学、生物医学研究和公共卫生等领域具有重要意义，因为它提供了一种实用的工具，帮助研究人员在海量高维生存数据中识别出关键的生物学标记或者影响生存时间的重要因素，从而推动科学研究的进展和临床决策的制定。这篇论文通过对高维生存数据的无模型特征筛选方法的研究，不仅解决了实际问题，也促进了理论方法的发展，为数据驱动的科学研究提供了一种强有力的分析工具。对于任何处理高维生存数据的专业人士来说，理解和应用这些研究成果都将有助于提升数据分析的质量和效率。

1620 Lin Y Y et al. Sci China Math September 2018 Vol. 61 No. 9

(C4) {Cov(β

)}

−1

E|E{β

I{T 6

Y }|

Y }| is bounded away from zero, where

Y is an independent

copy of Y .

(C5) min

k∈A

|Cov(X

, β

)| > 2ˆc

−η

for some constants ˆc

> 0 and η ∈ [0, 1/2).

(C6) lim inf

p→∞

{min

k∈A

|Cov(X

, β

)| − max

k∈A

|Cov(X

, β

)|} > ˆm

for a ˆm

> 0.

Condition (C1) is similar to [12, Condition (C6

′

)], which is common in survival analysis literature to

ensure the local Kaplan-Meier estimator are well-behaved. Condition (C2) is the sub-exponential moment

condition of the predictors that holds for various distributions, for example, the normal distribution and

distributions with bounded support. Conditions (C3)–(C5) are crucial to ensure the sure screening prop-

erty. In addition to Conditions (C1)–(C5), Condition (C6) is imposed to ensure the ranking consistency

property. Note that we do not impose any ﬁnite moment condition on the response variable. We state

the sure screening property as well as a bound on the size of selected variables by the proposed method

in Theorem 2.1.

Theorem 2.1. Suppose Conditions (C1)–(C5) hold. Let M

be a sequence depending on n. If

log(n)/(n

1−2η

h) → 0, n

→ 0, nh

→ ∞, n

1−2η

→ ∞ and log(n)/M

→ 0 as n → ∞, then there

exist constants c

> 0, c

> 0 and θ

> 0 such that

(1) (Sure screening property)



max

16k6p

|ˆw

− w

| > c

−η



6 2p



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large. In addition, letting γ = c

−η

, we have

P (A ⊆

A) > 1 − 2s



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large, where s

is the size of A.

(2) (Controlling false discovery rate) Moreover,





A| 6 2c

−1



k=1



> 1 − 2p



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large.

The above theorem tells that if M

= O(n

(1−2η)/2−α

) for any 0 < α

< (1 − 2η)/2 and the dimen-

sionality p is at the exponential rate of the sample size n, the sure screening property still holds. In other

words, our method is able to handle the ultra-high-dimensional data. In fact, it is pointed out that the

assumptions of Theorem 2.1 implies η < 2/5. Furthermore, under Condition (C2), the second part of

Theorem 2.1 suggests that when the true model size s

is at the order of n

, then the size of selected

active set is of polynomial size with high probability. In particular, with β + η < 1, the hard thresholding

rule with threshold [n/ log(n)] adopted in Sections 3–4 is able to select the true active predictors with

high probability.

The next theorem establishes the ranking consistency property of the proposed method.

Theorem 2.2 (Ranking consistency property). Assume Conditions (C1)–(C6) hold. If log(n)/(nh)

→

→ ∞

1−2η

→ ∞

and

log(

)

→

→ ∞

, then there exist constants

> 0, c

> 0 and θ

> 0 such that



min

k∈A

ˆw

− max

k∈A

ˆw



6 2p



exp(−c

n) + exp



−



+ nC

exp(−M

)



for n suﬃciently large.

剩余19页未读，继续阅读

weixin_38568031

粉丝: 5
资源: 895

无模型方法：高维生存数据分析特征筛选

深度学习驱动的肺癌生存预测：图像分析与拓扑特征的融合

最大最小爬山算法提升肺癌预后模型预测准确性

统计视角下的提升算法：模型复杂度与软件包详解

利用微阵列数据的癌症诊断模型：K均值与模糊C均值聚类

Cox比例风险模型的桥估计与变量选择

生存分析与数据寿命预测：数理统计的高级应用，解锁数据的新价值

R语言生存分析实战：用coxph包30分钟内构建风险比例模型

【处理大规模特征数据】：如何在SVM支持向量机中处理大规模特征数据

揭秘：大数据处理中的5大机器学习模型优化策略

特征选择的艺术：用XGBoost简化模型并提升准确性

最新资源