时间依赖性与数据价值：AI产品中的时效挑战

需积分: 3 192 浏览量更新于2024-07-09 收藏 1018KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"本文探讨了时间对AI和机器学习中数据价值的影响，以及数据的非平稳性和价值贬损问题。" 在《时间和数据价值》这篇研究论文中，作者深入研究了时变数据对于提升基于AI的产品和服务质量的关键作用。时间依赖性是数据特性的重要组成部分，因为数据随着时间的流逝会逐渐失去其与特定问题的相关性，这一现象可能导致算法性能的下降，进而减少业务价值。论文提出了一个模型，将时间相关性表示为概率分布的变化，揭示了一些出乎意料的发现。首先，论文理论上证明了即使无限量的随时间积累的数据可能对预测未来的能力产生限制。这意味着，在有限大小的数据集上训练的算法可能也能达到类似的性能水平。此外，增加包含较旧数据的数据量并不总是有益的，实际上可能对公司的竞争力产生负面影响，因为较旧的数据可能会降低算法的性能。论文进一步探讨了数据量如何影响企业的竞争优势。由于时间依赖性，数据量带来的进入壁垒被削弱，竞争对手只需拥有适量的最新数据，就有可能获得与大量历史数据相媲美的性能。这质疑了在基于AI的市场中先发优势的重要性。通过实证实验，作者使用下一个单词预测任务，测量了文本数据中价值损失的现象，结果显示，七年后的100MB文本数据与50MB的最新数据在预测任务上的效果相当。这些发现对AI和机器学习经济学有深远影响，强调了数据的新鲜度和时效性对于保持算法效率和业务价值的重要性。企业必须关注数据的非平稳性（non-stationarity），理解数据的易腐性（perishability）以及价值贬损（value depreciation）的概念，以便制定有效的数据管理和更新策略，以维持和增强其AI解决方案的性能。这不仅关乎数据的数量，更关乎数据的质量和时间维度。

资源详情

资源推荐

pg. 5

Harvard Business School Working Paper, No. 21-016

Acquiring data can happen in many ways. While firms may purchase data through an intermediary, they

may also organically gather datasets over time from interactions with their users. In both methods, there are

privacy concerns that may prevent users from sharing data. [10], through modeling an intermediary that

acquires data from users and sells the obtained data to firms, investigates the issue of data externality and

privacy. [2] discusses similar issues for organic data generation and claims that more data lowers the privacy

barrier and motivates more users to share their data. This argument, together with the data network effect

arguments above, suggests a significant growth rate in the size of a firm's data repository. [9, 20, 21]

proposed a growth model for data in firms and the economy. They answer questions on the firm’s growth

process.

Our paper researches the effectiveness of curated datasets, and hence, is not concerned with data solicitation

and growth of the firm's data repository. We argue that curating big datasets and blindly using them may

not always give a firm a significant advantage, and it may even put a firm in a disadvantageous position.

Our arguments, thus, question universal assumptions about the value of data for a firm and how it may

change the modes of competition. More precisely, we investigate how and if curating large datasets can

create barriers to entry and deter threats from entrants.

[17,18,33,34,38] discuss the implication of AI and, more precisely, data on competition. About data,

particularly, most debates are around its volume and whether it creates a competitive advantage. Some of

these studies focus on the antitrust issues and the potential role that data plays in creating a winner take all

(monopoly) situation. [31,39] are examples of these studies. Furthermore, there are researches on how data

can improve the prediction quality of services with respect to either the degree of personalization [25,39]

or between adjacent products [7]. These researches have direct strategy implications on how firms compete

on growing the user-base.

We believe that data characteristics play a crucial role in the value creation cycle and the modes of

competition. For example, non-rivalry and exclusivity of a dataset to a firm can prevent other players from

obtaining it, which in turn puts the owner in a superior position. Under exclusivity, data becomes an asset

that behaves like the supply of a physical good. Biases build a harmful environment for both the company

and its users. For the algorithm's fairness and potential biases, [16] provides a discussion. Paying closer

attention to dataset characteristics, we can see that time-dependency and perishability are similar since the

sampling time determines them both. It is of great interest to see how dependency on time changes the

strength of data externalities and influences the value creation cycle.

Electronic copy available at: https://ssrn.com/abstract=3680910

pg. 6

Harvard Business School Working Paper, No. 21-016

Perhaps the closest research to ours is [14], where the authors investigate the effect of historical search data

on search results' quality. They found little empirical evidence on the effectiveness of old data in the quality

of search engine results. Also, [6] raises a question on the economy of scale that data provides for specific

problems. They suggest a diminishing return to scale value model for data and argue that increasing data

volume in advertisement applications does not improve the service quality. The results in these papers

endorse our findings on the effectiveness of perishable data. We believe that both the search engine and

advertisement businesses use time-sensitive data and hence, face significant time dependency. Therefore,

the data loses its effectiveness quickly, contributing to not seeing significant improvement in prediction

quality.

2. Background and Framework

In this section, our goal is to introduce how we approach the problem and explain why we make particular

choices. We first describe time-dependency as a shift in distribution over time and dig into its cause. We

argue that time-dependency is mostly due to exceptional reasons that most often cause a monotonic decrease

in the value of data over time. We then have a brief introduction to machine learning and explain why we

focus on the probability distribution's maximum likelihood estimation. We mainly introduce a

decomposition of the MLE's objective function to lay the groundwork for the next section's propositions.

We finally formalize the notion of effectiveness and define the substitution gain curve as a proportion of

two effectiveness quantities; one from past and the other from the substitute time.

2.1. Change in distribution

Time-dependency is due to many reasons, among which we can mention the change in consumers' tastes

and behavior. If we look at music albums' best sellers from the 80s and compare them with the best sellers

in 2020, we can see the difference in taste. Innovation is another reason for the change because, nowadays,

we witness continuous innovation and considerable variation in product and services space. Telegram

seems ancient these days, and so will be the hardline telephones soon. Because of the differences, it is not

easy to translate the environment from different times to each other. Perhaps the way people communicate

is the best way to observe such changes. Hundreds of years ago, people used letters and the post to

communicate to far distances. If we use those letters to train a text auto-completion model for smartphones

today, users will be disappointed. It is because, today, we use some expressions or words less frequently.

From a scientific modeling perspective, it is as if data generating distribution is changed and has lower

significance for those particular words or expressions. In the meantime, the language allows for the birth

of new phrases and words, which is equivalent to an increase in their frequency of use. This birth and death

Electronic copy available at: https://ssrn.com/abstract=3680910

剩余37页未读，继续阅读

weixin_38640830

粉丝: 4
资源: 910

时间依赖性与数据价值：AI产品中的时效挑战

论文研究-基于数据挖掘的航班数据分析 .pdf

计算机信息技术与网络安全研究3篇-计算机安全论文-计算机论文.docx

国家风险评级的风险价值-研究论文

新兴行业中金融和非金融信息的价值相关性：网络流量数据的角色变化-研究论文

风险管理和数据质量选择：一种信息经济学方法-研究论文

在使用消费产品的时间之前评估其消费价值：一种在Internet上的应用程序-研究论文

评估战略信息技术投资的实施时间表：框架和对Edi的应用-研究论文

数据网格中数据复制管理技术研究-毕设论文.doc

液晶电视市场中间接网络外部性的经验测试-研究论文

利用移动网络数据的人类时空行为分析及建模研究--博士论文1

结果-研究生开题报告审核表-论文.zip

论文研究-医疗体检数据预处理方法研究.pdf

基于gprs无线通信数据传输系统的设计与应用-本科论文.doc

25220民营企业秘书办公自动化OA系统应用情况的调查与研究-开题报告-任务书-毕业论文-调研表-周报表.rar

度研究-论文.zip

最新资源