没有合适的资源?快使用搜索试试~ 我知道了~
首页时间依赖性与数据价值:AI产品中的时效挑战
"本文探讨了时间对AI和机器学习中数据价值的影响,以及数据的非平稳性和价值贬损问题。" 在《时间和数据价值》这篇研究论文中,作者深入研究了时变数据对于提升基于AI的产品和服务质量的关键作用。时间依赖性是数据特性的重要组成部分,因为数据随着时间的流逝会逐渐失去其与特定问题的相关性,这一现象可能导致算法性能的下降,进而减少业务价值。论文提出了一个模型,将时间相关性表示为概率分布的变化,揭示了一些出乎意料的发现。 首先,论文理论上证明了即使无限量的随时间积累的数据可能对预测未来的能力产生限制。这意味着,在有限大小的数据集上训练的算法可能也能达到类似的性能水平。此外,增加包含较旧数据的数据量并不总是有益的,实际上可能对公司的竞争力产生负面影响,因为较旧的数据可能会降低算法的性能。 论文进一步探讨了数据量如何影响企业的竞争优势。由于时间依赖性,数据量带来的进入壁垒被削弱,竞争对手只需拥有适量的最新数据,就有可能获得与大量历史数据相媲美的性能。这质疑了在基于AI的市场中先发优势的重要性。通过实证实验,作者使用下一个单词预测任务,测量了文本数据中价值损失的现象,结果显示,七年后的100MB文本数据与50MB的最新数据在预测任务上的效果相当。 这些发现对AI和机器学习经济学有深远影响,强调了数据的新鲜度和时效性对于保持算法效率和业务价值的重要性。企业必须关注数据的非平稳性(non-stationarity),理解数据的易腐性(perishability)以及价值贬损(value depreciation)的概念,以便制定有效的数据管理和更新策略,以维持和增强其AI解决方案的性能。这不仅关乎数据的数量,更关乎数据的质量和时间维度。
资源详情
资源推荐
pg. 5
Harvard Business School Working Paper, No. 21-016
Acquiring data can happen in many ways. While firms may purchase data through an intermediary, they
may also organically gather datasets over time from interactions with their users. In both methods, there are
privacy concerns that may prevent users from sharing data. [10], through modeling an intermediary that
acquires data from users and sells the obtained data to firms, investigates the issue of data externality and
privacy. [2] discusses similar issues for organic data generation and claims that more data lowers the privacy
barrier and motivates more users to share their data. This argument, together with the data network effect
arguments above, suggests a significant growth rate in the size of a firm's data repository. [9, 20, 21]
proposed a growth model for data in firms and the economy. They answer questions on the firm’s growth
process.
Our paper researches the effectiveness of curated datasets, and hence, is not concerned with data solicitation
and growth of the firm's data repository. We argue that curating big datasets and blindly using them may
not always give a firm a significant advantage, and it may even put a firm in a disadvantageous position.
Our arguments, thus, question universal assumptions about the value of data for a firm and how it may
change the modes of competition. More precisely, we investigate how and if curating large datasets can
create barriers to entry and deter threats from entrants.
[17,18,33,34,38] discuss the implication of AI and, more precisely, data on competition. About data,
particularly, most debates are around its volume and whether it creates a competitive advantage. Some of
these studies focus on the antitrust issues and the potential role that data plays in creating a winner take all
(monopoly) situation. [31,39] are examples of these studies. Furthermore, there are researches on how data
can improve the prediction quality of services with respect to either the degree of personalization [25,39]
or between adjacent products [7]. These researches have direct strategy implications on how firms compete
on growing the user-base.
We believe that data characteristics play a crucial role in the value creation cycle and the modes of
competition. For example, non-rivalry and exclusivity of a dataset to a firm can prevent other players from
obtaining it, which in turn puts the owner in a superior position. Under exclusivity, data becomes an asset
that behaves like the supply of a physical good. Biases build a harmful environment for both the company
and its users. For the algorithm's fairness and potential biases, [16] provides a discussion. Paying closer
attention to dataset characteristics, we can see that time-dependency and perishability are similar since the
sampling time determines them both. It is of great interest to see how dependency on time changes the
strength of data externalities and influences the value creation cycle.
Electronic copy available at: https://ssrn.com/abstract=3680910
pg. 6
Harvard Business School Working Paper, No. 21-016
Perhaps the closest research to ours is [14], where the authors investigate the effect of historical search data
on search results' quality. They found little empirical evidence on the effectiveness of old data in the quality
of search engine results. Also, [6] raises a question on the economy of scale that data provides for specific
problems. They suggest a diminishing return to scale value model for data and argue that increasing data
volume in advertisement applications does not improve the service quality. The results in these papers
endorse our findings on the effectiveness of perishable data. We believe that both the search engine and
advertisement businesses use time-sensitive data and hence, face significant time dependency. Therefore,
the data loses its effectiveness quickly, contributing to not seeing significant improvement in prediction
quality.
2. Background and Framework
In this section, our goal is to introduce how we approach the problem and explain why we make particular
choices. We first describe time-dependency as a shift in distribution over time and dig into its cause. We
argue that time-dependency is mostly due to exceptional reasons that most often cause a monotonic decrease
in the value of data over time. We then have a brief introduction to machine learning and explain why we
focus on the probability distribution's maximum likelihood estimation. We mainly introduce a
decomposition of the MLE's objective function to lay the groundwork for the next section's propositions.
We finally formalize the notion of effectiveness and define the substitution gain curve as a proportion of
two effectiveness quantities; one from past and the other from the substitute time.
2.1. Change in distribution
Time-dependency is due to many reasons, among which we can mention the change in consumers' tastes
and behavior. If we look at music albums' best sellers from the 80s and compare them with the best sellers
in 2020, we can see the difference in taste. Innovation is another reason for the change because, nowadays,
we witness continuous innovation and considerable variation in product and services space. Telegram
seems ancient these days, and so will be the hardline telephones soon. Because of the differences, it is not
easy to translate the environment from different times to each other. Perhaps the way people communicate
is the best way to observe such changes. Hundreds of years ago, people used letters and the post to
communicate to far distances. If we use those letters to train a text auto-completion model for smartphones
today, users will be disappointed. It is because, today, we use some expressions or words less frequently.
From a scientific modeling perspective, it is as if data generating distribution is changed and has lower
significance for those particular words or expressions. In the meantime, the language allows for the birth
of new phrases and words, which is equivalent to an increase in their frequency of use. This birth and death
Electronic copy available at: https://ssrn.com/abstract=3680910
剩余37页未读,继续阅读
weixin_38640830
- 粉丝: 4
- 资源: 910
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 前端面试必问:真实项目经验大揭秘
- 永磁同步电机二阶自抗扰神经网络控制技术与实践
- 基于HAL库的LoRa通讯与SHT30温湿度测量项目
- avaWeb-mast推荐系统开发实战指南
- 慧鱼SolidWorks零件模型库:设计与创新的强大工具
- MATLAB实现稀疏傅里叶变换(SFFT)代码及测试
- ChatGPT联网模式亮相,体验智能压缩技术.zip
- 掌握进程保护的HOOK API技术
- 基于.Net的日用品网站开发:设计、实现与分析
- MyBatis-Spring 1.3.2版本下载指南
- 开源全能媒体播放器:小戴媒体播放器2 5.1-3
- 华为eNSP参考文档:DHCP与VRP操作指南
- SpringMyBatis实现疫苗接种预约系统
- VHDL实现倒车雷达系统源码免费提供
- 掌握软件测评师考试要点:历年真题解析
- 轻松下载微信视频号内容的新工具介绍
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功