没有合适的资源?快使用搜索试试~ 我知道了~
首页敏捷机器学习:构建高效数据产品团队
敏捷机器学习:构建高效数据产品团队
需积分: 9 6 下载量 165 浏览量
更新于2024-07-17
收藏 4.06MB PDF 举报
"Agile Machine Learning.pdf" 是一本关于如何运用敏捷开发方法来构建和管理高效机器学习团队的书籍。作者结合了敏捷宣言的核心原则,为解决大规模、生产环境中的新颖数据问题提供指导,旨在帮助读者创建出更好的数据产品。 在书中,作者Eric Carter和Matthew Hurst探讨了如何构建一个以指标为中心、实验为中心、数据为中心的数据工程团队,并在基于数据和指标的基础上做出明智的实现和模型探索决策。他们强调了实时集体分析数据(数据沉思)的重要性,以及客观评估当前状态的能力的价值。此外,书中还阐述了数据素养——可靠数据工程师的关键素质,包括对定义和期望的理解。 这本书面向的是管理机器学习团队的人,或者负责创建生产级推理组件的人员。对于负责数据项目工作流程,包括数据采样、标注、训练、测试、改进和维护模型,以及系统和数据指标的人来说,这本书也非常有帮助。读者需要具备软件工程背景,理解机器学习基础知识以及与数据打交道的经验。 通过阅读本书,读者将学习到: 1. 如何有效管理数据工程团队,确保团队聚焦于关键指标、实验和数据。 2. 基于数据和指标做出实施和模型探索的决策。 3. 数据沉思的重要性,即实时集体分析数据以增进团队协作和理解。 4. 客观衡量团队当前状态的能力,以便持续优化和调整。 5. 提升数据素养,了解数据工程师应具备的知识和技能。 "Agile Machine Learning"是一本实践导向的指南,旨在帮助团队在快速变化的环境中适应并交付高质量的机器学习解决方案,通过借鉴敏捷开发的理念,推动机器学习项目的成功实施。
资源详情
资源推荐
1
© Eric Carter, Matthew Hurst 2019
E. Carter and M. Hurst, Agile Machine Learning, https://doi.org/10.1007/978-1-4842-5107-2_1
CHAPTER 1
Early Delivery
Our highest priority is to satisfy the customer through early and continuous
delivery of valuable [data].
—agilemanifesto.org/principles
Data projects, unlike traditional software engineering projects, are almost entirely
governed by a resource fraught with unknown patterns, distributions, and biases– the
data. To successfully execute a project that delivers value through inference over data
sets, a novel set of skills, processes, and best practices need to be adopted. In this chapter,
we look at the initial stages of a project and how we can make meaningful progress
through these unknowns while engaging the customer and continuously improving our
understanding of the data, its value, and the implications it holds for system design.
To get started, let’s take a look at a scenario that introduces the early stages of a
project that involved mining local business data from the Web which comes from the
authors’ experience working on Microsoft’s Bing search engine. There are millions of
local business locations in the United States. Approximately 50%
1
of these maintain
some form of web site whether in the form of a simple, one-page design hosted by a
drag-and- drop web hosting service or a sophisticated multi-brand site developed and
maintained by a dedicated web team. The majority of these businesses update their sites
before any other representation of the business data, driven by an economic incentive
to ensure that their customers can find authoritative information about them.
2
For
example, if their phone number is incorrect, then potential customers will not be able to
reach them; if they move and their address is not updated, then they risk losing existing
clients; if their business hours change with the seasons, then customers may turn away.
1
Based on analysis of local business feed data.
2
A survey of businesses showed that about 70% updated their web sites first and then other
channels such as social media or Search Engine Optimization (SEO) representatives.
2
A local search engine is only as good as its data. Inaccurate or missing data cannot
be improved by a pretty interface. Our team wanted to go to the source of the data– the
business web site– to get direct access to the authority on the business. As an aggregator,
we wanted our data to be as good as the data the business itself presented on the Web.
In addition, we wanted a machine-oriented strategy that could compete with the high-
scale, crowd-sourced methods that competitors benefitted from. Our vision was to
build an entity extraction system that could ingest web sites and produce structured
information describing the businesses presented on the sites.
Extraction projects like this require a schema and some notion of quality to deliver
a viable product,
3
both determined by the customer– that is, the main consumer of
the data. Our goal was to provide additional data to an existing system which already
ingested several feeds of local data, combining them to produce a final conflated output.
With an existing product in place, the schema was predetermined, and the quality of the
current data was a natural lower bound on the required quality. The schema included
core attributes: business name, address, phone number, as well as extended attributes
including business hours, latitude and longitude, menu (for restaurants), and so on.
Quality was determined in terms of errors in these fields.
The Metric Is the Customer The first big shift in going from traditional agile
software projects to data projects is that much of the role of the customer is
shifted to the metric measured for the data. The customer, or product owner,
certainly sets things rolling and works in collaboration with the development team
to establish and agree to the evaluation process. The evaluation process acts as an
oracle for the development team to guide investments and demonstrate progress.
Just as the customer-facing metric is used to guide the project and communicate
progress, any component being developed by the team can be driven by metrics.
Establishing internal metrics provides an efficient way for teams to iterate on the
3
Be extremely wary of projects that haven’t, won’t, or can’t define a desired output. If you
find yourself in the vicinity of such a project, run away– or at least make it the first priority to
determine exactly what the project is supposed to produce.
CHAPTER 1 EARLY DELIVERY
3
inner loop4 (generally not observed by the customer). An inner metric will guide
some area of work that is intended to contribute to progress of the customer-
facing metric.
A metric requires a data set in an agreed-upon schema derived from a sampling
process over the target input population, an evaluation function (that takes
data instances and produces some form of score), and an aggregation function
(taking all of the instance results and producing some overall score). Each of
these components is discussed and agreed upon by the stakeholders. Note that
you will want to distinguish metrics of quality (i.e., how correct is the data) from
metrics of impact or value (i.e., what is the benefit to the product that is using the
data). You can produce plenty of high-quality data, but if it is not in some way an
improvement on an existing approach, then it may not have any actual impact.
Getting Started
We began as a small team of two (armed with some solid data engineering skills) with
one simple goal– to drive out as many unknowns and assumptions as possible in the
shortest amount of time. To this end, we maximized the use of existing components to
get data flowing end to end as quickly as possible.
Inference We use the term “inference” to describe any sort of data
transformation that goes beyond simple data manipulation and requires some form
of conceptual modeling and reasoning, including the following techniques:
– Classification: Determining into which bucket to place a piece of data
– Extraction: Recognizing and normalizing information present in a document
– Regression: Predicting a scalar value from a set of inputs
– Logical reasoning: Deriving new information based on existing data rules
4
An “inner loop” is the high-frequency cycle of work that developers carry out when iterating on a
task, bringing it to completion. It is a metaphorical reference to the innermost loop in a block of
iterative code.
CHAPTER 1 EARLY DELIVERY
4
Structural transformations of data (e.g., joining tables in a database) are not
included as inference, though they may be necessary components of the systems
that we describe.
Adopting a strategy of finding technology rather than inventing technology allowed
us to build something in a matter of days that would quickly determine the potential of
the approach, as well as identify where critical investments were needed. We quickly
learned that studying the design of an existing system is a valuable investment in
learning how to build the next version.
But, before writing a single line of code, we needed to look at the data. Reviewing
uniform sample of web sites associated with businesses, we discovered the following:
– Most business web sites are small, with ten pages or less.
– Most sites used static web content– that is to say, all the information
is present in the HTML data rather than being dynamically fetched
and rendered at the moment the visitor arrives at their site.
– Sites often have a page with contact information on it, though it is
common for this information to be present in some form on many
pages, and occasionally there is no single page which includes all
desired information.
– Many businesses have a related page on social platforms (Facebook,
Instagram, Twitter), and a minority of them only have a social
presence.
These valuable insights, which took a day to derive, allowed us to make quick,
broad decisions regarding our initial implementation. In hindsight, we recognized
some important oversights, such as a distinction between (large) chain businesses and
the smaller “singleton” businesses. From a search perspective, chain data is of higher
value. While chains represent a minority of actual businesses, they are perhaps the
most important data segment because users tend to search for chain businesses the
most. Chains tend to have more sophisticated sites, often requiring more sophisticated
extraction technology. Extracting an address from plain HTML is far easier than
extracting a set of entities dynamically placed on a map as the result of a zip code search.
CHAPTER 1 EARLY DELIVERY
5
Every Task Starts with Data Developers can gain insights into the
fundamentals of a domain by looking at a small sample of data (less than 100). If
there is some aspect of the data that dominates the space, it is generally easy to
identify. Best practices for reviewing data include randomizing your data (this helps
to remove bias from your observations) and viewing the data in as native a form as
possible (ideally seeing data in a form equivalent to how the machinery will view it;
viewing should not transform it). To the extent possible, ensure that you are looking
at data from production
5
(this is the only way to ensure that you have the chance to
see issues in your end-to-end pipeline).
With our early, broad understanding of the data, we rapidly began building out an
initial system. We took a pragmatic approach to architecture and infrastructure. We
used existing infrastructure that was built for a large number of different information
processing pipelines, and we adopted a simple sequential pipeline architecture that
allowed us to build a small number of stages connected by a simple data schema.
Specifically, we used Bing’s cloud computation platform which is designed to run
scripted processes that follow the MapReduce pattern on large quantities of data.
We made no assumptions that the problem could best be delivered with this generic
architecture, or that the infrastructure and the paradigms of computation that it
supported were perfectly adapted to the problem space. The only requirement was that
it was available and capable of running some processes at reasonable scale and would
allow developers to iterate rapidly for the initial phase of the project.
Bias to Action In general, taking some action (reviewing data, building a
prototype, running an experiment) will always produce some information that is
useful for the team to make progress. This contrasts with debating options, being
intuitive about data, assuming that something is “obvious” about a data set, and
so on. This principle, however, cannot be applied recklessly– the action itself must
be well defined with a clear termination point and ideally a statement of how the
product will be used to move things forward.
5
“Production” refers to your production environment– where your product is running and
generating and processing data.
CHAPTER 1 EARLY DELIVERY
剩余256页未读,继续阅读
lixtan
- 粉丝: 0
- 资源: 20
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 计算机人脸表情动画技术发展综述
- 关系数据库的关键字搜索技术综述:模型、架构与未来趋势
- 迭代自适应逆滤波在语音情感识别中的应用
- 概念知识树在旅游领域智能分析中的应用
- 构建is-a层次与OWL本体集成:理论与算法
- 基于语义元的相似度计算方法研究:改进与有效性验证
- 网格梯度多密度聚类算法:去噪与高效聚类
- 网格服务工作流动态调度算法PGSWA研究
- 突发事件连锁反应网络模型与应急预警分析
- BA网络上的病毒营销与网站推广仿真研究
- 离散HSMM故障预测模型:有效提升系统状态预测
- 煤矿安全评价:信息融合与可拓理论的应用
- 多维度Petri网工作流模型MD_WFN:统一建模与应用研究
- 面向过程追踪的知识安全描述方法
- 基于收益的软件过程资源调度优化策略
- 多核环境下基于数据流Java的Web服务器优化实现提升性能
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功