敏捷机器学习：构建高效数据产品团队

需积分: 9 165 浏览量更新于2024-07-17 收藏 4.06MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

E. Carter and M. Hurst, Agile Machine Learning, https://doi.org/10.1007/978-1-4842-5107-2_1

CHAPTER 1

Early Delivery

Our highest priority is to satisfy the customer through early and continuous

delivery of valuable [data].

—agilemanifesto.org/principles

Data projects, unlike traditional software engineering projects, are almost entirely

governed by a resource fraught with unknown patterns, distributions, and biases– the

data. To successfully execute a project that delivers value through inference over data

sets, a novel set of skills, processes, and best practices need to be adopted. In this chapter,

we look at the initial stages of a project and how we can make meaningful progress

through these unknowns while engaging the customer and continuously improving our

understanding of the data, its value, and the implications it holds for system design.

To get started, let’s take a look at a scenario that introduces the early stages of a

project that involved mining local business data from the Web which comes from the

authors’ experience working on Microsoft’s Bing search engine. There are millions of

local business locations in the United States. Approximately 50%

of these maintain

some form of web site whether in the form of a simple, one-page design hosted by a

drag-and- drop web hosting service or a sophisticated multi-brand site developed and

maintained by a dedicated web team. The majority of these businesses update their sites

before any other representation of the business data, driven by an economic incentive

to ensure that their customers can find authoritative information about them.

For

example, if their phone number is incorrect, then potential customers will not be able to

reach them; if they move and their address is not updated, then they risk losing existing

clients; if their business hours change with the seasons, then customers may turn away.

Based on analysis of local business feed data.

A survey of businesses showed that about 70% updated their web sites first and then other

channels such as social media or Search Engine Optimization (SEO) representatives.

A local search engine is only as good as its data. Inaccurate or missing data cannot

be improved by a pretty interface. Our team wanted to go to the source of the data– the

business web site– to get direct access to the authority on the business. As an aggregator,

we wanted our data to be as good as the data the business itself presented on the Web.

In addition, we wanted a machine-oriented strategy that could compete with the high-

scale, crowd-sourced methods that competitors benefitted from. Our vision was to

build an entity extraction system that could ingest web sites and produce structured

information describing the businesses presented on the sites.

Extraction projects like this require a schema and some notion of quality to deliver

a viable product,

both determined by the customer– that is, the main consumer of

the data. Our goal was to provide additional data to an existing system which already

ingested several feeds of local data, combining them to produce a final conflated output.

With an existing product in place, the schema was predetermined, and the quality of the

current data was a natural lower bound on the required quality. The schema included

core attributes: business name, address, phone number, as well as extended attributes

including business hours, latitude and longitude, menu (for restaurants), and so on.

Quality was determined in terms of errors in these fields.

The Metric Is the Customer The first big shift in going from traditional agile

software projects to data projects is that much of the role of the customer is

shifted to the metric measured for the data. The customer, or product owner,

certainly sets things rolling and works in collaboration with the development team

to establish and agree to the evaluation process. The evaluation process acts as an

oracle for the development team to guide investments and demonstrate progress.

Just as the customer-facing metric is used to guide the project and communicate

progress, any component being developed by the team can be driven by metrics.

Establishing internal metrics provides an efficient way for teams to iterate on the

Be extremely wary of projects that haven’t, won’t, or can’t define a desired output. If you

find yourself in the vicinity of such a project, run away– or at least make it the first priority to

determine exactly what the project is supposed to produce.

CHAPTER 1 EARLY DELIVERY

Structural transformations of data (e.g., joining tables in a database) are not

included as inference, though they may be necessary components of the systems

that we describe.

Adopting a strategy of finding technology rather than inventing technology allowed

us to build something in a matter of days that would quickly determine the potential of

the approach, as well as identify where critical investments were needed. We quickly

learned that studying the design of an existing system is a valuable investment in

learning how to build the next version.

But, before writing a single line of code, we needed to look at the data. Reviewing

uniform sample of web sites associated with businesses, we discovered the following:

– Most business web sites are small, with ten pages or less.

– Most sites used static web content– that is to say, all the information

is present in the HTML data rather than being dynamically fetched

and rendered at the moment the visitor arrives at their site.

– Sites often have a page with contact information on it, though it is

common for this information to be present in some form on many

pages, and occasionally there is no single page which includes all

desired information.

– Many businesses have a related page on social platforms (Facebook,

Instagram, Twitter), and a minority of them only have a social

presence.

These valuable insights, which took a day to derive, allowed us to make quick,

broad decisions regarding our initial implementation. In hindsight, we recognized

some important oversights, such as a distinction between (large) chain businesses and

the smaller “singleton” businesses. From a search perspective, chain data is of higher

value. While chains represent a minority of actual businesses, they are perhaps the

most important data segment because users tend to search for chain businesses the

most. Chains tend to have more sophisticated sites, often requiring more sophisticated

extraction technology. Extracting an address from plain HTML is far easier than

extracting a set of entities dynamically placed on a map as the result of a zip code search.

CHAPTER 1 EARLY DELIVERY

Every Task Starts with Data Developers can gain insights into the

fundamentals of a domain by looking at a small sample of data (less than 100). If

there is some aspect of the data that dominates the space, it is generally easy to

identify. Best practices for reviewing data include randomizing your data (this helps

to remove bias from your observations) and viewing the data in as native a form as

possible (ideally seeing data in a form equivalent to how the machinery will view it;

viewing should not transform it). To the extent possible, ensure that you are looking

at data from production

(this is the only way to ensure that you have the chance to

see issues in your end-to-end pipeline).

With our early, broad understanding of the data, we rapidly began building out an

initial system. We took a pragmatic approach to architecture and infrastructure. We

used existing infrastructure that was built for a large number of different information

processing pipelines, and we adopted a simple sequential pipeline architecture that

allowed us to build a small number of stages connected by a simple data schema.

Specifically, we used Bing’s cloud computation platform which is designed to run

scripted processes that follow the MapReduce pattern on large quantities of data.

We made no assumptions that the problem could best be delivered with this generic

architecture, or that the infrastructure and the paradigms of computation that it

supported were perfectly adapted to the problem space. The only requirement was that

it was available and capable of running some processes at reasonable scale and would

allow developers to iterate rapidly for the initial phase of the project.

Bias to Action In general, taking some action (reviewing data, building a

prototype, running an experiment) will always produce some information that is

useful for the team to make progress. This contrasts with debating options, being

intuitive about data, assuming that something is “obvious” about a data set, and

so on. This principle, however, cannot be applied recklessly– the action itself must

be well defined with a clear termination point and ideally a statement of how the

product will be used to move things forward.

“Production” refers to your production environment– where your product is running and

generating and processing data.

CHAPTER 1 EARLY DELIVERY

剩余256页未读，继续阅读

lixtan

粉丝: 0
资源: 20

敏捷机器学习：构建高效数据产品团队

Machine Learning.pdf

华为Agile Controller方案.pdf

java.lang.ClassNotFoundException: com.cbcs.agile.common.security.grant.DisposableFreeLoginCodeAuthen

agile 9.3.1.2

js实现在“000/087/004/agile8700427.dwg”在最后一个/前面添加/buf 变成000/087/004/buf/agile8700427.dwg

互联网商业常用英文词汇100个

安装Agile controller

agile.net_obfuscator_6.6.0.11_downloadly

java获取agileplm的受影响物件更新表事件删除一行的操作

agile@833574

c 题 生产企业原材料的订购与运输的参考文献

计算机复试英文文献翻译

微信小程序外文参考文献

agile controller

qca芯片Agile模式开启不了

Agile Controller-Campus 下载

请再帮我出20道不同的

agile modbus教程

agile 和scrum的关系

最新资源

c 题生产企业原材料的订购与运输的参考文献