SQL Server 2016数据科学实战

需积分: 19 93 浏览量更新于2024-07-19 收藏 7.15MB PDF 举报

"Data Science with SQL Server" 是一本面向数据科学家和技术专业人士的书籍，主要讨论如何利用Microsoft SQL Server进行数据科学工作。SQL Server不仅是一个数据库管理系统，还具有高级分析功能，尤其是通过SQL Server R Services，它允许直接在RDBMS和R语言之间交互，同时保持数据的安全性。本书分为三个主要部分：首先介绍SQL Server R Services和SQL Server的基本概念，为读者提供一个入门指南；其次，解释数据科学家在这个新环境中的工作方式，帮助那些通常在“信息孤岛”中工作的数据科学家更好地融入业务开发过程；最后，通过实际的动手示例展示如何利用SQL Server R Services解决现实世界的问题，读者可以选择跟随章节进行实践操作。目标读者群体是熟悉R语言的数据科学家，但同时，由于书中对数据科学和R语言进行了简要介绍，并提供了学习资源，所以这本书也适合数据库管理员、开发者和其他数据专业人士阅读。尽管本书不涵盖SQL Server的所有内容，但提供了参考资料和一些基础概念解释，以帮助不熟悉SQL Server的数据科学家更好地理解。书中的作者包括Buck Woody, Danielle Dean, Debraj GuhaThakurta, Gagan Bansal, Matt Conners和Wee-Hyong Tok，他们都是在数据科学领域有深厚经验的专业人士。此书由Microsoft Press出版，遵循其一贯的标准和质量，确保了内容的专业性和实用性。值得注意的是，书中提供的信息可能会随着技术的发展而变化，例如URL和互联网网站引用可能需要更新。此外，书中所使用的例子仅为说明目的，所有人物和情况都是虚构的，没有实际关联。通过本书，读者将能够深入理解如何在SQL Server环境中进行数据科学工作，提升数据处理和分析能力，同时了解如何在企业级环境中结合R语言实施数据驱动的决策。

5 CHAPTER 1 | Using this book

 Multi-join operations

 Recursive SELECT statements

 Grouping, combining, and consolidating Data Manipulation Language (DML) statements

 SQL Server architecture and general operation

There is a litany of courses you can take for SQL in general, and T-SQL specifically. Here are a few:

 Learn SQL is a great site to get started with general SQL: http://www.sqlcourse.com/

 Codeacademy is another great place to get started: https://www.codecademy.com/learn/learn-sql

 To learn the basics of the T-SQL dialect, try this resource: http://www.tsql.info/

 Microsoft has a tutorial on getting started with T-SQL: https://msdn.microsoft.com/en-

us/library/ms365303.aspx

Next, you’ll need to understand SQL Server’s architecture and features. For that, use the information in

Books Online at https://msdn.microsoft.com/library/ms130214.aspx.

Step three: the R programming language and environment

R is a language and platform used to work with data, most often by using statistical methods. It’s very

mature and is used by many data professionals around the world. It’s extended with a “package,”

which is code that can reference using dot notation and function calls.

If you know SQL, T-SQL, or a scripting language like Windows PowerShell, you’ll be familiar with the

basic structure of an R program. It’s an interpreted language, and one of the interesting things about

the way it works is in how it stores computational data. When you work with R, everything is stored in

an ordered collection called a vector. This is both a strength and a weakness of the R system, one that

Microsoft addresses with its enhancements to the R platform.

To learn more about R, you have a very wide array (pun intended) of choices:

 There’s a full course you can take on R at DataCamp: https://www.datacamp.com/

 The primary resource you can use for learning R on SQL Server is here:

https://msdn.microsoft.com/library/mt674876.aspx

 And you can find tutorials on R for SQL Server here:

https://msdn.microsoft.com/library/mt591993.aspx

You can also find out more about data science and working with R at my blog, which you can view at

https://buckwoody.wordpress.com/. You’ll find a rich list of resources there to help you continue in

your learning journey. If you want to go further and learn more about data science, check out

https://buckwoody.wordpress.com/2015/09/16/the-amateur-data-science-body-of-knowledge/.

Now, on to R with SQL Server…

6 CHAPTER 2 | Microsoft SQL Server R Services

C H A P T E R

Microsoft SQL

Server R Services

This chapter presents an overview of the SQL Server R Services, how

it works, and where you can get it. We also show you how to make your

solutions operational and where you can learn more about R on SQL

Server.

The advantages of R on SQL Server

In a 2011 study,

Erik Brynjolfsson of the Massachusetts Institute of Technology Sloan School of

Management showed a link between firms that use Data-Driven Decision Making and higher

performance. Organizations are moving ever closer to using more and more data interpretation in

their operations. And much of that data lives in Relational Database Management Systems (RDMBS)

like Microsoft SQL Server.

R has long been a popular data-processing language. It has thousands of external packages, is

relatively easy to read and understand, and has rich data-processing features. R is used in thousands

of organizations around the world by data-analysis professionals.

Note If you’re not familiar with R, check out the resources provided in Chapter 1.

A statistical programmer versed in R often accesses data stored in a database by using a package that

calls the Open Database Connectivity (ODBC) Application Programming Interface (API), which serves

as a conduit to the RDBMS to retrieve data. R then receives that data as a data.frame object. The

results from the database server are either pushed back across the network to the RDBMS, or the data

professional saves the results locally in tabular or other form. Using this approach, all of the

processing of the data happens locally, with the exception of the SQL statement used to gather the

initial set of data. Data is rarely sent back to the RDBMS—it is most often a receive operation.

The Structured Query Language (SQL) is another data-processing language designed specifically for

working within an RDBMS. Its roots involve relational algebra and relational calculus, and it is used in

See http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1819486

7 CHAPTER 2 | Microsoft SQL Server R Services

multiple database systems. Most vendors extend the basic SQL constructs to take advantage of the

platform it runs on, and in the case of Microsoft SQL Server, this dialect is called Transact-SQL (T-SQL).

T-SQL is used to query, update, and delete data, along with many other functions.

In both R and T-SQL, the developer types commands in a step-wise fashion in an editor window or at

a command-line interface (CLI). But the path of operations is different from that point on. R is an

interpreted language, which means a set of binaries local to the command environment processes the

operations and returns the result directly to the calling program. In SQL Server, the client is separate

from the processing engine. The installation of SQL Server listens on a network interface, and the

client software puts the commands on the network path in a particular protocol. The server receives

this packet with the T-SQL statements only if the packet is “well formed.” The commands are run on

the server, and the results, along with any messages the server sends (such as the number of rows)

and any error messages, are returned to the client over the same protocol. The primary load in this

approach is on the server rather than the workstation. Of course, the workstation might then further

process the data—using Java, C#, or some other local language—but often the business logic is done

at the server level, with its security, performance, and other advantages and controls.

But SQL Server is much more than just a data store. It’s a rich ecostructure of services, tools, and an

advanced language to deal with data of almost any shape and massive size. Many organizations store

the vast amount of their actionable data within SQL Server by using custom and commercial software.

It has more than 36 data types, and gives you the ability to define more.

SQL Server also has fine-grained security features. When these are applied, the data professional can

simply query the data, and only the allowed datasets are returned. This facilitates good separation of

duties, which is highly important in large, complex systems for which one group of professionals

might handle the security of data, and another handles the querying and processing of the data.

SQL Server also has advanced performance features, such as a column-based index, which can provide

extremely fast search and query functions over very large sets of data.

Using R on SQL Server combines the power of the R language (and its many packages) and the

advantages of the SQL Server platform by placing the computation over the data. This means that you

aren’t moving the data to the R system, involving networking, memory on two systems, CPU power on

each side, and other disadvantages—the code operates on the same system as the application data.

Combining R and SQL Server means that the R environment gains not only the functions and features

in the R language, but also the ecostructure, security, and performance of SQL Server, as well as

increased scale. And using R directly on SQL Server means that the R code can save the results of the

operation to a new or existing table for other queries to access and update.

A brief overview of the SQL Server R Services

architecture

The native implementation of open-source R reads data into a data-frame structure, all of which is

held in memory. This means that R is limited to working with data sizes that will fit into the RAM on

the system that processes the data. Another limitation in R is within a few of the core packages that

process certain algorithms, most notably dealing with linear regression math. These native calls can

perform slowly.

SQL Server R Services

To address these limitations (and others), Microsoft R Server brings several major enhancements to

the R platform—Microsoft R Server is what is used in SQL Server R Services. The first enhancement is

the ScaleR library, which allows MRS to “chunk” data stored on permanent storage in either comma-

8 CHAPTER 2 | Microsoft SQL Server R Services

separated-value files, databases, and many other data sources into manageable sets. These libraries

also offer increased parallelization, which makes it possible for the R code to process data more

efficiently.

Microsoft R uses a binary storage format called an XDF, which handles data frames in a more efficient

pattern, allowing advantages such as appending data to the end of a file, and other performance

improvements.

Another set of enhancements involves replacing some of the core calls to some of the math libraries

in the open-source version of R, with much higher performance. Other enhancements involve

extending the scaling features of R to distribute the workload across multiple servers.

R Server is available on multiple platforms, from Windows to Linux, and has multiple editions.

Microsoft also has combined the R Server code in its other platforms, including HDInsight (Hadoop)

and with the release of SQL Server 2016. In this book, we’ll deal with the implementation in SQL Server

2016, called SQL Server R Services.

A SQL Server installation, called an instance, contains the binaries required to run the various RDBMS

engine functions, Business Intelligence (BI) features, and other engines. The instance also instantiates

entries into an internal Windows database construct called the Windows Registry, and a few SQL

Server databases to configure and secure the RDBMS environment. The binaries run as Windows

Services (equivalent to a Linux Daemon), regardless of whether someone is signed in to the server.

These Windows Services listen on networking ports for proper calls from client software.

In SQL Server 2016 and later, Microsoft combines the two environments by installing the Microsoft R

Server binaries along with the SQL Server installation. Changes in the SQL Server base code allows

the two environments to communicate securely in the same space and makes it possible for the two

services to be upgraded without affecting each other, within certain parameters. This architecture

means that you have the purest possible form of both servers, while allowing SQL Server the complete

access to the R environment.

To use R code in this architecture, you must configure the SQL Server instance to allow an external

scripts setting (which can be secured) so that the T-SQL code can make calls to the R Server. Data is

passed as a data.frame object to the R code directly from SQL Server, and SQL Server interprets the

results from the R code as a tabular or other format, depending on the data returned. In this manner,

the T-SQL and R code can interoperate on the same data, all while using the features and functions in

each language. Because the call stays within the constructs of SQL Server, the security and

performance of that environment is maintained.

Preparing to use SQL Server R Services

After the installation and configuration of SQL Server R Services, you can begin to use your R code in

two ways: by executing the code interactively, or, more commonly, by saving your R code within the

body of a script that executes on SQL Server, called a stored procedure. The stored procedure can

contain T-SQL and R code, and each can pass variables and data to the other. Before you can run your

code, you’ll need to install SQL Server R Services.

Installing and configuring

You can install R Services on an initial installation of a SQL Server 2016 instance. You also can add R

Services later by using the installation source. The installation or addition process will install the R

server and client libraries onto the SQL Server.

剩余89页未读，继续阅读

wrong1

粉丝: 0
资源: 7

SQL Server 2016数据科学实战

SQL-for-data-science-Coursera-answers:此处提供了名为 SQL for Data-Science 的 Coursera 作业的所有答案

SQL-for-Data-Science:加州大学戴维斯分校的Coursera

SQL-Basics-for-Data-Science:笔记和练习SQL | Coursera专业化| UCDavis

使用Microsoft SQL Server 2016进行数据科学Data Science with Microsoft SQL Server 2016

Data_Science_With_R_Workflow.pdf

SQL Server 2016 Developer's Guide

钓友俱乐部网络化管理系统开发与设计——基于JSP和SQLServer的实现

[Data Security and Backup]: Ultimate Strategies for Protecting Data with Python and MySQL

Data Migration Tips: How to Efficiently Store Data in MySQL Using Python

Monitoring MySQL Database Performance with Python: Essential Tools and Professional Techniques

最新资源