【免费】ConsiderationsforBigDataArchitectureandApproach

5星 · 超过95%的资源需积分: 0 6 浏览量更新于2023-06-10 收藏 942KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Considerations for Big Data: Architecture and Approach

Kapil Bakshi

Cisco Systems Inc.

13635 Dulles Technology Drive

Herndon, VA 20171

703-484-2057

kabakshi@cisco.com

Abstract—The amount of data in our industry and the world is

exploding. Data is being collected and stored at unprecedented

rates. The challenge is not only to store and manage the vast

volume of data (“big data”), but also to analyze and extract

meaningful value from it. There are several approaches to

collecting, storing, processing, and analyzing big data. The main

focus of the paper is on unstructured data analysis. Unstructured

data refers to information that either does not have a pre-defined

data model or does not fit well into relational tables. Unstructured

data is the fastest growing type of data, some example could be

imagery, sensors, telemetry, video, documents, log files, and e-

mail data files. There are several techniques to address this

problem space of unstructured analytics. The techniques share a

common character tics of scale-out, elasticity and high availability.

MapReduce, in conjunction with the Hadoop Distributed File

System (HDFS) and HBase database, as part of the Apache

Hadoop project is a modern approach to analyze unstructured data.

Hadoop clusters are an effective means of processing massive

volumes of data, and can be improved with the right architectural

approach.

TABLE OF CONTENTS

1. INTRODUCTION ................................................. 1!

2. SQL BASED AND NOSQL DATABASES ............. 1!

3. UNSTRUCTURED ANALYTICS ............................ 2!

4. BENCHMARKS ................................................... 3!

5. PERFORMANCE CONSIDERATIONS ................... 4!

6. CAPACITY PLANNING CONSIDERATIONS ......... 5!

7. SUMMARY ......................................................... 6!

REFERENCES ......................................................... 6!

BIOGRAPHY .......................................................... 7!

1. INTRODUCTION

As the data grows in the industry, new techniques and

approaches need to be adopted. This paper focuses on the

unstructured aspects of data analytics and reviews as a case

study, some of the key projects in Apache Hadoop. This

paper also describes the fundamentals of relational database

management systems (RDBMS) and their use for traditional

big data sets in data warehousing, decision support, and

analytics. The paper then reviews non-relational big data

approaches such as distributed/shared-nothing architectures,

horizontal scaling, key/value stores, and eventual

consistency. This part of the paper differentiates between

structured versus unstructured data. The paper describes

various building blocks and techniques for Map Reduce and

HDFS, HBase and their implementation in an open source

Hadoop framework. This paper focuses on the infrastructure

planning (compute, network, and storage systems), and

reviews Hadoop design criteria and implementation

considerations. Hadoop includes many technologies,

including MapReduce, which interact with the infrastructure

elements while analyzing data. This paper reviews

performance considerations and describes relevant

benchmarks with a Hadoop analytics cluster. In conclusion,

the paper review the alternatives to hosting an analytics

cluster in a public cloud.

2. SQL BASED AND NOSQL DATABASES

The traditional method of managing structured data includes

a relational database and schema to manage the storage and

retrieval of the dataset. For managing large datasets in a

structured fashion, the primary approaches are data

warehouses and data marts. A data warehouse is a relational

database system used for storing, analyzing, and reporting

functions. The data mart is the layer used to access the data

warehouse. The data stored in the warehouse is sourced

from the operational systems. A data warehouse focuses on

data storage. The main source of the data is cleaned,

transformed, catalogued, and made available for data mining

and online analytical functions. The data warehouse and

marts are SQL (Standard Query Language) based databases

systems. The two main approaches to storing data in a data

warehouse are the following:

• Dimensional: Transaction data are partitioned into “facts”

tables, which are generally transaction data and dimensions

tables, which are the reference information that gives

context to the facts.

• Normalized: The tables are grouped together by subject

areas that reflect data categories, such as data on products,

customers, and so on. The normalized structure divides data

into entities, which create several tables in a relational

database.

There are several approaches adopted by NoSQL (Not Only

SQL) for storing and managing unstructured data, also

referred to as “non-relational data”. These systems, which

are sometimes also called “key-value stores”, share the

goals of massive scaling “on demand” (elasticity), data

model flexibility and simplified application development

and deployment. NoSQL databases separate data

management and data storage, whereas relational databases

attempt to satisfy both concerns with databases. One of the

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余6页未读，立即下载

oYiChuangShanHai

粉丝: 0
资源: 1

会员权益专享

Considerations for Big Data Architecture and Approach

会员权益专享

最新资源

Considerations for Big Data Architecture and Approach

Scalable.Big.Data.Architecture.148421327

Big Data Architect’s Handbook 2018pdf

Big.Data.Governance.1519559

phy electrical test considerations for pci express architecture

Third-party cookie will be blocked. Learn more in the Issues tab.这个问题怎么解决

我要写个ppt，关于数字孪生、bi业务系统前后端设计、业务数据对接的内容

FPGA中控制接口的读写的Verilog代码

那你是3.5还是4.0版本的

Can you tell me some high-quality forums or blogs about artificial intelligence, and attach specific websites or access methods

软件 概要设计说明书 电商 csdn

Fiber optic gyroscope serial communication circuit requirements

memory attribute

DNL_INLmatlab

GEE reduceResolution

英语辩论Should We Rapidly Develop the Artificial Intelligence?正方一辩怎么写稿

CPW GCPW的区别

GBDT分类模型的主要参数，训练用时，数据切分，数据洗牌，交叉验证，节点分裂评价准则，学习其数量，无放回采样比例，划分时考虑的最大特征比例，英文版

updating python interpreter

armbian cups

会员权益专享

最新资源

软件概要设计说明书电商 csdn