没有合适的资源?快使用搜索试试~ 我知道了~
首页论文研究 - 使用增强的回归树和遥感数据来推动决策
由于数据的记录,维护,处理和存储方式,大数据分析面临挑战。 我们证明了分层的,多元的统计机器学习算法,即增强回归树(BRT)可以解决大数据挑战以推动决策。 这项研究面临的挑战是缺乏互操作性,因为数据,GIS形状文件集合,遥感图像以及聚合和内插的时空信息都存储在整体硬件组件中。 对于建模过程,有必要创建一个公共输入文件。 通过将数据源合并在一起,创建了一个结构化但嘈杂的输入文件,该文件显示了不一致和冗余。 在此表明,BRT可以处理不同的数据粒度,异构数据和丢失。 特别是,BRT的优点是默认情况下允许通过区分是否缺失值以及缺失值来处理缺失数据。 最重要的是,BRT提供了多种关于结果解释的可能性,并且通过考虑使用变量在树中定义拆分的频率来自动执行变量选择。 与两个类似的回归模型(随机森林和最小绝对收缩和选择算子,LASSO)的比较表明,在这种情况下,BRT的性能优于后者。 BRT还可作为现实场景中复杂的层次建模的起点。 例如,可以使用现有模型来测试BRT的单一方法或整体方法,以改善各种数据驱动的决策和应用程序的结果。
资源详情
资源评论
资源推荐
Open Journal of Statistics, 2017, 7, 859-875
http://www.scirp.org/journal/ojs
ISSN Online: 2161-7198
ISSN Print: 2161-718X
DOI:
10.4236/ojs.2017.75061 Oct. 31, 2017 859 Open Journal of Statistics
Using Boosted Regression Trees and Remotely
Sensed Data to Drive Decision-Making
Brigitte Colin, Samuel Clifford, Paul Wu, Samuel Rathmanner, Kerrie Mengersen
School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia
Abstract
Challenges in Big Data analysis arise due to the way the data are recorded,
maintained, processed and stored. We demonstrate that a hierarchical, mult
i-
variate, statistical machine learning algorithm, namely Boosted Regression
Tree (BRT) can
address Big Data challenges to drive decision making. The
challenge of this study is lack of interoperability since the data, a collection of
GIS shapefiles, remotely sensed imagery, and aggregated and interpolated
spatio-temporal information, are stored i
n monolithic hardware components.
For the modelling process, it was necessary to create one common input file.
By merging the data sources together, a structured but noisy input file, sho
w-
ing inconsistencies and redundancies, was created. Here, it is shown
that BRT
can process different data granularities, heterogeneous data and mis
singness.
In particular, BRT has
the advantage of dealing with missing data by default
by allowing a split on whether or not a value is missing as well as what the
value is. Most importantly, the BRT offers a wide range of possibilities r
e-
garding the interpretation of results and variable selection is automatically
performed by considering how frequently a variable is used to define a split in
the tree. A comparison with two simil
ar regression models (Random Forests
and Least Absolute Shrinkage and Selection Operator, LASSO) shows
that
BRT outperforms these in this instance. BRT can also be a starting point for
sophisticated hierarchical modelling in real world scenarios. For examp
le, a
single or ensemble approach of BRT could be tested with existing models in
order to improve results for a wide range of data-driven decisions and appl
i-
cations.
Keywords
Boosted Regression Trees, Remotely Sensed Data, Big Data Modelling Approach
,
Missing Data
How to cite this paper:
Colin, B.,
Clifford,
S
., Wu, P., Rathmanner, S. and Mengersen,
K
. (2017)
Using Boosted Regression Trees
and Remotely Sensed Data to
Drive Deci-
sion
-Making.
Open Journal of Statistics
,
7,
859
-875.
https://doi.org/10.4236/ojs.2017.75061
Received:
September 27, 2017
Accepted:
October 28, 2017
Published:
October 31, 2017
Copyright ©
2017 by authors and
Scientific
Research Publishing Inc.
This work is licensed under the Creative
Commons Attribution International
License (CC BY
4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
B. Colin et al.
DOI:
10.4236/ojs.2017.75061 860 Open Journal of Statistics
1. Background
Data are typically stored in various ways and various formats, mostly in mono-
lithic software architectures which do not allow for interoperability. Analysis of
data across multiple data sources is thus difficult, since the functionality of the
single data sources with respect to input and output, maintenance, data
processing, error handling and user interface is all interwoven and acts as archi-
tecturally separate components. In order to create a basis for analysing the data
considered here, it was required to extract the datasets from their original data-
bases and combine them to form a common input file for the modelling process.
It was therefore inevitable that this resulted in a data file structure which showed
missing data, inconsistencies, duplicates and redundancies.
A case study is presented here to examine land use data sourced from a GIS,
direct observations from an agricultural company, and remotely sensed data.
The data were extracted from a relational database, Excel spreadsheets, remotely
sensed imagery stored as raster data, and vector data from a Geographic Infor-
mation System (GIS), directly observed and measured data in real-time and in-
terpolated data. By combining these data sources to form one common basis for
our analysis, issues of data volume, variety and veracity were encountered. Big
Data research clearly deals with issues beyond volume and belongs not only to
the ongoing digital revolution, but to the scientific revolution as well. The ques-
tion posed of Big Data and illustrated in the case study presented here, is wheth-
er new knowledge can be extracted from various data sources that haven’t been
analysed in combination before, and can thus assist in a better and more confi-
dent decision making.
2. Introduction
There is an exponential increase in interest in the use of digital data to improve
decision making in a range of areas such as human systems, urban environ-
ments, agriculture and national security. For example, decisions in the agricul-
tural domain may require information based on vegetation or land use change,
estimation of crops or biomass, distribution of native or exotic species, livestock
or weed assessment and so on. One source of digital data that has generated in-
tense interest over the past decades is remotely sensed imagery. These data are
available from a wide range of sources, ranging from satellites to drones, and
have been used for a very wide range of environmental applications [1]-[8].
The availability and resolution of these data, combined with improved com-
puter storage and data management facilities, have greatly increased the oppor-
tunity for mathematicians and statisticians to utilise this information in their
models and analyses. The challenge in linking remotely sensed data to decision-
making is that there are multiple steps in the process. Here, we focus on an ex-
emplar real-world problem in the livestock industry: deciding on the allocation
of animals to different paddocks and potentially different grazing properties
based on the predicted availability of grass over the year. This problem arose in
B. Colin et al.
DOI:
10.4236/ojs.2017.75061 861 Open Journal of Statistics
the context of collaboration between statisticians at the Queensland University
of Technology and a large livestock organisation in Australia. The specific aim of
the project was to develop an ensemble of models to predict the carrying capaci-
ty, that is, the number of animals that can be sustained on a paddock. In order to
achieve this goal we utilised remote sensing data and supporting information
about climate and paddock characteristics. Further, it was important to present
the results in a form that is useful for the agricultural decision makers.
Difficult or challenging decisions demand a thorough consideration and even
then they imply uncertainty, complexity and different levels of risk. Making the
right decisions at the right time can lead to success, increase of profit or mini-
misation of risk. It is thus important that thoughtful considerations are put into
each decision.
Figure 1 demonstrates the workflow following a Big Data ap-
proach for our case study. Here, we use structured but heterogeneous data
sources that showed characteristics like missing data, noise and redundancies.
All the data sources were used to create a BRT model via an ensemble approach.
The resulting model and its output serves as a foundation for a better decision
making. The steps involved in the process are depicted in
Figure 1. Due to
commercial confidentiality concerns, the final results of the modelling workflow
are not presented here.
In this article we focus on one component of the ensemble modelling ap-
proach employed in the project, namely the use of BRT to estimate so-called
animal equivalents per paddock. Since calves, cows and bulls of different ages
consume different amounts of grass, these animals are standardised to a refer-
ence animal which can then be used as a common response variable in the anal-
ysis. An interesting conundrum is that one of the major inputs into such a model
is the amount of grass, or more generally the biomass, in a paddock. This can
potentially be estimated directly from remote sensing, but is confounded by the
fact that animals are on the paddock eating the very thing that is being measured
by the sensor. Moreover, the decision maker may be interested in the biomass
estimates themselves, either directly via the remotely sensed measurements or
indirectly via the animal equivalents based on animal weight and metabolic
formula.
A BRT is a popular statistical and machine learning approach that has not yet
seen much application in the analysis of remotely sensed data. Indeed, although
they were first defined two decades ago, BRT has only recently been extended to
deal with the types of features that are characteristic of remotely sensed data, in
particular its spatial and temporal dynamics. Most of the activity around the use
of BRT for agricultural and environmental applications does not appear in the
mainstream mathematical and statistical literature.
Figure 1. Modelling process for case study.
B. Colin et al.
DOI:
10.4236/ojs.2017.75061 862 Open Journal of Statistics
2.1. Case Study
The study area is located in the Northern Territory, Australia. The main climate
zone is identified as grassland with hot dry summers and mild winters [9]. It is a
heterogeneous region with a complex topography and land cover and type of
grassland. Identification, differentiation and quantitative estimation of biomass
is of primary interest in this case study. A range of data from different sources
was required for this problem. In this section, we describe the information de-
rived from Landsat imagery and comment briefly on other data. The reflectance
recorded by the Landsat sensor is stored as an 8 bit value, resulting in a scale of
256 different grey values ranging from black (0 max absorption) to white (255
max reflection). The electronically recorded data appear as an array of numbers
in digital format. In addition to the 8 bit quantisation, Landsat offers several
spectral bands in the electromagnetic and infrared spectrum in which each indi-
vidual pixel shows different values across different bands. This means that each
pixel has a different dimension and therefore will be represented differently in
each spectral band. Raster data are becoming increasingly common and increa-
singly large in volume, although it is possible to reduce file size with compres-
sion functions.
There is a strong advantage in using remotely sensed Landsat imagery and
applied spectroscopy for these types of analyses because the data are freely
available, the imagery covers a wide geographical range, and it avoids expensive,
extensive and often impractical in-situ measurement. However, the trade-off is
in resolution: in-situ measurements provide highly localised accuracy whereas a
pixel in a Landsat image covers an area of 30 × 30 meters. It is noted that other
satellites are now able to provide higher resolution, but these are not yet freely
available for the areas of interest in this case study.
Estimation of biomass using satellite data is of ongoing global interest. Grass
biomass estimation is challenging since the phenological growing cycle of natu-
rally existing grass is a dynamic process influenced by many complex parame-
ters, including grass type, soil, climate, topography and land use. With the spec-
tral information of remotely sensed imagery it is possible to detect green vegeta-
tion, which is driven by the photosynthetic biochemical process of grass bio-
mass. However, since raster imagery is only a two dimensional representation of
the land cover it is difficult to derive the quantity of the vertical grass biomass
directly.
Fractional cover [10] data are often available as derived products; for example
Geoscience Australia (GA) who provides an Australian Reflectance Grid 25
(ARG25) product which gives a 25 meter scale fractional cover representation of
underlying vegetation across Australia or Tern - Auscover in 30 meter resolution
of Landsat 5 and 7 covering the temporal extent from 2000-2011. Fractional
cover unmixing algorithms use the spectral reflectance of a Landsat scene for a
pixel to break it into three fractions represented as percentage values. These are
photosynthetic vegetation (includes leaves and grass), non-photosynthetic vege-
剩余16页未读,继续阅读
weixin_38623366
- 粉丝: 4
- 资源: 932
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0