熵基运输问题的几何编程方法优化

需积分: 9 93 浏览量更新于2024-07-31 收藏 348KB PDF 举报

本文主要探讨了熵基运输问题在几何编程方法中的应用（Entropy based transportation problem with geometric programming）。通常情况下，运输问题涉及在多个源与目标之间分配有限资源，以最小化成本或最大化效益。几何编程是一种优化技术，它将非线性约束转化为凸优化问题，从而简化了解决过程，特别适用于那些具有明确边界和结构的问题。在IT领域，特别是在数据挖掘和机器学习背景下，文章关注的是文本挖掘效率的提升。作者George Forman和Evan Kirshenbaum来自HPL Laboratories，他们针对大规模文本分类和索引应用提出了一个快速的文本特征提取方法。传统的文本处理步骤如Unicode转换、强制小写、词边界检测和字符串哈希计算通常耗时且占用大量计算资源。然而，论文介绍了一种新的方法，使用整数哈希特征替代字符串词特征，这种技术在保持统计性能相当的同时，显著减少了计算量和内存需求。整数哈希特征的实现通过一次性的预处理，减少了后续对每个词的独立处理，从而极大地提高了文本特征抽取的速度。这对于像在线广告中的行为定向和互联网广告分析这样的大数据场景尤为关键，能够实现实时分析和高效存储。此外，该研究还涉及到文本分类（document categorization）、文本索引（text indexing）和特征工程（feature engineering），这些是构建高效文本处理系统的基础。总结来说，这篇论文的核心贡献在于提出了一种基于熵的运输问题解决方案，结合了几何编程的优化技巧，并展示了如何通过整数哈希特征加速文本处理流程，使得在线广告平台能够更快地进行用户行为分析和广告匹配，从而提高广告效果并降低运营成本。这对于IT专业人士理解如何利用最新优化技术优化大规模文本数据处理具有重要参考价值。

than an array of booleans.) In practice, N will be chosen to be a

power of two, allowing the modulus operation (line 9) to be

replaced by a simple bitwise AND.

If a more compact representation of the resulting feature vector is

required, a quick, linear scan of the boolean array can be used to

insert each occurring word hash into a hash table or into an array

containing the word hashes found. In many cases, though,

features are extracted in order to be used immediately for some

task, an in such cases the boolean array representation is often

sufficient.

A reasonably large bits array of size N=2

is suggested. Larger

values of N result in a greater memory demand momentarily for

the bits array, whereas smaller values of N effectively restrict the

size of the hash space, conflating features that have different 32-

bit integer hash values together into a smaller space. Using N=2

is easily sufficient for corpora or training sets with hundreds of

thousands of unique terms. For perspective, a corpus of 804,414

Reuters news articles contains 47,236 unique words, and a corpus

of 62,369 Arxiv.org astrophysics papers contains 99,757 unique

words [8]. Regarding the memory demands, most compilers will

allocate one byte per boolean for good performance, rather than

packing them tightly into bits. Thus, for N=2

the size of this

bits table is 1 MB, which is a fraction of the L2 cache for modern

CPU chips. If instead one needs to track the word count for the

bag of words representation, this array can instead be used to

contain 8-bit or 16-bit word frequency counters.

The pre-computed code table actually serves four different

functions simultaneously: word character determination,

lowercasing, hash randomization, and, as will be introduced later

in Section 5.2, Unicode processing. Table 3 shows pseudo-code

to prepare the table appropriate for 8-bit character streams. For

any character that is not a word character, the table contains zero.

For word characters, it maps the character to its lowercase

equivalent and maps the result a distinct fixed random integer

code. This randomization substantially improves the hashing

function over just using the lowercase character code directly, as

the Java String.hashCode() method does. (For reference, Java

Strings use multiplicative hashing [10], computing their hashes by

iterating hash = 31*hash + char over their characters).

It should be noted that while in our tests our word characters were

letters and numbers an our normalization is lowercasing, this is all

embodied in the creation of the code table. Any other set of rules

could be used by providing a different table. The only restriction

is that the mapped and normalized character (if any) be decidable

from a single character. This can be relaxed, but the mechanisms

for doing so are beyond the scope of this paper.

Our hashing function, which we call Mapped Additive Shift

Hashing (MASH), consists of a simple arithmetic right shift, a

table lookup (from a very small table), and an addition. As such,

it is very efficient and easy for a compiler to optimize. Although

the rightward direction of the bit-shift operation (line 6) is

somewhat non-intuitive, it actually produces substantially better

hashes than using a left-shift. The reason is that we end up using

only the low order bits of the resulting word hash (line 9), and if a

left-shift were used instead, then these lowest order bits would be

largely unaffected by the first characters of the word, especially

for a longer word. (In the case in which the modulus is a power of

two for efficiency, they would be completely unaffected by

them.) If the rightmost ten bits are kept, then only the final ten

characters can contribute to the hash. By using a right-shift, on

the other hand, and by ensuring that each character contributes 32

bits to the value, a character’s influence continues (in a degrading

fashion) for at least 31 following characters. Even for the

occasional word longer than 32 characters, note that we use a

sign-extended right-shift, so that the parity of initial letters

continues to have some effect on the hash. Of course, this

influence does diminish, and so MASH is certainly not suitable

for hashing long texts, which is the goal of typical hash function

designs such as SHA1, MD5 and Rabin hashing, which are each

much more involved to compute.

If one wishes to also produce features for each phrase bi-gram,

then at the moment when the hash is registered (line 9), we can

also register a phrase hash consisting of the current hash

combined with the previous word hash, using a symmetric

combination (such as XOR or addition) if it is desired to have

features that are independent of the order of the words or an

asymmetric combination (such as subtraction or XOR with a

shifted value) if order is to be preserved. Working in integer

space, such combination is much more efficient than working in

string space, in which typically a new string must be constructed

(or at least a new heap-based object that refers to the two strings).

2.3 Use in Full-Text Indexing

So far the discussion has primarily focused on the use case of

extracting a bag or set of words, which is needed for use in

classification or clustering applications. But the speed benefits of

our method can also be leveraged for the widespread use case of

full-text indexing of large document repositories (either large

documents or many of them or both). In this case, the indexing

engine needs to extract the sequence of words/terms in a text, and

build reverse indices of them for later information retrieval. The

key difference for our features would be that the term features are

integers rather than Strings. When later a query is submitted, the

words that the user enters are submitted to the exact same hash

processing, yielding the same integer hashes that the text has

previously been indexed by. Note that for this usage, one can use

the full 32-bit integer hash space (i.e. without the modulo N and

the bit array on lines 1 and 9 of Table 2).

3. EXPERIMENTS

In this section we present a series of experiments to evaluate the

speed and other qualities of our method compared with various

baselines. In all cases our compiler is Java JDK 1.6. Where

speed measurements are concerned, the measurements were

performed on a 2.4 GHz Intel

Xeon™ CPU, which is a 64-bit

architecture, running a version of Linux derived from Red Hat

version 4. Measuring elapsed time for performance can be quite

tricky, since other processes and threads regularly interrupt

processing for uncontrolled amounts of time, occasionally

slowing down the overall or average measurement substantially.

For this reason, we measure the throughput for each test over 50

separate iterations and use the measurement that shows the best

Table 3. Code table pre-compilation for SpeedyFx.

prepTable():

1 int rand[256] = 256 fixed random integers;

2 foreach (ch: 0..255) {

3 codetable[ch] = isWord(ch) ?

4 rand[toLowerCase(ch)] : 0;

5 }

剩余15页未读，继续阅读

hj8380

粉丝: 2
资源: 6

熵基运输问题的几何编程方法优化

E-FPROMETHEE An Entropy Based Fuzzy___

A Quadratic Equilibrium Entropy Based Virtual Machine Load Balance Evaluation Algorithm

Weighted permutation entropy based on different symbolic approaches for financial time series

Entropy Estimator based the Lempel-Ziv algorithm:Entropy Estimator based on the Lempel-Ziv algorithm-matlab开发

An entropy-based multi-criterion group decision analysis method of selecting the optimum weak-subgrade treatment (2008年)

DDoS Detection Method Based on Chaos Analysis of Network Traffic Entropy

Algorithm Robustness Analysis for the Choice of Optimal Time Delay of Phase Space Reconstruction Based on Singular Entropy Method

ENTF: An Entropy-Based MT Evaluation Metric

A multi-objective genetic algorithm based on immune and entropy principle for flexible job-shop scheduling problem

entropy-method-matlab-:implement of entropy method in matlab， 用matlab实现熵权法

最新资源

entropy-method-matlab-:implement of entropy method in matlab，用matlab实现熵权法