Weka 3.8官方指南：Java编程与API详解

需积分: 9 31 浏览量更新于2024-07-18 收藏 6.32MB PDF 举报

WekaManual3.8官方文档是针对Weka（一个广泛使用的开源机器学习库）的详细介绍，它涵盖了使用Weka API进行Java编程的相关知识。该文档由Remco R. Bouckaert、Eibe Frank、Mark Hall等多位专家共同编撰，提供了从命令行接口到图形用户界面（GUI）的全面指导。首先，文档的核心部分是命令行 primer，这一章节深入浅出地介绍了Weka的基本概念。主要内容包括： 1.1 **介绍**：简述了Weka作为机器学习工具的主要目标和应用场景，以及它在大数据分析中的角色。 1.2 **基本概念**： - **数据集**：阐述了如何加载、处理和预处理数据集，这是所有机器学习流程的基础。 - **分类器**：详细讲解了如何选择和使用Weka提供的各种分类算法，如决策树、SVM、神经网络等。 - **过滤器**：介绍数据预处理工具，如特征选择、归一化、缺失值处理等，用于优化数据质量。 - **分类器包**：列出了Weka中各类分类器的类别和功能，帮助用户根据需求选择合适的方法。 1.3 **示例**：提供了实用的代码示例，以便读者通过实践更好地理解和应用Weka API。 1.4 **额外包与包管理器**： - **包管理器**：讲解如何管理和安装第三方插件，扩展Weka的功能。 - **运行已安装的学习算法**：说明如何在实际项目中调用和配置不同的算法。其次，文档还关注于Weka的图形用户界面（GUI），这对于不熟悉命令行操作的用户非常有用。这部分内容包括： - **启动Weka**：介绍了如何启动Weka的GUI环境，以及其界面布局和基本操作。 - **包管理器的详细说明**：在GUI中如何安装、卸载和管理包的步骤。 - **使用代理服务器**：针对网络限制或企业环境，如何配置Weka连接外部资源。 - **替代中央包元数据存储**：提供其他选项，以适应特定的环境需求。 WekaManual3.8官方文档为开发者和数据科学家提供了一套全面的指南，无论你是初学者还是经验丰富的用户，都能从中找到所需的信息来构建和优化自己的机器学习模型。通过理解和掌握这份文档，你将能够更有效地利用Weka进行数据挖掘和分析工作。

16 CHAPTER 1. A COMMAND-LINE PRIMER

1.2.2 Classiﬁer

Any learning algorithm in WEKA is derived from the abstract weka.classifiers.AbstractClassifier

class. This, in turn, implements weka.classifiers.Classifier. Surprisingly

little is needed for a basic classiﬁer: a routine which generates a classiﬁer model

from a training dataset (= buildClassifier) and another routine which eval-

uates the generated model on a n unseen test dataset (= classifyInstance), or

generates a probability distribution for all classes (= distributionForInstance).

A classiﬁer model is an arbitrary complex mapping from all-but-one dataset

attributes to the class attribute. The speciﬁc form and creation of this map-

ping, or model, diﬀers from classiﬁer to clas siﬁer. For example, ZeroR’s (=

weka.classifiers.rules.ZeroR) model just consists of a single value: the

most common class, or the median of a ll numeric values in c ase of pr edicting a

numeric value (= regression learning). ZeroR is a trivial classiﬁer, but it gives a

lower b ound on the performance of a given dataset which should be signiﬁcantly

improved by more complex classiﬁers. As such it is a reasonable test on how

well the class can be predicted without cons idering the other attributes.

Later, we will explain how to interpret the output from classiﬁers in detail –

for now just focus on the Correctly Classiﬁed Instances in the section Stratiﬁed

cross-validation and notice how it improves from ZeroR to J48:

java weka.classifiers.rules.ZeroR -t weather.arff

java weka.classifiers.trees.J48 -t weather.arff

There are various approaches to determine the performance of classiﬁers. The

performance can most simply be measured by counting the propor tion of cor-

rectly predicted examples in a n unse e n test dataset. T his value is the accuracy,

which is also 1-ErrorRate. Both terms are used in literature.

The simplest case is using a training set and a test set which are mutually

independent. This is referred to as hold-out estimate. To estimate variance in

these performance estimates, hold-out estimates may be computed by repeatedly

resampling the same dataset – i.e . randomly reordering it and then splitting it

into training and test sets with a speciﬁc prop ortion of the examples, collecting

all estimates on test data and computing average and standard deviation of

accuracy.

A more elaborate method is cross-validation. Here, a number of folds n is

speciﬁed. The dataset is randomly reordered and then s plit into n folds of equal

size. In each iteration, one fold is used for testing and the other n-1 folds are

used for training the classiﬁer. The test results are collected and averaged over

all folds. This gives the cross-validation estimate of the acc uracy. The folds can

be purely random or slightly modiﬁed to create the same class distributions in

each fo ld as in the complete dataset. In the latter c ase the cross-validation is

called stratiﬁed. Leave-one-out (= loo) cross-validation sig niﬁes that n is equal

to the number o f examples. Out of necessity, loo cv has to be no n-stratiﬁed,

i.e. the class distributions in the test set are not related to those in the training

data. Therefore loo cv tends to give less reliable results. However it is still

quite useful in dealing with small datasets since it utilizes the greatest amount

of training data from the dataset.

1.2. BASIC CONCEPTS 17

1.2.3 weka.ﬁlters

The weka.filters package is concerned with classes that transform datasets –

by removing or adding attributes, resampling the data set, re moving examples

and so o n. This package oﬀers useful support for data preprocessing, which is

an important step in machine learning.

All ﬁlters oﬀer the options -i for specifying the input dataset, and -o for

specifying the output data set. If any of these parameters is not given, standard

input and/or standard output will be read from/written to. Other pa rameters

are sp eciﬁc to each ﬁlter and can be found o ut via -h, as with any other class .

The weka.filters package is orga niz e d into supervised and unsupervised

ﬁltering, both of which are again subdivided into instance and attribute ﬁltering.

We will discuss each of the four subsections separately.

weka.ﬁlters.supervised

Classes below weka.filters.supervised in the class hierarchy a re for super -

vised ﬁltering, i.e., taking advantage of the class informatio n. A cla ss must be

assigned via -c, for WEKA default behaviour use -c last.

weka.ﬁlters.supervised.attribute

Discretize is used to discretize numeric attributes into nominal ones, bas ed

on the class information, via Fayyad & Irani’s MDL method, o r optionally

with Kononeko’s MDL method. At least some learning schemes or classiﬁers

can only process nominal data, e.g. weka.classifiers.rules.Prism; in some

cases discretization may also re duce lea rning time.

java weka.filters.supervised.attribute.Discretize -i data/iris.arff \

-o iris-nom.arff -c last

java weka.filters.supervised.attribute.Discretize -i data/cpu.arff \

-o cpu-classvendor-nom.arff -c first

NominalToBinary encodes all nominal a ttributes into binary (two-valued) at-

tributes, which can be used to transform the datas et into a pur ely numeric

representation, e.g. for visualization via multi-dimensional scaling.

java weka.filters.supervised.attribute.NominalToBinary \

-i data/contact-lenses.arff -o contact-lenses-bin.arff -c last

Keep in mind that most classiﬁers in WEKA utilize transformation ﬁlters in-

ternally, e.g. Logis tic and SMO, so you will us ually not have to use these ﬁlters

explicity. However, if you pla n to run a lot of experiments, pre-applying the

ﬁlters yourself may improve runtime performance.

weka.ﬁlters.supervised.instance

Resample creates a stratiﬁed subsample of the given dataset. This means that

overall class distributions are approximately re tained within the sample. A bias

towards uniform class distribution can be speciﬁed via -B.

java weka.filters.supervised.instance.Resample -i data/soybean.arff \

-o soybean-5%.arff -c last -Z 5

java weka.filters.supervised.instance.Resample -i data/soybean.arff \

-o soybean-uniform-5%.arff -c last -Z 5 -B 1

18 CHAPTER 1. A COMMAND-LINE PRIMER

StratifiedRemoveFolds creates stratiﬁed cross-validation folds of the given

dataset. This means that by default the class distributions are approximately

retained within each fold. The following example splits soybean.arﬀ into strat-

iﬁed training and test datasets, the latter consisting of 25% (= 1/4) of the

data.

java weka.filters.supervised.instance.StratifiedRemoveFolds \

-i data/soybean.arff -o soybean-train.arff \

-c last -N 4 -F 1 -V

java weka.filters.supervised.instance.StratifiedRemoveFolds \

-i data/soybean.arff -o soybean-test.arff \

-c last -N 4 -F 1

weka.ﬁlters.unsupervised

Classes below weka.filters.unsupervised in the class hierarchy ar e for unsu-

pervised ﬁltering, e.g. the non-stratiﬁed version of Re sample. A class attribute

should not be assigned here.

weka.ﬁlters.unsupervised.attribute

StringToWordVector transforms string attributes into word vectors, i.e. creat-

ing one attribute for ea ch word which either encodes presence or word count (=

-C) within the string. -W ca n be used to set an approximate limit o n the number

of words. When a class is assigned, the limit applies to each c lass separately.

This ﬁlter is useful for text mining.

Obfuscate renames the dataset name, all attribute names and nominal attribute

values. This is intended for exchanging sensitive datasets without giving away

restricted information.

Remove is intended for explicit deletion of attributes from a dataset, e.g. for

removing attributes of the iris dataset:

java weka.filters.unsupervised.attribute.Remove -R 1-2 \

-i data/iris.arff -o iris-simplified.arff

java weka.filters.unsupervised.attribute.Remove -V -R 3-last \

-i data/iris.arff -o iris-simplified.arff

weka.ﬁlters.unsupervised.instance

Resample creates a non-stratiﬁed subsample of the given dataset, i.e. random

sampling without regard to the class information. Other w ise it is equivalent to

its supervised variant.

java weka.filters.unsupervised.instance.Resample -i data/soybean.arff \

-o soybean-5%.arff -Z 5

RemoveFoldsc reates cross-validation folds of the given da taset. The class distri-

butions are not retained. The following e xample splits soybean.a rﬀ into training

and test datasets, the latter consisting of 25% (= 1/ 4) of the data.

java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff \

-o soybean-train.arff -c last -N 4 -F 1 -V

java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff \

-o soybean-test.arff -c last -N 4 -F 1

RemoveWithValues ﬁlters instances acco rding to the value of an attribute.

java weka.filters.unsupervised.instance.RemoveWithValues -i data/soybean.arff \

-o soybean-without_herbicide_injury.arff -V -C last -L 19

1.2. BASIC CONCEPTS 19

1.2.4 weka.classiﬁers

Classiﬁers are at the core of WEKA. T he re are a lot of common options for

classiﬁers , most of which are related to evaluation purposes. We will focus on

the most imp ortant ones. All others including classiﬁer-speciﬁc parameters can

be found via -h, as usual.

-t speciﬁes the training ﬁle (ARFF format)

-T

speciﬁes the test ﬁle in (ARFF format). If this parameter is miss-

ing, a c rossvalidation will be performed (default: ten-fold cv)

-x

This parameter determines the number of folds for the cross -

validation. A cv will only be performed if - T is missing.

-c

As we already know from the weka.ﬁlters section, this parameter

sets the class variable with a o ne -based index.

-d

The model after training can be s aved via this parameter. Each

classiﬁer has a diﬀerent binary format for the model, so it can

only be read back by the exact same classiﬁer on a compatible

dataset. Only the model on the training set is saved, not the

multiple models generated via c ross-validation.

-l

Loads a previously saved mode l, usually for tes ting on new, pre-

viously unseen data. In that case, a compatible test ﬁle should be

speciﬁed, i.e. the same attributes in the same order.

-p #

If a test ﬁle is speciﬁed, this pa rameter shows you the predictions

and one attribute (0 for none) for all test instances.

-o

This parameter switches the human-r eadable output of the model

description oﬀ. In case of support vector machines or NaiveBayes,

this makes some sense unless you want to parse and visualize a

lot of information.

We now give a short list of selected classiﬁers in WEKA. Other classiﬁers below

weka.classiﬁers may also be used. This is more e asy to see in the Explorer GUI.

• trees.J48 A clone of the C4.5 decision tree learner

• bayes.NaiveBayes A Naive Bayesian learner. -K switches on kernel den-

sity estimation for numerical attributes which often improves performance.

• meta.ClassificationViaRegression -W functions.LinearRegression

Multi-response linear regression.

• functions.Logistic Logistic Regression.

• functions.SMO Support Vecto r Machine (linear, polynomial and RBF ker-

nel) with Sequential Minimal Optimization Algorithm due to [4]. Defaults

to SVM with linear kernel, -E 5 -C 10 gives an SVM with polynomial

kernel of degree 5 and lambda of 10.

20 CHAPTER 1. A COMMAND-LINE PRIMER

• lazy.KStar Instance-Based learner. -E sets the blend entropy automati-

cally, which is usually preferable.

• lazy.IBk Instance-Based learner with ﬁx ed neig hborhood. -K sets the

number of neighbors to use. IB1 is equivalent to IBk -K 1

• rules.JRip A clone of the RIP PER rule learner.

Based on a simple example, we will now explain the output of a typical

classiﬁer, weka.classifiers.trees.J48. Co nsider the following call from the

command line, or start the WEKA explore r and train J48 on weather.arﬀ :

java weka.classifiers.trees.J48 -t data/weather.arff

J48 pruned tree

------------------

outlook = sunny

| humidity <= 75: yes (2.0)

| humidity > 75: no (3.0)

outlook = overcast: yes (4.0)

outlook = rainy

| windy = TRUE: no (2.0)

| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8

The ﬁrst part, unless you specify

-o, is a human-readable form of

the training set model. In this

case, it is a decision tree. out-

look is at the root of the tree

and determines the ﬁrst decision.

In ca se it is overcast, we’ll al-

ways play golf. The numbers in

(parentheses) at the end of each

leaf tell us the number of exam-

ples in this leaf. If one or more

leaves were not pure (= all of the

same class), the number of mis-

classiﬁed examples would also be

given, after a /slash/

Time taken to build model: 0.05 seconds

Time taken to test model on training data: 0 seconds

As you can see, a decision tree

learns quite fast and is evalu-

ated even faster. E.g. for a lazy

learner, testing would take far

longer than training.

== Error on training data ===

Correctly Classified Instance 14 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %

Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

1 0 1 1 1 yes

1 0 1 1 1 no

=== Confusion Matrix ===

a b <-- classified as

9 0 | a = yes

0 5 | b = no

This is quite b oring: our clas-

siﬁer is perfect, at least on the

training data – all instances were

classiﬁed co rrectly and all errors

are zero. As is usually the case,

the training set accuracy is too

optimistic. The detaile d accu-

racy by class and the confusion

matrix is similarily trivial.

剩余340页未读，继续阅读

Apolo_

粉丝: 44
资源: 7

Weka 3.8官方指南：Java编程与API详解

Weka 3.8 官方文档

weka详细使用文档实例

weka3.9手册

Weka3.8官方文档：机器学习与算法解析

WekaManual

wekamanual

WekaManual.pdf

WekaManual3.7.11 manual

WekaManual.rar_weka

WEKA 3.8.1官方手册：命令行与GUI教程

最新资源