Weka 3.7.1 教程：命令行与GUI指南

需积分: 9 50 浏览量更新于2024-07-24 收藏 4.78MB PDF 举报

"Weka Manual_3_7_1 是一本关于Weka 3.7.1版本的英文使用手册，由Remco Bouckaert、Eibe Frank、Mark Hall、Richard Kirkby、Peter Reutemann、Alex Seewald和David Scuse等人编写，发布日期为2010年1月11日。该手册遵循GNU General Public License version 2的许可协议。手册包含了对Weka命令行界面和图形用户界面的详细介绍，旨在帮助用户理解和使用这个强大的数据挖掘工具。" Weka是一款开源的数据挖掘和机器学习软件，主要由新西兰怀卡托大学开发。在本手册中，它详细介绍了如何使用Weka进行数据分析和模式识别。 I 部分主要关注命令行界面（Command-line Interface）： 1. 命令行基础：这部分引导用户了解命令行的基本概念，包括数据集（Dataset）、分类器（Classifier）、过滤器（Filter）和分类器包（weka.classifiers）。 2. 数据集：数据集是分析的基础，可以是.arff文件，包含属性和实例。 3. 分类器：Weka提供了多种预定义的分类算法，如朴素贝叶斯、决策树、支持向量机等，用户可以通过命令行调用并配置这些分类器。 4. 过滤器和分类器包：Weka提供了丰富的预处理工具（weka.filters），用于数据清洗、转换和选择；分类器包则包含各种机器学习算法。 5. 示例：手册给出了一些实际的命令行操作示例，帮助用户上手实践。 II 部分涉及图形用户界面（Graphical User Interface, GUI）： 1. 启动Weka：指导用户如何启动Weka的GUI版本。 2. Simple CLI：简单命令行界面，提供了基本的命令执行功能。 3. 命令操作：描述了在Simple CLI中输入命令的方式，包括命令重定向和命令补全。 4. Explorer：这是Weka的主要工作环境，分为多个部分，如Section Tabs、Status Box、Log Button、WEKA Status Icon和Graphical output。 - Section Tabs包括选择数据、预处理、构建模型、评估和视觉化等步骤。 - Status Box显示程序状态，Log Button记录操作日志。 - Graphical output允许用户以图表形式查看分析结果。 5. 预处理：介绍如何加载数据，以及数据预处理的方法，如数据清洗和特征选择。 Weka Manual_3_7_1为用户提供了全面的指南，无论是在命令行下还是通过图形界面，都能有效地利用Weka进行数据探索、特征工程、模型训练和评估。用户可以根据手册中的指示，逐步学习并掌握这个强大的数据挖掘工具。

16 CHAPTER 1. A COMMAND-LINE PRIMER

1.2.2 Classiﬁer

Any lear ning algorithm in WE KA is derived from the abstract weka.classifiers.Classifier

class. Surprisingly little is needed for a basic classiﬁer: a routine which gen-

erates a classiﬁer model from a training dataset (= buildClassifier) and

another routine which evaluates the generated model on an unseen test da taset

(= classifyInstance), or generates a probability distribution for all classes

(= distributionForInstance).

A classiﬁer model is an ar bitrary complex mapping from all-but-one dataset

attributes to the class attribute. The speciﬁc form and creation of this map-

ping, or model, diﬀers from classiﬁer to classiﬁer. For e xample, ZeroR’s (=

weka.classifiers.rules.ZeroR) model just consists of a single value: the

most common class, or the median of all numeric values in case of predicting a

numer ic value (= regression learning). ZeroR is a trivia l classiﬁer, but it gives a

lower bound on the per fo rmance of a g iven dataset which should be signiﬁcantly

improved by more complex classiﬁers. As such it is a reasonable test on how

well the class can b e predicted without considering the other attributes.

Later, we will explain how to interpret the output from classiﬁers in detail –

for now just focus on the Correctly Classiﬁed Instances in the section Stratiﬁed

cross-validation and notice how it improves from ZeroR to J48:

java weka.classifiers.rules.ZeroR -t weather.arff

java weka.classifiers.trees.J48 -t weather.arff

There are various approaches to determine the performance of classiﬁers. The

performance can most simply be measured by counting the proportion of cor-

rectly predicted examples in an unseen test dataset. This value is the accuracy,

which is also 1-ErrorRate. Both terms are used in literature.

The simplest case is using a training set and a test set which are mutually

independent. This is referred to as hold-out estimate. To estimate variance in

these performance estimates, hold-out estimates may be computed by repeatedly

resampling the same dataset – i.e. randomly reordering it and then splitting it

into training and test sets with a speciﬁc proportio n of the examples, collecting

all estimates on test data and computing average and standard deviation of

accuracy.

A more elaborate method is cross-validation. Here, a number of folds n is

sp e c iﬁed. The dataset is randomly reordered and then split into n folds o f equal

size. In each iter ation, one fold is used for testing and the other n- 1 folds are

used for training the classiﬁer. The test results are collected a nd average d over

all folds. This gives the cross-validation estimate of the accuracy. The folds can

be purely random o r slig htly modiﬁed to create the same class distributions in

each fold as in the complete data set. In the latter case the cross-validation is

called stratiﬁed. Leave-one-out (= loo) cross -validation signiﬁes that n is equal

to the number of examples. Out of necessity, loo cv has to be non-stratiﬁed,

i.e. the class distributions in the test set are not related to those in the training

data. Therefore loo cv tends to give less reliable results. However it is still

quite useful in dealing with small datas e ts since it utilizes the greatest amount

of training data from the dataset.

1.2. BASIC CONCEPTS 17

1.2.3 weka.ﬁlters

The weka.filters package is concerned with cla sses that transforms datasets

– by removing or adding attributes, resa mpling the dataset, removing examples

and so on. T his package oﬀers useful support for data preprocessing, which is

an important step in machine learning.

All ﬁlters oﬀer the options -i for specifying the input dataset, and -o for

sp e c ifying the output dataset. If any of these parameters is not given, this

sp e c iﬁes standard input resp. output for use within pipes. Other parameters

are speciﬁc to each ﬁlter and can be found out via -h, as with any other class.

The weka.filters packa ge is organized into supervised and unsupervised

ﬁltering, b oth of which are again subdivided into instance and attribute ﬁltering.

We will discuss each of the four subsection separately.

weka.ﬁlters.supervised

Classes below weka.filters.supervised in the c lass hierarchy are for sup e r-

vised ﬁltering, i.e., taking advantage of the class information. A class must be

assigned via -c, for WEKA default behaviour use -c last.

weka.ﬁlters.supervised.attribute

Discretize is used to discretize numeric attributes into nominal ones, based

on the class information, via Fayyad & Irani’s MDL method, or optionally

with Kononeko’s MDL method. At least some learning schemes or classiﬁers

can only process nominal data, e.g. weka.classifiers.rules.Prism; in some

cases discretization may also reduce learning time.

java weka.filters.supervised.attribute.Discretize -i data/iris.arff \

-o iris-nom.arff -c last

java weka.filters.supervised.attribute.Discretize -i data/cpu.arff \

-o cpu-classvendor-nom.arff -c first

NominalToBinary encodes all nominal attributes into binary (two- valued) at-

tributes, which can be used to transform the dataset into a purely numeric

representation, e .g. for visualization via multi-dimensional sca ling.

java weka.filters.supervised.attribute.NominalToBinary \

-i data/contact-lenses.arff -o contact-lenses-bin.arff -c last

Keep in mind that most classiﬁers in WEKA utilize transformation ﬁlters in-

ternally, e.g. Logistic and SMO, so you will usually not have to use these ﬁlters

explicity. However, if you plan to run a lot of experiments, pre-applying the

ﬁlters yourself may improve runtime performance.

weka.ﬁlters.supervised.instance

Resample creates a stratiﬁed subsample o f the given dataset. This means that

overall class distributions are approximately retained within the sample. A bias

towards uniform class distribution can be speciﬁed via -B.

java weka.filters.supervised.instance.Resample -i data/soybean.arff \

-o soybean-5%.arff -c last -Z 5

java weka.filters.supervised.instance.Resample -i data/soybean.arff \

-o soybean-uniform-5%.arff -c last -Z 5 -B 1

18 CHAPTER 1. A COMMAND-LINE PRIMER

StratifiedRemoveFolds creates stratiﬁed cross- validation folds of the given

dataset. This means that per default the class distributions are approximately

retained within each fold. The following example splits soybean.a rﬀ into strat-

iﬁed training and test datasets, the latter consisting of 25% (= 1/4) of the

data.

java weka.filters.supervised.instance.StratifiedRemoveFolds \

-i data/soybean.arff -o soybean-train.arff \

-c last -N 4 -F 1 -V

java weka.filters.supervised.instance.StratifiedRemoveFolds \

-i data/soybean.arff -o soybean-test.arff \

-c last -N 4 -F 1

weka.ﬁlters.unsupervis ed

Classes below weka.filters.unsupervised in the class hierarchy are for un-

supe rvised ﬁltering, e.g. the non-stratiﬁed version of Resample. A class should

not be assigned here.

weka.ﬁlters.unsupervis ed.attribute

StringToWordVector transfor ms string a ttributes into a word vectors, i.e. cre-

ating one attribute for each word which either encodes presence or word count

(= -C) within the string. -W can be used to set an approximate limit on the

number of words. When a clas s is assigned, the limit applies to each class

separately. This ﬁlter is useful for text mining.

Obfuscate renames the datase t name, all attribute names and nominal attribute

values. This is intended for exchanging sensitive datasets without giving away

restricted information.

Remove is intended for explicit deletion of attributes from a dataset, e.g. for

removing attributes of the iris dataset:

java weka.filters.unsupervised.attribute.Remove -R 1-2 \

-i data/iris.arff -o iris-simplified.arff

java weka.filters.unsupervised.attribute.Remove -V -R 3-last \

-i data/iris.arff -o iris-simplified.arff

weka.ﬁlters.unsupervis ed.instance

Resample creates a non-stratiﬁed subsample of the given dataset, i.e. random

sampling without regard to the class information. Otherwise it is equivalent to

its supervised variant.

java weka.filters.unsupervised.instance.Resample -i data/soybean.arff \

-o soybean-5%.arff -Z 5

RemoveFoldscreates cross-validation folds of the given dataset. The class distri-

butions are not retained. The following example splits soybean.arﬀ into training

and test datasets, the latter consisting of 25% (= 1/4) of the data.

java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff \

-o soybean-train.arff -c last -N 4 -F 1 -V

java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff \

-o soybean-test.arff -c last -N 4 -F 1

RemoveWithValues ﬁlters instances according to the value of an attribute.

java weka.filters.unsupervised.instance.RemoveWithValues -i data/soybean.arff \

-o soybean-without_herbicide_injury.arff -V -C last -L 19

1.2. BASIC CONCEPTS 19

1.2.4 weka.classiﬁers

Classiﬁers are at the core of WEKA. There are a lot of common options for

classiﬁers, most of which are related to evaluation purposes. We will focus on

the most important ones. All others including classiﬁer-speciﬁc parameters can

be found via -h, as usual.

-t sp e c iﬁes the training ﬁle (ARFF format)

-T

sp e c iﬁes the test ﬁle in (ARFF format). If this parameter is miss-

ing, a cr ossvalidation will be performed (default: ten-fold cv)

-x

This parameter determines the number of folds for the cross-

validation. A cv will only be performed if -T is missing.

-c

As we already know from the weka.ﬁlters section, this parameter

sets the class variable with a one-based index.

-d

The model after training can be saved via this parameter. Each

classiﬁer has a diﬀerent binary format for the model, so it can

only be read back by the exact sa me classiﬁer on a compatible

dataset. Only the model on the training set is saved, not the

multiple models generated via cross-validation.

-l

Loads a pr e viously saved model, usually for testing on new, pre-

viously unseen data. In that case, a compatible test ﬁle should be

sp e c iﬁed, i.e. the same attributes in the same order.

-p #

If a test ﬁle is speciﬁed, this parameter shows you the predictions

and one attribute (0 for none) for all test instances.

-i

A more deta iled performance description via precision, recall,

true- and false positive rate is additionally o utput with this pa-

rameter. All these values can also be computed from the confusion

matrix.

-o

This parameter switches the human-readable output of the model

description oﬀ. In case of support vector machines or Na iveBayes,

this makes some se nse unless you want to parse and visualize a

lot of information.

We now give a short list of s e lected classiﬁers in WEKA. Other classiﬁers below

weka.classiﬁers may also be used. This is more easy to see in the Explor e r GUI.

• trees.J48 A c lone of the C4.5 decision tree learner

• bayes.NaiveBayes A Naive B ayesian learner. -K switches on kernel den-

sity estimation for numerical attributes which often improves performance.

• meta.ClassificationViaRegression -W functions.LinearRegression

Multi-response linear regression.

• functions.Logistic Logistic Regression.

剩余306页未读，继续阅读

xx52lang

粉丝: 0
资源: 12

Weka 3.7.1 教程：命令行与GUI指南

Weka Manual

weka manual

WEKA Manual

weka+manual

weka3.9手册

Weka的操作手册

WekaManual3.7.11 manual

WEKA 3-6-1 使用手册：从命令行到GUI界面

Weka数据挖掘操作全攻略

WEKA数据挖掘软件用户手册

最新资源