examples to familiarize yourself with its use (e.g. load and explore the weather data,
provided with the installation).
2. Create a string data file in ARFF format (see the description of the ARFF format
at http://www.cs.waikato.ac.nz/~ml/weka/arff.html). Follow the directions below:
First create a concatenation of all text documents (text corpus) obtained from the
data collection step and save them in a single text file, where each document is
represented on a separate line in plain text format. For example, this can be done by
loading all text files in MS Word and then saving the file in plain text format without
line breaks. Other editors may be used for this purpose too. Students with
programming experience may want to write a program to automate this step.
Once the file with the text corpus is created enclose each line in it (an individual
document content) in quotation marks (“) and add the document name in the
beginning of the line and the document class at the end, all separated by commas.
Also add a file header in the beginning of the file followed by @data as shown
below:
@relation departments_string
@attribute document_name string
@attribute document_content string
@attribute document_class string
@data
Anthropology, " anthropology anthropology anthropology consists …”, A
…
This representation uses three attributes – document_name, document_content, and
document_class, all of type string. Each row in the data section (after @data)
represents one of the initial text documents. Note that the number of attributes and the
order in which they are listed in the header should correspond to the comma separated
items in the data section. An example of such string data file is “Departments-
string.arff”, available from the data repository at
http://www.cs.ccsu.edu/~markov/dmwdata.zip, folder “Weka data”.
3. Create Term counts, Boolean, and TFIDF data sets. Load the string data file in
Weka using the “Open file” button in “Preprocess” mode. After successful loading
the system shows some statistics about the number of attributes (3) their type (string)
and the number of instances (rows in the data section or documents).
Choose the StringToNominal filter and apply it (one at a time) to the first attribute,
document_name and then to the last attribute (index 3), document_class. Then choose
the StringToWordVector filter and apply it with outputWordCounts=true. You may