Synopsis
This operator reads an example set from file. The operator can be configured to read almost all line based file formats.
Description
This operator reads an example set from (a) file(s). Probably you can use the default parameter values for the most file formats (including the format produced by the ExampleSetWriter, CSV, ...). Please refer to section First steps/File formats for details on the attribute description file set by the parameter attributes used to specify attribute types. You can use the wizard of this operator or the tool Attribute Editor in order to create those meta data .aml files for your datasets.
This operator supports the reading of data from multiple source files. Each attribute (including special attributes like labels, weights, ...) might be read from another file. Please note that only the minimum number of lines of all files will be read, i.e. if one of the data source files has less lines than the others, only this number of examples will be read.
The split points can be defined with regular expressions (please refer to one of the plenty tutorials available on the web for an introduction). The default split parameter ",\s*|;\s*|\s+" should work for most file formats. This regular expression describes the following column separators
- the character "," followed by a whitespace of arbitrary length (also no white space)
- the character ";" followed by a whitespace of arbitrary length (also no white space)
- a whitespace of arbitrary length (min. 1)
A logical XOR is defined by "|". Other useful separators might be "\t" for tabulars, " " for a single whitespace, and "\s" for any whitespace. </p>
<p> Quoting is also possible with ". You can escape quotes with a backslash, i.e. \". Please note that you can change these characters by adjusting the corresponding settings.
Additionally you can specify comment characters which can be used at arbitrary locations of the data lines. Any content after the comment character will be ignored. Unknown attribute values can be marked with empty strings (if this is possible for your column separators) or by a question mark (recommended).
Input
Output
- output:
Parameters
- configure operator: Configure this operator by means of a Wizard.
- attributes: Filename for the xml attribute description file. This file also contains the names of the files to read the data from.
- sample ratio: The fraction of the data set which should be read (1 = all; only used if sample_size = -1)
- sample size: The exact number of samples which should be read (-1 = use sample ratio; if not -1, sample_ratio will not have any effect)
- permute: Indicates if the loaded data should be permutated.
- decimal point character: Character that is used as decimal point.
- column separators: Column separators for data files (regular expression)
- use comment characters: Indicates if a comment character should be used.
- comment chars: Any content in a line after one of these characters will be ignored.
- use quotes: Indicates if quotes should be regarded.
- quote character: Specifies the character which should be used for quoting.
- quoting escape character: Specifies the character which should be used for escape the quoting.
- trim lines: Indicates if lines should be trimmed (empty spaces are removed at the beginning and the end) before the column split is performed.
- skip error lines: Indicates if lines which can not be read should be skipped instead of letting this operator fail its execution.
- datamanagement: Determines, how the data is represented internally.
- encoding: The encoding used for reading or writing files.
- use local random seed: Indicates if a local random seed should be used.
- local random seed: Specifies the local random seed