You should specify append mode when importing a table where new rows are continually being
added with increasing row id values. You specify the column containing the row’s id with --check-
column. Sqoop imports rows where the check column has a value greater than the one specified
with --last-value.
An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use
this when rows of the source table may be updated, and each such update will set the value of a
last-modified column to the current timestamp. Rows where the check column holds a
timestamp more recent than the timestamp specified with --last-value are imported.
At the end of an incremental import, the value which should be specified as --last-value for a
subsequent import is printed to the screen. When running a subsequent import, you should
specify --last-value in this way to ensure you import only the new or updated data. This is handled
automatically by creating an incremental import as a saved job, which is the preferred
mechanism for performing a recurring incremental import. See the section on saved jobs later in
this document for more information.
7.2.10. File Formats
You can import data in one of two file formats: delimited text or SequenceFiles.
Delimited text is the default import format. You can also specify it explicitly by using the --as-
textfile argument. This argument will write string-based representations of each record to the
output files, with delimiter characters between individual columns and rows. These delimiters
may be commas, tabs, or other characters. (The delimiters can be selected; see "Output line
formatting arguments.") The following is the results of an example text-based import:
1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12
Delimited text is appropriate for most non-binary data types. It also readily supports further
manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in custom record-specific data
types. These data types are manifested as Java classes. Sqoop will automatically generate these
data types for you. This format supports exact storage of all data in binary representations, and
is appropriate for storing binary data (for example, VARBINARY columns), or data that will be
principly manipulated by custom MapReduce programs (reading from SequenceFiles is higher-
performance than reading from text files, as records do not need to be parsed).
Avro data files are a compact, efficient binary format that provides interoperability with
applications written in other programming languages. Avro also supports versioning, so that
when, e.g., columns are added or removed from a table, previously imported data files can be
processed along with new ones.
By default, data is not compressed. You can compress your data by using the deflate (gzip)
algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the -
-compression-codec argument. This applies to SequenceFile, text, and Avro files.
7.2.11. Large Objects
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large,
then these columns should not be fully materialized in memory for manipulation, as most
columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored
inline with the rest of the data, in which case they are fully materialized in memory on every
access, or they can be stored in a secondary storage file linked to the primary data storage. By
default, large objects less than 16 MB in size are stored inline with the rest of the data. At a
larger size, they are stored in files in the _lobs subdirectory of the import target directory. These
files are stored in a separate format optimized for large record storage, which can accomodate
records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by
the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep
inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external
storage.
Table 6. Output line formatting arguments: