sqoop --split-by
时间: 2023-08-23 12:12:38 浏览: 261
sqoop --split-by参数用于指定数据分割的列名。在使用Sqoop工具将数据从关系型数据库导入到Hadoop生态系统中时,可以使用--split-by参数来将数据划分为多个子任务并并行导入。这样可以提高导入的效率。--split-by参数的值应该是一个唯一递增的列,通常是主键列或者是有序的列。Sqoop会根据这个列的值范围将数据进行划分,并为每个子任务分配一个独立的数据范围。
相关问题
sqoop import --connect jdbc:mysql://zhaosai:3306/mydb --username root --password jqe6b6 --table news --target-dir /user/news --fields-terminated-by “;” --hive-import --hive-table news -m 1出现错误Warning: /opt/programs/sqoop-1.4.7.bin__hadoop-2.6.0/../hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: /opt/programs/sqoop-1.4.7.bin__hadoop-2.6.0/../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /opt/programs/sqoop-1.4.7.bin__hadoop-2.6.0/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /opt/programs/sqoop-1.4.7.bin__hadoop-2.6.0/../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 23/06/10 16:18:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 23/06/10 16:18:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 23/06/10 16:18:23 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 23/06/10 16:18:23 INFO tool.CodeGenTool: Beginning code generation Sat Jun 10 16:18:23 CST 2023 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 23/06/10 16:18:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM news AS t LIMIT 1 23/06/10 16:18:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM news AS t LIMIT 1 23/06/10 16:18:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/programs/hadoop-2.7.6 注: /tmp/sqoop-root/compile/84ba419f00fa83cb5d16dba722729d01/news.java使用或覆盖了已过时的 API。 注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。 23/06/10 16:18:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/84ba419f00fa83cb5d16dba722729d01/news.jar 23/06/10 16:18:25 WARN manager.MySQLManager: It looks like you are importing from mysql. 23/06/10 16:18:25 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 23/06/10 16:18:25 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 23/06/10 16:18:25 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 23/06/10 16:18:25 ERROR tool.ImportTool: Import failed: No primary key could be found for table news. Please specify one with --split-by or perform a sequential import with '-m 1'.
这段代码是用 Sqoop 工具从 MySQL 数据库中导入数据到 Hadoop 分布式文件系统,并在 Hive 中创建表。但是在导入过程中出现了错误,错误提示为“Import failed: No primary key could be found for table news. Please specify one with --split-by or perform a sequential import with '-m 1'.”,意思是在导入表 news 的过程中没有找到主键,需要使用 --split-by 参数指定一个主键或者使用 -m 1 进行串行导入。
sqoop import --connect jdbc:mysql://192.168.160.130:3306/mydb --username root --password jqe6b6 --table news --hive-import --hive-table mydb.news --incremental append --check-column --last-value 0 --split-by news --target-dir /hdfs://zhaosai:9000/user/hive/warehouse/news
根据您提供的 Sqoop 命令,您需要指定一个有效的 `--check-column` 参数,以便 Sqoop 使用该列来确定哪些行是新的或更新的。在您的命令中,`--check-column` 参数后面缺少列名,因此导致了错误。
请在 `--check-column` 参数后面指定一个有效的列名,例如:
```
sqoop import --connect jdbc:mysql://192.168.160.130:3306/mydb --username root --password jqe6b6 --table news --hive-import --hive-table mydb.news --incremental append --check-column id --last-value 0 --split-by news --target-dir /hdfs://zhaosai:9000/user/hive/warehouse/news
```
在上面的命令中,我将 `--check-column` 参数的值设置为 `id`,因为在您的表结构中没有一个适合作为检查列的整数类型列或时间戳列。
请注意,如果您使用的是自定义查询而不是简单的表导入,则需要使用 `--query` 参数来指定查询语句,并在查询语句中包含 `--check-column` 和 `--split-by` 参数的值。
阅读全文