本文共 3300 字,大约阅读时间需要 11 分钟。
hive里面的连接操作其实跟SQL还是差不多的...
首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:
[xingoo@localhost tmp]$ cat aa.txt 1 a 32 b 43 c 1[xingoo@localhost tmp]$ cat bb.txt 1 xxx 22 yyy 33 zzz 5
接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。
hive> create table aa > (a string,b string,c string) > row format delimited > fields terminated by ' ';OKTime taken: 0.19 secondshive> create table bb like aa;OKTime taken: 0.188 seconds
查看两个表的结构:
hive> describe aa;OKa string b string c string Time taken: 0.068 seconds, Fetched: 3 row(s)hive> describe bb;OKa string b string c string Time taken: 0.045 seconds, Fetched: 3 row(s)
下面可以基于本地的文件,导入数据
hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa;Loading data to table test.aaOKTime taken: 0.519 secondshive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb;Loading data to table test.bbOKTime taken: 0.321 seconds
内连接即基于on语句,仅列出表1和表2符合连接条件的数据。
hive> select * from aa a join bb b on a.c=b.a;MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1264 HDFS Write: 90 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK3 c 1 1 xxx 21 a 3 3 zzz 5Time taken: 17.083 seconds, Fetched: 2 row(s)
左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null
ive> select * from aa a left outer join bb b on a.c=b.a;MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1282 HDFS Write: 90 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 a 3 3 zzz 52 b 4 NULL NULL NULL3 c 1 1 xxx 2Time taken: 16.048 seconds, Fetched: 3 row(s)
类似左连接,同理。
hive> select * from aa a right outer join bb b on a.c=b.a;MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1306 HDFS Write: 90 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK3 c 1 1 xxx 2NULL NULL NULL 2 yyy 31 a 3 3 zzz 5Time taken: 15.483 seconds, Fetched: 3 row(s)
相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.
hive> select * from aa a full outer join bb b on a.c=b.a;MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 4026 HDFS Write: 270 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK3 c 1 1 xxx 2NULL NULL NULL 2 yyy 31 a 3 3 zzz 52 b 4 NULL NULL NULLTime taken: 1.689 seconds, Fetched: 4 row(s)
这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。
hive> select * from aa a left semi join bb b on a.c=b.a;MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1366 HDFS Write: 90 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 a 33 c 1Time taken: 16.811 seconds, Fetched: 2 row(s)
笛卡尔积会针对表1和表2的每条数据做连接...
hive> select * from aa join bb;MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1390 HDFS Write: 90 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 a 3 1 xxx 22 b 4 1 xxx 23 c 1 1 xxx 21 a 3 2 yyy 32 b 4 2 yyy 33 c 1 2 yyy 31 a 3 3 zzz 52 b 4 3 zzz 53 c 1 3 zzz 5
转载自: