网站首页 > 基础教程正文

Hive基础概念与基本知识详解

ccvgpt 2025-01-14 11:11:18 基础教程 1 ℃

hive与关系型数据库的比较

对比项	Hive	关系型数据库
查询语言	HQL	SQL
数据存储	HDFS	Llocal FS or RawDevice
执行器	MapReducer	Executor
数据insert	支持批量导入和单条插入	支持批量导入和单条插入
数据update和delete	支持追加，不支持删除	行级别的更新和删除
数据规模	大	小
执行延迟	高	低
分区	支持	支持
索引	v0.8之后支持	支持
数据加载模式	读时模式（快）	写时模式(慢)
扩展性	高	低
应用场景	海量数据查询	实时查询

Hive架构

编译器将一个Hive SQL转换操作符
操作符是Hive的最小的处理单元
每个操作符代表HDFS的一个操作或者一道MapReduce作业

Operator(操作符)

Operator都是Hive定义的处理过程
Operator的定义有：

protected List <Operator<? extends Serializable>> child Operators;
proetected List <Operator<? extends Serializable>> parentOperators;
protected boolean done; // 初始化值为false

Hive三种安装方式区别和搭建

Hive中metastore(元数据存储)的三种模式：

内嵌Derby模式
直连数据库模式
远程服务器模式

Hive基础知识

1. 创建表

create table person (
    id int ,
    name string,
    likes array<string>,
    address map<string, string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
localtion '/usr/'

2. 内部表和外部表

2.1 Hive内部表

create table [IF NOT EXISTS] table_name

删除表的时候，会将元数据和HDFS上的数据一起删除

2.2 Hive外部表

create external table [if not exists] table_name location hdfs_path

删除外部表只删除metastore的元数据，不删除hdfs中的表数据

3. 表分区

Hive分区partition必须在定义表时指定partition字段。

3.1 单分区表

create table day_table (id int, content string) partitioned by (dt string)

单分区表，按天分区，在表结构中存在id, content, dt三列.

上面的语句表示：以dt为文件夹区分

3.2 双分区建表语句

create table day_hour_table (id int, content string, dt string, hour string)
partitioned by (dt string, hour string)

双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

以dt为文件夹，再以hour为子文件夹区分

3.3 Hive表添加分区

表已经创建，在此基础上添加分区，注意只能添加在表创建时定义好的分区的值

alter table person6 add partition(age=30, sex="man")

表在创建时定义的是双分区，条件分区值时，不能只指定一个，需要同时指定两个分区的值。

3.4 修复分区

msck repair table table_name

修复分区是使用了hdfs中的数据，但是元数据中没有分区的信息，这个时候就可以使用分区修复。

在hdfs中，person7的目录下有age=10, age=20, age=30几个目录，因此在修复的时候，就创建了对应的partition分区的信息。

在元数据表中，则看到了对应的分区信息。

4. 使用已有表建表

create table empty_key_value_store like key_value_store;

这个创建语句，只是创建了和key_value_store相同的表结构，但是表中没有具体的数据信息。

也可以配合查询语句然后创建表结构:

create table table_name as select column1, column2 .. from table_name2

5. 删除数据

-- 删除数据
delete from table_name;

-- 删除数据
truncate table table_name;

以上的语句都是用来清空表数据，但是不能删除单条数据，当尝试删除单条数据的时候，会有以下的提示: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

6. 添加数据

6.1 load files into tables

在hive上进行数据的操作，主要包含copy和move两种操作，这两种操作都能够将数据文件移动到hive的表中。

load data [local] inpath 'filepath' [overwrite] into table tablename [partition (partcoll=val1, partcol2=val2)]

copy: load的是本地(local)的文件
move：load的是hdfs上的文件，不适用local. 如果也copy的话，hdfs上将产生不必要的文件。

6.2 Inserting data into Hive Tables from queries

查询的数据能够通过insert的语句插入到Hive表中，语法如下：

insert into table tablename1 [partition (partcol=val...)] select columns from tablenam32

insert overwrite table tablename1 [partition (partcoln=val1...)] [IF NOT EXSITS] select columns from tablename2

以上方式不建议使用，主要是因为insert的时候会创建mp任务并执行任务，需要比较长的时间。另外，上面的语句中也存在性能问题，主要是查询一条数据就需要插入一次。

将大表中的数据拆分到多个小表

from from_statement
insert overwrite table tablename1 [partitioin (partcol=val..)] 
[if not exists] select select_statements
insert overwrite table tablename1 [partitioin (partcol=val..)] 
[if not exists] select select_statements
insert overwrite table tablename1 [partitioin (partcol=val..)] 
[if not exists] select select_statements

以上方式只查询一次，性能较高

INSERT OVERWRITE：表示覆盖
INSERT INTO：表示追加

上一篇：在 C# .NET 中从 PDF 中提取表数据
下一篇：分库分表以后，如何管理几万张分片表?

网站首页 > 基础教程 正文

Hive基础概念与基本知识详解

hive与关系型数据库的比较

Hive架构

Hive架构

Operator(操作符)

Hive三种安装方式区别和搭建

Hive基础知识

1. 创建表

2. 内部表和外部表

2.1 Hive内部表

2.2 Hive外部表

3. 表分区

3.1 单分区表

3.2 双分区建表语句

3.3 Hive表添加分区

3.4 修复分区

4. 使用已有表建表

5. 删除数据

6. 添加数据

6.1 load files into tables

6.2 Inserting data into Hive Tables from queries

将大表中的数据拆分到多个小表

猜你喜欢

网站首页 > 基础教程正文