Hive常用正则表达式

Hive中常见的与正则匹配相关的函数

函数	返回值类型	函数说明
like	boolean	(A)str like (B)pattern，能否用B去完全匹配A的内容
rlike	boolean	(A)str rlike (B)regexp，能否用B去正则匹配A的内容
regexp	boolean	功能语法同rlike一样，只是名字不同
regexp_replace(str, regexp, rep)	string	将字符串str中符合正则表达式regexp的部分替换成字符串rep
regexp_extract(str, regexp[, idx])	string	将字符串str按照正则表达式regexp的规则拆分，返回idx指定位置的字符

注：用 [ ] 括起来的表示是可选参数。

1、like

使用格式：A like B，其中A是字符串，B是表达式，表示能否用B去表示A的所有内容。

否定的用法是 A not like B，或not A like B，与like相反。

返回值为boolean类型，如果A或者B有一个为NULL，则返回值为NULL。

表达式B中的匹配字符只能是 _ 和 %，_ 表示任意单个字符，% 表示任意数量的字符。

select 'hellohive' like 'hello%'; 
--返回结果：true
select 'hellohive' like 'hello_'; 
--返回结果：false
select 'hellohive' like 'hello____'; 
--返回结果：true

2、rlike/regexp

使用格式：A rlike B，其中A是字符串，B是正则表达式，符合Java正则表达式，表示字符串A是否符合正则表达式B的正则规则，rlike里的r就是regular规则的意思。

否定的用法是 A not rlike B，或not A rlike B，与rlike相反。

返回值为boolean类型，如果A或者B有一个为NULL，则返回值为NULL。

select '\\'; 
--返回结果：\
select 'hellohive' regexp 'hello(hive)?'; --是否是hello或者hellohive
--返回结果：true
select 'abc123456' regexp '^\\d+&'; --是否由纯数字组成
--返回结果：false
select 'abc123456' regexp '\\d+'; --是否含有数字
--返回结果：true

select 'hellohive' rlike '^[a-zA-Z]+$';  --是否由纯字母组成
--返回结果：true
select 'hello hive' rlike '^[a-zA-Z]+$'; 
--返回结果：false

中文匹配：

- 1、匹配是否由纯中文组成（全部字符为中文汉字）
select '中国' rlike '^[\\u4e00-\\u9fa5]+$'; 
--返回结果：true
select 'hello 中国' rlike '^[\\u4e00-\\u9fa5]+$'; 
--返回结果：false

- 2、匹配含有中文的（有中文汉字即满足）
select 'hello 中国' rlike '[\\u4e00-\\u9fa5]+';
--返回结果：true

- 3、匹配不含有中文的
select 'hello 中国' not rlike '[\\u4e00-\\u9fa5]+';
--返回结果：false

3、regxep_replace

- 1、截取字符串中的汉字部分，将非中文部分替换为空
select regexp_replace('i love 中国!!','([^\\u4E00-\\u9FA5]+)','') ;
--返回结果：中国

- 2、替换掉制表符、换行符、回车符
select regexp_replace('\t abc \n def \r hij', '\n|\t|\r', '');
--返回结果：abc  def  hij

4、regxep_extract

select regexp_extract('ilovechina', 'i(.*?)(china)', 1);
--返回结果：love
select regexp_extract('ilovechina', 'i(.*?)(china)', 0);
--返回结果：ilovechina
select regexp_extract('ilovechina', 'i(.*?)(china)', 2);
--返回结果：you

select regexp_extract('ilovechina', 'i(.*?)(chi)(na)', 1);
--返回结果：love
select regexp_extract('ilovechina', 'i(.*?)(chi)(na)', 2);
--返回结果：chi
select regexp_extract('ilovechina', 'i(.*?)(chi)(na)', 3);
--返回结果：na
select regexp_extract('ilovechina', 'i(.*?)(chi)(na)');
--返回结果：love

其中第三个参数idx不填时，默认idx为1。

只有当字符串符合正则表达式regexp规则时，才有结果返回。idx为0时返回字符串本身。

5、正则指定查询列

有时候我们只需要查询一个表中除了某个别字段之外的字段，如果字段很多的话，使用正则查询字段可以节省时间。

set hive.support.quoted.identifiers=none;
- 1、查询以time结尾的字段
select `.+time` from table; 
select `.*time` from table;

- 2、查询除了ejd和ehub之外的字段
select `(ejd|ehub)?+.+` from table;

- 2、查询除了以ejd或ehub开头的之外的字段
select `(ejd|ehub.*)?+.+` from table;

更多推荐

Hive常用正则表达式