大数据开发面试必会的SQL 30题！！！

- （1）查询每个区域的用户数
- （2）查询每个区域的男女用户数
- （3）查询姓张的用户数
- （4）筛选出id3～id5的用户
- （5）筛选出绩效不达标的员工
- （6）筛选出姓张的且绩效不达标的员工
- （7）查询获得销售冠军超过两次的人
- （8）查询某部门一年的月销售额最高涨幅
- （9）查询每个季度绩效得分大于70分的员工
- （10）删除重复值
- （11）行列互换
- （12）多列比较
- （13）对成绩进行分组
- （14）周累计数据获取
- （15）周环比数据获取
- （16）查询获奖员工信息
- （17）计算用户留存情况
- （18）筛选最受欢迎的课程
- （19）筛选出每个年级最受欢迎的三门课程
- （20）求累积和
- （21）获取新增用户数
- （22）获取用户首次购买时间
- （23）同时获取用户和订单数据
- （24）随机抽样
- （25）获取沉默用户数
- （26）获取新用户的订单数
- （27）获取借款到期名单
- （28）获取即将到期的借款信息
- （29）获取历史逾期借款信息
- （30）获取部门工资最高的员工
- （补）求连续登陆三天的用户
- （补）某互联网公司面试题

开发工具：

mysql-8.0
DataGrip

（1）查询每个区域的用户数

数据源：stu_table.csv

id,name,class,sex
4,张文华,二区,男
3,李思雨,一区,女
1,王小凤,一区,女
7,李智瑞,三区,男
6,徐文杰,二区,男
8,徐雨秋,三区,男
5,张青云,二区,女
9,孙皓然,三区,男
10,李春山,三区,男
2,刘诗迪,一区,女

需求：我们想知道每个区域有多少用户

解题思路：首先需要对区域进行分组，使用的是group by，然后对每个组内的用户进行计数聚合运算，使用的是count，最后运行结果如下表所示。

select
       class,
       count(id) as stu_sum
from
     test.stu_table
group by
         class;

运行结果：

（2）查询每个区域的男女用户数

数据源：stu_table.csv
需求：我们想知道每个区域内男生、女生分别有多少个。

-- 2.查询每个区域的男女用户数
-- 写法一
select class,sex,count(sex) from test.stu_table group by class,sex;
-- 写法二
select
       class ,
       count(case when sex = '男' then class end ) as '男',
       count(case when sex = '女' then class end ) as '女'
from
     test.stu_table
group by class;
-- 写法三
select
       sex ,
       count(case when class = '一区' then sex end ) as '一区',
       count(case when class = '二区' then sex end ) as '二区',
       count(case when class = '三区' then sex end ) as '三区'
from
     test.stu_table
group by sex;

运行结果：

（3）查询姓张的用户数

数据源：stu_table.csv
需求：我们想知道这张表中姓张的用户有多少个？

select
       count(id) as stu_num
from
     test.stu_table
where name like '张%';

运行结果：

（4）筛选出id3～id5的用户

数据源：stu_table.csv
需求：我们想要获取id按照从小到大的顺序排列以后id3～id5的用户的信息。

-- 4.筛选出id3～id5的用户
select * from test.stu_table order by id limit 2,3;

运行结果：

（5）筛选出绩效不达标的员工

数据源：score_table.csv

id,namr,group,score
1,王小凤,一部,88
2,刘诗迪,一部,70
3,李思雨,一部,92
4,张文华,二部,55
5,张青云,二部,77
6,徐文杰,二部,77
7,李智瑞,三部,56
8,徐雨秋,三部,91
9,孙皓然,三部,93
10,李春山,三部,57

需求：我们想把绩效不达标（绩效得分小于60分）的员工的信息筛选出来。

select * from test.score_table where score < 60;

运行结果：

（6）筛选出姓张的且绩效不达标的员工

数据源：score_table.csv

需求：我们现在想根据这张表筛选出姓张的且绩效不达标的员工的信息。

-- 6.筛选出姓张的且绩效不达标的员工
select * from test.score_table where score < 60 and name like '张%';

运行结果：

（7）查询获得销售冠军超过两次的人

数据源：month_table.csv

id,name,month_num
E002,王小凤,1
E001,张文华,2
E003,孙皓然,3
E001,张文华,4
E002,王小凤,5
E001,张文华,6
E004,李智瑞,7
E002,王小凤,8
E003,孙皓然,9

需求：现在需要查询获得销售冠军的次数超过2次的人及其获得销售冠军的次数。

select
       id,
       name,
       count(month_num) as num
from
     test.month_table
group by
         id,
         name
having
       num > 2;

运行结果：

（8）查询某部门一年的月销售额最高涨幅

数据源：sale_table.csv

year_num,month_num,sales
2019,1,2854
2019,2,4772
2019,3,3542
2019,4,1336
2019,5,3544
2018,1,2293
2018,2,2559
2018,3,2597
2018,4,2363

需求：现在我们想查询2019年的月销售额最高涨幅是多少。

select
       year_num,
       max(sales) as max_sales,
       min(sales) as min_sales,
       (max(sales) - min(sales)) as cha,
       ((max(sales) - min(sales)) / min(sales)) as growth
from
     test.sale_table
group by
         year_num;

运行结果：

（9）查询每个季度绩效得分大于70分的员工

数据源：score_info_table.csv

id,name,subject,score
1,王小凤,第一季度,88
1,王小凤,第二季度,55
1,王小凤,第三季度,72
3,徐雨秋,第一季度,92
3,徐雨秋,第二季度,77
3,徐雨秋,第三季度,93
2,张文华,第一季度,70
2,张文华,第二季度,77
2,张文华,第三季度,91

解题思路：我们要查询的是每个季度绩效得分都大于70分的员工，只要能够保证每个季度每位员工的最小绩效得分是大于70分的，就可以说明这位员工的每个季度绩效得分都大于70分。

需求：现在我们想要通过这张表查询每个季度绩效得分都大于70分的员工。

select
       id,
       name,
       min(score) as min_score
from
     test.score_info_table
group by
         id,
         name
having min_score > 70;

运行结果：

（10）删除重复值

数据源：stu_info_table.csv

id,name,t_1,t_2
1,王小凤,产品技术部,B端产品
2,刘诗迪,产品技术部,C端产品
3,李思雨,产品技术部,B端产品
5,张青云,销售运营部,数据分析
4,张文华,销售运营部,销售管理
6,徐文杰,销售运营部,销售管理
7,李智瑞,产品技术部,B端产品
8,徐雨秋,销售运营部,销售管理
9,孙皓然,产品技术部,B端产品

需求：现在我们想获取该公司一级部门及二级部门的信息，即哪些一级部门下包含哪些二级部门

select 
       t_1,
       t_2 
from 
     test.stu_info_table 
group by 
         t_1, 
         t_2 
order by t_1;

运行结果：

（11）行列互换

数据源：row_col_table.csv

year_num,month_num,sales
2019,1,100
2019,2,200
2019,3,300
2019,4,400
2020,1,200
2020,2,400
2020,3,600
2020,4,800

需求：我们需要把如上表所示的纵向存储数据的方式改成如下表所示的横向存储数据的方式。

解题思路：首先按照year_num分组，利用case when xxx then sales end条件控制语句，当month_num = 1时返回sales，以此类推，得到列值。

-- 11.行列互换
select * from test.row_col_table;
select year_num,
       sum(case when month_num = 1 then sales end ) as m1,
       sum(case when month_num = 2 then sales end ) as m2,
       sum(case when month_num = 3 then sales end ) as m3,
       sum(case when month_num = 4 then sales end ) as m4
from test.row_col_table group by year_num;

运行结果：

（12）多列比较

数据源：col_table.csv

col_1,col_2,col_3
5,10,7
1,10,6
9,3,5
5,2,9
10,4,3
5,2,9
5,8,6
8,8,6

需求：我们需要根据这三列数据生成一列结果列，结果列的生成规则为：如果col_1列大于col_2列，则结果为col_1列的数据；如果col_2列大于col_3列，则结果为col_3列的数据，否则结果为col_2列的数据。

解题思路：多列比较其实就是一个多重判断的过程，借助case when即可实现，先判断col_1 列和col_2列的关系，然后判断col_2列和col_3列的关系。这里需要注意的是，判断的执行顺序是先执行第一行case when，然后执行第二行case when，最后运行结果如下表所示。

select col_1,
       col_2,
       col_3,
       (case
           when col_1 > col_2 then col_1
           when col_2 > col_3 then col_3 end  ) as result_col
from
     test.col_table;

运行结果：

（13）对成绩进行分组

数据源：subject_table.csv

id,score
1,56
2,91
3,67
4,54
5,56
6,69
7,61
8,83
9,99

需求：我们想知道60分以下（不包含60分）、60～80分（不包含80分）、80～100分三个成绩段内分别有多少个学生

解题思路：写法一通过case when写法，分别求出各个分数段的个数作为列；写法二利用的是case when，完成成绩分段以后再对分段结果进行group by，接着在组内计数获得每个成绩段内的学生数

-- 写法一
select * from test.subject_table;
select
       count(case when score >= 80 then score end ) as '80～100分',
       count(case when score >= 60 and score < 80 then score end ) as '60～80分',
       count(case when score < 60 then score end ) as '60分以下'
from test.subject_table;

-- 写法二
select
       (case
    when score >= 80 then '80～100分'
    when score >= 60 and score < 80 then '60～80分'
    when score < 60 then '60分以下' end ) as score_bin ,
       count(case when score >= 80 then score
             when score >= 60 and score < 80 then score
             when score < 60 then score end) as count
from test.subject_table group by score_bin;

运行结果：

（14）周累计数据获取

数据源：order_table.csv

order_id,order_date
1,2019/1/8
2,2019/1/9
3,2019/1/10
4,2019/1/11
5,2020/1/8
6,2020/1/9
7,2020/1/10
8,2020/1/11
9,2020/1/12

需求：现在每天需要获取本周累计的订单数，本周累计是指本周一到获取数据当天，比如，今天是周三，那么本周累计就是周一到周三，该怎么实现呢？

解题思路：我们要获取本周累计的订单数，只需要把本周的订单明细筛选出来，然后对订单ID进行计数即可。

-- 14.周累计数据获取
select * from test.order_table;
-- 现在每天需要获取本周累计的订单数，本周累计是指本周一到获取数据当天，比如，今天是周三，那么本周累计就是周一到周三，该怎么实现呢？

update order_table set order_date = '2021/04/24' where order_id = 6;
update order_table set order_date = '2021/04/28' where order_id = 7;
update order_table set order_date = '2021/04/29' where order_id = 8;
update order_table set order_date = '2021/04/30' where order_id = 9;

select date_format(order_date,'%Y-%m-%d') from test.order_table;
select weekofyear(date_format(order_date,'%Y-%m-%d')) from test.order_table;
select weekofyear(current_date) as '本周';
select year(order_date) as '年' from test.order_table;
select week(order_date) as '周' from test.order_table;

select
       year(order_date) as '年',
       week(order_date) as '周',
       count(order_id)
from
     test.order_table
where
      year(order_date) = year(current_date)
  and week(order_date) = weekofyear(current_date)
group by
         year(order_date),
         week(order_date);

运行结果：

（15）周环比数据获取

数据源：order_table.csv

需求：获取当日的订单数和当日的环比订单数（即昨日的数据）

select
       count(case when date(order_date) = date (current_date) then order_id end ) as order_count,
       count(case when date_sub(date (current_date),interval 1 day ) = order_date then order_id end) as last_order_count
from
     test.order_table;

运行结果：

（16）查询获奖员工信息

数据源：16_table1.csv、16_table2.csv

id,name
1,王小凤
2,刘诗迪
3,李思雨
4,张文华
5,张青云
6,徐文杰
7,李智瑞
8,徐雨秋
9,孙皓然

id,name
1,王小凤
2,刘诗迪
3,李思雨
7,李智瑞
8,徐雨秋
9,孙皓然

需求：现在我们想通过table1表获取获奖员工的更多信息。

select * from test.`16_table1` inner join `16_table2` `16t2` on `16_table1`.id = `16t2`.id;

运行结果：

（17）计算用户留存情况

数据源：user_login.csv

uid,login_time
1,2021/4/21 6:00
1,2021/4/24 10:00
1,2021/4/25 19:00
2,2021/4/22 10:00
2,2021/4/28 9:00
2,2021/4/29 14:00
3,2021/4/27 8:00
3,2021/4/28 10:00

需求：我们想看用户的次日留存数、三日留存数、七日留存数（只要用户首次登录以后再登录就算留存下来了），该怎么实现呢？

解题思路：本题有两种解题思路，先看思路一：按照用户时间求出七日留存，首先按uid分组，求出每个uid的第一次登陆时间和最后一次登陆时间，算出中间间隔的时间，如果间隔为1就是次日留存，间隔为3就是3日留存，间隔为7就是7日留存，以此类推分别求出他们的数量。思路二：按照当日时间求出七日留存，如果用户登陆的时间正好等于当前日期前一天的日期，则去重统计uid数量即为次日留存数，以此类推分别求出三日留存、七日留存。

-- 写法一，按照用户时间求出七日留存
select
       count(case when user_day_value.day_value = 1 then uid end ) as '次日留存',
       count(case when user_day_value.day_value = 3 then uid end ) as '三日留存',
       count(case when user_day_value.day_value = 7 then uid end ) as '七日留存'
from
     (select uid,
       max(login_time),
       min(login_time),
       datediff(date (max(login_time)) , date (min(login_time))) as day_value
     from
        test.user_login
     group by
         uid) as user_day_value;
         
-- 写法二，按照当日时间求出七日留存
select
       count(distinct case when date(login_time) = date_sub(date (current_date),interval 1 day) then uid end ) as '次日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 3 day) then uid end ) as '三日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 7 day) then uid end ) as '七日留存'
from test.user_login;

运行结果：
写法一：

写法二：

（18）筛选最受欢迎的课程

数据源：course_table.csv

id,name,grade,course
1,王小凤,一年级,心理学
2,刘诗迪,二年级,心理学
3,李思雨,三年级,社会学
4,张文华,一年级,心理学
5,张青云,二年级,心理学
6,徐文杰,三年级,计算机
7,李智瑞,一年级,心理学
8,徐雨秋,二年级,计算机
9,孙皓然,三年级,社会学
10,李春山,一年级,社会学

需求：现在我们想知道最受欢迎的课程是哪一门。

解题思路：解法一通过按照course分组，按照course_count降序排序，取第一个值即为最受欢迎的课程，但是这种写法在一定程度上不严谨，比如说course_count最大值一样多，也就所说的并列第一，那么使用第一种方法就不能把所有最喜欢的课程都取出来。来看看第二种写法，通过使用rank窗口函数得出并列第一的值，通过where = 1，即可以全部取出最受欢迎的课程。

-- 18.筛选最受欢迎的课程
-- 解法一：不取重复值
select * from test.course_table;


select course,
       count(course) as course_count
from
     test.course_table
group by
         course
order by
         course_count desc limit 1;

-- 解法二：取重复值
select
       course,
       course_count
from
     (select
             course,
             count(course) as course_count,
             rank() over (order by count(course) desc ) as rank_num
     from
          test.course_table
     group by
              course) as t
where rank_num = 1;

运行结果：

（19）筛选出每个年级最受欢迎的三门课程

数据源：course_table.csv
需求：现在我们想知道每个年级最受欢迎的三门课程

解题思路：通过运用窗口函数，分组求Top N

-- 19.筛选出每个年级最受欢迎的三门课程
-- 解法一：利用两个窗口函数
select
       *
from
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank
     from
          (select grade,
                  course,
                  count(course) over (partition by grade,course) as count_num
          from test.course_table) as a
     group by a.grade,a.course,a.count_num) as b
where
      b.count_rank < 4 ;

-- 解法二
select * from test.course_table;

select
       *
from
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank from
        (select
          grade,
          course ,
          count(id) as count_num
        from
             test.course_table
        group by
                 grade,
                 course) as a) as b
where b.count_rank < 4;

运行结果：

根据此题也可以求出每个年级最受欢迎的课程，这里我不用窗口函数，使用连接子查询：

select
       *
from
     (select
             grade,
             course ,
             count(id) as count_num
     from
          test.course_table
     group by
              grade,
              course) as t1
where
      t1.count_num =
      (select
              t2.count_num
      from (select
                   grade,
                   course ,
                   count(id) as count_num
      from
           test.course_table
      group by
               grade,
               course) as t2
      where
            t1.grade = t2.grade
      order by
               t2.count_num desc limit 1);

运行结果：

（20）求累积和

数据源：consum_order_table.csv

order_id,uid,amount
201901,1,10
201902,2,20
201903,3,15
201904,3,15
201905,4,20
201906,4,20
201907,5,25
201908,5,25
201909,6,30
201910,6,30
201911,7,35
201912,7,35

需求：现在我们想看下80%的订单金额最少是由多少用户贡献的

解题思路：第一步按uid分组，求出每个uid的amount和，第二步在生成累积和的时候需要按照订单金额进行降序排列，这样就可以得到最少的人数，第三步利用子查询获取到全部的订单金额，求出小于总额80%的uid的数量

select
       count(uid)
from
     (select
       uid,
       sum_amount,
       sum(sum_amount) over (order by sum_amount desc ) as consume_amount
       from
            (select
                    uid,
                    sum(amount) as sum_amount
            from
                 test.consum_order_table
            group by uid) as uid_table) as t
where
      t.consume_amount < (select (sum(amount) * 0.8) as sum_80 from test.consum_order_table);

运行结果：

（21）获取新增用户数

数据源：user_reg_table.csv

uid,reg_time
1,2019/12/25 10:00
2,2019/12/26 10:00
3,2019/12/27 10:00
4,2019/12/28 10:00
5,2019/12/29 10:00
6,2019/12/30 10:00
7,2019/12/31 10:00
8,2020/1/1 10:00
9,2020/1/2 10:00
10,2020/1/3 10:00
11,2020/1/4 10:00

需求：我们想获取某一天的新增用户数，以及该天对应的过去7天内每天的平均新增用户数

解题思路：我们以2020-1-1那一天为例，首先通过case when 让reg_time = 设定的那一天，求出uid的数量即为某一天的新增用户数。然后求出2020-1-1向前7天的reg_time，求出uid的数量除以7即可得出过去7天内每天的平均新增用户数

-- 假设求2020-1-1那一天新增的用户
set @day_date = '2020-01-01';
select
       count(case when date (reg_time) = @day_date then uid end ) as new_count,
       (count(uid) / 7) as 7_avg_count

from
     test.user_reg_table
where
      date (reg_time) between date_sub(@day_date,interval 6 day) and @day_date;

运行结果：

（22）获取用户首次购买时间

数据源：first_order_table.csv

order_id,uid,order_time
201901,1,2020-01-01 10:00:00
201902,2,2020-01-02 10:00:00
201903,3,2020-01-03 10:00:00
201904,1,2020-01-04 10:00:00
201905,2,2020-01-05 10:00:00
201906,3,2020-01-06 10:00:00
201907,1,2020-01-07 10:00:00
201908,2,2020-01-08 10:00:00
201909,3,2020-01-09 10:00:00
201910,1,2020-01-10 10:00:00
201911,2,2020-01-11 10:00:00

需求：我们想获取每个用户的首次购买时间，以及首次购买时间是否在最近7天内

解题思路：首先按uid分组，求出每个uid的首次订单时间，然后通过case when判断首次订单时间是否大于当日七天之前的时间，如果大于则首次购买时间在最近7天内，如果小于则首次购买时间不在最近7天内

select
       uid,
       min(order_time) as first_time,
       (case when date_sub(current_date(),interval 7 day ) < min(date(order_time)) then 'yes' else 'no' end) as is_7_day
from
     test.first_order_table
group by
         uid;

运行结果：

（23）同时获取用户和订单数据

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取过去7天每天的新增用户数、订单数、下单用户数

解题思路：首先求过去7天每天新增用户数，再求过去7天每天新增订单数、下单用户数，两个表join即可得到需求

-- 假设以2020-01-04为例
set @day = '2020-01-04';

select
       a.tdate,
       a.count_uid,
       b.order_count,
       b.uesr_count
from
-- 求过去7天每天新增用户数
     (select
       date (reg_time) as tdate,
       count(uid) over(partition by date (reg_time)) as count_uid
     from
          test.user_reg_table
     where
           date (reg_time) between date_sub(@day,interval 7 day) and @day) as a
left join
-- 求过去7天每天新增订单数、下单用户数
     (select
             date (order_time) as tdate,
             count(order_id) as order_count ,
             count(uid) as uesr_count
     from
          test.first_order_table
     where
           date (order_time) between date_sub(@day,interval 7 day) and @day
     group by date (order_time)) as b

on a.tdate = b.tdate;

运行结果：

（24）随机抽样

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想从用户表中随机抽取5个用户，并获取这5个用户的历史购买订单数

解题思路：我们要随机抽取5个用户并获取他们的历史购买订单数，首先需要生成每个用户的历史购买订单数，然后从中随机抽取5个。具体的思路为利用rand()函数生成随机数，然后利用order by进行排序，最后利用limit将前5条数据显示出来，运行结果如下表所示。

select
       user_reg_table.uid,
       t.order_count
from
     test.user_reg_table

left join
         (select
                 uid,
                 count(order_id) as order_count
         from
              test.first_order_table
         group by
                  uid) as t

on user_reg_table.uid = t.uid
order by rand() limit 5;

运行结果：

（25）获取沉默用户数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取沉默用户的数量，沉默的定义是已注册但最近30天内没有购买记录的用户

解题思路：先求出最近三十天已经有购买记录的用户，再求出所有已经注册的用户。两个表连接，得出第一列为所有已经注册的用户，第二列则为最近三十天购买记录的用户，如果不能连接为null，最后求出null值的个数即为最近30天有购买记录的用户

-- 获取已经注册用户
select
       count(t1.uid) as slient_count
from
     test.user_reg_table as t1

-- 获取最近30天有购买记录的用户
left join
         (select
                 uid
         from
              test.first_order_table
         where
               date (order_time)
                   between current_date and date_sub(current_date,interval 30 day)
         group by uid) t2
on
    t1.uid = t2.uid
where
      t2.uid is null;

运行结果：

（26）获取新用户的订单数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取最近7天注册的新用户在最近7天内的订单数是多少

解题思路：第一步，求出最近7天注册的新用户，第二步，求出最近7天下订单的用户和订单数，第三步，两表连接，第二列为null值为最近7天新注册没下订单的用户，否则为下订单的用户，求出第三列的个数即为获取最近7天注册的新用户在最近7天内的订单数

select
       sum(t2.order_count)
from
-- 获取最近7天注册的新用户
       (select
               uid
       from
            test.user_reg_table
       where date (reg_time) between current_date and date_sub(current_date,interval 7 day )) as t1

left join
-- 获取最近7天下订单的用户和订单数
       (select
               uid,
               count(order_id) as order_count
       from
            test.first_order_table
       where date (order_time) between current_date and date_sub(current_date,interval 7 day )
       group by
                uid) as t2

on t1.uid = t2.uid;

运行结果：

（27）获取借款到期名单

数据源：loan_table.csv

id,loan_time,expire_time,reback_time,amount,status
1,2019/12/1,2019/12/31,,2208,0
2,2019/12/1,2019/12/31,2019/12/31,5283,1
3,2019/12/5,2020/1/4,,5397,0
4,2019/12/5,2020/1/4,,4506,0
5,2019/12/10,2020/1/9,,3244,0
6,2019/12/10,2020/1/9,2020/1/12,4541,1
7,2020/1/1,2020/1/31,2020/1/10,3580,1
8,2020/1/1,2020/1/31,,7045,0
9,2020/1/5,2020/2/4,,2067,0
10,2020/1/5,2020/2/4,,7225,0

需求：这张表包含id、loan_time（借款时间）、expire_time（到期时间）、reback_time（还款时间）、amount（金额）和status（还款状态，1表示已还款、0表示未还款）六个字段，我们想获取每天到期的借款笔数、借款金额和平均借款天数

select
       count(id) as loan_count ,
       sum(amount) as loan_amount,
       avg(datediff(reback_time,loan_time)) as avg_day
from
     test.loan_table
where
      date (loan_time) = current_date;

运行结果：

（28）获取即将到期的借款信息

数据源：loan_table.csv

需求：我们想知道有多少笔借款会在未来7天内到期，以及其中有多少笔是已经还款的

解题思路：先获取今日到未来7天所有的还款日期，求出count(id)即为借款会在未来7天内到期的数量，在计算出目前状态为还款的数量。

select
       count(id) as loan_count,
       count(case when status = 1 then id end ) as reback_count
from
     test.loan_table
where
      expire_time between current_date and date_add(current_date,interval 7 day );

运行结果：

（29）获取历史逾期借款信息

数据源：loan_table.csv

需求：我们想知道历史逾期的借款笔数和金额，以及至今还逾期的借款笔数和金额

解题思路：这里面的关键信息在于，逾期怎么判断，对到期时间和还款时间进行比较，如果是逾期且现在已经还款的，可以直接比较到期时间和还款时间，如果还款时间大于到期时间，则说明是逾期的；还有一种是逾期且至今还未还款的，这种情况是没有还款时间的，也就是还款时间为空，但是到期时间是在今天之前，说明已经到期但是未还款。

select
       count(id) as loan_count,
       sum(amount) as loan_amount,
       count(case when status = 0 then id end ) as no_reback_count,
       sum(case when status = 0 then amount end ) as  no_reback_amount
from
     test.loan_table
where (current_date > expire_time and reback_time is null) or (reback_time > loan_table.expire_time);

运行结果：

（30）获取部门工资最高的员工

数据源：Employee.csv、Department.csv

Id,Name,Salary,DepartmentId
1,Joe,70000,1
2,Jim,90000,1
3,Henry,80000,2
4,Sam,60000,2
5,Max,90000,1

Id,Name
1,IT
2,Sales

需求1：如何求出Employee表薪水最高的员工（含重复值）

-- 问题一：如何求出并列第一第二的值
-- 解法一：运用子查询
select
       a.Id,
       Name,
       Salary
from
     test.Employee a
         inner join
         (select
                 Id
         from
              test.Employee
         where
               Salary = (select max(Salary) from test.Employee)) b on a.Id = b.Id;

-- 解法二：运用查rank（）窗口函数
select
       Name,
       Salary
from
     (select
             Name,
             Salary,rank() over (order by Salary desc ) as rank_num
     from test.Employee) as t
where t.rank_num = 1;

运行结果：

需求2：求出部门工资最高的员工（含重复值）

-- 问题二：部门工资最高的员工
select
       t2.Department,
       t2.Employee,
       t2.Salary
from
     (select
             Department,
             Employee,
             Salary,
             rank() over (partition by Department order by Salary desc ) as rank_num
     from
          (select
                  a.Name as Employee,
                  a.Salary as Salary ,
                  b.Name as Department
          from
               test.Employee as a left join test.Department as b on a.DepartmentId = b.Id) as t) as t2
where rank_num = 1;

运行结果：

（补）求连续登陆三天的用户

表结构：

解题思路：

第一步首先按照id，和date分组，
第二步，使用窗口函数，按照id分区，按日期排序，
第三步，使用date_sub函数让date向前移动cum天
第四步，按照id,result_date分组，求出result_date的个数，并取出result_date>3的个数即为连续登陆3天以上的用户

具体代码：

select
       id,
       result_date,
       count(result_date) cnt_3
from
     (select
             id,
             date,
             cum,
             date_sub(date,interval cum day ) as result_date
     from
          (select
                  id,
                  date(date) as date,
                  row_number() over (partition by id order by date(date)) as cum
          from
               test.demo01
          group by
                   id,
                   date(date)
          ) as date_table) as t1
group by
         id,
         result_date
having count(result_date) >= 3;

（补）某互联网公司面试题

问题：一张用户日活信息表，用户某一天的用户info为空，需要用该用户的最近一天的info去补，请问该如何实现

表结构：

dt         uid     info
20210811   001    1
20210811   002    0
20210811   003    1
20210811   004    1
20210812   001    null
20210812   002    0
20210812   003    null
20210813   002    1
20210813   003    0
20210813   004    1
20210814   001    0
20210814   003    1
20210814   004    null

比如说001用户12号的数据为null，就需要拿10号或者14号的数据去补，而不是拿15号

20210810   001    1      
20210812   001    null  
20210814   001    1  
20210815   001    1  
20210816   001    0  
20210817   001    null     
20210818   001    null  
20210819   001    null

思路：先做笛卡尔乘积，再取时间差的绝对值的最小值，再分组排序，取第一个

具体测试数据：

测试代码：

select
    dt1
    ,uid
    ,info as info_null
from (
                  select dt1
                       , uid
                       , info_null
                       , info
                       , min(dt_diff)                                                   as min_dt_diff
                       , row_number() over (partition by dt1,uid order by min(dt_diff)) as rank_num
                  from (
                           select a.dt                      as dt1
                                , b.dt                      as dt2
                                , a.uid                     as uid
                                , a.info                    as info_null
                                , b.info                    as info
                                , abs(datediff(b.dt, a.dt)) as dt_diff
                           from
                               -- 计算出为空的数据
                               (
                                   select dt
                                        , uid
                                        , info
                                        , row_number() over (partition by uid order by dt) as rank_num
                                   from user_login_info
                                   where info is null
                               ) a
                                   left join
                               -- 计算出不为空的数据
                                   (
                                       select dt
                                            , uid
                                            , info
                                            , row_number() over (partition by uid order by dt) as rank_num
                                       from user_login_info
                                       where info is not null
                                   ) as b
                               on a.uid = b.uid
                           group by a.dt
                                  , b.dt
                                  , a.uid
                                  , a.info
                                  , b.info
                       ) as t
                  group by dt1
                         , uid
                         , info_null
                         , info
              ) as abt
where rank_num = 1

运行结果：

但是我这里的写法可能不太好，因为用户日活数据表数据量可能比较大，在做笛卡尔积的时候会性能较低。

以上内容仅供参考学习，如有侵权请联系我删除！
如果这篇文章对您有帮助，左下角的大拇指就是对博主最大的鼓励。
您的鼓励就是博主最大的动力！

更多推荐

大数据开发面试必会的SQL 30题！！！

大数据开发面试必会的SQL 30题！！！

目录

（1）查询每个区域的用户数

（2）查询每个区域的男女用户数

（3）查询姓张的用户数

（4）筛选出id3～id5的用户

（5）筛选出绩效不达标的员工

（6）筛选出姓张的且绩效不达标的员工

（7）查询获得销售冠军超过两次的人

（8）查询某部门一年的月销售额最高涨幅

（9）查询每个季度绩效得分大于70分的员工

（10）删除重复值

（11）行列互换

（12）多列比较

（13）对成绩进行分组

（14）周累计数据获取

（15）周环比数据获取

（16）查询获奖员工信息

（17）计算用户留存情况

（18）筛选最受欢迎的课程

（19）筛选出每个年级最受欢迎的三门课程

（20）求累积和

（21）获取新增用户数

（22）获取用户首次购买时间

（23）同时获取用户和订单数据

（24）随机抽样

（25）获取沉默用户数

（26）获取新用户的订单数

（27）获取借款到期名单

（28）获取即将到期的借款信息

（29）获取历史逾期借款信息

（30）获取部门工资最高的员工

（补）求连续登陆三天的用户

（补）某互联网公司面试题

发布评论取消回复

最近发表

热门文章

标签列表

大数据开发面试必会的SQL 30题！！！

目录

（1）查询每个区域的用户数

（2）查询每个区域的男女用户数

（3）查询姓张的用户数

（4）筛选出id3～id5的用户

（5）筛选出绩效不达标的员工

（6）筛选出姓张的且绩效不达标的员工

（7）查询获得销售冠军超过两次的人

（8）查询某部门一年的月销售额最高涨幅

（9）查询每个季度绩效得分大于70分的员工

（10）删除重复值

（11）行列互换

（12）多列比较

（13）对成绩进行分组

（14）周累计数据获取

（15）周环比数据获取

（16）查询获奖员工信息

（17）计算用户留存情况

（18）筛选最受欢迎的课程

（19）筛选出每个年级最受欢迎的三门课程

（20）求累积和

（21）获取新增用户数

（22）获取用户首次购买时间

（23）同时获取用户和订单数据

（24）随机抽样

（25）获取沉默用户数

（26）获取新用户的订单数

（27）获取借款到期名单

（28）获取即将到期的借款信息

（29）获取历史逾期借款信息

（30）获取部门工资最高的员工

（补）求连续登陆三天的用户

（补）某互联网公司面试题

相关文章

发布评论取消回复

最近发表

热门文章

标签列表