python爬虫之使用selenium爬取b站视频信息

前言

在之前提到爬虫时，我想到的步骤大多是：

查找要爬取内容的页面的url，使用requests库获取响应内容
内容为html源码则使用BeautifulSoup等工具解析html源码，得到想要的数据
内容为Json则直接转为Json对象解析
保存数据

但今天我发现了selenium这个工具，selenium是一套完整的web应用程序测试系统，可以用它来模拟真实浏览器进行测试。在爬虫中使用它的话，我们就可以通过它来与网站进行交互，比如模拟在b站搜索“爱乐之城”，能够获取到搜索结果的页面内容，而不必自己去复制URL了。

使用selenium

安装

python中selenium库的安装依然是傻瓜式：

pip install selenium

不过光有selenium库还不够，还需要下载webdriver，我使用的是chrome浏览器，下载地址：http://npm.taobao/mirrors/chromedriver/
下载对应的版本，解压后添加到环境变量中。
查看chrome版本
右上角选项 → \to → 帮助 → \to → 关于Google Chrome，即可查看版本。
添加环境变量
解压后得到的是chromedriver可执行程序，在windows下把它移动到chrome.exe的目录下：
然后添加到环境变量即可。
在ubuntu下则通过sudo mv chromedriver /usr/local/bin即可。

简单使用

访问页面

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("http://www.baidu")
browser.close()

运行后会打开浏览器窗口并访问百度首页。

进行交互

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

browser = webdriver.Chrome()
browser.get('http://www.baidu')
input = browser.find_element(By.ID, 'kw')
input.send_keys('ipad')
time.sleep(1)
input.clear()
input.send_keys('MacBook pro')
button = browser.find_element(By.ID, 'su')
button.click()
time.sleep(5)
browser.quit()

程序会自动打开百度，输入ipad后清空，再输入MacBook pro并点击搜索。

要爬取的信息

进入b站，搜索爱乐之城，在搜索结果页面按F12审查元素：

我们需要爬取的视频信息有：

标题
描述
观看次数
弹幕数量
上传时间
up主
视频链接

通过审查元素能够看到一个视频的相关信息是一个无序列表的元素，里面包含了我们所需要爬取的信息。

访问b站

根据我们要爬取的内容，需要通过selenium进行的操作包括：

打开b站首页，输入“爱乐之城”并点击搜索
在搜索结果页面不断点击“下一页”，获取下一页的搜索结果

访问b站并进行搜索

browser = webdriver.Chrome(options=options)
browser.set_window_size(1400, 900)

browser.get('https://bilibili')
# 刷新一下，防止搜索button被登录弹框遮住
browser.refresh()
input = browser.find_element(By.CLASS_NAME, 'search-keyword')
button = browser.find_element(By.CLASS_NAME, 'search-submit')
input.send_keys('爱乐之城')
button.click()

刷新一下页面是因为在访问b站时会弹出来一个登陆的弹窗，遮住了搜索按钮导致其无法被点击。

依次访问下一页

首先获取总页数：

total_btn = browser.find_element(By.CSS_SELECTOR, "div.page-wrap > div > ul > li.page-item.last > button")
total = int(total_btn.text)
print(f'总页数: {total}')

对于每一页，都要解析响应的数据并保存：

def get_source():
    html = browser.page_source
    soup = BeautifulSoup(html, 'lxml')
    save_to_excel(soup)

获取下一页：

def next_page(page_num):
    WAIT = WebDriverWait(browser, 10)   # 设置超时时长为10s
    try:
        next_btn = WAIT.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'div.page-wrap > div > ul > li.page-item.next > button')))
        next_btn.click()
        WAIT.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div.page-wrap > div > ul > li.page-item.active > button'), str(page_num)))
        get_source()
    except TimeoutException:
        browser.refresh()
        return next_page(page_num)

为了防止页面未加载出来导致没有成功获取到页面数据，我设置了超时的异常捕获，WAIT.until会每隔0.5s调用一次EC中的方法，直到它返回成功信号或者超时抛出异常。
在上面的代码中，有判断“下一页”这个按钮是否clickable和当前显示的页面是否是指定的页面（page_num）

解析数据

使用BeautifulSoup来解析数据：

infos = soup.find_all(class_='video matrix')
for info in infos:
    title = info.find('a').get('title')
    href = info.find('a').get('href')
    desc = info.find(class_='des hide').string.strip()
    views = info.find(class_='so-icon watch-num').text.strip()
    barrages = info.find(class_='so-icon hide').text.strip()
    date = info.find(class_='so-icon time').text.strip()
    up = info.find(class_='up-name').string.strip()
    print(f'爬取: {title} up主: {up} 观看次数: {views}')

保存数据可以用xlwt库来写入excel表格文件，也可以转成Json等，这个比较简单就不用说了。

无界面爬虫

在进行调试时打开浏览器窗口还是挺好的，但当我们写好爬虫之后似乎就没必要打开这个窗口了，在之前版本的selenium是通过结合PhantomJS来实现无界面爬虫的，但现在已经转成使用更简洁的headless模式的浏览器了：

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=options)   # 设置无界面爬虫

运行结果

完整源码：https://github/chenf99/python/blob/master/BiliBiliCrawl/crawl.py

参考链接

selenium中查找元素的方法
selenium中的WebDriverWait
selenium中判断元素中是否存在指定的文本

更多推荐

python爬虫之使用selenium爬取b站视频信息

python爬虫之使用selenium爬取b站视频信息

前言

使用selenium

安装

简单使用

要爬取的信息

访问b站

访问b站并进行搜索

依次访问下一页

解析数据

无界面爬虫

运行结果

参考链接

发布评论取消回复

最近发表

热门文章

标签列表

python爬虫之使用selenium爬取b站视频信息

前言

使用selenium

安装

简单使用

要爬取的信息

访问b站

访问b站并进行搜索

依次访问下一页

解析数据

无界面爬虫

运行结果

参考链接

相关文章

发布评论取消回复

最近发表

热门文章

标签列表