主要内容

Resquests 库获取网页源代码
使用 Selenium 库获取网页源代码
使用requests 库和 Selenium 库爬取网页的优缺点

一、Resquests 库获取网页源代码

1.1 使用Requests 库获取百度新闻的网页源代码

具体代码：

import requests
url = 'https://www.baidu/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=阿里巴巴'
res = requests.get(url).text
print(res)

获取到的网页源代码：

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu/"></noscript>
</body>
</html>

可以看到没有获取到真正的网页源代码，这是因为百度新闻网站只认可浏览器发送的访问请求，不认可Python发送的访问请求。
- 解决方案：此时需要通过设置requests.get()函数的参数headers，以模拟浏览器进行访问。
```
headers = {
       "User-Agent":"Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 103.0.5060.114Safari / 537.36Edg / 103.0.1264.62"
    }
```

1.2 改进版——模拟浏览器获取真实网址源代码

实现代码：

import requests
url = 'https://www.baidu/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=阿里巴巴'
headers = {
       "User-Agent":"Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 103.0.5060.114Safari / 537.36Edg / 103.0.1264.62"
    }
res = requests.get(url,headers=headers).text
print(res)

1.3 关于参数 headers 具体介绍

参数headers用于向网站提供访问者的信息，其中的User-Agent（用户代理）反映了访问者使用的是哪种浏览器
虽然有时不加headers也能获得网页的源代码（如爬取Python官网），但是headers的设置和使用并不麻烦，而且可以避免可能会出现的爬取失败，所以还是建议加上headers。

每次只需要只要记得在爬虫程序的最前面写上如下代码：

headers = {
       "User-Agent":"Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 103.0.5060.114Safari / 537.36Edg / 103.0.1264.62"
    }

然后每次使用requests.get()函数访问网址时，加上headers=headers参数即可

1.4 Requests库的缺点

使用requests 库获取的是未经渲染的网页源代码，如果用它来爬取动态渲染的网页，就往往爬取不断我们想要的结果
快速验证网页是否被动态渲染的方法：
- 用右键快捷菜单查看网页源代码 ，若看到的网页源代码内容很少，也不包含用开发者工具能看到的信息，就可以判定用开发者工具看到的网页源代码是动态渲染后的结果
要从经过动态渲染的网页中爬取数据的办法：
- 需要使用Selenium库打开一个模拟器访问网页，然后获取渲染后的网页源代码

二、使用 Selenium 库获取网页源代码

2.1 模拟浏览器以及Selenium 库的安装

要使用Selenium库爬取数据，除了需要为Python安装Selenium库，还需要安装一个模拟浏览器(详细安装步骤看博客4)。Selenium库控制这个模拟浏览器去访问网页，才能获取网页源代码。

2.2 获取网页源代码

使用Selenium库获取新浪财经股票信息

import time
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://finance.sina/stock/')
data = browser.page_source # 核心代码
print(data)
time.sleep(1000)

使用下面的代码，就能关闭模拟浏览器
```
browser.quit()
```

三、使用requests 库和 Selenium 库爬取网页的优缺点

因为Requests库是直接访问网页，爬取速度非常快；而Selenium库要先打开模拟浏览器再访问网页，导致爬取速度较慢
如果说Requests库可以爬取50%的网站，那么Selenium库可以爬取95%的网站，大部分爬取难度较高的网站都可以用Selenium库获取网页源代码
实战中通常将这两个库结合使用，实现优势互补：如果用Requests库能获取到需要的网页源代码，那么优先使用Requests库进行爬取；如果用Requests库获取不到，再使用Selenium库进行爬取。

如果觉得文章不错，可以给我点赞鼓励一下我，欢迎收藏
关注我，我们一起学习，一起进步！！！

更多推荐

03.获取网页源代码

03.获取网页源代码

主要内容

一、Resquests 库获取网页源代码

1.1 使用Requests 库获取百度新闻的网页源代码

1.2 改进版——模拟浏览器获取真实网址源代码

1.3 关于参数 headers 具体介绍

1.4 Requests库的缺点

二、使用 Selenium 库获取网页源代码

2.1 模拟浏览器以及Selenium 库的安装

2.2 获取网页源代码

三、使用requests 库和 Selenium 库爬取网页的优缺点

发布评论取消回复

最近发表

热门文章

标签列表

03.获取网页源代码

主要内容

一、Resquests 库获取网页源代码

1.1 使用Requests 库获取百度新闻的网页源代码

1.2 改进版——模拟浏览器获取真实网址源代码

1.3 关于 参数 headers 具体介绍

1.4 Requests库的缺点

二、使用 Selenium 库获取网页源代码

2.1 模拟浏览器以及Selenium 库的安装

2.2 获取网页源代码

三、使用requests 库 和 Selenium 库爬取网页的优缺点

相关文章

发布评论取消回复

最近发表

热门文章

标签列表

1.3 关于参数 headers 具体介绍

三、使用requests 库和 Selenium 库爬取网页的优缺点