python beautifulsoup模拟点击_使用selenium webdriver+beautifulsoup+跳转frame，实现模拟点击网页下一页按钮，抓取网页数据...

记录一次快速实现的python爬虫，想要抓取中财网数据引擎的新三板板块下面所有股票的公司档案，网址为http://data.cfi/data_ndkA0A1934A1935A1986A1995.html。

比较简单的网站不同的页码的链接也不同，可以通过观察链接的变化找出规律，然后生成全部页码对应的链接再分别抓取，但是这个网站在换页的时候链接是没有变化的，因此打算去观察一下点击第二页时的请求

发现使用的是get的请求方法，并且请求里有curpage这个参数，貌似控制着不同页数，于是改动了请求链接中的这个参数值为其他数值发现并没有变化，于是决定换一种方法，就是我们标题中提到的使用selenium+beautifulsoup实现模拟点击网页中的下一页按钮来实现翻页，并分别抓取网页里的内容。

首先我们先做一下准备工作，安装一下需要的包，打开命令行，直接pip install selenium和pip install beautifulsoup4

然后就是下载安装chromedriver的驱动，网址如下https://sites.google/a/chromium/chromedriver/downloads，记得配置下环境变量或者直接安装在工作目录下。(还可以使用IE、phantomJS等)

这里我们先抓取每一个股票对应的主页链接，代码如下(使用python2)：

1 # -*- coding: utf-8 -*-

2 from selenium import webdriver

3 from bs4 import BeautifulSoup

4 import sys

5 reload(sys)

6 sys.setdefaultencoding('utf-8')

8 def crawl(url):

9 driver = webdriver.Chrome()

10 driver.get(url)

11 page = 0

12 lst=[]

13 with open('./url.txt','a') as f:

14 while page < 234:

15 soup = BeautifulSoup(driver.page_source, "html.parser")

16 print(soup)

17 urls_tag = soup.find_all('a',target='_blank')

18 print(urls_tag)

19 for i in urls_tag:

20 if i['href'] not in lst:

21 f.write(i['href']+'\n')

22 lst.append(i['href'])

23 driver.find_element_by_xpath("//a[contains(text(),'下一页')]").click()

24 time.sleep(2)

25 return 'Finished'

26 def main():

27 url = 'http://data.cfi/cfidata.aspx?sortfd=&sortway=&curpage=2&fr=content&ndk=A0A1934A1935A1986A1995&xztj=&mystock='

28 crawl(url)

29 if __name__ == '__main__':

30 main()

运行代码发现总是报错：

这里报错的意思是找不到想要找的按钮。

于是我们去查看一下网页源代码：

发现网页分为不同的frame，所以我们猜想应该需要跳转frame，我们需要抓取的链接处于的frame的name为“content”，所以我们添加一行代码：driver.switch_to.frame('content')

def crawl(url):

driver = webdriver.Chrome()

driver.get(url)

driver.switch_to.frame('content')

page = 0

lst=[]

with open('./url.txt','a') as f:

while page < 234:

soup = BeautifulSoup(driver.page_source, "html.parser")

print(soup)

urls_tag = soup.find_all('a',target='_blank')

print(urls_tag)

for i in urls_tag:

if i['href'] not in lst:

f.write(i['href']+'\n')

lst.append(i['href'])

driver.find_element_by_xpath("//a[contains(text(),'下一页')]").click()

time.sleep(2)

return 'Finished'

至此，运行成：

参考博文链接：http://unclechen.github.io/2016/12/11/python%E5%88%A9%E7%94%A8beautifulsoup+selenium%E8%87%AA%E5%8A%A8%E7%BF%BB%E9%A1%B5%E6%8A%93%E5%8F%96%E7%BD%91%E9%A1%B5%E5%86%85%E5%AE%B9/

http://wwwblogs/liyuhang/p/6661835.html

更多推荐

python beautifulsoup模拟点击_使用selenium webdriver+beautifulsoup+跳转frame,实现模拟点击网页下一页按

python beautifulsoup模拟点击_使用selenium webdriver+beautifulsoup+跳转frame，实现模拟点击网页下一页按钮，抓取网页数据...

发布评论取消回复

最近发表

热门文章

标签列表

python beautifulsoup模拟点击_使用selenium webdriver+beautifulsoup+跳转frame，实现模拟点击网页下一页按钮，抓取网页数据...

相关文章

发布评论取消回复

最近发表

热门文章

标签列表