python爬取起点小说_Python简单爬取起点中文网小说（仅学习）

前言

实习期间自学了vba，现在开始捡回以前上课学过的python，在此记录学习进程

本文内容仅用于学习，请勿商用

一、爬虫思路

无需登录的页面只需要用到简单爬虫，获取小说目录、通过目录获取小说正文即可。

二、使用步骤

1.引入库

代码如下(示例)：

import requests,sys

from bs4 import BeautifulSoup

2.读取页面

代码如下(示例)：

target = 'https://book.qidian/info/1024995653#Catalog'

req = requests.get(url=target)

为防止页面出错、页面乱码问题，分别加入：

req.raise_for_status()

req.encoding = req.apparent_encoding

此时即可看到网页HTML：

html = req.text

3.分析HTML

在HTML代码中，我们要找到对应目录的文字和链接，以及承载这两个信息的标签：

在小说目录页面按下F12，观察页面的HTML，可以发现目录是在一个class=‘catalog-content-wrap’、id=‘j-catalogWrap’的div标签下的。继续分析，发现还有volume-wrap，volume等子标签作为目录的容器：

一直向下延伸到带有链接的a标签，定位到目标，分析完毕。

bf = BeautifulSoup(html,"html.parser")

catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')

volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')

volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

3.从标签中取出信息

仍然是利用BS直接取出volume中所有的a标签，并且把其中的文本和对应的href存起来。

aList = volumeDiv.find_all('a')

for a in aList:

chapterName = a.string

chapterHref = a.get('href')

这样整个目录就检索完成了，开始利用Href爬取正文。

4.爬取正文

先随便选择一个链接打开，观察正文的HTML：

发现格式会有两种情况，一种直接用p标签装起来，一种是p中带有span，用class=content-wrap的span装起来。

但是首先他们都一定是在class=‘read-content j_readContent’的div下，因此直接定位：

req = requests.get(url=chapterHref)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

mainTextWrapDiv = bf.find('div',class_='main-text-wrap')

readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')

readContent = readContentDiv.find_all('span',class_='content-wrap')

这时已经可以拿到带有标签的正文部分了，由于链接不同，会导致标签格式不同，因此用判断区分：

if readContent == []:

textContent = readContentDiv.text.replace('

','\r\n')

textContent = textContent.replace('

','')

else:

for content in readContent:

if content.string == '':

print('error format')

else:

textContent += content.string + '\r\n'

正文内容获取完毕。

现在只需遍历就能获取整部小说啦！

总结

以下为完整代码：

#!/usr/bin/env python3

# coding=utf-8

# author:sakuyo

#----------------------------------

import requests,sys

from bs4 import BeautifulSoup

class downloader(object):

def __init__(self,target):#初始化

self.target = target

self.chapterNames = []

self.chapterHrefs = []

self.chapterNum = 0

self.session = requests.Session()

def GetChapterInfo(self):#获取章节名称和链接

req = self.session.get(url=self.target)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')

volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')

volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

for volumeDiv in volumeDivs:

aList = volumeDiv.find_all('a')

for a in aList:

chapterName = a.string

chapterHref = a.get('href')

self.chapterNames.append(chapterName)

self.chapterHrefs.append('https:'+chapterHref)

self.chapterNum += len(aList)

def GetChapterContent(self,chapterHref):#获取章节内容

req = self.session.get(url=chapterHref)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

mainTextWrapDiv = bf.find('div',class_='main-text-wrap')

readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')

readContent = readContentDiv.find_all('span',class_='content-wrap')

if readContent == []:

textContent = readContentDiv.text.replace('

','\r\n')

textContent = textContent.replace('

','')

else:

for content in readContent:

if content.string == '':

print('error format')

else:

textContent += content.string + '\r\n'

return textContent

def writer(self, path, name='', content=''):

write_flag = True

with open(path, 'a', encoding='utf-8') as f: #a模式意为向同名文件尾增加文本

if name == None:

name=''

if content == None:

content = ''

f.write(name + '\r\n')

f.writelines(content)

f.write('\r\n')

if __name__ == '__main__':#执行层

target = 'https://book.qidian/info/1024995653#Catalog'

dlObj = downloader(target)

dlObj.GetChapterInfo()

print('开始下载：')

for i in range(dlObj.chapterNum):

try:

dlObj.writer( 'test.txt',dlObj.chapterNames[i], dlObj.GetChapterContent(dlObj.chapterHrefs[i]))

except Exception:

print('下载出错，已跳过')

pass

sys.stdout.write(" 已下载:%.3f%%" % float(i/dlObj.chapterNum) + '\r')

sys.stdout.flush()

print('下载完成')

原文链接:https://blog.csdn/weixin_47190827/article/details/113087316

更多推荐

python爬取起点小说_Python简单爬取起点中文网小说（仅学习）

python爬取起点小说_Python简单爬取起点中文网小说（仅学习）

发布评论取消回复

最近发表

热门文章

标签列表

python爬取起点小说_Python简单爬取起点中文网小说（仅学习）

相关文章

发布评论取消回复

最近发表

热门文章

标签列表