第一次写博客&第一次自学Python&第一次实战

Hi，亲们，本博客只是个人瞎写着记录的：

作为已经做BI工程师三年的人竟然第一次接触Python，自学Python 哎~~是不是很晚呢
毕业第一年进入SAP BW模块
第二年已经差不多开始懂点什么是数据建模，什么是数据仓库，数据集市
*第三年在公司开始做ETL,数据清洗，数据整合，报表展示

到目前所使用过的数据库：
SAP HANA,SQL server,Greenplum,Mysql,Hadoop,MongoDB,Oracle**

所使用过的BI工具：
SAP BO,Tableau，PowerBI，Microsoft CUBE，Tabular等

唯独目前最火的Python,R等语言一次都没接触过，因此借此正好有空闲时间来自学一下Python

入门学习篇

先明确学习的内容：爬虫！！
去各种网站搜Python自学教学等教材。
最好用的还是 w3cschool的资料，不懂得问题直接去Bing搜，就是这么简单粗暴
边学习边应用：
个人是比较习惯于一一边学习一边实战的方式。所以一边学习爬虫相关的库，一边直接应用
学习到的库：
BS4
urlopen
pyhdb
datetime
requests
re
也不能说是学完了，应该是有一点点知道了怎么应用，网上资料很多，不懂直接去Bing搜就出来一堆东西啦~哈哈哈哈
选定一个目标就直接实战：
称为我猎物的是某DM论坛。（因为个人比较喜欢游戏）

废话不多说下面就献丑自己乱写的代码，因为没有考虑效率问题，希望能得到更好的建议

实战篇

分析

先进入论坛首页，分析网站结构，找出规律。

1、发现搜索列右边已经给出所有游戏类目，找到定位就更简单了。
把所有类目都放在class=”scbar_hot_td”下的id=”scbar_hot”下 class=‘xi2’
Href=后面直接有每个类目地址

2、打开每个类目地址又发现一个很有规律性的逻辑~~哈哈太有趣
每个类目下都按一定规律写了游戏论坛地址和名称，让我更简单的利用爬虫来爬取数据提供了很强大的支持。

3、进入每款游戏论坛地址也一样，都按一定规律编写每个帖子的基本信息

比如class=”new” 存放帖子类型、帖子标题、热度等信息
class=”by”存放创建者创建日期、最后回复人以及最后回复日期等
class=”num”存放查看数、回复次数

编写Python

提取每款游戏信息：

def getgrand(url):

#URL就是网页地址
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}  # 设置头文件信息 #
response = requests.get(url, headers=headers).content  # 提交requests get 请求
soup = BeautifulSoup(response, "html.parser")  # 用Beautifulsoup 进行解析
commid = soup.findAll('a', class_='xi2') ##这里找出所有xi2类
for commid2 in commid[2:-1]: ##爬取论坛首页
    href=commid2.get("href") ##这里要抽取每个游戏类目地址
    if len(href.split("-"))>=2 :
        id=href.split("-")[1] ##这里要抽取每个类目ID，后面做数据模型的时候用
        site=url+href  #完整的游戏类目地址
        print(site)
        cate=commid2.text   #这里要抽取每个游戏类目名称
        if check_contain_chinese(cate) == True:  ## 相当于数据清洗，因为抽取有可能不是类目名称，排除不是中文的名称
            response2 = requests.get(site, headers=headers).content  # 提交requests get 请求
            soup2= BeautifulSoup(response2, "html.parser")  # 用Beautifulsoup 进行解析
            catmid=soup2.findAll('dt')
            intohana_grandcat(conn, id, cate, site)##导入到HANA数据库里 建立类目维表
            # print(catmid)
            # print(cate)
            for catmid2 in catmid: ##爬取每个游戏首页 
                a=catmid2.findAll("a")[0]
                href2=a.get("href")
                gamename=a.text  ## 提取游戏名称
                # print(gamename)
                if (len(href2.split("-"))>=2) & (href2[-4:]=="html") :    ##数据清洗，清理垃圾数据
                    site2=url+href2  ## 提取出每款游戏地址
                    id2=href2.split("-")[1]  ##每款游戏ID

            #         print(a)
            #         print(id2)
            #         print(site2)
                    intohana_game(conn, id, cate, id2, gamename, site2)  ###建立游戏维度维表
                    getdetail(site2,conn,id,id2,gamename)  ## get detail of blog information ##要爬取每个游戏论坛地址

爬取每个游戏论坛地址：

def getdetail(site,conn,GRAND_ID,GAME_ID,GAME_NAME): ## get detail of blog information

globals()   #定义所用到的所有全局变量
type=''
theme=''
replynum=''
readnum=''
editor=''
createdate=''
lastreply=''
lastreplydate=''

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}  # 设置头文件信息 #

response3 = requests.get(site, headers=headers).content  # 提交requests get 请求
soup3 = BeautifulSoup(response3, "html.parser")  # 用Beautifulsoup 进行解析
gamemid = soup3.findAll('th', class_='new')  
td = soup3.findAll('td', class_='by')
for tbody in soup3.findAll('tbody'):   ##对每个tbody进行解析
    for tr in tbody.findAll('tr'):   ##对每个tbody的tr进行解析
        for new in tr.findAll('th', class_='new'):  ##对每个tbody的tr下class_=new进行解析
            em = new.find('em')    #抽取帖子类型
            if em:    ##数据清洗
                type = em.text[1:-1]
            else:
                type = ''
            theme = new.find('a', class_='s xst').text #抽取帖子主题
            num = tr.find('td', class_='num')  ###get reply & read #抽取帖子查看和回复次数
            if num:  ##数据清洗
                replynum = num.find('a').text 
                readnum = num.find('em').text
            else:
                replynum='0'
                readnum='0'
            # print(type, theme, replynum, readnum)
        by = tr.findAll('td', class_='by')  # get editor & date #抽取每个帖子 创建者和最后回复人
        if by: ##数据清洗
            for uby in by[:1]:  # get editor & date
                createdate = uby.find('em').text
                editor = uby.find('cite').text
            # print(editor,createdate)
            for uby in by[1:]:  # get editor & date
                lastreply = uby.find('cite').text
                lastreplydate = uby.find('em').text
            # print(editor, createdate, lastreply, lastreplydate)
            # print(lastreply, lastreplydate)
        intohana_blogdetail(conn,GRAND_ID,GAME_ID,GAME_NAME,type,theme,replynum,
                            readnum,editor,createdate,lastreply,lastreplydate,current_daytime) # 对每条数据逐步insert到数据库表建立Fact实时表

整个代码发布到GIT上，做了点简单的现状分析

https://github/zangmeisim/YouminAnalysis

简单分析：

工具

PyCharm
SAP HANA
SAP BO
EXCEL

改进

1、不知道怎么改进代码，就感觉做的太粗糙，希望有人指导
2、分析目的不明确，导致抽取的数据没有目的性。
3、希望有大神能教点分析方法。

更多推荐

自学Python之路--入门菜鸟的菜鸟篇：爬虫

自学Python之路--入门菜鸟的菜鸟篇：爬虫

第一次写博客&第一次自学Python&第一次实战

唯独目前最火的Python,R等语言一次都没接触过，因此借此正好有空闲时间来自学一下Python

入门学习篇

实战篇

分析

编写Python

简单分析：

工具

改进

发布评论取消回复

最近发表

热门文章

标签列表

自学Python之路--入门菜鸟的菜鸟篇：爬虫

第一次写博客&第一次自学Python&第一次实战

唯独目前最火的Python,R等语言 一次都没接触过，因此借此正好有空闲时间来自学一下Python

入门学习篇

实战篇

分析

编写Python

简单分析：

工具

改进

相关文章

发布评论取消回复

最近发表

热门文章

标签列表

唯独目前最火的Python,R等语言一次都没接触过，因此借此正好有空闲时间来自学一下Python