java 中常用框架、intell idea简单使用、爬虫系统

学习：http://www.ityouknow/spring-boot.html

http://blog.didispace/spring-boot-learning-1/

***intell idea的使用

https://wwwblogs/yjd_hycf_space/p/7483921.html

https://blog.csdn/qq_27093465/article/details/77449117/

设置：https://blog.csdn/weixin_40816738/article/details/90116150

intell idea下载和配置:https://tech.souyunku/?p=16235

无限使用步骤：IDEA-》preferencers-》plugins-》点击上面的设置小齿轮，点击加号，添加https://plugins.zhile.io，出现IDE Eval reset，install安装。

windows中设置mavende本地仓库:file-》settings-〉build-》build tools-〉maven-》配置maven的环境,maven home directory,user setting files,local repository

Intell idea的目录结构：

在IntelliJ IDEA里面“new Project”就相当于我们eclipse的“workspace”，而“new Module”才是创建一个工程。

mac中设置maven:下在javaJDK和maven、tomcat,配置jdk环境变量;（springboot项目不需要tomcat）

InteliJ IDEA中maven的配置：

InteliJ IDEA-》preferences-》Build,Execution,Deployment->Build Tools->Maven->配置maven home directory(maven的路径)、user setting files(settting.xml)、local repository(maven本地仓库,这个在setting.xml中也要配置).

IDE创建SpringBoot项目：

File-》new-》project-》Spring Initializr

IDE创建普通Maven项目：

File-》new-》project-》Maven

intell idea中配置普通maven项目：

Run---EDit Configurations-》Templates-》Tomcat Server---Local-》-> Unnamed -> Server -》 Application server项目下，点击 Configuration ，找到本地 Tomcat 服务器，再点击 OK按钮。

在Deployment选项卡中点击+，在弹出的对话框中选择你的项目artifact，

maven的相关配置：InteliJ IDEA-》preferences-》Build,Execution,Deployment->Build Tools->Maven->配置maven home directory(maven的路径)、user setting files(settting.xml)、local repository(maven本地仓库,这个在setting.xml中也要配置).

File-》project stucture-》

Artifacts，它的作用是整合编译后的 java 文件，资源文件等，有不同的整合方式，比如war、jar、war exploded 等，对于 Module 而言，有了 Artifact 就可以部署到 web 容器中了。其中 war 和 war exploded 区别就是后者不压缩，开发时选后者便于看到修改文件后的效果。

File-》project stucture-》 Artifacts->+

Run-》edit configure-》maven，注意这里设置maven的时候要点最上面的+号，

Run-》edit configure-》Tomcat server，服务器设置完成运行的按钮就变绿了。

intellij idea部署运行的几种方法：

1.打包后放到服务器webapp下运行（不太用了）

2.部署到本地tomcat（使用本地测试），要修改输出路径output directory：https://blog.csdn/lairikeqi/article/details/93648594

3.maven插件部署运行（部署到远程服务器）

Intell idea打包方式：

打包方法一. Intell idea自身打包：Artifacts打包

在工程上右键，然后点击Open Moudle Setting-》选择 Artifacts -》

打包方法二：

先双击clean清空缓存，然后再双击package。

打包方法三：https://blog.csdn/qq_28289405/article/details/81111182

用Maven插件maven-shade-plugin打包
用Maven插件maven-assembly-plugin打包

创建微服务项目步骤:（多模块结构）

1.在工作目录下创作一个文件夹A,在IDEA中open这个文件夹;

2.在A下面创建模块:Spring Initializr快速创建一个springboot模块作为父项目,里面不写代码,只是做项目配置管理,src可以删除,;

3.dependencyManagement只做依赖的管理,不下载,dependencies会把依赖下载下来;一般在父项目中只用dependencyManagement做管理即可;

1.创建一个文件夹目录javaIntelijIDEA,open打开文件夹;

2.在javaIntelijIDEA中创建一个父工程ABC:右键-》new-》moudle-》maven-》

3.ABC作为父工程,在ABC的pom中添加依赖包,这里作为公共使用的;在别的地方就不需要依赖了.父工程全部打pom包,其他打jar包.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache/POM/4.0.0"
         xmlns:xsi="http://www.w3/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache/POM/4.0.0 http://maven.apache/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.example</groupId>
    <artifactId>ABC</artifactId>
    <packaging>pom</packaging>
    <version>1.0-SNAPSHOT</version>
    <description>父工程</description>
    <modules>
        <module>ABC-Gateway</module>
        <module>ABC-Service</module>
        <module>ABC-common</module>
        <module>ABC-Sercice-api</module>
        <module>ABC-Eureka</module>
    </modules>
<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.2.5.RELEASE</version>
		
		<relativePath/> <!-- lookup parent from repository -->
	</parent>


<dependencies>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
    <exclusions>
       <exclusion>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-logging</artifactId>
        </exclusion>
        </exclusions>
</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
			<exclusions>
			 <exclusion>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-logging</artifactId>
        </exclusion>
				<exclusion>
					<groupId>org.junit.vintage</groupId>
					<artifactId>junit-vintage-engine</artifactId>
				</exclusion>
			</exclusions>
		</dependency> 
</dependencies>
</project>

4.在ABC下面创建(微服务网关)的父工程ABG,所有(服务)的父工程ABS-service、所有api的父工程ABC-Service-api;所有(web页面渲染)的父工程ABC-web、(工具包)ABC-common、数据库操作(工具包)ABC-commondb;

5.建立注册中心:EUREKA,也在父工程AB下面,和ABG平级;

<dependencies>
   
    <dependency>
      <groupId>org.springframework.cloud</groupId>
      <artifactId>spring-cloud-starter-netflix-eureka-server</artifactId>
    </dependency>
   
  </dependencies>

在resource文件夹中创建application.properties或application.yml

server: 
  port: 3000 #端口
 
eureka:
  instance:
    hostname: localhost #eureka服务端的实例名称
  client:
    register-with-eureka: false #false表示不向注册中心注册自己。
    fetch-registry: false #false表示自己端就是注册中心，我的职责就是维护服务实例，并不需要去检索服务
    service-url:
      defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/      
  #设置与Eureka Server交互的地址查询服务和注册服务都需要依赖这个地址。

spring:
  application:
    name:eureka

在项目启动类上使用@EnableEurekaServer，可以将项目作为SpringCloud中的注册中心。

如果启动单个项目如ABC-Eureka,选中项目右键run XXaplication。

***java 中常用框架https://www.zhihu/question/389519975

***框架组合：

SSH框架组合主要包括Struts 、Spring、Hibernate三大框架

SSM框架主要包括SpringMVC、Spring、Mybatis三大框架

***Maven

Maven项目管理和构建自动化工具，越来越多的开发人员使用它来管理项目中的jar包。但是对于我们程序员来说，我们最关心的是它的项目构建功能。

***spring

是一个容器。

***springBoot

***springMVC

轻量级，基于MVC的web层应用框架，封装了Serverlet，偏前端而不是基于业务逻辑层，是spring框架的一个后续产品。

***MyBatis

MyBatis 是支持普通 SQL查询，存储过程和高级映射的优秀持久层框架。MyBatis 消除了几乎所有的JDBC代码和参数的手工设置以及结果集的检索。MyBatis 使用简单的 XML或注解用于配置和原始映射，将接口和 Java 的POJOs（Plain Old Java Objects，普通的 Java对象）映射成数据库中的记录。

***Mybatisplus

**mycat:数据库操作;

https://wwwblogs/fyy-hhzzj/p/9044775.html

mycat分布式数据库的中间件;对读写分离,分库分表支持.

mysql单表可以存储1000万条数据,超过了就要进行分库分表设计.

使用的时候需要有四个配置文件:server.xml、schema.xml、rule.xml(分片的算法)、sequence_db_conf.properties;

Mycat的默认端口是：8066

使用步骤:下载mycat、启动mycat、

**HBase

**Zookeeper分布式自增主键

***Dubbo是一个分布式服务框架，致力于提供高性能和透明化的RPC（远程过程调用协议）远程服务调用方案，以及SOA服务治理方案。简单的说，dubbo就是个服务框架，如果没有分布式的需求，其实是不需要用的，只有在分布式的时候，才有dubbo这样的分布式服务框架的需求，并且本质上是个服务调用的东东，说白了就是个远程服务调用的分布式框架。

***Redis

redis是一个key-value存储系统。和Memcached类似，它支持存储的value类型相对更多，包括string(字符串)、list(链表)、set(集合)、zset(sorted set –有序集合)和hash（哈希类型）。这些数据类型都支持push/pop、add/remove及取交集并集和差集及更丰富的操作，而且这些操作都是原子性的。在此基础上，redis支持各种不同方式的排序。与memcached一样，为了保证效率，数据都是缓存在内存中。区别的是redis会周期性的把更新的数据写入磁盘或者把修改操作写入追加的记录文件，并且在此基础上实现了master-slave(主从)同步。Redis数据库完全在内存中，使用磁盘仅用于持久性。相比许多键值数据存储，Redis拥有一套较为丰富的数据类型。Redis可以将数据复制到任意数量的从服务器。

***Log4j

日志记录的优先级，分为OFF、FATAL、ERROR、WARN、INFO、DEBUG、ALL或者您定义的级别。

***Ehcache

EhCache 是一个纯Java的进程内缓存框架，具有快速、精干等特点，是Hibernate中默认的CacheProvider。Ehcache是一种广泛使用的开源Java分布式缓存。主要面向通用缓存,Java EE和轻量级容器。它具有内存和磁盘存储，缓存加载器,缓存扩展，缓存异常处理程序，一个gzip缓存servlet过滤器，支持REST和SOAP api等特点。

**Webmagic爬虫框架:http://git.oschina/flashsword20/webmagic

文档:http://webmagic.io/、http://webmagic.io/docs/zh

webmagic大致工作流程:请求网址-》Downloader下载网页-〉PageProcessor解析网页,解析出更多的网址-》Pipeline负责抽取结果的处理或者继续request请求,之类要Scheduler做去重管理,-〉继续Downloader下载网页,-》PageProcessor解析网页,-〉Pipeline负责抽取结果的处理.

需要使用代理ip,设置成动态的代理ip;

Selenium是一个模拟浏览器进行页面渲染的工具，WebMagic依赖Selenium进行动态页面的抓取,可以对 JavaScript 生成信息进行抽取;

chromeDriver是谷歌开发的自动化测试接口;

HttpClient是Http请求组件，Jsoup是网页解析器（内置了Http请求功能）.

WebMagic使用Jsoup作为HTML解析工具，

爬虫的分类：分布式和单机,分布式主要就是apache的nutch框架，java实现，依赖hadoop运行，学习难度高，一般只用来做搜索引擎开发。

java单机的框架有：webmagic和webcollector以及crawler4j.

python单机的框架：scrapy和pyspider.

webmagic-selenium支持动态网页的爬取，webmagic-saxon支持X-Path和XSLT的解析.

WebMagic的代码分为两部分：webmagic-core和webmagic-extension,WebMagic支持使用独有的注解风格编写一个爬虫，引入webmagic-extension包即可使用此功能。

WebMagic由四个组件(Downloader、PageProcessor、Scheduler、Pipeline)构成。Spider是内部流程的核心，四大组件都是它的属性。

1.Downloader

Downloader负责从互联网上下载页面，以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。

2.PageProcessor

PageProcessor负责解析页面，抽取有用信息，以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具，并基于其开发了解析XPath的工具Xsoup。

在这四个组件中，PageProcessor对于每个站点每个页面都不一样，是需要使用者定制的部分。

3.Scheduler

Scheduler负责管理待抓取的URL，以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL，并用集合来进行去重。也支持使用Redis进行分布式管理。

除非项目有一些特殊的分布式需求，否则无需自己定制Scheduler。

4.Pipeline

Pipeline负责抽取结果的处理，包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。

Pipeline定义了结果保存的方式，如果你要保存到指定数据库，则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline。

** Spider也是WebMagic操作的入口，它封装了爬虫的创建、启动、停止、多线程等功能。



 public static void main(String[] args){
//使用一个PageProcessor创建一个Spider对象
     Spider.create(new ListPageProcesser()) 

//第一个需要爬取的地址,从https://github/code4craft开始抓
 .addUrl("https://github/code4craft") 

//设置Scheduler，使用Redis来管理URL队列
 .setScheduler(new RedisScheduler("localhost")) 

//设置Pipeline，将结果以json方式保存到文件
.addPipeline(new JsonFilePipeline("D:\\data\\webmagic")) 

//开启5个线程同时执行 
.thread(5) 

//启动爬虫
 .run();

  }

WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。WebMagic用于保存结果的组件叫做Pipeline。通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的，它叫做ConsolePipeline。那么，我现在想要把结果用Json的格式保存下来，怎么做呢？我只需要将Pipeline的实现换成"JsonFilePipeline"就可以了。

.addPipeline(new FilePipeline("/data/edu1"))设置爬取结果存储形式和位置，保存位置是/data/edu1。

.addPipeline(new ConsolePipeline()) 设置爬取结果存储形式和位置，这里将结果同时输出到console页面。

.setScheduler(new FileCacheQueueScheduler()) 使用文件保存抓取的URL，可以在关闭程序并下次启动时，从之前抓取的URL继续抓取。需指定路径，会建立.urls.txt和.cursor.txt两个文件。

WebMagic里主要使用了三种抽取技术：XPath、正则表达式和CSS选择器。

**抽取结果的api

//获取整个html
 page.putField("download", page.getHtml().toString());
//获取第一个a链接的地址
page.putField("download", page.getHtml().links().toString());
//获取所有a链接的地址,在links后面也可以加上正则
page.putField("download", page.getHtml().links().all().toString());

//获取当前请求的url.
page.getUrl


//**正则  (https://www.runoob/regexp/regexp-metachar.html)
 page.putField("download", page.getHtml().regex("https://bss.csdn/cview/reg/.*").toString());


//***xpath  (https://www.w3school/xpath/xpath_syntax.asp)

//获取类选择器叫lt_avatar的div
 page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']"));
//获取类选择器叫lt_avatar的div下的img
 page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']/img"));
//获取img中src属性
page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']/img/@src"));
//text()获取元素a中的文本值,在pipeline中entry.getValue()没有值
page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']/a/text()"))
//tidyText()获取元素a中的文本值,在pipeline中entry.getValue()会解析出href的链接
page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']/a/tidyText()"));


//***css,这里的lt_avatar就是属性值,格式:直接元素加属性值,子元素直接写在后面.
 page.putField("download", page.getHtml().css("div.lt_avatar a"));
//这里后面的text表示获取a链接的文本值
page.putField("download", page.getHtml().css("div.lt_avatar a","text"));
//后面的href表示获取a链接的href属性值
 page.putField("download", page.getHtml().css("div.lt_avatar a","href"));


//***选择器
 Elements mainElements = page.getHtml().getDocument()
                 .getElementsByClass("lt-dialog__title");
    	 String t=mainElements.text();
String attr= mainElements.text.attr("src");//获取元素属性值
    	 System.out.println(t);

            //获取当前元素
            String html = mainElements(0).html();
            List<String> all = new Html(html)
                    .xpath("//div[@class=\"mod_relations\"]").links().all();
            // 加入到爬虫队列
            page.addTargetRequests(all);

**获取结果的api

get()	返回一条String类型的结果	String link= html.links().get()
toString()	功能同get()，返回一条String类型的结果	String link= html.links().toString()
all()	返回所有抽取结果	List links= html.links().all()
match()	是否有匹配结果	if (html.links().match()){ xxx; }

**site的相关配置

private String domain;//主机域名

private String  userAgent;//http协议中的UserAgent

private MapdefaultCookies =new LinkedHashMap();//默认cookie

private Mapcookies =new HashMap>();//cookie

private String charset;//编码格式

private int sleepTime =5000;//休眠时间

private int retryTimes =0;//重试时间

private int cycleRetryTimes =0;//重试此时

private int retrySleepTime =1000;//重试休眠时间

private int timeOut =5000;//默认关闭时间

private static final SetDEFAULT_STATUS_CODE_SET =new HashSet();//状态码集合

private Set acceptStatCode =DEFAULT_STATUS_CODE_SET;//默认成功的网页返回码

private Mapheaders =new HashMap();//http协议

private boolean useGzip =true;//默认使用zip解码

private boolean disableCookieManagement =false;//需不需要使用cookie管理者

***简单实例

***pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache/POM/4.0.0" xmlns:xsi="http://www.w3/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache/POM/4.0.0 http://maven.apache/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.9.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.easy</groupId>
    <artifactId>webmagic</artifactId>
    <version>0.0.1</version>
    <name>webmagic</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
        <encoding>UTF-8</encoding>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <scope>compile</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>


**
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import us.codecraft.webmagic.Spider;

@SpringBootApplication
public class WebmagicdemoApplication {

	public static void main(String[] args) {
		SpringApplication.run(WebmagicdemoApplication.class, args);
		 //获取影片标题和页面链接
//        Spider.create(new ListPageProcesser()).addUrl("https://me.csdn/u011146511")
//                .addPipeline(new MyPipeline()).thread(1).run();
//
        //获取指定详情页面的影片下载地址
        Spider.create(new DetailPageProcesser()).addUrl("https://me.csdn/u011146511")
                .addPipeline(new MyPipeline()).thread(1).run();
	}

}


**

import org.jsoup.select.Elements;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

public class DetailPageProcesser implements PageProcessor {
    private Site site = Site.me().setDomain("my.csdn");

    @Override
    public void process(Page page) {
    	
//        page.putField("download", page.getHtml().links().all().toString());
    	
    	//正则
//      page.putField("download", page.getHtml().regex("https://bss.csdn/cview/reg/.*").toString());
    	
//    	//选择器
//    	 Elements mainElements = page.getHtml().getDocument()
//                 .getElementsByClass("lt-dialog__title");
//    	 String t=mainElements.text();
//    	 System.out.println(t);
    	
    	//xpath
//    	 page.putField("download", page.getHtml().xpath("//div[@class='lt_avatar']"));
    	
    	//css
    	 page.putField("download", page.getHtml().css("div.lt_avatar a","href"));
    	 
    	 
    	 
    	
    }

    @Override
    public Site getSite() {
        return site;
    }
}

**

import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.Map;

@Slf4j
public class MyPipeline implements Pipeline {
    @Override
    public void process(ResultItems resultItems, Task task) {
        log.info("get page: " + resultItems.getRequest().getUrl());
        for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
  /* ResultItems 保存了抽取结果，它是一个 Map 结构，    
 *在 page.putField(key,value)中保存的数据，可以通过 ResultItems.get(key)获取,这个key就是之前putFile中的key ,
 *getvalve就是解析获取的结果,这个结果就是putFile中value解析后的结果.
  */
            log.info("解析的结果"+entry.getKey() + ":\t" + entry.getValue());
        }
    }
}

除了使用  page.addTargetRequests(url或者url集合) 加入到爬虫队列.

还可以使用request模拟post请求添加到爬虫队列.
                 Request req = new Request();
                   req.setMethod(HttpConstant.Method.POST);
    	                 req.setUrl("my.csdn");
    	               req.setRequestBody(HttpRequestBody.json("{CategoryType: 'SiteHome', ParentCategoryId: 0, CategoryId: 808, PageIndex: " + pageNum
    	                       + ", TotalPostCount: 4000,ItemListActionName:'PostList'}", "utf-8"));
    	       page.addTargetRequest(req);

**WebCollector爬虫系统:https://github/CrawlScript/WebCollector

教程:https://www.oschina/p/webcollector

多种情况教程:http://datahref/archives/10

获取网站图片
@SpringBootApplication

//@ComponentScan("com.example.service")
public class DemoApplication {
	 
	public static void main(String[] args) {
		  Logger logger = LoggerFactory.getLogger(DemoApplication.class);
		    logger.info("Hello World");
	
		    /**
	         * ZhihuCrawler 构造器中会进行 数据初始化，这两个参数接着会传给父类
	         * super(crawlPath, autoParse);
	         * crawlPath：表示设置保存爬取记录的文件夹，本例运行之后会在应用根目录下生成一个 "crawl" 目录存放爬取信息
	         * */
		    ZhihuCrawler crawler = new ZhihuCrawler("crawl", true);
		    //设置为断点爬取，否则每次开启爬虫都会重新爬取
		    crawler.setResumable(true);
	        /**
	         * 启动爬虫，爬取的深度为4层
	         * 添加的第一层种子链接,为第1层
	         */
	        try {
				crawler.start(4);
			} catch (Exception e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}


	}
	
}


***
package com.example.crawler;

import java.io.File;
import java.io.IOException;

import java.InetSocketAddress;
import java.Proxy;
import java.util.Random;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Pattern;

import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.HttpRequest;
import cn.edu.hfut.dmic.webcollector.HttpResponse;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.util.FileUtils;
/**
 * BreadthCrawler 是 WebCollector 最常用的爬取器之一

 * */
public class ZhihuCrawler extends BreadthCrawler{
	 // 用于保存图片的文件夹
    File downloadDir;

    // 原子性int，用于生成图片文件名
    AtomicInteger imageId;
    private final static String downPath = "/Users/HaokeMaster/Desktop/sts/jianshuImage";
	public ZhihuCrawler(String crawlPath, boolean autoParse) {
		super(crawlPath, autoParse);
		//只有在autoParse和autoDetectImg都为true的情况下
        //爬虫才会自动解析图片链接
        setAutoParse(true);

        //如果使用默认的Requester，需要像下面这样设置一下网页大小上限
        //否则可能会获得一个不完整的页面
        //下面这行将页面大小上限设置为10M
        //getConf().setMaxReceiveSize(1024 * 1024 * 10);
		
		 /**设置爬取的网站地址
         * addSeed 表示添加种子
         * 种子链接会在爬虫启动之前加入到抓取信息中并标记为未抓取状态.这个过程称为注入*/
        //添加种子URL
        addSeed("http://www.meishij/");
        //限定爬取范围
        addRegex("http://www.meishij/.*");
        addRegex("http://images.meishij/.*");
        addRegex("-.*#.*");
        addRegex("-.*\\?.*");
        this.addSeed("https://www.meishij");
 
        /**
         * 循环添加了4个种子，其实就是分页，结果类似：
         * https://blog.github/page/2/
         * https://blog.github/page/3/
         * https://blog.github/page/4/
         * https://blog.github/page/5/
         */
//        for (int pageIndex = 2; pageIndex <= 5; pageIndex++) {
//            String seedUrl = String.format("https://blog.github/page/%d/", pageIndex);
//            this.addSeed(seedUrl);
//        }
 
        /**限定爬取范围  addRegex 参数为一个 url 正则表达式, 可以用于过滤不必抓取的链接，如 .js .jpg .css ... 等
         * 也可以指定抓取某些规则的链接，如下 addRegex 中会抓取 此类地址：
         * https://blog.github/2018-07-13-graphql-for-octokit/
         * */
//        this.addRegex("https://blog.github/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/");
        /**
         * 过滤 jpg|png|gif 等图片地址 时：
         * this.addRegex("-.*\\.(jpg|png|gif).*");
         * 过滤 链接值为 "#" 的地址时：
         * this.addRegex("-.*#.*");
         */
 
        /**设置线程数*/
        setThreads(50);
        setTopN(100);
       
 
        /**
         * 是否进行断电爬取，默认为 false
         * setResumable(true);
         */

	}

	 /**
     * 必须重写 visit 方法，作用是:
     * 在整个抓取过程中,只要抓到符合要求的页面,webCollector 就会回调该方法,并传入一个包含了页面所有信息的 page 对象
     *
     * @param page
     * @param next
     */
    @Override
    public void visit(Page page, CrawlDatums next) {
    	 downloadDir = new File(downPath);
         if (!downloadDir.exists()) {
             downloadDir.mkdirs();
         }
         computeImageId();

        String url = page.url();
        /**如果此页面地址 确实是要求爬取网址，则进行取值
         */
//        if (page.matchUrl("http://www.meishij/")) {
 
            /**
             * 通过 选择器 获取页面 标题以及 正文内容
             * */
//            String title = page.select(".foot p").text();
//            Elements content = page.select("html");
            //我们需要网页中标签的属性来进行判断,只有html属性和图片属性的标签才有可能是图片标签
            String contentType = page.response().contentType();
            System.out.println("contentType:\n" + contentType);
            if (contentType == null) {
                return;
            } else if (contentType.contains("html")) {
                // 如果是网页，则抽取其中包含图片的URL，放入后续任务
            	 Elements imgs = page.select("img[src]");
                 for (Element img : imgs) {
                     String imgSrc = img.attr("abs:src");
                     next.add(imgSrc);
                	
                 }
            } else if (contentType.startsWith("image")) {
                // 如果是图片，直接下载
            	 String extensionName = contentType.split("/")[1];
                 String imageFileName = imageId.incrementAndGet() + "." + extensionName;
                 File imageFile = new File(downloadDir, imageFileName);
                 try {
                	 byte[] image = page.content();//图片数据
                     FileUtils.write(imageFile, image);
                     System.out.println("保存图片 " + page.url() + " 到 " + imageFile.getAbsolutePath());
                 } catch (IOException ex) {
                     throw new RuntimeException(ex);
                 }
      }
            
//            System.out.println("URL:\n" + url);
//            System.out.println("title:\n" + title);
//            System.out.println("content:\n" + content);
 
//        }
    }

    public void computeImageId() {
        int maxId = -1;
        for (File imageFile : downloadDir.listFiles()) {
            String fileName = imageFile.getName();
            String idStr = fileName.split("\\.")[0];
            int id = Integer.valueOf(idStr);
            if (id > maxId) {
                maxId = id;
            }
        }
        imageId = new AtomicInteger(maxId);
    }
	
//设置代理userAgent,需要重写getResponse
    @Override
    public HttpResponse getResponse(CrawlDatum crawlDatum) throws Exception {

        /* 设置代理服务器 */
        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8889));
        //HttpRequest hr = new HttpRequest(crawlDatum, proxy);
        HttpRequest hr = new HttpRequest(crawlDatum, null);
//         hr.setProxy(proxy);
        hr.setUserAgent(getusergent());
        return hr.response();
    }




    private static String[] agentarray = { "Mozilla/5.0 (compatible, MSIE 10.0, Windows NT, DigExt)",
            "Mozilla/4.0 (compatible, MSIE 7.0, Windows NT 5.1, 360SE)",
            "Mozilla/4.0 (compatible, MSIE 8.0, Windows NT 6.0, Trident/4.0)",
            "Mozilla/5.0 (compatible, MSIE 9.0, Windows NT 6.1, Trident/5.0),",
            "Opera/9.80 (Windows NT 6.1, U, en) Presto/2.8.131 Version/11.11",
            "Mozilla/4.0 (compatible, MSIE 7.0, Windows NT 5.1, TencentTraveler 4.0)",
            "Mozilla/5.0 (Windows, U, Windows NT 6.1, en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
            "Mozilla/5.0 (Macintosh, Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/5.0 (Macintosh, U, Intel Mac OS X 10_6_8, en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
            "Mozilla/5.0 (Linux, U, Android 3.0, en-us, Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
            "Mozilla/5.0 (iPad, U, CPU OS 4_3_3 like Mac OS X, en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            "Mozilla/4.0 (compatible, MSIE 7.0, Windows NT 5.1, Trident/4.0, SE 2.X MetaSr 1.0, SE 2.X MetaSr 1.0, .NET CLR 2.0.50727, SE 2.X MetaSr 1.0)",
            "Mozilla/5.0 (iPhone, U, CPU iPhone OS 4_3_3 like Mac OS X, en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            "MQQBrowser/26 Mozilla/5.0 (Linux, U, Android 2.3.7, zh-cn, MB200 Build/GRJ22, CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" 
            };

    public static String getusergent() {
        Random rand = new Random();
        return agentarray[rand.nextInt(agentarray.length)];

    }
    
}

https://blog.csdn/jiangsanfeng1111/article/details/52334907

爬取js相关:https://blog.csdn/osaymissyou0/article/details/49450209、https://blog.csdn/osaymissyou0/article/details/49450209

webcollector地址:https://github/CrawlScript/WebCollector/blob/master/README.md

更多推荐

java 中常用框架、intell idea简单使用、爬虫系统