jar包导入

		<dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.8.3</version>
        </dependency>

代码实现

	  String filePath = "D:\\工作文档\\国民经济行业分类\\报告_test.html";
      //读取.html文件为字符串
      String htmlStr = toHtmlString(new File(filePath));
      //解析字符串为Document对象
      Document doc = Jsoup.parse(htmlStr);
      //获取body元素,获取class="fc"的table元素
      Elements table = doc.body().getElementsByClass("fc");
      //获取tbody元素
      Elements children = table.first().children();
      //获取tr元素集合
      Elements tr = children.get(0).getElementsByTag("tr");
      //遍历tr元素,获取td元素,并打印
      for(int i=0; i<tr.size(); i++){
          Element e1 = tr.get(i);
          Elements td = e1.getElementsByTag("td");
          for(int j=0; j<td.size(); j++){
              String value = td.get(j).text();
              System.out.print("  "+value);
          }
          System.out.println();
      }

/**
 *  读取本地html文件里的html代码
 * @return
 */
public static String toHtmlString(File file) {
    // 获取HTML文件流
    StringBuffer htmlSb = new StringBuffer();
    try {
        BufferedReader br = new BufferedReader(new InputStreamReader(
                new FileInputStream(file), "unicode"));
        while (br.ready()) {
            htmlSb.append(br.readLine());
        }
        br.close();
        // 删除临时文件
        //file.delete();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    // HTML文件字符串
    String htmlStr = htmlSb.toString();
    // 返回经过清洁的html文本
    return htmlStr;
}

打印结果

Document属性及方法介绍

1、对象的属性

1.document.title //设置文档的标题(HTML的title标签)
2.document.bgColor //设置背景页面的颜色
3.document.fgColor //设置前景色(文本颜色)
4.documen.URL //设置URL属性在同一个窗口打开其他页面
5.document.linkColor //未点击过的链接颜色
6.document.cookie //设置和读出cookie
7.document.fileSize //设置文件大写,(注:只读属性)
8.document.charset //设置字符集
9.document.alinkColor //激活链接颜色(注:焦点在链接上)
10.document.vlinkColor //已点击过的链接颜色
11.document.fileCreatedData //文件创建日期(注:只读属性)
12.document.ModifiedDate //文件修改日期(注:只读属性)

2、常用的对象的方法

1.document.write() //动态向页面添加内容
2.document.createElement(Tag) //创建一个html标签对象
3.document.getElementById(ID) //获得指定Id的对象
4.document.getElementByClassName(className) //获得指定class值的对象(数组)
5.document.getElementByTagName(TagName) //获得指定的tag对象
6.document.body.appendChild(Tag) //向body中添加创建的新的标签对象
7.document.getElementByName(Name) //获得指定的Name值的对象

HTML测试文件

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=unicode">
<style>
.AlignLeft { text-align: left; }
.AlignCenter { text-align: center; }
.AlignRight { text-align: right; }
body { font-family: sans-serif; font-size: 11pt; }
td { vertical-align: top; padding-left: 4px; padding-right: 4px; }

tr.SectionGap td { font-size: 4px; border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionBegin td { border-left: none; border-top: none; border-right: 1px solid Black; }
tr.SectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionMiddle td { border-left: none; border-top: none; border-right: 1px solid Black; }
tr.SubsectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; }
tr.SubsectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; }
table.fc { border-top: 1px solid Black; border-left: 1px solid Black; width: 100%; font-family: monospace; font-size: 10pt; }
td.DataItemHeader { color: #000000; background-color: #FFFFFF; background-color: #E7E7E7; padding-top: 8px; }
td.DataItemInsigDiff { color: #000000; background-color: #EEEEFF; }
td.DataItemInsigOrphan { color: #000000; background-color: #FAEEFF; }
td.DataItemNum { color: #696969; background-color: #F0F0F0; }
td.DataItemSame { color: #000000; background-color: #FFFFFF; }
td.DataItemSigDiff { color: #000000; background-color: #FFE3E3; }
td.DataItemSigOrphan { color: #000000; background-color: #F1E3FF; }
.DataSegInsigDiff { color: #0000FF; }
.DataSegSigDiff { color: #FF0000; }
</style>
<title>国民经济行业分类比较</title>
</head>
<body>
国民经济行业分类比较<br/>
已产生: 2022/2/16 14:00:55<br/>
&nbsp; &nbsp;
<br/>
模式:&nbsp; 全部 &nbsp;
<br/>
左边文件: D:\工作文档\国民经济行业分类\Eleasing-国民经济行业分类 (2).xlsx &nbsp;
<br/>
右边文件: D:\工作文档\国民经济行业分类\人行-国民经济分类.xlsx &nbsp;
<br/>
<table class="fc" cellspacing="0" cellpadding="0">
<tr class="SectionAll">
<td class="DataItemHeader">1:</td>
<td class="DataItemHeader">2:</td>
<td class="DataItemHeader">&nbsp;</td>
<td class="DataItemHeader">1:</td>
<td class="DataItemHeader">2:</td>
</tr>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">A0111</td>
<td class="DataItemSame AlignLeft">稻谷种植</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">A0111</td>
<td class="DataItemSame AlignLeft">稻谷种植</td>
</tr>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">A0112</td>
<td class="DataItemSame AlignLeft">小麦种植</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">A0112</td>
<td class="DataItemSame AlignLeft">小麦种植</td>
</tr>
<tr class="SectionMiddle">
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">T9600</span></td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">国际组织</span></td>
<td class="AlignCenter">+-</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
</tr>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="AlignCenter">-+</td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">T9700</span></td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">国际组织</span></td>
</tr>
<tr class="SectionEnd">
<td class="DataItemSame AlignLeft">代码</td>
<td class="DataItemSame AlignLeft">中文名称</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">代码</td>
<td class="DataItemSame AlignLeft">中文名称</td>
</tr>
</table>
<br/>
</body>
</html>

参考文档:
java读取本地html文本
document对象常用的属性和方法

更多推荐

java读取.html文件并获取数据