

	res = requests.get(url)
	html = etree.HTML(res.text)
	contents = html.xpaht('//div/xxxx')


	Traceback (most recent call last):
	  File "xxxxxxxx.py", line 157, in <module>
	  File "xxxxxxxx.py", line 141, in get_website_title_content
	    html = etree.HTML(html_text)
	  File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML
	  File "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
	ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

关键错误就是 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

通过查阅相关资料,造成这个错误的原因其实是requests返回的 res.text 和 res.content 两者区别的问题。查阅requests源代码中是text和content定义(如下所示)可知:res.text返回的是Unicode类型的数据,而res.content返回的是bytes类型的数据。

    def content(self):
        """Content of the response, in bytes."""

        if self._content is False:
            # Read the contents.
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')

            if self.status_code == 0 or self.raw is None:
                self._content = None
                self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''

        self._content_consumed = True
        # don't need to release the connection; that's been handled by urllib3
        # since we exhausted the data.
        return self._content

    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return str('')

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding

        # Decode unicode from given encoding.
            content = str(self.content, encoding, errors='replace')
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            # A TypeError can be raised if encoding is None
            # So we try blindly encoding.
            content = str(self.content, errors='replace')

        return content

因此解决方法很简单,第一种就是直接使用 res.content,如下:

	res = requests.get(url)
	html = etree.HTML(res.content )
	contents = html.xpath('//div/xxxx')


	res = requests.get(url)
	html_text = bytes(bytearray(res.text, encoding='utf-8'))
	html = etree.HTML(html_text)
	contents = html.xpath('//div/xxxx')

