编码麻烦 - 一种格式到另一种格式(Encoding troubles - one format to another)

我有一个从其他地方收集一些我无法控制的数据的刮板。 源数据做各种有趣的Unicode字符,但它将它们转换为非常无用的格式,所以

\u00e4

对于带有变音符号的小'a'(没有我认为应该在那里的双引号)*。 当然这会在我的HTML中呈现为纯文本。

是否有任何现实的方法将unicode源转换为适当的字符,不涉及我手动处理每个字符串序列并在刮擦期间替换它们?

*这里是它吐出的json的样本:

({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})

I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so

\u00e4

for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.

Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?

*here is a sample of the json that it spits out:

({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})

最满意答案

考虑到\ u00e4是Unicode字符的Javascript表示,可能使用json_decode() PHP函数将其解码为PHP字符串。

有效的JSON字符串将是:

$json = '"\u00e4"';

和这个 :

header('Content-type: text/html; charset=UTF-8'); $php = json_decode($json); var_dump($php);

会给你正确的输出:

string 'ä' (length=2)

(它是一个字符,但是两个字节长)

尽管如此,这感觉有点hackish ^^ 它可能不太好,取决于你输入的字符串类型...

[编辑]我刚刚看到你的评论,你似乎表明你得到JSON作为输入? 如果是这样, json_decode()可能真的是这个工作的正确工具;-)

Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode() PHP function, to decode that to a PHP string...

The valid JSON string would be :

$json = '"\u00e4"';

And this :

header('Content-type: text/html; charset=UTF-8'); $php = json_decode($json); var_dump($php);

would give you the right output :

string 'ä' (length=2)

(It's one character, but two bytes long)

Still, it feels a bit hackish ^^ And it might not work too well, depending on the kind of string you get as input...

[Edit] I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode() might really be the right tool for the job ;-)

更多推荐

HTML,电脑培训,计算机培训,IT培训"/> <meta name="description" conten