使用nltk在法语中进行标记(Tokenizing in french using nltk)

我正在尝试对法语单词进行标记，但是当我对法语单词进行标记时，包含“^”符号的单词将返回\ xe。以下是我实现的代码。

import nltk from nltk.tokenize import WhitespaceTokenizer from nltk.tokenize import SpaceTokenizer from nltk.tokenize import RegexpTokenizer data = "Vous êtes au volant d'une voiture et vous roulez à vitesse" #wst = WhitespaceTokenizer() #tokenizer = RegexpTokenizer('\s+', gaps=True) token=WhitespaceTokenizer().tokenize(data) print token

输出我得到了

['Vous', '\xeates', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', '\xe0', 'vitesse']

期望的输出

['Vous', 'êtes', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', 'à', 'vitesse']

I am trying to tokenize french words but when i tokenize french words the words which contain "^" symbol returns \xe .The following is the code that i implemented .

Output i got

['Vous', '\xeates', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', '\xe0', 'vitesse']

Desired output

['Vous', 'êtes', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', 'à', 'vitesse']

最满意答案

在Python 2中，要在代码中编写UTF-8文本，您需要使用# -*- coding: <encoding name> -*-在不使用ASCII时启动文件。您还需要在u前面添加Unicode字符串：

# -*- coding: utf-8 -*- import nltk ... data = u"Vous êtes au volant d'une voiture et vous roulez à grande vitesse" print WhitespaceTokenizer().tokenize(data)

如果您不是在Python代码中编写data而是从文件中读取data ，则必须确保它已被Python正确解码。 codecs模块有助于：

import codecs codecs.open('fichier.txt', encoding='utf-8')

这是一种很好的做法，因为如果存在编码错误，您将立即知道它：它不会在以后咬你，例如。处理完您的数据后。这也是在Python 3中运行的唯一方法，其中codecs.open变为open并且总是立即完成解码。更一般地说，避免使用'str'Python 2类型，如瘟疫，并始终坚持使用Unicode字符串，以确保编码正确完成。

推荐读物：

Python 2：Unicode HOWTO Python 3：Unicode HOWTO Python 3文本文件处理 Python 3中的新功能：Unicode

邦勇气！

In Python 2, to write UTF-8 text in your code, you need to start your file with # -*- coding: <encoding name> -*- when not using ASCII. You also need to prepend Unicode strings with u:

# -*- coding: utf-8 -*- import nltk ... data = u"Vous êtes au volant d'une voiture et vous roulez à grande vitesse" print WhitespaceTokenizer().tokenize(data)

When you're not writing data in your Python code but reading it from a file, you must make sure that it's properly decoded by Python. The codecs module helps here:

import codecs codecs.open('fichier.txt', encoding='utf-8')

This is good practice because if there is an encoding error, you will know about it right away: it won't bite you later on, eg. after processing your data. This is also the only approach that works in Python 3, where codecs.open becomes open and decoding is always done right away. More generally, avoid the 'str' Python 2 type like the plague and always stick with Unicode strings to make sure encoding is done properly.