python拼写检查

语言工具(LanguageTool)

LanguageTool is an open-source grammar tool, also known as the spellchecker for OpenOffice. This library allows you to detect grammar errors and spelling mistakes through a Python script or through a command-line interface. We will work with the language_tool_pyton python package which can be installed with the pip install language-tool-python command. By default, language_tool_python will download a LanguageTool server .jar and run that in the background to detect grammar errors locally. However, LanguageTool also offers a Public HTTP Proofreading API that is supported as well but there is a restriction in the number of calls.

LanguageTool是一种开源语法工具,也称为OpenOffice的拼写检查器。 该库使您可以通过Python脚本或命令行界面检测语法错误和拼写错误。 我们将使用language_tool_pyton python软件包,该软件包可通过pip install language-tool-python命令pip install language-tool-python 。 默认情况下, language_tool_python将下载LanguageTool服务器.jar并在后台运行该服务器以在本地检测语法错误。 但是,LanguageTool还提供了受支持的公共HTTP校对API ,但调用次数受到限制。

Python语言工具 (LanguageTool in Python)

We will provide a practical example of how you can detect your grammar mistakes and also correct them. We will work with the following text:

我们将提供一个实用的示例,说明如何检测语法错误并进行纠正。 我们将处理以下文本:

LanguageTool offers spell and grammar checking. Just paste your text here and click the ‘Check Text’ button. Click the colored phrases for details on potential errors. or use this text too see an few of of the problems that LanguageTool can detecd. What do you thinks of grammar checkers? Please not that they are not perfect. Style issues get a blue marker: It’s 5 P.M. in the afternoon. The weather was nice on Thursday, 27 June 2017 “.

LanguageTool提供拼写和语法检查。 只需在此处粘贴文本,然后单击“检查文本”按钮即可。 单击彩色短语以获取有关潜在错误的详细信息。 使用该文本也看到少数问题,LanguageTool可以detecd。 您如何看待语法检查器? 不要说它们不是完美的。 样式问题是一个蓝色标记:下午5点。 2017年6月27日星期四,天气很好

I made bold the grammar issues. Let’s see how we can detect them with Python:

我把语法问题加粗了。 让我们看看如何使用Python检测它们:

import tool = text = # get the matchesmatches = matches

And we get:

我们得到:

[Match({'ruleId': 'UPPERCASE_SENTENCE_START', 'message': 'This sentence does not start with an uppercase letter', 'replacements': ['Or'], 'context': '...hrases for details on potential errors. or use this text too see an few of of the ...', 'offset': 168, 'errorLength': 2, 'category': 'CASING', 'ruleIssueType': 'typographical'}), Match({'ruleId': 'TOO_TO', 'message': 'Did you mean "to see"?', 'replacements': ['to see'], 'context': '...s on potential errors. or use this text too see an few of of the problems that Language...', 'offset': 185, 'errorLength': 7, 'category': 'CONFUSED_WORDS', 'ruleIssueType': 'misspelling'}), Match({'ruleId': 'EN_A_VS_AN', 'message': 'Use "a" instead of \'an\' if the following word doesn\'t start with a vowel sound, e.g. \'a sentence\', \'a university\'', 'replacements': ['a'], 'context': '...ential errors. or use this text too see an few of of the problems that LanguageToo...', 'offset': 193, 'errorLength': 2, 'category': 'MISC', 'ruleIssueType': 'misspelling'}), Match({'ruleId': 'ENGLISH_WORD_REPEAT_RULE', 'message': 'Possible typo: you repeated a word', 'replacements': ['of'], 'context': '...errors. or use this text too see an few of of the problems that LanguageTool can dete...', 'offset': 200, 'errorLength': 5, 'category': 'MISC', 'ruleIssueType': 'duplication'}), Match({'ruleId': 'MORFOLOGIK_RULE_EN_US', 'message': 'Possible spelling mistake found.', 'replacements': ['detect'], 'context': '...f of the problems that LanguageTool can detecd. What do you thinks of grammar checkers...', 'offset': 241, 'errorLength': 6, 'category': 'TYPOS', 'ruleIssueType': 'misspelling'}), Match({'ruleId': 'DO_VBZ', 'message': 'After the auxiliary verb \'do\', use the base form of the main verb. Did you mean "think"?', 'replacements': ['think'], 'context': '...at LanguageTool can detecd. What do you thinks of grammar checkers? Please not that th...', 'offset': 261, 'errorLength': 6, 'category': 'GRAMMAR', 'ruleIssueType': 'grammar'}), Match({'ruleId': 'PLEASE_NOT_THAT', 'message': 'Did you mean "note"?', 'replacements': ['note'], 'context': '... you thinks of grammar checkers? Please not that they are not perfect. Style issues...', 'offset': 296, 'errorLength': 3, 'category': 'TYPOS', 'ruleIssueType': 'misspelling'}), Match({'ruleId': 'PM_IN_THE_EVENING', 'message': 'This is redundant. Consider using "P.M."', 'replacements': ['P.M.'], 'context': "... Style issues get a blue marker: It's 5 P.M. in the afternoon. The weather was nice on Thursday, 27 Ju...", 'offset': 366, 'errorLength': 22, 'category': 'REDUNDANCY', 'ruleIssueType': 'style'}), Match({'ruleId': 'DATE_WEEKDAY', 'message': 'The date 27 June 2017 is not a Thursday, but a Tuesday.', 'replacements': [], 'context': '... the afternoon. The weather was nice on Thursday, 27 June 2017', 'offset': 413, 'errorLength': 22, 'category': 'SEMANTICS', 'ruleIssueType': 'inconsistency'})]

As we can see we get a detailed dictionary that shows the ruleId, the message etc. You can find a detailed explanation of every rule id in the LanguageTool Community. It is interesting to see the error that it captured about the date where it returns a message that: The date 27 June 2017 is not a Thursday, but a Tuesday. However, for this case, it does not have a correction because it cannot guess what did the author mean by entering this date 🙂

如我们所见,我们获得了一个详细的字典,其中显示了ruleIdmessage等。您可以在LanguageTool社区中找到每个规则ID的详细说明。 有趣的是看到它在返回消息的日期中捕获到的错误: The date 27 June 2017 is not a Thursday, but a Tuesday. 但是,在这种情况下,它没有更正,因为它无法猜测输入此日期后作者的意思🙂

Since we detect the mistakes now we can correct them.

由于我们发现了错误,因此我们可以纠正它们。

my_mistakes = 
my_corrections =
start_positions =
end_positions = for in
if
start_positions.append(rules.offset)
end_positions.append(rules.errorLength+rules.offset)
my_mistakes.append(
text[rules.offset:rules.errorLength+rules.offset])
my_corrections.append(rules.replacements[0])my_new_text = for in
for in
my_new_text[start_positions[m]] =
if and
my_new_text[i]=""my_new_text = my_new_text

And we get (In bold you can see the corrections):

我们得到(粗体字,您可以看到更正):

LanguageTool offers spell and grammar checking. Just paste your text here and click the ‘Check Text’ button. Click the colored phrases for details on potential errors. Or use this text to see a few of the problems that LanguageTool can detect. What do you think of grammar checkers? Please note that they are not perfect. Style issues get a blue marker: It’s 5 P.M. The weather was nice on Thursday, 27 June 2017

LanguageTool提供拼写和语法检查。 只需在此处粘贴文本,然后单击“检查文本”按钮即可。 单击彩色短语以获取有关潜在错误的详细信息。 使用此文字来查看 少数的问题,LanguageTool可以检测 您如何看待语法检查器? 注意,它们并不完美。 样式问题有一个蓝色标记:现在是下午5 2017年6月27日,星期四,天气很好。

拼写和语法错误 (Spelling and Grammar Mistakes)

Let’s see the mistakes that we captured and their corresponding corrections.

让我们看看我们捕获的错误及其相应的更正。

list(zip(my_mistakes,my_corrections))[('or', 'Or'),
('too see', 'to see'),
('an', 'a'),
('of of', 'of'),
('detecd', 'detect'),
('thinks', 'think'),
('not', 'note'),
('P.M. in the afternoon.', 'P.M.')]

详细的例子 (Detailed Example)

We will provide a detailed example by taking into consideration a simple example of just one sentence and we will have a look at the output we get from the LanguageTool. Our sentence:

我们将通过仅考虑一个句子的简单示例来提供一个详细的示例,并查看从LanguageTool获得的输出。 我们的句子:

Your the best but their are allso good!

您最好,但他们都很好

text = 
matches = len(matches)
# 4

The LanguageTool detected 4 issues. We can focus on each issue. Let’s have a look at the first one.

LanguageTool检测到4个问题。 我们可以专注于每个问题。 让我们来看看第一个。

matches[0]

And we get:

我们得到:

Match({'ruleId': 'YOUR_YOU_RE', 'message': 'Did you mean "You\'re"?', 'replacements': ["You're"], 'context': 'Your the best but their are allso  good !', 'offset': 0, 'errorLength': 4, 'category': 'TYPOS', 'ruleIssueType': 'misspelling'})

As we can see it mentions the ruleId, a message to the end your which is “Did you mean “You’re“, the suggested replacements, the context which is the input, the offset which is the position of the start of the issue, the errorLength which is the number of characters of the issue, in our case 4 characters, the category of the mistake which is “TYPOS” in our case and the releIssueType which is “misspelling”.

我们可以看到它提到的ruleId ,一个message到最后你的,这是“你的意思是‘你’,建议replacementscontext这是输入的offset这是问题的开始位置,即errorLength ,它是问题的字符数,在我们的案例中为4个字符,在我们的案例中,错误的category为“ TYPOS” ,而releIssueType为“ releIssueType

We can show how we can call each element of the language_tool_python.match.Match type with the name followed by a period. Let’s say that we want to call the replacements.

我们可以演示如何调用名称后面带有句点的language_tool_python.match.Match类型的每个元素。 假设我们要调用替换项

matches[0].replacements# ["You're"]

Let’s have a look at the other issues detected by LanguageTool. The second detected issue was the “their” which is corrected to there

让我们看一下LanguageTool检测到的其他问题。 检测到的第二个问题是“他们的”,并在那里进行了更正。

matches[1]

And we get:

我们得到:

Match({'ruleId': 'THEIR_IS', 'message': 'Did you mean "there"?', 'replacements': ['there'], 'context': 'Your the best but their are allso  good !', 'offset': 18, 'errorLength': 5, 'category': 'CONFUSED_WORDS', 'ruleIssueType': 'misspelling'})

The third detected issue was the “allso” which is corrected to also

第三个检测到的问题是“allso”,这被修正为

matches[2]

And we get:

我们得到:

Match({'ruleId': 'MORFOLOGIK_RULE_EN_US', 'message': 'Possible spelling mistake found.', 'replacements': ['also', 'all so'], 'context': 'Your the best but their are allso  good !', 'offset': 28, 'errorLength': 5, 'category': 'TYPOS', 'ruleIssueType': 'misspelling'})

Finally, the last detected issue was the double spaces which is corrected to a single space.

最后,最后检测到的问题是将双空格纠正为单个空格

matches[3]

And we get:

我们得到:

Match({'ruleId': 'WHITESPACE_RULE', 'message': 'Possible typo: you repeated a whitespace', 'replacements': [' '], 'context': 'Your the best but their are allso  good!', 'offset': 33, 'errorLength': 2, 'category': 'TYPOGRAPHY', 'ruleIssueType': 'whitespace'})

讨论区 (Discussion)

If we want a free tool in Python that does similar work with Grammarly and supports more than 20 languages, then the LanguageTool is a good option. Of course, none tool is perfect and we cannot rely solely on grammar and spell checkers but for sure is something that we can use mainly in NLP tasks and projects.

如果我们想要一个免费的Python工具,可以与Grammarly进行类似的工作并支持20多种语言,那么LanguageTool是一个不错的选择。 当然,没有一种工具是完美的,我们不能仅仅依靠语法和拼写检查器,但是可以肯定的是,我们可以将其主要用于NLP任务和项目中。

Originally published at https://predictivehacks.

最初发布在https://predictivehacks

翻译自: https://towardsdatascience/languagetool-grammar-and-spell-checker-in-python-578ac4e94642

python拼写检查

更多推荐

python拼写检查_python中的languagetool语法和拼写检查器