python nltk工具

The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

作者选择了“ 开放互联网/言论自由”基金来接受捐赠,这是“ 为捐赠写信”计划的一部分。

介绍 (Introduction)

A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

今天生成的大量数据是非结构化的 ,需要进行处理才能生成洞察力。 非结构化数据的一些示例是新闻文章,社交媒体上的帖子以及搜索历史。 分析自然语言并使之具有意义的过程属于自然语言处理(NLP)领域。 情感分析是NLP的一项常见任务,涉及将文本或部分文本分类为预定义的情感。 您将使用自然语言工具包(NLTK)( Python中常用的NLP库)来分析文本数据。

In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

在本教程中,您将使用不同的数据清理方法从NLTK软件包中为NLP准备样本推文的数据集。 数据集准备好处理后,您将在预分类的推文上训练模型,并使用该模型将样本推文分类为负面情绪和正面情绪。

This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

本文假定您熟悉Python的基础知识(请参阅我们的Python 3系列方法 ),主要是数据结构,类和方法的使用。 本教程假定您没有NLP和nltk背景知识,尽管nltk有一些了解是一个附加的优势。

先决条件 (Prerequisites)

  • This tutorial is based on Python version 3.6.5. If you don’t have Python 3 installed, Here’s a guide to install and setup a local programming environment for Python 3.

    本教程基于Python 3.6.5版本。 如果您没有安装Python 3,请参阅以下指南,以安装和设置Python 3的本地编程环境 。

  • Familiarity in working with language data is recommended. If you’re new to using NLTK, check out the How To Work with Language Data in Python 3 using the Natural Language Toolkit (NLTK) guide.

    建议熟悉语言数据。 如果您不熟悉 NLTK,请查看《 使用自然语言工具包(NLTK)指南》中的《 如何在Python 3中使用语言数据》 。

第1步-安装NLTK并下载数据 (Step 1 — Installing NLTK and Downloading the Data)

You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

您将在本教程中将Python中的NLTK包用于所有NLP任务。 在此步骤中,您将安装NLTK并下载用于训练和测试模型的样本推文。

First, install the NLTK package with the pip package manager:

首先,使用pip软件包管理器安装NLTK软件包 :

  • pip install nltk==3.3

    点安装nltk == 3.3

This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

本教程将使用NLTK软件包中的示例推文。 首先,通过运行以下命令启动Python交互式会话:

  • python3

    python3

Then, import the nltk module in the python interpreter.

然后,将nltk模块导入python解释器中。

  • import nltk

    导入nltk

Download the sample tweets from the NLTK package:

从NLTK软件包下载样本推文:

  • nltk.download('twitter_samples')

    nltk.download('twitter_samples')

Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

从Python解释器运行此命令将在本地下载并存储推文。 下载样本后,即可使用它们。

You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

您将在本教程的后面部分使用消极和积极的推文来训练您的情绪分析模型。 没有情感的推文将用于测试您的模型。

If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

如果您想使用自己的数据集,则可以使用Twitter API从特定时间段,用户或主题标签收集推文。

Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

现在,您已经导入了NLTK并下载了示例推文,请通过输入exit()退出交互式会话。 您准备导入推文并开始处理数据。

第2步-标记数据 (Step 2 — Tokenizing the Data)

Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

机器无法正确处理原始形式的语言,因此您需要处理语言以使机器更易于理解。 理解数据的第一部分是通过称为标记化的过程,或将字符串分成称为标记的较小部分。

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

令牌是作为单位的文本中的字符序列。 根据您创建令牌的方式,令牌可能由单词,表情符号,主题标签,链接甚至单个字符组成。 将语言分解为标记的基本方法是通过基于空格和标点符号来拆分文本。

To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

首先,创建一个新的.py文件来保存您的脚本。 本教程将使用nlp_test.py

  • nano nlp_test.py

    纳米nlp_test.py

In this file, you will first import the twitter_samples so you can work with that data:

在此文件中,您将首先导入twitter_samples以便可以使用该数据:

nlp_test.py nlp_test.py
from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the model:

这将从NLTK导入三个数据集,其中包含各种推文来训练和测试模型:

  • negative_tweets.json: 5000 tweets with negative sentiments

    negative_tweets.json :5000条负面情绪的推文

  • positive_tweets.json: 5000 tweets with positive sentiments

    positive_tweets.json :5000条带有积极情绪的推文

  • tweets.20150430-223406.json: 20000 tweets with no sentiments

    tweets.20150430-223406.json :20000条没有情感的推文

Next, create variables for positive_tweets, negative_tweets, and text:

接下来,为positive_tweetsnegative_tweetstext创建变量:

nlp_test.py nlp_test.py
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

twitter_samplesstrings()方法会将数据集中的所有tweet打印为字符串。 将不同的tweet集合设置为变量将使处理和测试更加容易。

Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

在NLTK中使用标记器之前,您需要下载其他资源punktpunkt模块是一个预先训练的模型,可以帮助您标记单词和句子。 例如,该模型知道一个名称可能包含一个句点(例如“ S. Daityari”),并且句子中存在该句点并不一定以其结尾。 首先,启动一个Python交互式会话:

  • python3

    python3

Run the following commands in the session to download the punkt resource:

在会话中运行以下命令以下载punkt资源:

  • import nltk

    导入nltk
  • nltk.download('punkt')

    nltk.download('punkt')

Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

下载完成后,就可以使用NLTK的令牌生成器了。 NLTK使用.tokenized()方法为推文提供默认的标记器。 添加一行以创建一个对象,该对象标记化positive_tweets.json数据集:

nlp_test.py nlp_test.py
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

如果您想测试脚本以查看.tokenized方法的实际作用,请将突出显示的内容添加到nlp_test.py脚本中。 这将标记来自positive_tweets.json数据集的一条推文:

nlp_test.py nlp_test.py
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

print(tweet_tokens[0])

Save and close the file, and run the script:

保存并关闭文件,然后运行脚本:

  • python3 nlp_test.py

    python3 nlp_test.py

The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

标记化的过程需要一些时间,因为它不是在空格上的简单拆分。 经过一会儿的处理,您将看到以下内容:


   
    Output
   ['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'being',
 'top',
 'engaged',
 'members',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

在这里, .tokenized()方法返回特殊字符,例如@_ 。 这些字符将在本教程的后面部分通过正则表达式删除。

Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

既然您已经了解了.tokenized()方法的工作原理,请确保在行的开头添加# ,以注释掉或删除最后一行以从脚本中打印标记化的推文:

nlp_test.py nlp_test.py
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

#print(tweet_tokens[0])

Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

现在,您的脚本已配置为标记数据。 在下一步中,您将更新脚本以规范化数据。

步骤3 —标准化数据 (Step 3 — Normalizing the Data)

Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

单词具有不同的形式,例如,“ ran”,“ runs”和“ running”是同一动词“ run”的各种形式。 根据您的分析要求,所有这些版本可能都需要转换为相同的形式“运行”。 NLP中的规范化是将单词转换为其规范形式的过程。

Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

规范化有助于将含义相同但形式不同的单词组合在一起。 如果没有规范化,则“运行”,“运行”和“运行”将被视为不同的单词,即使您可能希望将它们视为相同的单词。 在本节中,您探索词干词形还原 ,这是标准化的两种流行的技术。

Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

词干是从单词中除去词缀的过程。 词干仅使用简单的动词形式,是一种启发式过程,可删除单词的结尾。

In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

在本教程中,您将使用词素化的过程,该过程通过词汇和文本中词的词法分析的上下文对词进行归一化。 lemmatization算法分析单词的结构及其上下文,以将其转换为规范化形式。 因此,这是以速度为代价的。 词干和词根限制的比较最终归结为速度和准确性之间的权衡。

Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

在继续使用lemmatization之前,通过在Python交互式会话中输入以下内容来下载必要的资源:

  • python3

    python3

Run the following commands in the session to download the resources:

在会话中运行以下命令以下载资源:

  • import nltk

    导入NLTK
  • nltk.download('wordnet')

    nltk.download('wordnet')
  • nltk.download('averaged_perceptron_tagger')

    nltk.download('averaged_perceptron_tagger')

wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

wordnet是英语的词汇数据库,可帮助脚本确定基本单词。 您需要averaged_perceptron_tagger资源来确定句子中单词的上下文。

Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

下载后,您几乎可以使用lemmatizer。 在运行lemmatizer之前,您需要确定文本中每个单词的上下文。 这是通过标记算法实现的,该算法评估单词在句子中的相对位置。 在Python会话中,导入pos_tag函数,并提供令牌列表作为获取标签的参数。 让我们在Python中尝试一下:

  • from nltk.tag import pos_tag

    从nltk.tag导入pos_tag
  • from nltk.corpus import twitter_samples

    从nltk.corpus导入twitter_samples
  • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

    tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
  • print(pos_tag(tweet_tokens[0]))

    打印(pos_tag(tweet_tokens [0]))

Here is the output of the pos_tag function.

这是pos_tag函数的输出。


   
    Output
   [('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]

From the list of tags, here is the list of the most common items and their meaning:

在标签列表中,以下是最常见的项目及其含义的列表:

  • NNP: Noun, proper, singular

    NNP :名词,适当,单数

  • NN: Noun, common, singular or mass

    NN :名词,普通,单数或质量

  • IN: Preposition or conjunction, subordinating

    IN :介词或连词,从属

  • VBG: Verb, gerund or present participle

    VBG :动词,动名词或现在分词

  • VBN: Verb, past participle

    VBN :动词,过去分词

Here is a full list of the dataset.

这是数据集的完整列表 。

In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

通常,如果标签以NN开头,则该单词为名词,如果以VB开头,则该单词为动词。 查看完标记后,通过输入exit()退出Python会话。

To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

要将其合并到规范句子的功能中,应首先为文本中的每个标记生成标签,然后使用标签对每个单词进行词法化。

Update the nlp_test.py file with the following function that lemmatizes a sentence:

使用以下使句子nlp_test.py句的函数更新nlp_test.py文件:

nlp_test.py nlp_test.py
...

from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

此代码导入WordNetLemmatizer类,并将其初始化为变量lemmatizer

The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

函数lemmatize_sentence首先获取一条tweet的每个令牌的位置标签。 在if语句中,如果标记以NN开头,则标记被分配为名词。 类似地,如果标签以VB开头,则将令牌分配为动词。

Save and close the file, and run the script:

保存并关闭文件,然后运行脚本:

  • python3 nlp_test.py

    python3 nlp_test.py

Here is the output:

这是输出:


   
    Output
   ['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'be',
 'top',
 'engage',
 'member',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

你会发现,动词being改变其根形式, be和名词members更改为member 。 在继续之前,请注释掉最后一条从脚本中打印出示例推文的行。

Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

现在,您已经成功创建了将单词标准化的函数,您可以继续进行操作以消除噪音。

步骤4 —从数据中消除噪音 (Step 4 — Removing Noise from the Data)

In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

在此步骤中,您将从数据集中删除噪声。 杂音是文本的任何部分,不会在数据中添加含义或信息。

Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

噪声是每个项目专用的,因此在一个项目中构成噪声的可能不在另一个项目中。 例如,一种语言中最常见的单词称为停用词 。 停用词的一些示例是“ is”,“ the”和“ a”。 在处理语言时,它们通常是无关紧要的,除非有特定的用例保证将其包括在内。

In this tutorial, you will use regular expressions in Python to search for and remove these items:

在本教程中,您将在Python中使用正则表达式来搜索和删除以下项目:

  • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.

    超链接-Twitter中的所有超链接都将转换为URL缩短器t.co。 因此,将它们保留在文本处理中不会为分析增加任何价值。

  • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.

    Twitter处理回复 -这些Twitter用户名前面带有@符号,该符号不表示任何含义。

  • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

    标点和特殊字符 -尽管这些通常会为文本数据提供上下文,但这种上下文通常很难处理。 为简单起见,您将从推文中删除所有标点符号和特殊字符。

To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

要删除超链接,您需要首先搜索与以http://https://开头的URL匹配的子字符串,然后是字母,数字或特殊字符。 匹配模式后, .sub()方法将其替换为空字符串。

Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

由于我们将在内部标准化的单词形式remove_noise()函数,你可以注释掉lemmatize_sentence()从脚本功能。

Add the following code to your nlp_test.py file to remove noise from the dataset:

将以下代码添加到您的nlp_test.py文件中,以消除数据集中的噪声:

nlp_test.py nlp_test.py
...

import re, string

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

这段代码创建了remove_noise()函数,该函数消除了噪声,并合并了上一节中提到的归一化和lemmatization。 该代码有两个参数:tweet令牌和停用词元组。

The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

然后,代码使用循环从数据集中删除噪声。 要删除超链接,代码首先搜索与以http://https://开头的URL匹配的子字符串,后跟字母,数字或特殊字符。 匹配模式后, .sub()方法将其替换为空字符串或''

Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

同样,要删除@提及,代码将使用正则表达式替换文本的相关部分。 该代码使用re库搜索@符号,后跟数字,字母或_ ,并将它们替换为空字符串。

Finally, you can remove punctuation using the library string.

最后,您可以使用库string删除标点符号。

In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

除此之外,您还将使用NLTK中内置的一组停用词来删除停用词,这需要单独下载。

Execute the following command from a Python interactive session to download this resource:

从Python交互式会话执行以下命令以下载此资源:

  • nltk.download('stopwords')

    nltk.download('停用词')

Once the resource is downloaded, exit the interactive session.

下载资源后,退出交互式会话。

You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

您可以使用.words()方法获取英语停用词列表。 为了测试功能,让我们在示例推文上运行它。 nlp_test.py添加到nlp_test.py文件的末尾:

nlp_test.py nlp_test.py
...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

print(remove_noise(tweet_tokens[0], stop_words))

After saving and closing the file, run the script again to receive output similar to the following:

保存并关闭文件后,再次运行脚本以接收类似于以下内容的输出:


   
    Output
   ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

请注意,该函数删除了所有@提及,停用词,并将这些词转换为小写。

Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

在继续进行下一步建模之前,请使用remove_noise()函数清除正面和负面的推文。 注释掉该行以在示例鸣叫上打印remove_noise()的输出,并将以下内容添加到nlp_test.py脚本中:

nlp_test.py nlp_test.py
...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

#print(remove_noise(tweet_tokens[0], stop_words))

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

现在,您已经添加了清理样本tweet的代码,您可能希望将原始令牌与样本tweet的清理令牌进行比较。 如果您想进行测试,请将以下代码添加到文件中,以比较列表中第500条推文的两个版本:

nlp_test.py nlp_test.py
...
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

保存并关闭文件并运行脚本。 从输出中,您将看到标点符号和链接已删除,单词已转换为小写。


   
    Output
   ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']

There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

在文本的预处理过程中可能会出现某些问题。 例如,没有空格的单词(“ iLoveYou”)将被视为一个单词,可能很难分开这些单词。 此外,除非您编写解决该问题的特定方法,否则脚本将对“ Hi”,“ Hii”和“ Hiiiii”进行不同的处理。 通常会针对您的特定数据微调噪声消除过程。

Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

既然您已经看到了remove_noise()函数的作用,请确保注释掉脚本或从脚本中删除最后两行,以便可以向其中添加更多内容:

nlp_test.py nlp_test.py
...
#print(positive_tweet_tokens[500])
#print(positive_cleaned_tokens_list[500])

In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

在此步骤中,您已删除了数据中的噪音,以使分析更加有效。 在下一步中,您将分析数据以在样本数据集中找到最常用的单词。

步骤5 —确定单词密度 (Step 5 — Determining Word Density)

The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

对文本数据进行分析的最基本形式是取出单词频率。 一条推文对于实体而言太小,无法找出单词的分布,因此,将对所有肯定的推文进行单词频率分析。

The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

以下代码片段定义了一个名为get_all_words 的生成器函数 , 该函数 get_all_words文列表作为参数,以提供所有已加入的推文令牌中的单词列表。 将以下代码添加到您的nlp_test.py文件中:

nlp_test.py nlp_test.py
...

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

现在,您已经在推文样本中编译了所有单词,您可以使用NLTK的FreqDist类找出最常见的单词。 将以下代码添加到nlp_test.py文件中:

nlp_test.py nlp_test.py
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

.most_common()方法列出了数据中出现频率最高的单词。 进行这些更改后,保存并关闭文件。

When you run the file now, you will find the most common terms in the data:

现在运行文件时,您将在数据中找到最常用的术语:


   
    Output
   [(':)', 3691),
 (':-)', 701),
 (':d', 658),
 ('thanks', 388),
 ('follow', 357),
 ('love', 333),
 ('...', 290),
 ('good', 283),
 ('get', 263),
 ('thank', 253)]

From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

从这些数据中,您可以看到图释实体构成了正面推文中最常见的部分。 在继续下一步之前,请确保注释掉脚本的最后一行,该脚本显示前十个令牌。

To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

总而言之,您从nltk提取了tweet,对nltk标记化,规范化,并清理了这些tweet以便在模型中使用。 最后,您还查看了数据中标记的频率,并检查了前十个标记的频率。

In the next step you will prepare data for sentiment analysis.

在下一步中,您将准备用于情感分析的数据。

第6步-为模型准备数据 (Step 6 — Preparing Data for the Model)

Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

情感分析是确定作者对所写话题的态度的过程。 您将创建训练数据集以训练模型。 这是一个有监督的学习机学习过程,需要您将每个数据集与“情感”相关联以进行训练。 在本教程中,您的模型将使用“正面”和“负面”情绪。

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

情感分析可用于将文本分类为各种情感。 为了简化和提供训练数据集,本教程可帮助您仅在正反两类中训练模型。

A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

模型是使用规则和方程式的系统描述。 它可以像方程式那样简单,该方程式根据给定的身高预测一个人的体重。 您将建立的情绪分析模型会将推文与正面或负面情绪相关联。 您将需要将数据集分为两部分。 第一部分的目的是构建模型,而下一部分则测试模型的性能。

In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

在数据准备步骤中,您将通过将标记转换为字典形式来准备用于情感分析的数据,然后将其拆分以用于培训和测试目的。

将令牌转换为字典 (Converting Tokens to a Dictionary)

First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

首先,您将准备将数据输入模型中。 您将使用NLTK中的朴素贝叶斯分类器执行建模练习。 请注意,该模型不仅需要鸣叫中的单词列表,还需要Python字典,其中单词作为键,而True作为值。 以下功能使生成器功能可以更改已清除数据的格式。

Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

添加以下代码,以将已清除令牌列表中的tweet转换为字典,其中键为令牌, True为值。 相应的字典存储在positive_tokens_for_modelnegative_tokens_for_model

nlp_test.py nlp_test.py
...
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

拆分数据集以训练和测试模型 (Splitting the Dataset for Training and Testing the Model)

Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

接下来,您需要准备用于训练NaiveBayesClassifier类的数据。 将以下代码添加到文件中以准备数据:

nlp_test.py nlp_test.py
...
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

这段代码在每个推特上都添加了PositiveNegative标签。 然后,它通过加入正面和负面推文来创建dataset

By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

默认情况下,数据包含所有正向推文,然后依次包含所有负向推文。 训练模型时,应提供不包含任何偏差的数据样本。 为了避免偏见,你已经添加代码随机安排使用的数据.shuffle()的方法random

Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

最后,该代码将经过重组的数据按70:30的比例分别进行训练和测试。 由于推文数量为10000,因此您可以使用混洗后的数据集中的前7000条推文来训练模型,最后使用3000条推文来测试模型。

In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

在此步骤中,您将清除的令牌转换为字典形式,随机地对数据集进行混洗,然后将其拆分为训练和测试数据。

第7步-建立和测试模型 (Step 7 — Building and Testing the Model)

Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

最后,您可以使用NaiveBayesClassifier类来构建模型。 使用.train()方法训练模型,并使用.accuracy()方法在测试数据上测试模型。

nlp_test.py nlp_test.py
...
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Save, close, and execute the file after adding the code. The output of the code will be as follows:

添加代码后,保存,关闭并执行文件。 代码的输出如下:


   
    Output
   Accuracy is: 0.9956666666666667

Most Informative Features
                      :( = True           Negati : Positi =   2085.6 : 1.0
                      :) = True           Positi : Negati =    986.0 : 1.0
                 welcome = True           Positi : Negati =     37.2 : 1.0
                  arrive = True           Positi : Negati =     31.3 : 1.0
                     sad = True           Negati : Positi =     25.9 : 1.0
                follower = True           Positi : Negati =     21.1 : 1.0
                     bam = True           Positi : Negati =     20.7 : 1.0
                    glad = True           Positi : Negati =     18.1 : 1.0
                     x15 = True           Negati : Positi =     15.9 : 1.0
               community = True           Positi : Negati =     14.1 : 1.0

Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

准确性定义为测试数据集中的推文所占的百分比,模型可以正确地针对这些推文预测情绪。 测试仪上的准确率达到99.5%相当不错。

In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

在显示信息最多的功能的表中,输出中的每一行都显示了训练数据集中的正负标签推文中令牌出现的比率。 数据的第一行表示,在所有所有包含令牌:( ,的推文中,负数与正数推文之比为2085.6 1 。有趣的是,似乎在正数数据集中有一个带有:(令牌。您可以看到,文字中的前两个区别项是表情符号,此外,诸如sad词会导致负面情绪,而welcomeglad与正面情绪相关。

Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

接下来,您可以检查模型在Twitter上随机发的推文上的表现。 将此代码添加到文件中:

nlp_test.py nlp_test.py
...
from nltk.tokenize import word_tokenize

custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

此代码将允许您通过更新与custom_tweet变量关联的字符串来测试自定义推文。 进行这些更改后,保存并关闭文件。

Run the script to analyze the custom text. Here is the output for the custom text in the example:

运行脚本以分析自定义文本。 这是示例中自定义文本的输出:


   
    Output
   'Negative'

You can also check if it characterizes positive tweets correctly:

您还可以检查它是否正确表征了正面推文:

nlp_test.py nlp_test.py
...
custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'

Here is the output:

这是输出:


   
    Output
   'Positive'

Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

既然您已经测试了正面和负面情绪,请更新变量以测试更复杂的情绪,例如讽刺。

nlp_test.py nlp_test.py
...
custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'

Here is the output:

这是输出:


   
    Output
   'Positive'

The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

该模型将该示例分类为肯定。 这是因为训练数据不够全面,无法将讽刺性推文归为负面。 如果您希望模型预测讽刺,则需要提供足够数量的训练数据以进行相应的训练。

In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

在此步骤中,您构建并测试了模型。 您还探讨了其某些局限性,例如在特定示例中未检测到讽刺。 完整的代码仍然存在本教程之后遗留的工件,因此下一步将指导您使代码与Python的最佳实践保持一致。

步骤8 —清理代码(可选) (Step 8 — Cleaning Up the Code (Optional))

Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

尽管您已经完成了本教程,但建议您重组nlp_test.py文件中的代码,以遵循最佳编程实践。 按照最佳实践,您的代码应符合以下条件:

  • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.

    所有导入都应在文件的顶部。 来自同一库的导入应在单个语句中分组在一起。
  • All functions should be defined after the imports.

    导入后应定义所有功能。
  • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

    文件中的所有语句应置于if __name__ == "__main__":条件下。 如果要在另一个文件中导入文件的功能,这可以确保不执行语句。

We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

我们还将删除通过以下教程注释掉,与沿码lemmatize_sentence功能,因为词形还原被新建成remove_noise功能。

Here is the cleaned version of nlp_test.py:

这是nlp_test.py的清理版本:

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier

import re, string, random

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

if __name__ == "__main__":

    positive_tweets = twitter_samples.strings('positive_tweets.json')
    negative_tweets = twitter_samples.strings('negative_tweets.json')
    text = twitter_samples.strings('tweets.20150430-223406.json')
    tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

    stop_words = stopwords.words('english')

    positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

    positive_cleaned_tokens_list = []
    negative_cleaned_tokens_list = []

    for tokens in positive_tweet_tokens:
        positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

    for tokens in negative_tweet_tokens:
        negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

    all_pos_words = get_all_words(positive_cleaned_tokens_list)

    freq_dist_pos = FreqDist(all_pos_words)
    print(freq_dist_pos.most_common(10))

    positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
    negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

    positive_dataset = [(tweet_dict, "Positive")
                         for tweet_dict in positive_tokens_for_model]

    negative_dataset = [(tweet_dict, "Negative")
                         for tweet_dict in negative_tokens_for_model]

    dataset = positive_dataset + negative_dataset

    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    classifier = NaiveBayesClassifier.train(train_data)

    print("Accuracy is:", classify.accuracy(classifier, test_data))

    print(classifier.show_most_informative_features(10))

    custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

    custom_tokens = remove_noise(word_tokenize(custom_tweet))

    print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))

结论 (Conclusion)

This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

本教程向您介绍了使用Python 3中的nltk库的基本情感分析模型。首先,通过对推文进行标记化,对单词进行规范化并消除噪音,对推文进行了预处理。 接下来,您可以可视化数据中经常出现的项目。 最后,您构建了一个模型,将推文与特定情感相关联。

A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.

监督学习模型仅与其训练数据一样好。 为了进一步加强模型,您可以考虑添加更多类别,例如兴奋和愤怒。 在本教程中,您只是通过构建基本模型来刮擦表面。 这是有关在进行情感分析时必须注意的各种注意事项的详细指南 。

翻译自: https://www.digitalocean/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

python nltk工具

更多推荐

python nltk工具_如何使用自然语言工具包(NLTK)在Python 3中执行情感分析