gpt-2 生成文本

A template for fine tuning your own GPT-2 model.

用于微调您自己的GPT-2模型的模板。

GPT-3 has dominated the NLP news cycle recently with its borderline magical performance in text generation, but for everyone without $1,000,000,000 of Azure compute credits there are still plenty of ways to experiment with language models on your own. Hugging Face is a free open source company focussing on NLP tooling and they provide one of the easiest ways of accessing pre-trained models and tokenizers for NLP experiments. In this article, I will share a method for fine tuning the 117M parameter GPT-2 model with a corpus of Magic the Gathering card flavour texts to create a flavour text generator. This will all be captured in a Colab notebook so you can copy and edit to create generators for your own tasks!

GPT-3最近在NLP新闻周期中占据主导地位,其在文本生成方面具有出色的神奇性能,但是对于没有$ 1,000,000,000的Azure计算积分的每个人,仍然有很多方法可以自己尝试语言模型。 Hugging Face是一家专注于NLP工具的免费开源公司,它们为NLP实验提供了访问预训练模型和令牌生成器的最简单方法之一。 在本文中,我将分享一种使用Magic the Gathering卡味觉文本集对117M参数GPT-2模型进行微调的方法,以创建味觉文本生成器。 所有这些都将捕获在Colab笔记本中,因此您可以复制和编辑以创建适合自己任务的生成器!

初始点 (Starting Point)

Generative language models require billions of data points and millions of dollars in compute power to train successfully from scratch. For example, GPT-3 cost an estimated $4.6 million dollars to train and 355 years of compute time. However, fine tuning many of these models for custom tasks is easily within reach to anyone with access to even a single GPU. For this project we will be using Colab, which comes with many common data science packages pre-installed, including PyTorch and free access to GPU resource.

生成语言模型需要数十亿个数据点和数百万美元的计算能力才能从头开始成功训练。 例如,GPT-3的培训费用约为460万美元,计算时间为355年 。 但是,对于甚至可以访问单个GPU的任何人,都可以轻松地针对自定义任务微调许多这些模型。 对于此项目,我们将使用Colab,该软件预装有许多常见的数据科学软件包,包括PyTorch和对GPU资源的免费访问。

First, we will install the Hugging Face transformers library, which will also fetch the excellent (and fast) tokenizers library. Although Hugging Face provide a resource for text datasets in their nlp library, I will be sourcing my own data for this project. If you don’t have a dataset or application in mind, the nlp library would provide an excellent starting place for easy data acquisiton.

首先,我们将安装Hugging Face 转换器库,该库还将获取出色(快速)的标记程序库。 尽管Hugging Face在其nlp库中为文本数据集提供了资源,但我将为此项目采购自己的数据。 如果您没有数据集或应用程序,那么nlp库将为轻松获取数据提供一个很好的起点。

This will install the Hugging Face transformers library and the tokenizer dependencies.
这将安装Hugging Face转换器库和分词器依赖项。

The Hugging Face libraries will give us access to the GPT-2 model as well as it’s pretrained weights and biases, a configuration class, and a tokenizer to convert each word in our text dataset into a numerical representation to feed into the model for training. Tokenization is important as the models can’t work with text data directly so they need to be encoded into something more manageable. Below is an example of tokenization on some sample text to give a small representative example of what encoding provides.

Hugging Face库将使我们能够访问GPT-2模型及其预训练的权重和偏差,配置类和令牌生成器,以将文本数据集中的每个单词转换为数字表示形式,以馈入模型进行训练。 标记化很重要,因为模型无法直接处理文本数据,因此需要将它们编码为更易于管理的内容。 下面是对一些示例文本进行标记化的示例,以给出编码提供的小代表示例。

In this example “Sample Text is encoded by the tokenizer to a vector [36674, 8255].
在此示例中,“样本文本由令牌生成器编码为矢量[36674,8255]。

数据 (The Data)

Now it’s time to grab our data. For this project I’ll be using Magic the Gathering card flavour text from the Scryfall api, which returns an easily parsable JSON object of card data. Here, I extracted only English flavour text to avoid introducing new tokens for non-native words, as the GPT-2 model was originally trained on English-only data. After parsing, I was left with an iterable list object of 29222 MtG flavour texts, a preview of which is below.

现在是时候获取我们的数据了。 对于这个项目,我将使用Scryfall api中的Magic the Gathering卡片风格文本,该文本返回卡片数据的易于解析的JSON对象。 在这里,我只提取了英语风味文本,以避免引入非本地单词的新标记,因为GPT-2模型最初是针对仅英语数据进行训练的。 解析之后,剩下的是一个29222 MtG风味文本的可迭代列表对象,下面是其预览。

Here are a few samples of the training corpus.
这是训练语料库的一些示例。

装载机 (The Loader)

Now that we have our text data, we need to create a structured dataset and dataloader to appropriately feed into the model. For this step, we will use in-built PyTorch classes to define the dataset and dataloader, which will feed the neural network. The dataloader object is comprised of a dataset, a sampler, and provides single- or multi-process iterators over the dataset (see the official documentation for further information). There are a lot of details here, but the important points are:

现在我们有了文本数据,我们需要创建一个结构化的数据集和数据加载器以适当地输入模型中。 对于这一步,我们将使用内置的PyTorch类来定义数据集和数据加载器,它们将馈入神经网络。 dataloader对象由数据集,采样器组成,并在数据集上提供单进程或多进程迭代器(有关更多信息,请参见官方文档 )。 这里有很多细节,但是重要的是:

  1. The dataset object will create a new list, which is a tuple of tensors.

    数据集对象将创建一个新列表,该列表是张量的元组。
  2. The first tensor is the encoded flavour text, wrapped in a start of text token, an end of text token and padded up to a maximum embedding length (if the length of the string is shorter than the maximum embedding space).

    第一个张量是编码的风味文本,包装在文本令牌的开头,文本令牌的结尾,并填充到最大嵌入长度(如果字符串的长度小于最大嵌入空间)。
  3. The second tensor is an attention mask, which is a list of 1's and 0's that tells the model which tokens are important, 1’s, and which should be ignored, 0's.

    第二张量是一个注意掩码,它是1和0的列表,告诉模型哪些标记是重要的(1),哪些令牌应该被忽略(0)。

The code for creating this dataset object is below and has been generalized to fit any tokenizer and datalist. It has also been padded up to a maximum length, which can be specified. The maximum length of all of the strings in my corpus was 98, so my tensors are only padded to a max length of 98 tokens. The maximum length possible for the GPT-2 tokenizer is 768, so keep in mind that specifying padding length will make a difference to the training speed of the model and the batch size you are able to allocate.

下面是用于创建此数据集对象的代码,并且已对其进行了通用化以适合任何标记程序和数据列表。 它也已被填充最大长度,可以指定。 我的语料库中所有字符串的最大长度为98,因此我的张量仅填充为98个令牌的最大长度。 GPT-2分词器可能的最大长度为768,因此请记住,指定填充长度将对模型的训练速度和您可以分配的批次大小产生影响。

Code to define a custom dataset in PyTorch.
在PyTorch中定义自定义数据集的代码。
Example output for above dataset. In the first tensor, 50257 is the ‘start of sentence’ token, the subsequent tokens are the encoded words in the string, and 50256 represents the ‘end of sentence’ token. Combined, these tokens inform the model where the sentence construction starts and stops. The repeating token 50258 encodes the ‘padding’ token and are assigned as 0’s in the attention mask tensor so that the model gives them no value.
以上数据集的示例输出。 在第一个张量中,50257是“句子的开始”标记,后续的标记是字符串中的已编码单词,而50256表示“句子的结束”标记。 这些标记结合在一起,可以通知模型句子构建的开始和结束位置。 重复令牌50258对“填充”令牌进行编码,并在注意掩码张量中分配为0,以使模型不给它们任何值。

We now need to split the dataset into a training and validation set before creating the dataloaders. The code below shows an example of doing this to the MTGDataset we have created from the dataset template code, using the GPT2Tokenizer we instantiated, and dividing the data into 80:20 training/validation sets. It is important to note that different samplers are employed for the training and validation dataloaders. We want random sampling for the training data, but that isn’t required in the validation samples so these are tested sequentially.

现在,我们需要在创建数据加载器之前将数据集分为训练和验证集。 下面的代码显示了使用实例化的GPT2Tokenizer对从数据集模板代码创建的MTGDataset执行此操作的示例,并将数据划分为80:20训练/验证集。 重要的是要注意,训练和验证数据加载器使用了不同的采样器。 我们希望对训练数据进行随机采样,但是在验证样本中并不需要这样做,因此将对它们进行顺序测试。

The dataset was split using an 80:20 random sample of the MTGDataset.
使用MTGDataset的80:20随机样本拆分数据集。

该模型 (The Model)

Before training, we need to instantiate a few more things. First of all we should load and set the parameters of the GPT-2 model. Next, create an instance of the GPT-2 language model itself and configure it with the parameters we just set. Lastly, to speed up training we should run this on the available GPU, to do this we need to instruct PyTorch to load the data to the cuda device.

在训练之前,我们需要再实例化一些东西。 首先,我们应该加载并设置GPT-2模型的参数。 接下来,创建GPT-2语言模型本身的实例,并使用我们刚刚设置的参数对其进行配置。 最后,为了加快培训速度,我们应该在可用的GPU上运行它,为此,我们需要指示PyTorch将数据加载到cuda设备。

Setting the model configuration, instantiating the model and running the model on the GPU.
设置模型配置,实例化模型并在GPU上运行模型。

At this point, we need to investigate what type of instance we have connected to in Colab. We can check this by running !nvidia-smi, which will display the device information, including the GPU model, P100, K80, T4 etc., as well as the amount of VRAM available to the device. This information is crucial because it will inform our choice of batch size. Setting the batch to the maximum you can fit into memory is generally good practice. On a T4 or K80 we can set the batch to 32 on this particular data, otherwise the batch much be set smaller or the data will fail to load to the GPU and training won’t start. More VRAM will enable larger batch sizes, which will make training faster.

此时,我们需要研究在Colab中连接到的实例类型。 我们可以通过运行!nvidia-smi来检查它 将显示设备信息,包括GPU型号,P100,K80,T4 ,以及设备可用的VRAM数量。 此信息至关重要,因为它将告知我们批次大小的选择。 通常,将批处理设置为可以容纳到内存的最大数量。 在T4或K80上,我们可以针对此特定数据将批处理设置为32,否则将批处理设置为较小,否则数据将无法加载到GPU,并且训练将不会开始。 更多的VRAM将启用更大的批处理大小,这将加快培训速度。

Now we can set the epoch number (number of training cycles) and create the optimizer we will use for training. We will be using the Hugging Face implementation of AdamW, though other optimizers are acceptable. Fastai have a wonderful blog post explaining the AdamW optimizer, including a brief history and the recent tweaks that led to its current state.

现在,我们可以设置时期数(训练周期数)并创建用于训练的优化程序。 我们将使用AdamW的Hugging Face实现,尽管可以接受其他优化程序。 Fastai有一篇很棒的博客文章,解释了AdamW优化器,包括简要的历史记录和最近导致其当前状态的调整。

At this stage, we could fine tune other hyperparameters, such as the learning rate, the beta and epsilon values of the optimizer, or vary the batch size or epoch number. If we are otherwise happy with the defaults, we can establish the training loop and begin!

在此阶段,我们可以微调其他超参数,例如学习率,优化器的beta和epsilon值,或更改批处理大小或历元数。 如果我们对默认设置感到满意,则可以建立训练循环并开始!

训练 (Training)

The code for the training loop is below. For anyone unfamiliar with neural network training I’ll try to provide an accessible description of the basic work flow this code encapsulate:

训练循环的代码如下。 对于不熟悉神经网络训练的任何人,我将尝试提供此代码封装的基本工作流程的易于访问的描述:

  1. The training batch is loaded to the GPU and the network makes predictions on some labels.

    训练批处理将加载到GPU,并且网络会对某些标签进行预测。
  2. The performance of the model is accessed and the loss, how far the predictions are from the truth, is found.

    可以访问模型的性能,并找到损失,即预测与事实之间的距离。
  3. The derivative of the loss is calculated and the optimizer will move down the gradient towards some minima.

    计算出损耗的导数,优化器将使梯度向下移至某个最小值。
  4. The changes to the model that reflect this step are then back propogated through the model, the weights are updated at each layer of the model and the next sample is tested.

    然后,在模型中反向传播反映此步骤的模型更改,在模型的每一层更新权重并测试下一个样本。

This process repeats for every training batch and ideally the model will equilibriate at a position of minimized global loss. To see if the model generalizes well to data it hasn’t seen it is tested on the validation data. After this point, the model is fine tuned on our new dataset and we can examine the overall model performance and test the outputs to see how well this worked!

每个训练批次都会重复此过程,理想情况下,模型将在全局损失最小的位置达到平衡。 要查看模型是否能很好地推广到数据,还没有在验证数据上对其进行测试。 此后,可以在新的数据集上对模型进行微调,然后我们可以检查模型的整体性能并测试输出,以了解其效果如何!

This is the basic training loop the model will iterate through. Other deep learning frameworks abstract this to a model.fit() method but in PyTorch we need to define our own training loops.
这是模型将迭代的基本训练循环。 其他深度学习框架将其抽象为model.fit()方法,但在PyTorch中,我们需要定义自己的训练循环。

轻松培训 (Training Made Easy)

Shortly after this article was first published Julien Chaumond pointed me to the new Trainer class in transformers which makes this training loop significantly more concise and offers several other benefits as well. The trainer even makes some of the Dataloader instantiation obsolete as you only need to provide dataset objects and it will automatically create the loaders and even use random samplers in the training and sequential samplers in the validation set, precisely as we have configured. It will even prompt you to log in to a service like Weights and Biases to log your model training and can configure your model to train across multiple devices including TPU’s. Unless you really want to specifiy every detail of the training cycle manually I would highly recommended using this method.

在本文首次发表后不久, 朱利安·乔蒙德(Julien Chaumond)向我介绍了新的变压器培训师课程,这使该培训循环更加简洁,并且还提供了其他一些好处。 训练器甚至使某些Dataloader实例化变得过时,因为您只需要提供数据集对象,它就会自动创建装载器,甚至在训练中使用随机采样器,并在验证集中使用顺序采样器,这正是我们所配置的。 它甚至会提示您登录“权重和偏差”之类的服务来记录您的模型训练,并且可以配置您的模型以在包括TPU在内的多种设备上训练。 除非您真的想手动指定培训周期的每个细节,否则我强烈建议您使用此方法。

transformers.Trainer equivalent of the PyTorch training loop established earlier.
较早建立的PyTorch训练循环的Trainer等效项。

评价 (Evaluation)

First, we will examine the shape of the loss curves for the training and validation, shown below. This is a good outcome given that many of the model parameters are defaults, as the training loss doesn’t dip too far below the validation loss, which would have indicated possible over-fitting.

首先,我们将检查损失曲线的形状,以进行训练和验证,如下所示。 鉴于许多模型参数都是默认值,因此这是一个很好的结果,因为训练损失不会比验证损失低很多,这可能表明过度拟合。

Training and validation loss values.
培训和验证损失值。

Now for the fun part! We will evaluate the model outputs from a human perspective. Below, are five examples of the model outputs. I’m pretty pleased with these! They read like the flavour texts I put into the model and a quick check shows they aren’t duplicates of any existing cards from the corpus. It even uses in some places the quote attribution to real entities from MtG and in the correct location of the flavour text so this model definitely appears to understand the strucutre of our data. Overall, I would say that the training has gone really well and that the fine tuning has produced a new model that successfully generates novel MtG flavour texts!

现在是有趣的部分! 我们将从人类的角度评估模型输出。 以下是模型输出的五个示例。 我对这些感到非常满意! 他们读起来就像我在模型中输入的风味文字,然后快速检查一下,发现它们与语料库中现有的卡片都不重复。 它甚至在某些地方使用了来自MtG​​的真实实体的报价属性,并在风味文字的正确位置使用了报价,因此该模型显然看起来可以理解我们数据的结构。 总的来说,我想说培训进行得非常好,微调产生了一个新模型,可以成功生成新颖的MtG风味文字!

Some example MtG flavour text outputs.
MtG风味文本输出的一些示例。

未来的变化 (Future Changes)

There are some obvious changes that we could make to this work flow that might improve the model. Some hyperparameter tuning could create more ‘accurate’ outputs, but that would be a difficult metric to calculate in this instance. Different language models, or a larger version than the 117M parameter GPT-2 model could have been fine tuned as well. It is also possible my scraping removed too many data points, or that the API I chose didn’t contain every possible flavour text and a more exhaustive search would return a richer dataset.

我们可以对此工作流程进行一些显而易见的更改,从而可以改进模型。 某些超参数调整可能会创建更多“准确”的输出,但是在这种情况下,这将是很难计算的指标。 也可以对不同的语言模型或比117M参数的GPT-2模型更大的版本进行微调。 我的抓取也有可能删除了太多的数据点,或者我选择的API没有包含所有可能的味觉文本,并且更详尽的搜索将返回更丰富的数据集。

If replicating this work flow isn’t for you, but you want to use this generator to create something on your own, this model is now hosted by Hugging Face. The below embedded URL will take you straight to the model’s home page.

如果复制此工作流程不适合您,但是您想使用此生成器自行创建内容,则该模型现在由Hugging Face托管。 下面的嵌入式URL将带您直接进入模型的主页。

By following the instructions on that page, or from the code chunk below you can load the Magic-The-Generating model straight from the transformers library into your local Python environment.

按照该页面上的说明或从下面的代码块中,您可以将Magic-The-Generating模型直接从转换器库加载到本地Python环境中。

致谢 (Acknowledgements)

I think it’s really important to give credit where credit is due within the open source community. So if you found this blog entertaining or informative, please take a moment to visit these excellent resources for more language modelling content, tutorials and projects:

我认为在开放源代码社区中应归功于信誉的地方非常重要。 因此,如果您发现此博客有趣或有益,请花一点时间访问这些出色的资源,以获取更多语言建模内容,教程和项目:

  • Rey Farhan, who was my inspiration for this particular project

    雷伊·法汉 ( Rey Farhan) ,我是这个特定项目的灵感来源

  • Chris McCormick’s BERT fine-tuning tutorial (heavily cited by Rey)

    Chris McCormick的 BERT微调教程(被Rey大量引用)

  • Ian Porter’s GPT-2 tutorial

    伊恩·波特(Ian Porter)的 GPT-2教程

  • Hugging Face Language model fine-tuning script

    拥抱面部语言模型微调脚本

翻译自: https://medium/@rjbownes/fine-tuning-gpt-2-for-magic-the-gathering-flavour-text-generation-3bafd0f9bb93

gpt-2 生成文本

更多推荐

gpt-2 生成文本_对gpt 2进行了微调,以实现神奇的收集风味文本生成