1张2080ti用一天训练GPT2-small效果如何？

引言

由于chatGPT的风靡，最近越来越多人开始讨论，为什么LLM通常使用decoder-only结构，并将GPT推向了一个前所未有的高度，因为GPT结构在zero-shot和few-shot任务上的极好效果，训练一个GPT是很多人都想尝试的事情，而训练一个大的GPT模型又需要耗费很多资源，因此训练一个GPT-small是比较适合大众的，只需1天就可以训练一个small版本的GPT2

框架

最近清华大学的chatGLM的效果非常不错，其框架基于的是deepspeed以及megatron-lm，虽然现在使用的是单卡，但是在后续任务中，进行模型分割代码的修改等都会比较简单，因此本文便基于megatron-lm进行训练GPT2

配置环境

我用的是ubuntu-18.04.2系统，首先下载deepspeedExamples ，进入到megatron/Megatron-LM文件夹中，在此文件夹中，我们首先需要创建一个新的环境，我是采用的virtualenv的方式，首先在conda 环境中运行
pip install virtualenv
virtualenv -p python3.6 bert_env megatron只支持3.6python环境
source bert_env/bin/activate
pip install -r requirements.txt

在此文件夹下下载数据，首先下载wikipedia数据，采用wget下载就可以，速度还挺快的
pip install wikiextractor
然后用python -m wikiextractor.WikiExtractor <Wikipedia dump file> -o text --json的方式获取处理好的数据，以json的格式，文件的保存在text文件夹下
在scripts文件夹下，创建process_wiki.sh，用于构建输入到模型中的大文件格式，即一个json文件

#!/bin/bash

# loop through all files and subdirectories in the "text" directory
for file in text/*; do

    # check if the current file is a directory
    if [ -d "$file" ]; then

        # if it is a directory, loop through all files in it
        for subdirfile in "$file"/*; do
            # do something with the file
            echo "Processing file: $subdirfile"
            python scripts/presplit_sentences_json.py "$subdirfile" wikipedia/wikidump_lines.json
        done

    else

        # if it is a file, do something with it
        echo "Processing file: $file"

    fi
    break
done

处理时间挺长的，需要2-4个小时，处理完后，会在wikipedia文件夹下出现一个大约18G的文件，这样我们就将数据处理好了
然后我们就可以训练了.scripts/pretrain_gpt2.sh的内容大概是这样的

#! /bin/bash

# Runs the "345M" parameter model

RANK=0
WORLD_SIZE=1

CUDA_VISIBLE_DEVICES=0 python pretrain_gpt2.py \
       --num-layers 12 \
       --hidden-size 768 \
       --num-attention-heads 12 \
       --batch-size 8 \
       --seq-length 512 \
       --max-position-embeddings 512 \
       --train-iters 320000 \
       --save checkpoints/gpt2_345m \
       --load checkpoints/gpt2_345m \
       --resume-dataloader \
       --train-data wikipedia \
       --lazy-loader \
       --tokenizer-type GPT2BPETokenizer \
       --cache-dir cache \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --checkpoint-activations \
       --fp16 \
       --save-interval 5000


set +x

使用bash scripts/pretrain_gpt2.sh就可以训练了
训练完之后，可以使用bash scripts/generate_text.sh来生成文本，也可以使用bash scripts/eval_gpt2.sh来测试模型在wiki103上的ppl，在这里会出现一个问题，模型训练的时候，为了使得总体词表的数量能够整除128，因此将字符进行填充了，而预测和验证的时候，是没有加词表的填充的，因此我们需要修改代码，在tokenizer的地方添加词表填充，例如在generae_samples.py中，加上被我注释的话

def prepare_tokenizer(args):

    tokenizer_args = {
        'tokenizer_type': args.tokenizer_type,
        'corpus': None,
        'model_path': args.tokenizer_path,
        'vocab_size': args.vocab_size,
        'model_type': args.tokenizer_model_type,
        'cache_dir': args.cache_dir}
    tokenizer = make_tokenizer(**tokenizer_args)

    args.tokenizer_num_tokens = tokenizer.num_tokens
    args.tokenizer_num_type_tokens = tokenizer.num_type_tokens
    args.eod_token = tokenizer.get_command('eos').Id

    after = tokenizer.num_tokens
    # while after % mpu.get_model_parallel_world_size() != 0:
    #     after += 1
    #
    # multiple = args.make_vocab_size_divisible_by * \
    #                mpu.get_model_parallel_world_size()
    # while (after % multiple) != 0:
    #     after += 1

    args.vocab_size = after
    print("prepare tokenizer done", flush=True)

    return tokenizer

这样就可以运行了

我输入了hangzhou is a beautiful city，但是输出的是在jiangsu province，说明模型对于事实性的回答存在混淆，毕竟模型容量比较小，不过我在训练初期的时候，也运行过一次，回答是正确的，在zhejiang province，但是越训练越错误，因此我猜测是模型过拟合了。

为了能够高效的运行模型，还可以使用scripts/pretrain_gpt2_model_parallel.sh，我的设定如下：

#! /bin/bash

# Runs the "345M" parameter model

GPUS_PER_NODE=2
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt2.py \
       --model-parallel-size 2 \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --batch-size 8 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --train-iters 320000 \
       --save checkpoints/gpt2_mnvbc_mp2 \
       --load checkpoints/gpt2_mnvbc_mp2 \
       --resume-dataloader \
       --train-data wikipedia mnvbc\
       --lazy-loader \
       --tokenizer-type GPT2BPETokenizer \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --no-load-optim \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --checkpoint-activations \
       --fp16 True \
       --save-interval 5000
set +x

用这种方式就可以训练更大的模型，用更大的批次了，但是有一个问题是他预测的时候也需要使用同样的方法，使用两块GPU进行预测，这个比较麻烦，因此我自己写了一个模型融合的代码，将模型融合后，再使用之前的generate_text.sh和eval_gpt2就可以了，我测出来的在wiki103上的ppl为27，也没怎么训练充分，效果挺好的了