中文对话式大语言模型Firefly-2b6开源，使用210万训练数据-电子发烧友网

在文章Firefly(流萤): 中文对话式大语言模型中，我们介绍了关于Firefly(流萤)项目的工作，并且分享了我们训练的firefly-1b4模型。这是Firefly项目开源的第一个模型，虽然取得了还不错的效果，但无论是训练数据还是模型参数量，都还有很大的优化空间。

所以，在firefly-1b4实验的基础上，我们对训练数据进行清洗，并且增加了数据量，得到210万数据，并用它训练得到了firefly-2b6模型。

在本文中，我们将对该模型进行分享和介绍。与firefly-1b4相比，firefly-2b6的代码生成能力取得了较大的进步，并且在古诗词生成、对联、作文、开放域生成等方面也有不错的提升。

firefly-1b4和firefly-2b6的训练配置如下表所示。无论是训练数据量，还是训练步数，firefly-2b6都更加充分。

参数	firefly-1b4	firefly-2b6
batch size	16	8
learning rate	3e-5	3e-5
warmup step	3000	3000
lr schedule	cosine	cosine
max length	512	512
training step	90k	260k
训练集规模	160万	210万

项目地址：

https://github.com/yangjianxin1/Firefly

模型权重链接见文末。

模型使用

使用如下代码即可使用模型：

from transformers import BloomTokenizerFast, BloomForCausalLM
device = 'cuda'
path = 'YeungNLP/firefly-2b6'


tokenizer = BloomTokenizerFast.from_pretrained(path)
model = BloomForCausalLM.from_pretrained(path)
model.eval()
model = model.to(device)
text = input('User：')
while True:
    text = '{}'.format(text)
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    input_ids = input_ids.to(device)
    outputs = model.generate(input_ids, max_new_tokens=250, do_sample=True, top_p=0.7, temperature=0.35,
                             repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
    rets = tokenizer.batch_decode(outputs)
    output = rets[0].strip().replace(text, "").replace('', "")
    print("Firefly：{}".format(output))
    text = input('User：')