使用 HuggingFace transformers 进行模型分片存储

发表于 2023-06-10

在HuggingFace transformers 的模型库中常常会见到这种pytorch_model.bin.index.json 参数名与模型bin文件的映射表，在如今的大模型中更为常见。本文主要介绍如何利用HuggingFace transformers进模型保存分片。

pytorch_model.bin.index.json

{
  "metadata": {
    "total_size": 8396800
  },
  "weight_map": {
    "liner1.bias": "pytorch_model-00001-of-00002.bin",
    "liner1.weight": "pytorch_model-00001-of-00002.bin",
    "liner2.bias": "pytorch_model-00002-of-00002.bin",
    "liner2.weight": "pytorch_model-00002-of-00002.bin"
  }
}

权重结构：

saved_pt
├── config.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
└── pytorch_model.bin.index.json

阅读全文 »

正则记录

发表于 2020-12-13 更新于 2023-06-10

正则表达式，真的是一段时间不用就会忘记，于是就简单记录一下日常用到的一些正则表达式的例子，以便日后复盘。

取出括号中的词

这在序列标注任务中常常会用到
text：[以后][在全国范围][普遍][推广]
code：

import re
text = "[以后][在全国范围][普遍][推广]"
words = re.findall(r'\[(.*?)\]',text)
print(words)

out:

1	['以后', '在全国范围', '普遍', '推广']

找到多行文件中不包括str1, str2, str3的行

Note: 建议使用regex 替换 re

1 2	import regex as re p = ^((?!str1\|str2\|str3).)*$

阅读全文 »

pandas 例子

发表于 2020-12-12 更新于 2023-06-10

记录下，清洗数据时，碰到的一些操作。

笛卡尔积

问题1

假设我我们有如下数据表1，需要根据数据表1，生成数据表2，应该如何生成呢？
解释：根据表1 中的 user_id 与 item_id 进行两两组合，将未出现在表1中的组合的label 设为0，并与表1进行合并。(与表2内容一样即可，不要求顺序相同。)

表1：

user_id	item_id	label
1	1	1
2	2	1
3	3	1

表2：

user_id	item_id	label
1	1	1
2	2	1
3	3	1
1	2	0
1	3	0
2	1	0
2	3	0
3	1	0
3	2	0

阅读全文 »

TensorFlow2.x 实时打印学习率

发表于 2020-10-28 更新于 2023-06-10

在TensorFlow2.x中，通过下面代码，显示当前学习率。

1	optimizer._decayed_lr(tf.float32).numpy()

以下代码展示了在TensorFlow2.x中，如何实时打印当前学习率。
这里以bert使用的adamw为例，模型构建部分省略。

阅读全文 »

TensorFlow2.x 语法糖

发表于 2020-10-21 更新于 2023-06-10

tf.where

tf.where(condition, x=None, y=None, name=None)

如果x，y均为空，则返回满足条件的索引indices

1 2	a = tf.Variable([[1,2,3,4],[3,4,2,1]]) tf.where(a>3)

out:

<tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[0, 3],
       [1, 1]])>

如果x，y均不为空，则满足条件位置的值为x相应位置的值，其余为y相应位置的值。(非常实用)

1
2
3

a = tf.Variable([[1,2,3,4],[3,4,2,1]])
b = tf.zeros_like(a)
tf.where(a>3, b, a)

out:

1
2
3

<tf.Tensor: shape=(2, 4), dtype=int32, numpy=
array([[1, 2, 3, 0],
       [3, 0, 2, 1]], dtype=int32)>

阅读全文 »

TensorFlow2.x 模型部署

发表于 2020-08-22 更新于 2023-06-10

模型训练完后，往往需要将模型应用到生产环境中。最常见的就是通过TensorFlow Serving来将模型部署到服务器端，以便客户端进行访问。

阅读全文 »

CheckpointManager

发表于 2020-03-12 更新于 2023-06-10

简单的记录下，如何怎么结合CheckpointManager和Callback ，实现按一定周期保存最近N个模型。

N = 5
# 构建模型
class Model(tf.keras.Model):
    def __init__(self, **kwargs):
        super(Model, self).__init__(**kwargs)
        self.d = tf.keras.layers.Dense(1, kernel_initializer=tf.keras.initializers.ones())

    @tf.function
    def call(self, x, training=True, mask=None):
        return self.d(x)


# 定义回调函数
class Save_Callbacks(tf.keras.callbacks.Callback):
    def __init__(self, checkpoint_manager):
        self.checkpoint_manager = checkpoint_manager

    def on_train_batch_end(self, batch, logs=None):
        super().on_train_batch_end(batch, logs)
        self.checkpoint_manager.save()

        
model = Model()
model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer='SGD')
checkpoint = tf.train.Checkpoint(model=model, optimizer=model.optimizer)
checkpoint_manager = tf.train.CheckpointManager(checkpoint, 'save', max_to_keep=N)

model.fit(x=tf.ones((100, 3)), y=tf.constant(tf.ones((100, 1))), batch_size=2,
          callbacks=[Save_Callbacks(checkpoint_manager)])
model.reset_metrics()

# 从最近保存的ckpt中，恢复模型
ckpt = tf.train.Checkpoint(model=model)
ckpt.restore(tf.train.latest_checkpoint('save'))
print(model(tf.ones((2, 3))))

SGD

发表于 2020-03-04 更新于 2023-06-10

为什么局部下降最快的方向就是梯度的负方向？如何理解SGD算法？

在机器学习中，我们常用的优化算法就是SGD，随机梯度下降算法。这里，我就简单记录下我的理解。有很多同学只会用，却一直无法理解其中的原理。我觉得是在对偏导(导数) 理解出现的失误，还一直停留在，导数=斜率的理解层面上。这里，我们要深刻理解导数是一种线性变换。

阅读全文 »

keras自动炼丹器

发表于 2020-01-30 更新于 2023-06-10

keras自动炼丹器

占坑待更新

https://keras-team.github.io/keras-tuner/
https://github.com/keras-team/keras-tuner
https://blog.tensorflow.org/2020/01/hyperparameter-tuning-with-keras-tuner.html?linkId=81371017

Create-TFRecord

发表于 2020-01-14 更新于 2023-06-10

TFRecord是一种高效的数据存储格式，尤其是在处理大数据集时，我们无法对数据进行一次读取，这时我们就可以将文件存储为TFRecord，然后再进行读取。这样可以可以提高数据移动、读取、处理等速度。
在对小数据集进行读取时，可以直接使用tf.data API来进行处理。

在TFRecord中是将每个样本example 以字典的方式进行存储。

主要的数据类型如下：

int64：tf.train.Feature(int64_list = tf.train.Int64List(value=输入))
float32：tf.train.Feature(float_list = tf.train.FloatList(value=输入))
string：tf.train.Feature(bytes_list=tf.train.BytesList(value=输入))
注：输入必须是list(向量)

这里我们举一个NLP中常见例子。

这里有10个句子sentence，每个句子有128个token_id。
每个句子对应的10个标签label。
每个句子中对应的token weight (mask)
每个句子经过Embedding后的句子matrix，tensor (两者是同一个东西，只是为了后面介绍两种不同的存储方式。)

那么我们怎样将这些转换为TFRecord呢？

阅读全文 »