English Version
This is a Chinese-English bilingual Transformer model supporting both languages. It contains model configurations and training workflows, completing training by loading parameters defined in params/config.json. The training script train.py parses command-line arguments, including dataset root directory, number of training epochs, batch size, checkpoint interval, and checkpoint directory.
Model Download Options (Choose One):
- Visit HuggingFace to access Files and versions
- Run
params/download.pyto download parameters
Demo Video: bilibili
Training dataset sources are listed in the Model Card section of the HuggingFace download link.
License: Code follows Apache-2.0 protocol. Please credit the source code when used.
- 📊 Device Selection: Code defaults to CUDA training
- 🌐 Performance Optimization:
dtype=torch.bfloat16is enabled to accelerate training and reduce memory usage. Ensure hardware supports this feature. - 🤖 Language Support: Model supports Chinese and English training. The BBPE tokenizer was trained without multilingual text, so OOV (out-of-vocabulary) issues are minimized for these languages but may exist for others.
📌 Training Guide
To train this Transformer model, follow these steps:
(1). Prepare Dataset:
Place datasets in the designated root directory. Files should be text documents in Chinese, English, or mixed. Format should align with model input requirements - preferably pre-tokenized token_ids stored as torch.Tensor (using torch.Tensor saves memory compared to Python lists, which default to 64-bit precision).
(2). Install Dependencies:
pip install -r requirements.txt
pip install .
(3). Run Training Script:
python train.py \
--train_type=train_type[seq, sft, dpo] \
--data_root_path=/path/to/dataset \
--n_epoch=5 \
--batch_size=8 \
--max_lr=2e-4 \
--n_iter_ckpt=10000 \
--ckpt_dir checkpoints
Parameters Explanation:
--train_type: Training type (seq, sft, dpo)--data_root_path: Dataset root directory--n_epoch: Total training epochs--batch_size: Batch size--n_iter_step: Number of batches per training step--warning_step: Warmup steps--max_lr: Maximum learning rate (uses warmup + cosine decay)--n_iter_ckpt: Checkpoint saving interval--ckpt_dir: Checkpoint directory--resume_dir: Path to resume training from checkpoint
Training logs are saved in train_log.txt. Checkpoints will be stored in the specified directory for resuming training or evaluation.
👉 Usage Guide
(1). Chatting with the Model:
Open chat.py or use streaming/non-streaming interfaces:
Streaming Output:
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []
while True:
query = input(">> ")
if query == "!exit":
break
response_size = 0
for response, history in model.stream_generate(
query=query,
history=history,
temperature=0.85,
top_p=0.95,
top_k=50
):
print(response[response_size:], end="")
response_size = len(response)
Non-streaming Output:
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []
while True:
query = input(">> ")
if query == "!exit":
break
response = model.generate(
query=query,
history=history,
temperature=0.85,
top_p=0.95,
top_k=50
)
print(response)
(2) Retrieval-Augmented Generation (RAG):
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
retrieved_content = model.retrieve_generate(
query=query,
retrieve_top_k=5,
temperature=0.6,
top_k=30,
top_p=0.95
)
print(retrieved_content)
📌 Model Specifications
This model is based on a 24-layer Transformer with parameters defined in config.json, totaling approximately 1.0 billion (1.0B) parameters.
Key Design Choices:
- Weight tying between embedding and final linear layers (standard for small models to save parameters)
- Embedding layer optimization: Without weight tying, a 10,000-word vocabulary would consume ~102M parameters (0.1B)
Limitations:
- May struggle with complex language phenomena due to smaller parameter size
- Prone to overfitting on specialized datasets
- Limited multilingual capabilities
Advantages:
- Runs efficiently on lower-spec hardware
- Shorter training time compared to larger models
Training Pipeline: The model has completed pre-training + SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) workflows. All corresponding training code is included in the repository.
中文版本
这是一个支持中英文双语的 Transformer 模型,能够处理两种语言。模型包含配置文件和训练流程,通过加载 `params/config.json` 中定义的参数完成训练。训练脚本 `train.py` 支持命令行参数解析,包括数据集根目录、训练轮数(epochs)、批量大小(batch size)、检查点保存间隔、检查点目录等。模型下载选项(任选其一):
- 访问 HuggingFace 查看 Files and versions
- 运行
params/download.py下载模型参数
演示视频: bilibili
训练数据来源请参见 HuggingFace 下载页面中的 Model Card 部分。
许可证: 代码遵循 Apache-2.0 协议,使用时请注明出处。
- 📊 设备选择: 默认使用 CUDA 进行训练
- 🌐 性能优化: 启用
dtype=torch.bfloat16以加速训练并减少内存占用,请确保硬件支持该特性 - 🤖 语言支持: 模型支持中文和英文训练。由于 BBPE 分词器未使用多语言文本训练,因此中英文的 OOV(未登录词)问题较少,其他语言可能存在 OOV 问题
📌 训练指南
要训练该 Transformer 模型,请按照以下步骤操作:
(1). 准备数据集:
将数据集放置在指定的根目录下。文件应为包含中文、英文或混合文本的文本文档。格式应符合模型输入要求——建议使用预分词后的 token_ids 并以 torch.Tensor 格式保存(使用 torch.Tensor 相比 Python 列表更节省内存,列表默认为 64 位精度)。
(2). 安装依赖:
pip install -r requirements.txt
pip install .
(3). 运行训练脚本:
python train.py \
--train_type=train_type[seq, sft, dpo] \
--data_root_path=/path/to/dataset \
--n_epoch=5 \
--batch_size=8 \
--max_lr=2e-4 \
--n_iter_ckpt=10000 \
--ckpt_dir checkpoints
参数说明:
--train_type: 训练类型(seq, sft, dpo)--data_root_path: 数据集根目录--n_epoch: 总训练轮数--batch_size: 批量大小--n_iter_step: 每个训练步骤的 batch 数量--warning_step: 预热步数(warmup steps)--max_lr: 最大学习率(使用预热 + 余弦衰减)--n_iter_ckpt: 检查点保存间隔--ckpt_dir: 检查点保存目录--resume_dir: 从指定路径恢复训练
训练日志将保存在 train_log.txt 中。检查点将保存在指定目录,用于恢复训练或评估。
👉 使用指南
(1). 与模型对话:
打开 chat.py 或使用流式/非流式接口:
流式输出:
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []
while True:
query = input(">> ")
if query == "!exit":
break
response_size = 0
for response, history in model.stream_generate(
query=query,
history=history,
temperature=0.85,
top_p=0.95,
top_k=50
):
print(response[response_size:], end="")
response_size = len(response)
非流式输出:
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []
while True:
query = input(">> ")
if query == "!exit":
break
response = model.generate(
query=query,
history=history,
temperature=0.85,
top_p=0.95,
top_k=50
)
print(response)
(2). 基于检索的生成(RAG):
import torch
from khaosz import Khaosz
model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
retrieved_content = model.retrieve_generate(
query=query,
retrieve_top_k=5,
temperature=0.6,
top_k=30,
top_p=0.95
)
print(retrieved_content)
📌 模型规格说明(重复部分)
该模型基于一个 24 层的 Transformer 架构,参数配置定义在 config.json 中,总参数量约为 10 亿(1.0B)。
关键设计选择:
- 在嵌入层(embedding)与最终线性层之间进行权重绑定(weight tying),这是小型模型中常见的节省参数量的做法
- 嵌入层优化:若不进行权重绑定,一个包含 10,000 个词的词汇表将消耗约 1.02 亿(0.1B)参数
局限性:
- 由于参数规模较小,可能在处理复杂语言现象时表现受限
- 在特定领域的数据集上容易出现过拟合
- 多语言能力有限
优势:
- 可在低配置硬件上高效运行
- 相较于大型模型,训练时间更短
训练流程:
该模型已完成预训练(pre-training)+ 监督微调(SFT, Supervised Fine-Tuning)+ 直接偏好优化(DPO, Direct Preference Optimization)的全流程。所有相关的训练代码均已包含在代码库中。
- Downloads last month
- 14