问答社区

原创作者: 图龙网络科技发布时间： 2023-09-23 234.84K 人阅读

Hunyuan-DiT：腾讯混元大模型细粒度中文理解能力的强大多分辨率扩散变换器

太极混元发布于 5个月前分类：语言模型

腾讯提出了 Hunyuan-DiT，一种对英文和中文具有细粒度理解的文本到图像扩散变换器。为了构建 Hunyuan-DiT，我们精心设计了变换器结构、文本编码器和位置编码。我们还从头开始构建了整个数据管道来更新和评估数据以进行迭代模型优化。

对于细粒度的语言理解，我们训练了一个多模态大型语言模型来细化图像的字幕。最后，Hunyuan-DiT 可以与用户进行多轮多模态对话，根据上下文生成和细化图像。通过我们精心设计的整体人工评估协议，拥有 50 多名专业人工评估员，与其他开源模型相比，Hunyuan-DiT 在中文到图像生成方面创下了新的最高水平。

Hunyuan-DiT 主要特点

中英双语DiT建筑

Hunyuan-DiT 是潜在空间中的扩散模型，如下图所示。根据潜在扩散模型，我们使用预训练的变分自动编码器 (VAE) 将图像压缩到低维潜在空间，并训练扩散模型以使用扩散模型学习数据分布。我们的扩散模型使用转换器进行参数化。为了对文本提示进行编码，我们利用预训练的双语（英语和中文）CLIP 和多语言 T5 编码器的组合。

1727171371-756457dc85f1345

多轮 Text2Image 生成

理解自然语言指令并与用户进行多轮交互对于文本转图像系统非常重要。它可以帮助构建一个动态且可迭代的创作过程，逐步将用户的想法变为现实。在本节中，我们将详细介绍如何赋予 Hunyuan-DiT 进行多轮对话和图像生成的能力。我们训练 MLLM 理解多轮用户对话并输出用于图像生成的新文本提示。

1727171387-2a0b20c4454af28

比较

为了全面对比混元DiT与其他模型的生成能力，我们构建了4维测试集，包括图文一致性、去除AI伪像、主体清晰度、美观度，由50多位专业评测人员进行评测。

模型	开源	文本-图像一致性 (%)	不包括人工智能工件（%）	主题清晰度 (%)	美学（%）	全面的（％）
模型	开源	超大尺寸	✔	64.3	美学（%）	全面的（％）	60.6	91.1	76.3	42.7
PixArt-α	✔	68.3	60.9	93.2	77.5	45.5
游乐场 2.5	✔	71.9	70.8	94.9	83.3	54.3
SD 3	✘	77.1	69.3	94.6	82.5	56.7
MidJourney v6	✘	73.5	80.2	93.5	87.2	63.3
达尔-E 3	✘	83.9	80.3	96.5	89.4	71.0
混元一DiT	✔	74.2	74.3	95.4	86.6	59.0

📜 要求：

该 repo 由 DialogGen（一个即时增强模型）和 Hunyuan-DiT（一个文本转图像模型）组成。

下表显示了运行模型的要求（批量大小 = 1）：

模型	--load-4bit (DialogGen)	GPU 峰值内存	图形处理器
DialogGen + 混元-DiT	✘	32G	A100
DialogGen + 混元-DiT	✔	22克	A100
混元一DiT	-	11G	A100
混元一DiT	-	14G	RTX3090/RTX4090

需要支持 CUDA 的 NVIDIA GPU。
- 我们已经测试了 V100 和 A100 GPU。
- 最低限度：所需的最低 GPU 内存为 11GB。
- 建议：我们建议使用具有 32GB 内存的 GPU 以获得更好的生成质量。
测试的操作系统：Linux

🛠️ 依赖项和安装

首先克隆存储库：

git clone https://github.com/tencent/HunyuanDiT
cd HunyuanDiT

Linux 安装指南：

我们提供了一个environment.yml用于设置 Conda 环境的文件。Conda 的安装说明可在此处获得。

我们推荐 CUDA 版本 11.7 和 12.0+。

# 1. Prepare conda environment
conda env create -f environment.yml

# 2. Activate the environment
conda activate HunyuanDiT

# 3. Install pip dependencies
python -m pip install -r requirements.txt

# 4. Install flash attention v2 for acceleration (requires CUDA 11.6 or above)
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3

此外，您还可以使用docker来设置环境。

# 1. Use the following link to download the docker image tar file.
# For CUDA 12
wget https://dit.hunyuan.tencent.com/download/HunyuanDiT/hunyuan_dit_cu12.tar
# For CUDA 11
wget https://dit.hunyuan.tencent.com/download/HunyuanDiT/hunyuan_dit_cu11.tar

# 2. Import the docker tar file and show the image meta information
# For CUDA 12
docker load -i hunyuan_dit_cu12.tar
# For CUDA 11
docker load -i hunyuan_dit_cu11.tar  
docker image ls
# 3. Run the container based on the image
docker run -dit --gpus all --init --net=host --uts=host --ipc=host --name hunyuandit --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged  docker_image_tag

🧱 下载预训练模型

要下载模型，首先要安装 huggingface-cli。（详细说明请参见）

python -m pip install "huggingface_hub[cli]"

然后使用以下命令下载模型：

# Create a directory named 'ckpts' where the model will be saved, fulfilling the prerequisites for running the demo.
mkdir ckpts
# Use the huggingface-cli tool to download the model.
# The download time may vary from 10 minutes to 1 hour depending on network conditions.
huggingface-cli download Tencent-Hunyuan/HunyuanDiT-v1.2 --local-dir ./ckpts

💡huggingface-cli 使用技巧（网络问题）

所有模型都将自动下载。有关该模型的更多信息，请访问此处的Hugging Face 存储库。

模型	#参数	Huggingface 下载地址	腾讯云下载地址
肌醇5	1.6亿	肌醇5	肌醇5
夹子	3.5 亿	夹子	夹子
标记器	-	标记器	标记器
对话生成器	7.0B	对话生成器	对话生成器
sdxl-vae-fp16-修复	8300 万	sdxl-vae-fp16-修复	sdxl-vae-fp16-修复
混元-DiT-v1.0	15亿	混元一DiT	混元-DiT-v1.0
混元-DiT-v1.1	15亿	混元-DiT-v1.1	混元-DiT-v1.1
混元-DiT-v1.2	15亿	混元-DiT-v1.2	混元-DiT-v1.2
数据演示	-	-	数据演示

🚚 训练：数据准备

参考下面的命令来准备训练数据。

安装依赖项

我们提供了高效的数据管理库 IndexKits，支持训练过程中对数亿级数据的读取管理，更多信息请参见文档。
```
# 1 Install dependencies
cd HunyuanDiT
pip install -e ./IndexKits
```

资料下载

请随意下载数据演示。

# 2 Data download
wget -O ./dataset/data_demo.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/data_demo.zip
unzip ./dataset/data_demo.zip -d ./dataset
mkdir ./dataset/porcelain/arrows ./dataset/porcelain/jsons

数据转换

使用下表列出的字段创建一个用于训练数据的 CSV 文件。

字段	必需的	描述	例子
`image_path`	必需的	图片路径	`./dataset/porcelain/images/0.png`
`text_zh`	必需的	文本	青花瓷风格，一只蓝色的鸟儿站在蓝色的花瓶上，周围有白色的树木，背景是白色
`md5`	选修的	图像 md5（消息摘要算法 5）	`d41d8cd98f00b204e9800998ecf8427e`
`width`	选修的	图像宽度	`1024`
`height`	选修的	图像高度	`1024`

⚠️MD5、宽度和高度等可选字段可以省略。如果省略，下面的脚本将自动计算它们。处理大规模训练数据时，此过程可能非常耗时。

我们利用Arrow作为训练数据格式，提供标准且高效的内存数据表示。提供转换脚本以将 CSV 文件转换为 Arrow 格式。

# 3 Data conversion 
python ./hydit/data_loader/csv2arrow.py ./dataset/porcelain/csvfile/image_text.csv ./dataset/porcelain/arrows 1

数据选择和配置文件创建

我们通过 YAML 文件配置训练数据。在这些文件中，您可以设置标准数据处理策略，用于对训练数据进行过滤、复制、重复数据删除等。有关更多详细信息，请参阅./IndexKits。

示例文件请参见file。完整参数配置文件请参见file。

使用 YAML 文件创建训练数据索引文件。

 # Single Resolution Data Preparation
 idk base -c dataset/yamls/porcelain.yaml -t dataset/porcelain/jsons/porcelain.json

 # Multi Resolution Data Preparation     
 idk multireso -c dataset/yamls/porcelain_mt.yaml -t dataset/porcelain/jsons/porcelain_mt.json

数据集的目录结构porcelain是：

 cd ./dataset

 porcelain
    ├──images/  (image files)
    │  ├──0.png
    │  ├──1.png
    │  ├──......
    ├──csvfile/  (csv files containing text-image pairs)
    │  ├──image_text.csv
    ├──arrows/  (arrow files containing all necessary training data)
    │  ├──00000.arrow
    │  ├──00001.arrow
    │  ├──......
    ├──jsons/  (final training data index files which read data from arrow files during training)
    │  ├──porcelain.json
    │  ├──porcelain_mt.json

全参数训练：

要求：

最低要求是具有至少 20GB 内存的单个 GPU，但我们建议使用具有大约 30 GB 内存的 GPU，以避免主机内存卸载。
此外，我们鼓励用户利用不同节点上的多个 GPU 来加快大型数据集的训练速度。

注意：

个人用户也可以使用轻量级的 Kohya 来微调模型，大约需要 16GB 的内存。目前我们正在尝试进一步降低个人用户的工业级框架的内存占用。
如果您有足够的 GPU 内存，请尝试删除 --cpu-offloading或--gradient-checkpointing减少时间成本。

具体对于分布式训练，你可以通过调整和等参数来灵活地控制单节点/多节点训练。有关更多详细信息，请参阅链接。--hostfile--master_addr

# Single Resolution Training
PYTHONPATH=./ sh hydit/train.sh --index-file dataset/porcelain/jsons/porcelain.json
# Multi Resolution Training
PYTHONPATH=./ sh hydit/train.sh --index-file dataset/porcelain/jsons/porcelain_mt.json --multireso --reso-step 64
# Training with old version of HunyuanDiT (<= v1.1)
PYTHONPATH=./ sh hydit/train_v1.1.sh --index-file dataset/porcelain/jsons/porcelain.json

保存检查点后，您可以使用以下命令来评估模型。

# Inference
  #   You should replace the 'log_EXP/xxx/checkpoints/final.pt' with your actual path.
python sample_t2i.py --infer-mode fa --prompt "青花瓷风格，一只可爱的哈士奇" --no-enhance --dit-weight log_EXP/xxx/checkpoints/final.pt --load-key module
# Old version of HunyuanDiT (<= v1.1)
#   You should replace the 'log_EXP/xxx/checkpoints/final.pt' with your actual path.
python sample_t2i.py --infer-mode fa --prompt "青花瓷风格，一只可爱的哈士奇" --model-root ./HunyuanDiT-v1.1 --use-style-cond --size-cond 1024 1024 --beta-end 0.03 --no-enhance --dit-weight log_EXP/xxx/checkpoints/final.pt --load-key module

洛拉：

我们为 LoRA 提供训练和推理脚本，详细信息请参阅./lora。

# Training for porcelain LoRA.
PYTHONPATH=./ sh lora/train_lora.sh --index-file dataset/porcelain/jsons/porcelain.json
# Inference using trained LORA weights.
python sample_t2i.py --infer-mode fa --prompt "青花瓷风格，一只小狗"  --no-enhance --lora-ckpt log_EXP/001-lora_porcelain_ema_rank64/checkpoints/0001000.pt

我们提供两种类型的经过训练的 LoRA 权重porcelain，jade详情请参阅链接

cd HunyuanDiT
# Use the huggingface-cli tool to download the model.
huggingface-cli download Tencent-Hunyuan/HYDiT-LoRA --local-dir ./ckpts/t2i/lora
# Quick start
python sample_t2i.py --infer-mode fa --prompt "青花瓷风格，一只猫在追蝴蝶"  --no-enhance --load-key ema --lora-ckpt ./ckpts/t2i/lora/porcelain

推理：6GB GPU VRAM 推理

现在可以在基于扩散器的 6GB GPU VRAM 下运行 HunyuanDiT 。我们在这里提供说明和演示，以便您快速入门。

6GB版本支持RTX 3070/3080/4080/4090、A100等Nvidia Ampere架构系列显卡。

您唯一需要做的就是安装以下库：

pip install -U bitsandbytes
pip install git+https://github.com/huggingface/diffusers
pip install torch==2.0.0

然后您就可以直接在6GB GPU VRAM下享受HunyuanDiT文本转图像之旅了！

这里为您提供一个演示。

cd HunyuanDiT
# Quick start
model_id=Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers-Distilled
prompt=一个宇航员在骑马
infer_steps=50
guidance_scale=6
python3 lite/inference.py ${model_id} ${prompt} ${infer_steps} ${guidance_scale}

更多详细信息请参阅./lite。

使用 Gradio：在运行以下命令之前，请确保 conda 环境已激活。

# By default, we start a Chinese UI. Using Flash Attention for acceleration.
python app/hydit_app.py --infer-mode fa
# You can disable the enhancement model if the GPU memory is insufficient.
# The enhancement will be unavailable until you restart the app without the `--no-enhance` flag. 
python app/hydit_app.py --no-enhance --infer-mode fa
# Start with English UI
python app/hydit_app.py --lang en --infer-mode fa
# Start a multi-turn T2I generation UI. 
# If your GPU memory is less than 32GB, use '--load-4bit' to enable 4-bit quantization, which requires at least 22GB of memory.
python app/multiTurnT2I_app.py --infer-mode fa

然后就可以通过http://0.0.0.0:443来访问demo了，需要注意的是这里的 0.0.0.0 需要跟你的服务器IP一致。

使用🤗 扩散器：

请提前安装 PyTorch 2.0 或更高版本以满足指定版本的扩散器库的要求。

安装🤗扩散器，确保版本至少为 0.28.1：

pip install git+https://github.com/huggingface/diffusers.git

或者

pip install diffusers

您可以使用以下 Python 脚本生成带有中文和英文提示的图像：

import torch
from diffusers import HunyuanDiTPipeline
pipe = HunyuanDiTPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers", torch_dtype=torch.float16)
pipe.to("cuda")
# You may also use English prompt as HunyuanDiT supports both English and Chinese
# prompt = "An astronaut riding a horse"
prompt = "一个宇航员在骑马"
image = pipe(prompt).images[0]

您可以使用我们的提炼模型更快地生成图像：

import torch
from diffusers import HunyuanDiTPipeline
pipe = HunyuanDiTPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers-Distilled", torch_dtype=torch.float16)
pipe.to("cuda")
# You may also use English prompt as HunyuanDiT supports both English and Chinese
# prompt = "An astronaut riding a horse"
prompt = "一个宇航员在骑马"
image = pipe(prompt, num_inference_steps=25).images[0]

更多详细信息请参阅HunyuanDiT-v1.2-Diffusers-Distilled

更多功能：对于 LoRA 和 ControlNet 等其他功能，请查看./diffusers的 README 。

使用命令行：我们提供了几个快速启动的命令：

# Only Text-to-Image. Flash Attention mode
python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚" --no-enhance
# Generate an image with other image sizes.
python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚" --image-size 1280 768
# Prompt Enhancement + Text-to-Image. DialogGen loads with 4-bit quantization, but it may loss performance.
python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚"  --load-4bit

🏗️ 适配器：控制网

我们提供了 ControlNet 的训练脚本，详细信息请参阅./controlnet。

# Training for canny ControlNet.
PYTHONPATH=./ sh hydit/train_controlnet.sh

我们提供三种类型的经过训练的 ControlNet 权重canny depth，pose详情见链接

cd HunyuanDiT
# Use the huggingface-cli tool to download the model.
# We recommend using distilled weights as the base model for ControlNet inference, as our provided pretrained weights are trained on them.
huggingface-cli download Tencent-Hunyuan/HYDiT-ControlNet-v1.2 --local-dir ./ckpts/t2i/controlnet
huggingface-cli download Tencent-Hunyuan/Distillation-v1.2 ./pytorch_model_distill.pt --local-dir ./ckpts/t2i/model
# Quick start
python3 sample_controlnet.py --infer-mode fa --no-enhance --load-key distill --infer-steps 50 --control-type canny --prompt "在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围" --condition-image-path controlnet/asset/input/canny.jpg --control-weight 1.0

指示：a.安装依赖项

依赖项和安装基本与基础模型相同。

b. 模型下载

# Use the huggingface-cli tool to download the model.
huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner

推理：

我们的模型支持三种不同的模式：直接生成中文字幕、基于特定知识生成中文字幕和直接生成英文字幕。注入的信息可以是准确的线索，也可以是嘈杂的标签（例如，从互联网上爬取的原始描述）。该模型能够根据插入的信息和图像内容生成可靠且准确的描述。

模式	提示模板	描述
caption_zh	描述这篇文章	中文字幕
插入内容	根据提示词“{}”，描述这张图片	插入知识的标题
caption_en	请描述此图片的内容	英文字幕

a. 中文单图推理

python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner"

b. 在标题中插入具体知识

python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner"

c. 英文单图推理

python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner"

d. 多张图片中文推理

### Convert multiple pictures to csv file. 
python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv"

### Multiple pictures inference
python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner"

（可选）要将输出的 csv 文件转换为 Arrow 格式，请参阅数据准备 #3了解详细说明。

格拉迪奥：

要在本地启动 Gradio 演示，请逐个运行以下命令。有关更详细的说明，请参阅LLaVA。

cd mllm
python -m llava.serve.controller --host 0.0.0.0 --port 10000

python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "../ckpts/captioner" --model-name LlavaMistral

然后就可以通过http://0.0.0.0:443来访问demo了，需要注意的是这里的 0.0.0.0 需要跟你的服务器IP一致。

🚀 加速（适用于 Linux）

我们提供 TensorRT 版本的 HunyuanDiT 用于推理加速（比 flash 注意力更快）。更多详细信息请参阅腾讯-Hunyuan/TensorRT-libs 。
我们提供 HunyuanDiT 的 Distillation 版本用于推理加速。更多详细信息请参阅腾讯-Hunyuan/Distillation 。

🔗 BibTeX

如果您发现Hunyuan-DiT或DialogGen对您的研究和应用有用，请使用此 BibTeX 进行引用：

@misc{li2024hunyuandit,
      title={Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding}, 
      author={Zhimin Li and Jianwei Zhang and Qin Lin and Jiangfeng Xiong and Yanxin Long and Xinchi Deng and Yingfang Zhang and Xingchao Liu and Minbin Huang and Zedong Xiao and Dayou Chen and Jiajun He and Jiahao Li and Wenyue Li and Chen Zhang and Rongwei Quan and Jianxiang Lu and Jiabin Huang and Xiaoyan Yuan and Xiaoxiao Zheng and Yixuan Li and Jihong Zhang and Chao Zhang and Meng Chen and Jie Liu and Zheng Fang and Weiyan Wang and Jinbao Xue and Yangyu Tao and Jianchen Zhu and Kai Liu and Sihuan Lin and Yifu Sun and Yun Li and Dongdong Wang and Mingtao Chen and Zhichao Hu and Xiao Xiao and Yan Chen and Yuhong Liu and Wei Liu and Di Wang and Yong Yang and Jie Jiang and Qinglin Lu},
      year={2024},
      eprint={2405.08748},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@article{huang2024dialoggen,
  title={DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation},
  author={Huang, Minbin and Long, Yanxin and Deng, Xinchi and Chu, Ruihang and Xiong, Jiangfeng and Liang, Xiaodan and Cheng, Hong and Lu, Qinglin and Liu, Wei},
  journal={arXiv preprint arXiv:2403.08857},
  year={2024}
}