国产精品久久久久久久久久免费,韩国高清不卡一区二区 ,国产精品嫩草研究所永久网址

StableDiffusionPipeline,
StableDiffusionImg2ImgPipeline,
StableDiffusionInpaintPipeline,
StableDiffusionDepth2ImgPipeline
)

# 載入管線

model_id = "stabilityai/stable-diffusion-2-1-base"

pipe = StableDiffusionPipeline.from_pretrained(model_id).to(device)

pipe = StableDiffusionPipeline.from_pretrained(model_id,

    revision="fp16",torch_dtype=torch.float16).to(device)

# 查看pipe的組件

print(list(pipe.components.keys()))

# 輸出組件

['vae','text_encoder','tokenizer','unet','scheduler',

  'safety_checker','feature_extractor']

這些組件的概念之前都介紹過了，下面通過代碼分析一下這些組件的細節：

一、可變分自編碼器（VAE）

可變分自編碼器（VAE）是一種模型，模型結構如圖所示：

在使用Stable Diffusion生成圖片時，首先需要在VAE的“隱空間”中應用擴散過程以生成隱編碼，然后在擴散之后對他們解碼，得到最終的輸出圖片，實際上，UNet的輸入不是完整的圖片，而是經過VAE壓縮后的特征，這樣可以極大地減少計算資源，代碼如下：

# 創建取值區間為(-1, 1)的偽數據

images = torch.rand(1, 3, 512, 512).to(device) * 2 - 1 

print("Input images shape:", images.shape)

　

# 編碼到隱空間

with torch.no_grad():

    latents = 0.18215 * pipe.vae.encode(images).latent_dist.mean

print("Encoded latents shape:", latents.shape)

　

# 再解碼回來

with torch.no_grad():

    decoded_images = pipe.vae.decode(latents / 0.18215).sample

print("Decoded images shape:", decoded_images.shape)

# 輸出

Input images shape: torch.Size([1, 3, 512, 512])

Encoded latents shape: torch.Size([1, 4, 64, 64])

Decoded images shape: torch.Size([1, 3, 512, 512])

在這個示例中，原本512X512像素的圖片被壓縮成64X64的隱式表示，圖片的每個空間維度都被壓縮至原來的八分之一，因此設定參數width和height時，需要將它們設置成8的倍數。

PS：VAE解碼過程并不完美，圖像質量有所損失，但在實際使用中已經足夠好了。

二、分詞器tokenizer和文本編碼器text_encoder

? ? ? ?Prompt文本描述如何控制Stable Diffusion呢？首先需要對Prompt文本描述使用tokenizer進行分詞轉換為數值表示的ID，然后將這些分詞后的ID輸入給文本編碼器。實際使用戶中，可以直接調用_encode_prompt方法來補全或者截斷分詞后的長度為77獲得最終的Prompt文本表示，代碼如下：

# 手動對提示文字進行分詞和編碼

# 分詞

input_ids = pipe.tokenizer(["A painting of a flooble"])['input_ids']

print("Input ID -> decoded token")

for input_id in input_ids[0]:

    print(f"{input_id} -> {pipe.tokenizer.decode(input_id)}")

　

# 將分詞結果輸入CLIP 

input_ids = torch.tensor(input_ids).to(device)

with torch.no_grad():

    text_embeddings = pipe.text_encoder(input_ids)['last_hidden_state']

print("Text embeddings shape:", text_embeddings.shape)

# 輸出

Input ID -> decoded token

49406 -> <|startoftext|>

320 -> a

3086 -> painting

539 -> of

320 -> a

4062 -> floo

1059 -> ble

49407 -> <|endoftext|>

Text embeddings shape: torch.Size([1, 8, 1024])

獲取最終的文本特征

text_embeddings = pipe._encode_prompt("A painting of a flooble", 

   device, 1, False, '')

print(text_embeddings.shape)

# 輸出

torch.Size([1, 77, 1024])

可以看到最終的文本從長度8補全到了77。

三、UNet網絡

在擴散模型中，UNet的作用是接收“帶噪”的輸入并預測噪聲，以實現“去噪”，網絡結構如下圖所示，與前面的示例不同，此次輸入的并非是原始圖片，而是圖片的隱式表示，另外還有文本Prompt描述也作為UNet的輸入。

下面讓我們對上述三種輸入使用偽輸入讓模型來了解一下UNet在預測過程中，輸入輸出形狀和大小，代碼如下：

# 創建偽輸入

timestep = pipe.scheduler.timesteps[0]

latents = torch.randn(1, 4, 64, 64).to(device)

text_embeddings = torch.randn(1, 77, 1024).to(device)

　

# 讓模型進行預測

with torch.no_grad():

    unet_output = pipe.unet(latents, timestep, text_embeddings).sample

print('UNet output shape:', unet_output.shape)

# 輸出

UNet output shape: torch.Size([1, 4, 64, 64])

四、調度器Scheduler

調度器保持了如何添加噪聲的信息，并管理如何基于模型的預測更新“帶噪”樣本，默認的調度器是PNDMScheduler。

我們可以觀察一下在添加噪聲過程中，噪聲水平隨時間步增加的變化

plt.plot(pipe.scheduler.alphas_cumprod, label=r'$\bar{\alpha}$')

plt.xlabel('Timestep (high noise to low noise ->)')

plt.title('Noise schedule')

plt.legend()

下面看一下使用不同的調度器生成的效果對比，比如使用LMSDiscreteScheduler，代碼如下：

from diffusers import LMSDiscreteScheduler

　

# 替換原來的調度器

pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)

　

# 輸出配置參數

print('Scheduler config:', pipe.scheduler)

　

# 使用新的調度器生成圖片

pipe(prompt="Palette knife painting of an winter cityscape", 

   height=480, width=480, generator=torch.Generator(device=device).

   manual_seed(42)).images[0]

# 輸出

Scheduler config: LMSDiscreteScheduler {

  "_class_name": "LMSDiscreteScheduler",

  "_diffusers_version": "0.11.1",

  "beta_end": 0.012,

  "beta_schedule": "scaled_linear",

  "beta_start": 0.00085,

  "clip_sample": false,

  "num_train_timesteps": 1000,

  "prediction_type": "epsilon",

  "set_alpha_to_one": false,

  "skip_prk_steps": true,

  "steps_offset": 1,

  "trained_betas": null

}

生成的圖片如下圖所示：

五、復現完整Pipeline

到目前為止，我們已經分步驟剖析了Pipeline的每個組件，現在我們將組合起來手動實現一個完整的Pipeline，代碼如下：

guidance_scale = 8

num_inference_steps=30

prompt = "Beautiful picture of a wave breaking"

negative_prompt = "zoomed in, blurry, oversaturated, warped"

　

# 對提示文字進行編碼

text_embeddings = pipe._encode_prompt(prompt, device, 1, True, 

   negative_prompt)

　

# 創建隨機噪聲作為起點

latents = torch.randn((1, 4, 64, 64), device=device, generator=generator)

latents *= pipe.scheduler.init_noise_sigma

　

# 準備調度器

pipe.scheduler.set_timesteps(num_inference_steps, device=device)

　

# 生成過程開始

for i, t in enumerate(pipe.scheduler.timesteps):

　

    latent_model_input = torch.cat([latents] * 2)



    latent_model_input = pipe.scheduler.scale_model_input(

       latent_model_input, t)

　

    with torch.no_grad():

      noise_pred = pipe.unet(latent_model_input, t, 

         encoder_hidden_states=text_embeddings).sample

　

      noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)

      noise_pred = noise_pred_uncond + guidance_scale * 

          (noise_pred_text - noise_pred_uncond)

　

    latents = pipe.scheduler.step(noise_pred, t, latents).prev_sample

　

# 將隱變量映射到圖片，效果如圖6-14所示

with torch.no_grad():

    image = pipe.decode_latents(latents.detach())

　

pipe.numpy_to_pil(image)[0]

生成的效果，如下圖所示：

六、其他Pipeline介紹

? ? ? ?在擴散模型實戰（十）：Stable Diffusion文本條件生成圖像大模型也提到一些其他Pipeline模型，比如圖片到圖片風格遷移Img2Img，圖片修復Inpainting以及圖片深度Depth2Image模型，本小節我們就來探索一下這些模型的具體使用效果。

6.1 Img2Img

到目前為止，我們的圖片仍然是從完全隨機的隱變量開始生成的，并且都使用了完整的擴展模型采樣循環。Img2Img Pipeline不必從頭開始，它首先會對一張已有的圖片進行編碼，在得到一系列的隱變量后，就在這些隱變量上隨機添加噪聲，并以此作為起點。

噪聲的數量和“去噪”的步數決定了Img2Img生成的效果，添加少量噪聲只會帶來微小的變化，添加大量噪聲并執行完整的“去噪”過程，可能得到與原始圖片完全不同，近在整體結構上相似的圖片。

# 載入Img2Img管線

model_id = "stabilityai/stable-diffusion-2-1-base"

img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(

    model_id).to(device)

result_image = img2img_pipe(

    prompt="An oil painting of a man on a bench",

    image = init_image, # 輸入待編輯圖片

    strength = 0.6, # 文本提示在設為0時完全不起作用，設為1時作用強度最大

).images[0]

　

# 顯示結果，如圖6-15所示

fig, axs = plt.subplots(1, 2, figsize=(12, 5))

axs[0].imshow(init_image);axs[0].set_title('Input Image')

axs[1].imshow(result_image);axs[1].set_title('Result')

生成的效果，如下圖所示：

6.2 Inpainting

Inpainting是一個圖片修復技術，它可以保留圖片一部分內容不變，其他部分生成新的內容，Inpainting UNet網絡結構如下圖所示：

下面我們使用一個示例來展示一下效果：

pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/

 stable-diffusion-inpainting")

pipe = pipe.to(device)

# 添加提示文字，用于讓模型知道補全圖像時使用什么內容

prompt = "A small robot, high resolution, sitting on a park bench"

image = pipe(prompt=prompt, image=init_image, 

    mask_image=mask_image).images[0]

　

# 查看結果，如圖6-17所示

fig, axs = plt.subplots(1, 3, figsize=(16, 5))

axs[0].imshow(init_image);axs[0].set_title('Input Image')

axs[1].imshow(mask_image);axs[1].set_title('Mask')

axs[2].imshow(image);axs[2].set_title('Result')

生成的效果，如下圖所示：

這是個有潛力的模型，如果可以和自動生成掩碼的模型結合就會非常強大，比如Huggingface Space上的一個名為CLIPSeg的模型就可以自動生成掩碼。

6.3 Depth2Image

如果想保留圖片的整體結構而不保留原有的顏色，比如使用不同的顏色或紋理生成新圖片，Img2Img是很難通過“強度”來控制的。而Depth2Img采用深度預測模型來預測一個深度圖，這個深度圖被輸入微調過的UNet以生成圖片，我們希望生成的圖片既能保留原始圖片的深度信息和總體結構，同時又能在相關部分填入全新的內容，代碼如下：

# 載入Depth2Img管線

pipe = StableDiffusionDepth2ImgPipeline.from_pretrained 

   ("stabilityai/stable-diffusion-2-depth")

pipe = pipe.to(device)

# 使用提示文字進行圖像補全

prompt = "An oil painting of a man on a bench"

image = pipe(prompt=prompt, image=init_image).images[0]

　

# 查看結果，如圖6-18所示

fig, axs = plt.subplots(1, 2, figsize=(16, 5))

axs[0].imshow(init_image);axs[0].set_title('Input Image')

axs[1].imshow(image);axs[1].set_title('Result')