BLIP-2による画像質問応答・説明生成プログラム（ソースコードと実行結果）

Python開発環境，ライブラリ類

ここでは、最低限の事前準備について説明する。機械学習や深層学習を行う場合は、NVIDIA CUDA、Visual Studio、Cursorなどを追加でインストールすると便利である。これらについては別ページ https://www.kkaneko.jp/cc/dev/aiassist.htmlで詳しく解説しているので、必要に応じて参照してください。

Python 3.12 のインストール

インストール済みの場合は実行不要。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要である。

REM Python をシステム領域にインストール
winget install --scope machine --id Python.Python.3.12 -e --silent
REM Python のパス設定
set "PYTHON_PATH=C:\Program Files\Python312"
set "PYTHON_SCRIPTS_PATH=C:\Program Files\Python312\Scripts"
echo "%PATH%" | find /i "%PYTHON_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_PATH%" /M >nul
echo "%PATH%" | find /i "%PYTHON_SCRIPTS_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_SCRIPTS_PATH%" /M >nul

【関連する外部ページ】

Python の公式ページ: https://www.python.org/

AI エディタ Windsurf のインストール

Pythonプログラムの編集・実行には、AI エディタの利用を推奨する。ここでは，Windsurfのインストールを説明する。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行して、Windsurfをシステム全体にインストールする。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要となる。

winget install --scope machine Codeium.Windsurf -e --silent

【関連する外部ページ】

Windsurf の公式ページ: https://windsurf.com/

必要なライブラリのインストール

コマンドプロンプトを管理者として実行（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install transformers pillow opencv-python

BLIP-2画像質問応答・説明生成プログラム

概要

このプログラムは、BLIP-2（Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models）を用いた画像理解システムである。静止画像を入力として受け取り、画像に関する質問への回答生成、または画像内容の自動説明を行う。ユーザインターフェースは対話型で、画像ファイル、カメラ撮影、サンプル画像の3つの入力方式に対応する。

主要技術

BLIP-2アーキテクチャ

BLIP-2は、凍結された画像エンコーダと大規模言語モデルを効率的に接続する視覚言語モデルである[1]。中核となるQ-Former（Querying Transformer）は、視覚特徴と言語表現の間のモダリティギャップを橋渡しする軽量な変換器として機能する。この設計により、事前学習済みの画像エンコーダと言語モデルを再学習することなく活用できる。

Q-Former機構

Q-Formerは学習可能なクエリベクトルを用いて、画像エンコーダから最も関連性の高い視覚特徴を抽出する[1]。これらのクエリは自己注意機構と交差注意機構を通じて、視覚情報を言語モデルが理解可能な形式に変換する。この2段階のブートストラップ学習により、計算効率を保ちながら高品質な画像理解を実現する。

技術的特徴

モデル選択の柔軟性

4種類の事前学習済みモデル（OPT-2.7B、Flan-T5-XL、OPT-6.7B、Flan-T5-XXL）から選択可能である。各モデルは異なる言語モデルバックボーンを持ち、用途に応じて精度と処理速度のバランスを調整できる[2]。

GPU/CPU自動選択機構

PyTorchのデバイス検出機能を活用し、利用可能なハードウェアに応じて自動的に最適な実行環境を選択する。GPU使用時はfloat16精度で省メモリ化を図り、CPU使用時はfloat32精度を維持して安定性を確保する。

実装の特色

多様な入力方式

ファイル選択：tkinterを用いた複数画像の一括選択
カメラ撮影：OpenCVによるリアルタイム画像取得とスペースキーによる撮影
サンプル画像：事前定義されたURLから自動ダウンロード

結果の視覚化

生成された回答を元画像上にオーバーレイ表示する。PIL/Pillowを用いた日本語フォント対応により、画像幅に応じた自動改行処理を実装。長文回答の場合は画像高さを考慮して省略記号で処理する。

トークン数の調整機能

キャプション生成用（MAX_NEW_TOKENS_CAPTION）と質問応答用（MAX_NEW_TOKENS_QA）で異なるトークン数を設定可能である。これにより、タスクの性質に応じて生成テキストの長さを制御できる。

参考文献

[1] Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://proceedings.mlr.press/v202/li23q.html

[2] Salesforce Research. (2023). BLIP-2 Model Cards. Hugging Face Model Hub. https://huggingface.co/Salesforce

ソースコード


"""
プログラム名: BLIP-2画像質問応答・説明生成プログラム
特徴技術名: BLIP-2 (Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models)
出典: Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
特徴機能: Q-Former（Querying Transformer）による効率的な視覚-言語アライメント。凍結された画像エンコーダと大規模言語モデルを橋渡しし、画像の内容理解と自然言語での説明生成を実現
学習済みモデル:
  - Salesforce/blip2-opt-2.7b (OPT-2.7B言語モデル使用、軽量版)
  - Salesforce/blip2-flan-t5-xl (Flan-T5-XL言語モデル使用、バランス型)
  - Salesforce/blip2-opt-6.7b (OPT-6.7B言語モデル使用、標準版)
  - Salesforce/blip2-flan-t5-xxl (Flan-T5-XXL言語モデル使用、大規模版)
  - URL: https://huggingface.co/Salesforce/ (Hugging Face Model Hubから自動ダウンロード)
方式設計:
  - 関連利用技術:
    * Transformers (Hugging Face): BLIPモデルのロードと推論実行
    * PyTorch: テンソル演算とGPU/CPU制御
    * OpenCV: カメラ制御と画像表示
    * PIL/Pillow: 画像形式変換とテキスト描画
    * tkinter: ファイル選択ダイアログ
    * urllib: サンプル画像のダウンロード
  - 入力と出力: 入力: 複数の静止画像，カメラ（ユーザは「0:画像ファイル，1:カメラ，2:サンプル画像」のメニューで選択．0:画像ファイルの場合はtkinterで複数ファイル選択可能．1の場合はOpenCVでカメラが開き，スペースキーで撮影（複数回可能）．2の場合はfruits.jpg、messi5.jpg、aero3.jpg、Cat03.jpgの4枚を順次処理）、出力: OpenCV画面で結果画像表示（質問と回答を画像上にオーバーレイ）、コンソールに分析結果を出力
  - 処理手順:
    1. ユーザが4つのBLIP-2モデルから1つを選択
    2. 選択されたモデルとプロセッサをロード（GPU利用可能時は自動でGPU使用）
    3. 入力方式（ファイル/カメラ/サンプル）を選択
    4. 各画像に対して：
       a. 画像を1秒間表示
       b. ユーザが質問を入力（空欄で自動説明生成）
       c. BGR→RGB変換しPIL形式に変換
       d. プロセッサで画像（と質問）を前処理
       e. モデルで推論実行し回答/説明を生成
       f. 結果を画像上に描画して表示
  - 前処理、後処理:
    * 前処理: BGR→RGB色空間変換、PIL Image形式への変換、プロセッサによる正規化とトークン化
    * 後処理: 生成トークンのデコード、特殊トークンの除去（skip_special_tokens=True）、strip()による空白除去
  - 追加処理:
    * GPU使用時のfloat16精度による省メモリ化（device_map='auto'）
    * CPU使用時のfloat32精度維持
    * 画像幅に基づく自動改行処理（単語単位で分割、描画幅計算による適切な改行）
    * 画像高さを超える場合の省略記号表示
    * Windows環境でのカメラ初期化改善（cv2.CAP_DSHOW優先試行）
  - 調整を必要とする設定値:
    * MAX_NEW_TOKENS_CAPTION (30): キャプション生成時の最大トークン数。値を増やすと詳細な説明、減らすと簡潔な説明
    * MAX_NEW_TOKENS_QA (50): 質問応答時の最大トークン数。複雑な質問には大きい値が必要
    * CAMERA_ID (0): 使用するカメラデバイスのID。複数カメラ接続時は変更が必要
将来方策: MAX_NEW_TOKENS_CAPTIONの最適値を、生成された文の完全性を評価して自動調整する機能。具体的には、文末が句点や完全な単語で終わっているかを判定し、不完全な場合はトークン数を10ずつ増加させて再生成する処理を実装
その他の重要事項:
  - 質問は英語で入力する必要がある
  - 大規模モデルは初回ロード時に数GBのダウンロードが発生
  - 日本語フォント（meiryo.ttc）が必要（Windows環境）
  - サンプル画像使用後は自動的に一時ファイルを削除
前準備:
  - pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
  - pip install transformers pillow opencv-python
"""
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image, ImageDraw, ImageFont
import cv2
import tkinter as tk
from tkinter import filedialog
import urllib.request
import os
import numpy as np
import time
from datetime import datetime

# モデル定義
MODELS = [
    'Salesforce/blip2-opt-2.7b',
    'Salesforce/blip2-flan-t5-xl',
    'Salesforce/blip2-opt-6.7b',
    'Salesforce/blip2-flan-t5-xxl'
]

# 設定値
MAX_NEW_TOKENS_CAPTION = 30  # キャプション生成用トークン数
MAX_NEW_TOKENS_QA = 50  # 質問応答用トークン数
MODEL_COUNT = 4  # 利用可能モデル数
CAMERA_ID = 0  # カメラデバイスID

# 結果ログ
results_log = []

def image_processing(img):
    current_time = time.time()

    # 画像を最初に表示
    cv2.imshow('Input Image', img)
    cv2.waitKey(1000)  # 1秒表示

    print('\n=== 画像分析 ===')
    print('質問を英語で入力してください（例：What is this?、What color is it?）')
    print('Enterのみで画像の説明を自動生成します')
    question = input('質問（空欄で自動説明）: ')

    # BGR→RGB変換してPIL画像に変換
    pil_image = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))

    if not question.strip():
        # 質問がない場合は画像キャプション生成
        print('画像の自動説明を生成中...')
        inputs = processor(images=pil_image, return_tensors='pt').to(device)

        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS_CAPTION)
            answer = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
        question = "Description"
    else:
        # 質問に対する回答生成
        inputs = processor(images=pil_image, text=question, return_tensors='pt').to(device)

        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS_QA)
            answer = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()

    # 結果表示（コンソール）
    print('\n=== 分析結果 ===')
    print(f'質問/モード: {question}')
    print(f'回答: {answer}')

    result = f'Q: {question}, A: {answer}'

    # 結果を画像上に描画
    result_img = img.copy()
    img_height, img_width = img.shape[:2]
    FONT_PATH = 'C:/Windows/Fonts/meiryo.ttc'
    FONT_SIZE = 20

    try:
        font = ImageFont.truetype(FONT_PATH, FONT_SIZE)
    except:
        print(f'フォントファイル {FONT_PATH} が見つかりません')
        return result_img, result, current_time

    img_pil = Image.fromarray(cv2.cvtColor(result_img, cv2.COLOR_BGR2RGB))
    draw = ImageDraw.Draw(img_pil)

    # 質問を画像に描画
    draw.text((10, 10), f'Q: {question}', font=font, fill=(0, 255, 0))

    # 画像幅に基づいて適切に改行
    max_text_width = img_width - 20  # 左右10pxのマージン

    # テキストを適切な長さで分割
    lines = []
    words = answer.split()
    current_line = []

    for word in words:
        test_line = ' '.join(current_line + [word])
        # 実際の描画幅を計算
        bbox = draw.textbbox((0, 0), test_line, font=font)
        text_width = bbox[2] - bbox[0]

        if text_width <= max_text_width:
            current_line.append(word)
        else:
            if current_line:
                lines.append(' '.join(current_line))
                current_line = [word]
            else:
                # 単語が長すぎる場合はそのまま追加
                lines.append(word)
                current_line = []

    if current_line:
        lines.append(' '.join(current_line))

    # 回答を描画（画像の高さを超えないようにチェック）
    y_offset = 40
    line_height = 25

    for line in lines:
        # 描画前に画像の高さをチェック
        if y_offset + line_height > img_height - 10:
            # 画像の下端に近い場合は省略記号を表示して終了
            draw.text((10, y_offset), "...", font=font, fill=(255, 255, 0))
            break
        draw.text((10, y_offset), line, font=font, fill=(255, 255, 0))
        y_offset += line_height

    result_img = cv2.cvtColor(np.array(img_pil), cv2.COLOR_RGB2BGR)

    return result_img, result, current_time


def process_and_display_images(image_sources, source_type):
    display_index = 1
    for source in image_sources:
        img = cv2.imread(source) if source_type == 'file' else source
        if img is None:
            continue
        cv2.imshow(f'Image_{display_index}', img)
        processed_img, result, current_time = image_processing(img)
        cv2.imshow(f'BLIP-2分析_{display_index}', processed_img)
        print(datetime.fromtimestamp(current_time).strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], result)
        results_log.append(result)
        display_index += 1

# メイン処理開始
print('=== BLIP-2 画像質問応答・説明生成システム ===')
print('\n【概要】')
print('BLIP-2モデルを使用して画像の内容を分析します')
print('質問応答モード：画像に関する質問に回答')
print('自動説明モード：画像の内容を自動的に説明')
print('\n【操作方法】')
print('カメラモード：スペースキーで撮影、qキーで終了')
print('画像分析時：質問を入力またはEnterで自動説明')
print('\n【注意事項】')
print('質問は英語で入力してください')
print('大きなモデルは読み込みに時間がかかります')
print('-' * 50)

print('\n利用可能なモデル:')
for i, model_name in enumerate(MODELS):
    print(f'{i+1}. {model_name}')

model_choice = input('モデル番号を選択 (1-4): ')
if not model_choice.isdigit() or not (1 <= int(model_choice) <= MODEL_COUNT):
    print('無効な選択です')
    exit()

model_name = MODELS[int(model_choice)-1]

# GPU/CPU自動選択
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'デバイス: {str(device)}')
# GPU使用時の最適化
if device.type == 'cuda':
    torch.backends.cudnn.benchmark = True

print(f'選択されたモデル: {model_name}')
print('モデル読み込み中...')

# モデル初期化
try:
    if device.type == 'cuda':
        model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map='auto'
        )
    else:
        model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float32
        ).to(device)

    processor = Blip2Processor.from_pretrained(model_name)
    print('モデルの読み込みが完了しました')
except Exception as e:
    print(f'モデルの読み込みに失敗しました: {e}')
    exit()

print('\n入力方式を選択:')
print('0: 画像ファイル')
print('1: カメラ')
print('2: サンプル画像')

choice = input('選択: ')

# 入力方式別処理
try:
    if choice == '0':
        root = tk.Tk()
        root.withdraw()
        if not (paths := filedialog.askopenfilenames()):
            exit()
        process_and_display_images(paths, 'file')
        cv2.waitKey(0)

    elif choice == '1':
        cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
        if not cap.isOpened():
            cap = cv2.VideoCapture(0)
        cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)

        if not cap.isOpened():
            print('カメラを開けませんでした')
            exit()

        print('\n【カメラ操作】')
        print('スペースキー: 画像を撮影して分析')
        print('qキー: 終了')

        try:
            while True:
                ret, frame = cap.read()
                if not ret:
                    break
                cv2.imshow('Camera', frame)
                key = cv2.waitKey(1) & 0xFF
                if key == ord(' '):
                    processed_img, result, current_time = image_processing(frame)
                    cv2.imshow('BLIP-2分析', processed_img)
                    print(datetime.fromtimestamp(current_time).strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], result)
                    results_log.append(result)
                elif key == ord('q'):
                    break
        finally:
            cap.release()

    else:
        urls = [
            "https://raw.githubusercontent.com/opencv/opencv/master/samples/data/fruits.jpg",
            "https://raw.githubusercontent.com/opencv/opencv/master/samples/data/messi5.jpg",
            "https://raw.githubusercontent.com/opencv/opencv/master/samples/data/aero3.jpg",
            "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg"
        ]
        downloaded_files = []
        for i, url in enumerate(urls):
            try:
                urllib.request.urlretrieve(url, f"sample_{i}.jpg")
                downloaded_files.append(f"sample_{i}.jpg")
            except:
                print(f"画像のダウンロードに失敗しました: {url}")
        process_and_display_images(downloaded_files, 'file')
        cv2.waitKey(0)

        # 一時ファイル削除
        for filename in downloaded_files:
            try:
                os.remove(filename)
            except OSError:
                pass

finally:
    print('\n=== プログラム終了 ===')
    cv2.destroyAllWindows()
    if results_log:
        with open('result.txt', 'w', encoding='utf-8') as f:
            f.write('=== 結果 ===\n')
            f.write(f'使用デバイス: {str(device).upper()}\n')
            if device.type == 'cuda':
                f.write(f'GPU: {torch.cuda.get_device_name(0)}\n')
            f.write('\n')
            f.write('\n'.join(results_log))
        print(f'\n処理結果をresult.txtに保存しました')