MediaPipeによるしぐさ検出（ソースコードと実行結果）

Python開発環境，ライブラリ類

ここでは、最低限の事前準備について説明する。機械学習や深層学習を行う場合は、NVIDIA CUDA、Visual Studio、Cursorなどを追加でインストールすると便利である。これらについては別ページ https://www.kkaneko.jp/cc/dev/aiassist.htmlで詳しく解説しているので、必要に応じて参照してください。

Python 3.12 のインストール

インストール済みの場合は実行不要。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要である。

REM Python をシステム領域にインストール
winget install --scope machine --id Python.Python.3.12 -e --silent
REM Python のパス設定
set "PYTHON_PATH=C:\Program Files\Python312"
set "PYTHON_SCRIPTS_PATH=C:\Program Files\Python312\Scripts"
echo "%PATH%" | find /i "%PYTHON_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_PATH%" /M >nul
echo "%PATH%" | find /i "%PYTHON_SCRIPTS_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_SCRIPTS_PATH%" /M >nul

【関連する外部ページ】

Python の公式ページ: https://www.python.org/

AI エディタ Windsurf のインストール

Pythonプログラムの編集・実行には、AI エディタの利用を推奨する。ここでは，Windsurfのインストールを説明する。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行して、Windsurfをシステム全体にインストールする。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要となる。

winget install --scope machine Codeium.Windsurf -e --silent

【関連する外部ページ】

Windsurf の公式ページ: https://windsurf.com/

必要なライブラリのインストール

コマンドプロンプトを管理者として実行（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する

pip install opencv-python mediapipe numpy scipy pillow
mkdir "C:\Program Files\Python312\Lib\site-packages\mediapipe\modules"
mkdir "C:\Program Files\Python312\Lib\site-packages\mediapipe\modules\pose_landmark"
icacls "C:\Program Files\Python312\Lib\site-packages\mediapipe\modules" /grant "%USERNAME%:(OI)(CI)F" /T
icacls "C:\Program Files\Python312\Lib\site-packages\mediapipe\modules\pose_landmark" /grant "%USERNAME%:(OI)(CI)F" /T

MediaPipeによるしぐさ検出プログラム

概要

このプログラムは動画像から人物の全身543点のランドマークを検出し、追跡することで身体動作を認識する。まばたき検出にはEAR（Eye Aspect Ratio）アルゴリズム[1]を使用し、手の震えには高周波成分抽出、肩の緊張には変動係数による評価を行う。

主要技術

MediaPipe Holistic: Googleが開発した機械学習フレームワークである。顔468点、両手21点ずつ（計42点）、体33点の合計543点のランドマークをリアルタイムで検出する[2]。BlazeFace、BlazePalm、BlazePoseの3つのモデルを統合している。
Eye Aspect Ratio (EAR): 目の開き具合を定量化するアルゴリズムである。6つの目のランドマーク点から垂直距離と水平距離の比率を計算する[1]。まばたき検出の標準的手法として使用されている。

参考文献

[1] Soukupová, T., & Čech, J. (2016). Real-time eye blink detection using facial landmarks. In 21st computer vision winter workshop (pp. 1-8). https://vision.fe.uni-lj.si/cvww2016/proceedings/papers/05.pdf
[2] Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C. L., Yong, M. G., Lee, J., Chang, W. T., Hua, W., Georg, M., & Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172. https://arxiv.org/abs/1906.08172

ソースコード


# MediaPipeによるしぐさ検出プログラム
# 特徴技術名: MediaPipe Holistic
# 出典: Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C. L., Yong, M. G., Lee, J., Chang, W. T., Hua, W., Georg, M., & Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
# 特徴機能: 全身543点のランドマーク検出機能。顔468点、両手21点ずつ（計42点）、体33点の合計543点のランドマークをリアルタイムで同時検出し、人体の詳細な姿勢と動作を認識する。BlazeFace、BlazePalm、BlazePoseの3つのモデルを統合した包括的なソリューション。
# 学習済みモデル: MediaPipe Holisticは統合モデルとして、BlazeFace（顔検出）、BlazePalm（手検出）、BlazePose（姿勢検出）の学習済みモデルを含む。これらはmediapipeライブラリのインストール時に自動的に取得され、リアルタイムでの人体ランドマーク検出を実現する。
# 方式設計:
#   関連利用技術: OpenCV（動画入力・表示・画像処理）、NumPy（数値計算・配列処理）、Tkinter（ファイル選択GUI）、Pillow（日本語フォント描画）、Urllib（サンプル動画ダウンロード）
#   入力と出力: 入力: 動画（ユーザは「0:動画ファイル，1:カメラ，2:サンプル動画」のメニューで選択．0:動画ファイルの場合はtkinterでファイル選択．1の場合はOpenCVでカメラが開く．2の場合はhttps://raw.githubusercontent.com/opencv/opencv/master/samples/data/vtest.aviを使用）、出力: 動画
#   処理手順: 1.動画フレーム取得 2.MediaPipe Holisticによる543点ランドマーク検出 3.EARアルゴリズムによるまばたき検出 4.手の震え検出（高周波成分抽出） 5.肩の緊張検出（変動係数計算） 6.結果の画面表示
#   前処理、後処理: 前処理: RGB色空間変換（MediaPipe要求仕様）、後処理: BGR色空間復帰（OpenCV表示用）
#   追加処理: EAR計算による目の開き具合定量化、手ランドマークの高周波成分抽出による震え検出、肩ランドマークの位置変動係数による緊張検出
#   調整を必要とする設定値: EAR_THRESHOLD（まばたき検出閾値、通常0.25、個人差により調整必要）、HIGH_FREQ_THRESHOLD（手の震え検出閾値、高周波成分の基準値）、TENSION_THRESHOLD（肩の緊張検出閾値、変動係数の基準値）
#   算出・計算処理の検証: EAR計算式、高周波成分抽出、変動係数計算が正しく実装されていることを確認
# 将来方策: 閾値の自動調整機能、個人校正機能
# その他の重要事項: リアルタイム処理性能、照明条件への対応
# 前準備: pip install mediapipe opencv-python pillow numpy scipy を実行してください

import cv2
import mediapipe as mp
import numpy as np
import tkinter as tk
from tkinter import filedialog
import urllib.request
import os
import time
from datetime import datetime
from PIL import Image, ImageDraw, ImageFont
import math
from scipy import signal

frame_count = 0
results_log = []
fps_estimate = 30.0
frame_times = []

EAR_THRESHOLD = 0.25
HIGH_FREQ_THRESHOLD = 0.5
TENSION_THRESHOLD = 0.01
FONT_PATH = 'C:/Windows/Fonts/meiryo.ttc'
FONT_SIZE = 16

mp_holistic = mp.solutions.holistic
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

# 履歴データ
left_hand_history = []
right_hand_history = []
left_shoulder_history = []
right_shoulder_history = []
HISTORY_SIZE = 30
MIN_FILTER_LENGTH = 18

# MediaPipe Face Meshの正しい目のランドマーク番号
RIGHT_EYE_LANDMARKS = [33, 159, 158, 133, 153, 145]
LEFT_EYE_LANDMARKS = [362, 380, 374, 263, 386, 385]

def calculate_ear(eye_landmarks):
    # 正しいEAR計算式: (||p2-p6|| + ||p3-p5||) / (2 * ||p1-p4||)
    p1, p2, p3, p4, p5, p6 = eye_landmarks

    A = math.sqrt((p2[0] - p6[0])**2 + (p2[1] - p6[1])**2)
    B = math.sqrt((p3[0] - p5[0])**2 + (p3[1] - p5[1])**2)
    C = math.sqrt((p1[0] - p4[0])**2 + (p1[1] - p4[1])**2)

    if C == 0:
        return 0

    ear = (A + B) / (2.0 * C)
    return ear

def update_fps_estimate():
    global fps_estimate, frame_times
    current_time = time.time()
    frame_times.append(current_time)

    if len(frame_times) > 10:
        frame_times.pop(0)

    if len(frame_times) >= 2:
        dt = frame_times[-1] - frame_times[0]
        if dt > 0:
            fps_estimate = (len(frame_times) - 1) / dt

def extract_high_frequency_component(hand_landmarks, hand_history):
    # 手の重心位置を計算
    center_x = np.mean([lm[0] for lm in hand_landmarks])
    center_y = np.mean([lm[1] for lm in hand_landmarks])

    hand_history.append([center_x, center_y])
    if len(hand_history) > HISTORY_SIZE:
        hand_history.pop(0)

    if len(hand_history) < MIN_FILTER_LENGTH:
        return 0.0

    # 時系列データから高周波成分を抽出
    x_data = np.array([pos[0] for pos in hand_history])
    y_data = np.array([pos[1] for pos in hand_history])

    # ハイパスフィルタで高周波成分を抽出
    try:
        nyquist = 0.5 * fps_estimate
        high_cutoff = 2.0

        if high_cutoff >= nyquist:
            high_cutoff = nyquist * 0.8

        b, a = signal.butter(2, high_cutoff/nyquist, btype='high')

        # データ長がフィルタ要求を満たすか確認
        padlen = min(len(x_data) // 3, 9)

        if len(x_data) >= 3 * len(b):
            x_filtered = signal.filtfilt(b, a, x_data, padlen=padlen)
            y_filtered = signal.filtfilt(b, a, y_data, padlen=padlen)

            # 高周波成分の強度を計算
            high_freq_intensity = np.sqrt(np.mean(x_filtered**2) + np.mean(y_filtered**2))
            return high_freq_intensity
        else:
            return 0.0
    except (ValueError, RuntimeWarning) as e:
        return 0.0

def calculate_shoulder_tension(shoulder_history):
    if len(shoulder_history) < 5:
        return 0.0

    # 肩の位置の変動係数を計算
    positions = np.array(shoulder_history)
    x_coords = positions[:, 0]
    y_coords = positions[:, 1]

    if len(x_coords) > 1:
        x_std = np.std(x_coords)
        y_std = np.std(y_coords)
        x_mean = np.mean(x_coords)
        y_mean = np.mean(y_coords)

        if abs(x_mean) > 1e-6 and abs(y_mean) > 1e-6:
            cv_x = x_std / abs(x_mean)
            cv_y = y_std / abs(y_mean)
            return (cv_x + cv_y) / 2

    return 0.0

def draw_eye_landmarks(image, eye_landmarks, color=(0, 255, 0)):
    # 目のランドマーク6点を描画
    for point in eye_landmarks:
        cv2.circle(image, (int(point[0]), int(point[1])), 2, color, -1)

    # 目の輪郭を線で結ぶ
    points = np.array(eye_landmarks, np.int32)
    cv2.polylines(image, [points], True, color, 1)

def draw_hand_landmarks(image, hand_landmarks, color=(255, 0, 0)):
    # 手のランドマークを描画
    for point in hand_landmarks:
        cv2.circle(image, (int(point[0]), int(point[1])), 2, color, -1)

def draw_shoulder_landmarks(image, left_shoulder, right_shoulder):
    # 肩のランドマークを描画
    cv2.circle(image, (int(left_shoulder[0]), int(left_shoulder[1])), 5, (0, 0, 255), -1)
    cv2.circle(image, (int(right_shoulder[0]), int(right_shoulder[1])), 5, (0, 0, 255), -1)
    cv2.line(image, (int(left_shoulder[0]), int(left_shoulder[1])),
             (int(right_shoulder[0]), int(right_shoulder[1])), (0, 0, 255), 2)

def video_frame_processing(frame):
    global frame_count
    current_time = time.time()
    frame_count += 1

    update_fps_estimate()

    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    with mp_holistic.Holistic(
        static_image_mode=False,
        model_complexity=1,
        enable_segmentation=False,
        refine_face_landmarks=True) as holistic:

        results = holistic.process(rgb_frame)

        processed_frame = frame.copy()

        # 初期化
        right_ear = left_ear = 0.0
        left_hand_tremor = right_hand_tremor = 0.0
        left_shoulder_tension = right_shoulder_tension = 0.0

        # まばたき検出（EAR計算）
        if results.face_landmarks:
            face_landmarks = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in results.face_landmarks.landmark]

            # 目のランドマークを取得して描画
            right_eye = [face_landmarks[i] for i in RIGHT_EYE_LANDMARKS]
            left_eye = [face_landmarks[i] for i in LEFT_EYE_LANDMARKS]

            right_ear = calculate_ear(right_eye)
            left_ear = calculate_ear(left_eye)

            # 目のランドマークのみ描画
            draw_eye_landmarks(processed_frame, right_eye, (0, 255, 0))
            draw_eye_landmarks(processed_frame, left_eye, (0, 255, 0))

        # 手の震え検出（高周波成分抽出）
        if results.left_hand_landmarks:
            left_hand_landmarks = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in results.left_hand_landmarks.landmark]
            left_hand_tremor = extract_high_frequency_component(left_hand_landmarks, left_hand_history)
            draw_hand_landmarks(processed_frame, left_hand_landmarks, (255, 0, 0))

        if results.right_hand_landmarks:
            right_hand_landmarks = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in results.right_hand_landmarks.landmark]
            right_hand_tremor = extract_high_frequency_component(right_hand_landmarks, right_hand_history)
            draw_hand_landmarks(processed_frame, right_hand_landmarks, (255, 0, 0))

        # 肩の緊張検出（変動係数）
        if results.pose_landmarks:
            pose_landmarks = [(lm.x * frame.shape[1], lm.y * frame.shape[0]) for lm in results.pose_landmarks.landmark]

            if len(pose_landmarks) >= 13:
                left_shoulder = pose_landmarks[11]
                right_shoulder = pose_landmarks[12]

                # 肩の履歴を更新
                left_shoulder_history.append(left_shoulder)
                right_shoulder_history.append(right_shoulder)

                if len(left_shoulder_history) > HISTORY_SIZE:
                    left_shoulder_history.pop(0)
                if len(right_shoulder_history) > HISTORY_SIZE:
                    right_shoulder_history.pop(0)

                left_shoulder_tension = calculate_shoulder_tension(left_shoulder_history)
                right_shoulder_tension = calculate_shoulder_tension(right_shoulder_history)

                # 肩のランドマークのみ描画
                draw_shoulder_landmarks(processed_frame, left_shoulder, right_shoulder)

        # 画面への数値表示
        text_lines = [
            f"右目EAR: {right_ear:.3f} 左目EAR: {left_ear:.3f}",
            f"右手震え: {right_hand_tremor:.3f} 左手震え: {left_hand_tremor:.3f}",
            f"右肩緊張: {right_shoulder_tension:.3f} 左肩緊張: {left_shoulder_tension:.3f}"
        ]

        # 状態判定
        blink_status = "まばたき" if (right_ear + left_ear) / 2 < EAR_THRESHOLD else "正常"
        right_hand_status = "震え" if right_hand_tremor > HIGH_FREQ_THRESHOLD else "正常"
        left_hand_status = "震え" if left_hand_tremor > HIGH_FREQ_THRESHOLD else "正常"
        right_shoulder_status = "緊張" if right_shoulder_tension > TENSION_THRESHOLD else "正常"
        left_shoulder_status = "緊張" if left_shoulder_tension > TENSION_THRESHOLD else "正常"

        status_lines = [
            f"目: {blink_status}",
            f"右手: {right_hand_status} 左手: {left_hand_status}",
            f"右肩: {right_shoulder_status} 左肩: {left_shoulder_status}"
        ]

        # 日本語フォント描画
        try:
            font = ImageFont.truetype(FONT_PATH, FONT_SIZE)
            img_pil = Image.fromarray(cv2.cvtColor(processed_frame, cv2.COLOR_BGR2RGB))
            draw = ImageDraw.Draw(img_pil)

            y_offset = 10
            for line in text_lines + status_lines:
                draw.text((10, y_offset), line, font=font, fill=(0, 255, 0))
                y_offset += 25

            processed_frame = cv2.cvtColor(np.array(img_pil), cv2.COLOR_RGB2BGR)
        except (OSError, IOError):
            y_offset = 30
            for line in text_lines + status_lines:
                cv2.putText(processed_frame, line, (10, y_offset), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
                y_offset += 25

        # print用の結果文字列
        result = f"右目EAR:{right_ear:.3f}, 左目EAR:{left_ear:.3f}, 右手震え:{right_hand_tremor:.3f}, 左手震え:{left_hand_tremor:.3f}, 右肩緊張:{right_shoulder_tension:.3f}, 左肩緊張:{left_shoulder_tension:.3f}"

        return processed_frame, result, current_time

print("MediaPipeによるしぐさ検出プログラム")
print("全身543点のランドマークを検出し、まばたき（EAR）・手の震え（高周波成分）・肩の緊張（変動係数）を分析します")
print("操作方法: qキーでプログラム終了")
print()
print("0: 動画ファイル")
print("1: カメラ")
print("2: サンプル動画")

choice = input("選択: ")

if choice == '0':
    root = tk.Tk()
    root.withdraw()
    path = filedialog.askopenfilename()
    if not path:
        print("ファイルが選択されませんでした")
        exit()
    cap = cv2.VideoCapture(path)
elif choice == '1':
    cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
    if not cap.isOpened():
        cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)
else:
    SAMPLE_URL = 'https://raw.githubusercontent.com/opencv/opencv/master/samples/data/vtest.avi'
    SAMPLE_FILE = 'vtest.avi'
    try:
        urllib.request.urlretrieve(SAMPLE_URL, SAMPLE_FILE)
        cap = cv2.VideoCapture(SAMPLE_FILE)
    except:
        print("サンプル動画のダウンロードに失敗しました")
        exit()

if not cap.isOpened():
    print('動画ファイル・カメラを開けませんでした')
    exit()

print('\n=== 動画処理開始 ===')
print('操作方法:')
print('  q キー: プログラム終了')

try:
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        MAIN_FUNC_DESC = "MediaPipe しぐさ検出"
        processed_frame, result, current_time = video_frame_processing(frame)
        cv2.imshow(MAIN_FUNC_DESC, processed_frame)

        if choice == '1':
            print(datetime.fromtimestamp(current_time).strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], result)
        else:
            print(frame_count, result)

        results_log.append(result)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

finally:
    print('\n=== プログラム終了 ===')
    cap.release()
    cv2.destroyAllWindows()

    if results_log:
        with open('result.txt', 'w', encoding='utf-8') as f:
            f.write('=== MediaPipeしぐさ検出結果 ===\n')
            f.write(f'処理フレーム数: {frame_count}\n')
            f.write(f'検出ランドマーク数: 543点（顔468点、両手42点、体33点）\n')
            f.write('\n=== 検出結果詳細 ===\n')
            for i, result in enumerate(results_log, 1):
                f.write(f'フレーム{i}: {result}\n')
        print(f'\n処理結果をresult.txtに保存しました')