[Kubernetes] Model Serving


서론

On premise 환경에서 Kubernetes를 사용해 Mnist model을 GPU에 올려 추론 결과까지 출력하는 것을 목표로 한다.

이때 클라이언트는 노드 포트를 사용해 서버에 접속하여 API 요청을 하고 결과를 얻는다.

동작 순서는 아래와 같다.

  1. 간단한 모델을 구한다.
  2. 로컬에서 Fast API를 가지고 모델을 GPU에 올리고 배포한다.
  3. 클라이언트에서 API요청을 통해 결과값을 가져오고 GPU를 모니터링 한다.
  4. 1 ~ 3번이 잘 진행되었다면 DockerFile을 만들어 이미지로 패키징한다.
  5. Private registry를 사용하여 이미지를 빌드 및 푸쉬한다.
  6. Deployment를 작성하고 서비스는 NodePort를 사용한다.
  7. K8s를 통해 모델을 배포한다.
  8. 배포된 노드에 접속하여 GPU를 모니터링한다.

Spec

(Node1) GPU: rtx2060 | Master & Worker

(Node2) GPU: rtx A5000 | Worker

(Node3) GPU: X | Worker

Pre-installation

K8s는 설치되어 있다고 가정한다.

  • Nvidia Driver 설치
    • Nvidia-smi로 CUDA Version확인

Serving

DockerFile Packaging

  • FastAPI를 활용한 ONNX Runtime 배포 코드 작성
    Code
    from fastapi import FastAPI, File, UploadFile
    import onnxruntime as ort
    import numpy as np
    import cv2
    from PIL import Image
    import io
    import logging
    import time
    import os
    import logging
    
    log_dir = "logs"
    
    if not os.path.exists(log_dir):
        os.makedirs(log_dir, exist_ok=True)
    os.chmod(log_dir, 0o777)
    
    log_file = os.path.join(log_dir, "server.log")
    
    
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[
            logging.FileHandler(log_file, mode="a"), 
            logging.StreamHandler()  
        ],
    )
    logger = logging.getLogger(__name__)
    
    app = FastAPI()
    providers = []
    onnx_model_path = "./mnist-12.onnx"
    
    # ONNX model load for gpu 
    logger.info(f"Loading ONNX model from {onnx_model_path}")
    
    available_providers = ort.get_available_providers()
    
    providers = ["CUDAExecutionProvider"]
    logger.info(f"ONNX model loaded successfully with providers: {providers}")
    
    session = ort.InferenceSession(onnx_model_path, providers=providers)
    
    # MNIST preprocess
    def preprocess_image(image_bytes):
        logger.info("Preprocessing input image...")
        start_time = time.time()
        
        image = Image.open(io.BytesIO(image_bytes)).convert("L")  
        image = image.resize((28, 28))  
        
        image_np = np.array(image, dtype=np.float32) / 255.0  
        image_np = image_np.reshape(1, 1, 28, 28)  # (b, c, h, w)
        
        elapsed_time = time.time() - start_time
        logger.info(f"Preprocessing completed in {elapsed_time:.4f} seconds.")
        
        return image_np
    
    @app.get("/health")
    async def health_check():
        return {"status": "healthy"}
    
    @app.post("/predict/")
    async def predict(file: UploadFile = File(...)):
        logger.info(f"Received file: {file.filename}")
    
        image_bytes = await file.read()
        input_tensor = preprocess_image(image_bytes)
    
        # inference
        input_name = session.get_inputs()[0].name
        output_name = session.get_outputs()[0].name
        
        logger.info("Running inference on the model...")
        start_time = time.time()
        outputs = session.run([output_name], {input_name: input_tensor})
        inference_time = time.time() - start_time
        logger.info(f"Inference completed in {inference_time:.4f} seconds.")
    
        prediction = np.argmax(outputs[0])
        logger.info(f"Predicted class: {prediction}")
    
        return {"prediction": int(prediction), "inference_time": round(inference_time, 4)}
    
    if __name__ == "__main__":
        import uvicorn
        uvicorn.run(app, host="0.0.0.0", port=8000)
  • requirments.txt 작성
    Code
    fastapi
    uvicorn
    onnxruntime-gpu==1.19.2
    pillow
    numpy
    opencv-python
    python-multipart
  • DockerFIle 작성
    Code
    # CUDA 12.4 + Python 3.10 기반 이미지 사용
    FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
    
    WORKDIR /app
    
    RUN apt-get update && apt-get install -y \
        python3.10 \
        python3-pip \
        libgl1 \
        libglib2.0-0 \
        wget
    
    RUN ln -s /usr/bin/python3.10 /usr/bin/python
    
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    
    COPY mnist.py .
    COPY mnist-12.onnx .
    RUN mkdir -p logs && chmod 777 logs
    
    RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
        dpkg -i cuda-keyring_1.1-1_all.deb && \
        apt-get update && \
        apt-get -y install cuda-toolkit-12-8 && \
        apt-get -y install cudnn
    
    ENV CUDA_VISIBLE_DEVICES=0
    ENV PATH=/usr/local/cuda-12.4/bin:$PATH
    ENV LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
    
    EXPOSE 8000
    
    CMD ["uvicorn", "mnist:app", "--host", "0.0.0.0", "--port", "8000"]
  • Docker Private Registry 구축
    Code
    sudo docker run -d -p 5000:5000 --restart=always --name registry registry:2
  • DockerFile 배포
    Code
    docker build -t mnist-serving:0.5 .
    sudo docker tag mnist-serving:0.6 localhost:5000/mnist-serving:
    0.6
    sudo docker push localhost:5000/mnist-serving:0.6

K8s

  • Deployments.yaml (Replicaset: 2)
    Code
    # mnist-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mnist-serving
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: mnist-serving
      template:
        metadata:
          labels:
            app: mnist-serving
        spec:
          containers:
          - name: mnist-container
            image: 192.168.0.32:5000/mnist-serving:0.6
            imagePullPolicy: Always
            ports:
            - containerPort: 8000
            resources:
              limits:
                memory: "512Mi"
                cpu: "500m"
                nvidia.com/gpu: 1
            volumeMounts:
            - name: logs-volume
              mountPath: /app/logs
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          volumes:
          - name: logs-volume
            hostPath:
              path: /var/log/mnist-serving
              type: DirectoryOrCreate
  • Service.yaml
    Code
    # mnist_service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: mnist-service
    spec:
      type: NodePort
      selector:
        app: mnist-serving
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
          nodePort: 30080
  • 실행확인하면 두개의 노드에 잘 배포된 것을 볼 수 있다.
    kubectl apply -f mnist_deployment.yaml
  • 노드하나에 쉘로 접속해서 GPU사용율을 확인해 GPU를 사용하는지 확인