NKS 에서 GPU 사용 job 실행해보기

NAVER Cloud

NKS 에서 GPU 사용 job 실행해보기

한크크 2023. 7. 25. 15:29

# kct2 get node --show-labels
NAME           STATUS   ROLES    AGE     VERSION   LABELS
gpu-w-33q5     Ready    <none>   3d17h   v1.25.8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.GPU.T4.G002.C016.M080.NET.SSD.B050.G001,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=gpu-w-33q5,kubernetes.io/os=linux,ncloud.com/nks-nodepool=gpu,nodeId=18462172,regionNo=1,type=gpu,zoneNo=3
test3-w-2qa6   Ready    <none>   83d     v1.25.8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.STAND.C008.M032.NET.SSD.B050.G002,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=test3-w-2qa6,kubernetes.io/os=linux,ncloud.com/nks-nodepool=test3,nodeId=17080373,regionNo=1,zoneNo=3

1. nvidia device plugin 설치

wget https://github.com/NVIDIA/k8s-device-plugin/blame/main/nvidia-device-plugin.yml

>> gpu instance 에만 플러그인이 설치되도록 ndoeSelector 추가 
vi nvidia-device-plugin.yml 

---
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      nodeSelector:   ## nodelabel 추가 
       type: "gpu"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
---

kct2 apply -f nvidia-device-plugin.yml

 kct2 get pods -n kube-system -o wide | grep nvidia
nvidia-device-plugin-daemonset-fspr7      1/1     Running   0          40s     198.18.2.218   gpu-w-33q5     <none>           <none>

2. pytorch job 실행

https://github.com/pytorch/examples/tree/main/mnist 파일을 nas 볼륨에 다운로드 한 후 pod 에 mount 하였다.

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-pod
spec:
  ttlSecondsAfterFinished: 10
  template:
    spec:
      nodeSelector: # gpu node 에 스케줄링 되도록 nodeSelector 설정
       type: "gpu"
      containers:
      - name: pytorch-container
        image: pytorch/pytorch
        command:
        - "/bin/sh"
        - "-c"
        args:
        - cd ./mnist && pwd && python3 main.py && echo "complete"
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: examples
          mountPath: /workspace
      volumes:
      - name: examples
        persistentVolumeClaim:
         claimName: csi-pod-1
      restartPolicy: OnFailure

      volumes:  #pytorch job 수행을 위해 필요한 파일을 NAS 볼륨에 다운로드 후 mount
      - name: examples
        nfs:
          server: 169.254.82.85
          path: /n2534632_pvc83026478097d4f288/examples
          readOnly: false

 kct2 apply -f pytorch-job.yaml
 
 kct2 logs pytorch-pod-9p72l -f
/workspace/mnist
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.282550
Train Epoch: 1 [640/60000 (1%)] Loss: 1.383654
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.893991
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.607930
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.358046
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.448105
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.274314
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.618691
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.241671
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.265854
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.292246
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.203914
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.353010
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.173982
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.330888
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.189820
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.276857
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.243717
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.223437
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.125721
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.262643
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.079488
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.154174
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.174587
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.375594
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.375659
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.091523
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.142970
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.232557
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.212773
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.160379
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.097161
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.203838
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.135524
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.350136
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.308733

>> pyroch pod 에 접속하여 gpu 사용률 확인
kct2 exec -it pytorch-pod-9p72l /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# nvidia-smi
Tue Jul 25 05:33:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:06.0 Off |                  Off |
| N/A   55C    P0    47W /  70W |   1958MiB / 16127MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

3. Jupyter 실행

vi jupyter.ayml
---

apiVersion: v1
kind: Pod
metadata:
  name: tf-jupyter
  labels:
    app: jupyter
spec:
  nodeSelector:
   type: "gpu"
  containers:
    - name: tf-juypter-container
      image: tensorflow/tensorflow:latest-gpu-jupyter
      volumeMounts:
        - mountPath: /notebooks
          name: host-volume
      resources:
        limits:
           nvidia.com/gpu: 2 # requesting 2 GPUs
      command: ["/bin/sh"]
      args: ["-c","jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token= ","echo complete"]
  volumes:
  - name : host-volume
    nfs:
      server: 169.254.82.85
      path: /n2534632_pvc83026478097d4f288/examples
      readOnly: false


---

apiVersion: v1
kind: Service
metadata:
  name: jupyter-svc
spec:
  type: LoadBalancer
  selector:
    app: jupyter
  ports:
   - protocol: TCP
     port: 80
     targetPort: 8888

저작자표시 비영리 변경금지