NKS 에서 GPU 사용 job 실행해보기
# kct2 get node --show-labels NAME STATUS ROLES AGE VERSION LABELS gpu-w-33q5 Ready <none> 3d17h v1.25.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.GPU.T4.G002.C016.M080.NET.SSD.B050.G001,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=gpu-w-33q5,kubernetes.io/os=linux,ncloud.com/nks-nodepool=gpu,nodeId=18462172,regionNo=1,type=gpu,zoneNo=3 test3-w-2qa6 Ready <none> 83d v1.25.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.STAND.C008.M032.NET.SSD.B050.G002,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=test3-w-2qa6,kubernetes.io/os=linux,ncloud.com/nks-nodepool=test3,nodeId=17080373,regionNo=1,zoneNo=3
1. nvidia device plugin 설치
wget https://github.com/NVIDIA/k8s-device-plugin/blame/main/nvidia-device-plugin.yml >> gpu instance 에만 플러그인이 설치되도록 ndoeSelector 추가 vi nvidia-device-plugin.yml --- spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" nodeSelector: ## nodelabel 추가 type: "gpu" containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" --- kct2 apply -f nvidia-device-plugin.yml kct2 get pods -n kube-system -o wide | grep nvidia nvidia-device-plugin-daemonset-fspr7 1/1 Running 0 40s gpu-w-33q5 <none> <none>
2. pytorch job 실행
https://github.com/pytorch/examples/tree/main/mnist 파일을 nas 볼륨에 다운로드 한 후 pod 에 mount 하였다.
apiVersion: batch/v1 kind: Job metadata: name: pytorch-pod spec: ttlSecondsAfterFinished: 10 template: spec: nodeSelector: # gpu node 에 스케줄링 되도록 nodeSelector 설정 type: "gpu" containers: - name: pytorch-container image: pytorch/pytorch command: - "/bin/sh" - "-c" args: - cd ./mnist && pwd && python3 main.py && echo "complete" resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: examples mountPath: /workspace volumes: - name: examples persistentVolumeClaim: claimName: csi-pod-1 restartPolicy: OnFailure volumes: #pytorch job 수행을 위해 필요한 파일을 NAS 볼륨에 다운로드 후 mount - name: examples nfs: server: path: /n2534632_pvc83026478097d4f288/examples readOnly: false
kct2 apply -f pytorch-job.yaml kct2 logs pytorch-pod-9p72l -f /workspace/mnist Train Epoch: 1 [0/60000 (0%)] Loss: 2.282550 Train Epoch: 1 [640/60000 (1%)] Loss: 1.383654 Train Epoch: 1 [1280/60000 (2%)] Loss: 0.893991 Train Epoch: 1 [1920/60000 (3%)] Loss: 0.607930 Train Epoch: 1 [2560/60000 (4%)] Loss: 0.358046 Train Epoch: 1 [3200/60000 (5%)] Loss: 0.448105 Train Epoch: 1 [3840/60000 (6%)] Loss: 0.274314 Train Epoch: 1 [4480/60000 (7%)] Loss: 0.618691 Train Epoch: 1 [5120/60000 (9%)] Loss: 0.241671 Train Epoch: 1 [5760/60000 (10%)] Loss: 0.265854 Train Epoch: 1 [6400/60000 (11%)] Loss: 0.292246 Train Epoch: 1 [7040/60000 (12%)] Loss: 0.203914 Train Epoch: 1 [7680/60000 (13%)] Loss: 0.353010 Train Epoch: 1 [8320/60000 (14%)] Loss: 0.173982 Train Epoch: 1 [8960/60000 (15%)] Loss: 0.330888 Train Epoch: 1 [9600/60000 (16%)] Loss: 0.189820 Train Epoch: 1 [10240/60000 (17%)] Loss: 0.276857 Train Epoch: 1 [10880/60000 (18%)] Loss: 0.243717 Train Epoch: 1 [11520/60000 (19%)] Loss: 0.223437 Train Epoch: 1 [12160/60000 (20%)] Loss: 0.125721 Train Epoch: 1 [12800/60000 (21%)] Loss: 0.262643 Train Epoch: 1 [13440/60000 (22%)] Loss: 0.079488 Train Epoch: 1 [14080/60000 (23%)] Loss: 0.154174 Train Epoch: 1 [14720/60000 (25%)] Loss: 0.174587 Train Epoch: 1 [15360/60000 (26%)] Loss: 0.375594 Train Epoch: 1 [16000/60000 (27%)] Loss: 0.375659 Train Epoch: 1 [16640/60000 (28%)] Loss: 0.091523 Train Epoch: 1 [17280/60000 (29%)] Loss: 0.142970 Train Epoch: 1 [17920/60000 (30%)] Loss: 0.232557 Train Epoch: 1 [18560/60000 (31%)] Loss: 0.212773 Train Epoch: 1 [19200/60000 (32%)] Loss: 0.160379 Train Epoch: 1 [19840/60000 (33%)] Loss: 0.097161 Train Epoch: 1 [20480/60000 (34%)] Loss: 0.203838 Train Epoch: 1 [21120/60000 (35%)] Loss: 0.135524 Train Epoch: 1 [21760/60000 (36%)] Loss: 0.350136 Train Epoch: 1 [22400/60000 (37%)] Loss: 0.308733 >> pyroch pod 에 접속하여 gpu 사용률 확인 kct2 exec -it pytorch-pod-9p72l /bin/sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. # nvidia-smi Tue Jul 25 05:33:11 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:06.0 Off | Off | | N/A 55C P0 47W / 70W | 1958MiB / 16127MiB | 16% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
3. Jupyter 실행
vi jupyter.ayml --- apiVersion: v1 kind: Pod metadata: name: tf-jupyter labels: app: jupyter spec: nodeSelector: type: "gpu" containers: - name: tf-juypter-container image: tensorflow/tensorflow:latest-gpu-jupyter volumeMounts: - mountPath: /notebooks name: host-volume resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs command: ["/bin/sh"] args: ["-c","jupyter notebook --no-browser --ip= --allow-root --NotebookApp.token= ","echo complete"] volumes: - name : host-volume nfs: server: path: /n2534632_pvc83026478097d4f288/examples readOnly: false --- apiVersion: v1 kind: Service metadata: name: jupyter-svc spec: type: LoadBalancer selector: app: jupyter ports: - protocol: TCP port: 80 targetPort: 8888
