NAVER Cloud
NKS 에서 GPU 사용 job 실행해보기
한크크
2023. 7. 25. 15:29
반응형
# kct2 get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
gpu-w-33q5 Ready <none> 3d17h v1.25.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.GPU.T4.G002.C016.M080.NET.SSD.B050.G001,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=gpu-w-33q5,kubernetes.io/os=linux,ncloud.com/nks-nodepool=gpu,nodeId=18462172,regionNo=1,type=gpu,zoneNo=3
test3-w-2qa6 Ready <none> 83d v1.25.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.STAND.C008.M032.NET.SSD.B050.G002,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=test3-w-2qa6,kubernetes.io/os=linux,ncloud.com/nks-nodepool=test3,nodeId=17080373,regionNo=1,zoneNo=3
1. nvidia device plugin 설치
wget https://github.com/NVIDIA/k8s-device-plugin/blame/main/nvidia-device-plugin.yml
>> gpu instance 에만 플러그인이 설치되도록 ndoeSelector 추가
vi nvidia-device-plugin.yml
---
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
nodeSelector: ## nodelabel 추가
type: "gpu"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
---
kct2 apply -f nvidia-device-plugin.yml
kct2 get pods -n kube-system -o wide | grep nvidia
nvidia-device-plugin-daemonset-fspr7 1/1 Running 0 40s 198.18.2.218 gpu-w-33q5 <none> <none>
2. pytorch job 실행
https://github.com/pytorch/examples/tree/main/mnist 파일을 nas 볼륨에 다운로드 한 후 pod 에 mount 하였다.
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-pod
spec:
ttlSecondsAfterFinished: 10
template:
spec:
nodeSelector: # gpu node 에 스케줄링 되도록 nodeSelector 설정
type: "gpu"
containers:
- name: pytorch-container
image: pytorch/pytorch
command:
- "/bin/sh"
- "-c"
args:
- cd ./mnist && pwd && python3 main.py && echo "complete"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: examples
mountPath: /workspace
volumes:
- name: examples
persistentVolumeClaim:
claimName: csi-pod-1
restartPolicy: OnFailure
volumes: #pytorch job 수행을 위해 필요한 파일을 NAS 볼륨에 다운로드 후 mount
- name: examples
nfs:
server: 169.254.82.85
path: /n2534632_pvc83026478097d4f288/examples
readOnly: false
kct2 apply -f pytorch-job.yaml
kct2 logs pytorch-pod-9p72l -f
/workspace/mnist
Train Epoch: 1 [0/60000 (0%)] Loss: 2.282550
Train Epoch: 1 [640/60000 (1%)] Loss: 1.383654
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.893991
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.607930
Train Epoch: 1 [2560/60000 (4%)] Loss: 0.358046
Train Epoch: 1 [3200/60000 (5%)] Loss: 0.448105
Train Epoch: 1 [3840/60000 (6%)] Loss: 0.274314
Train Epoch: 1 [4480/60000 (7%)] Loss: 0.618691
Train Epoch: 1 [5120/60000 (9%)] Loss: 0.241671
Train Epoch: 1 [5760/60000 (10%)] Loss: 0.265854
Train Epoch: 1 [6400/60000 (11%)] Loss: 0.292246
Train Epoch: 1 [7040/60000 (12%)] Loss: 0.203914
Train Epoch: 1 [7680/60000 (13%)] Loss: 0.353010
Train Epoch: 1 [8320/60000 (14%)] Loss: 0.173982
Train Epoch: 1 [8960/60000 (15%)] Loss: 0.330888
Train Epoch: 1 [9600/60000 (16%)] Loss: 0.189820
Train Epoch: 1 [10240/60000 (17%)] Loss: 0.276857
Train Epoch: 1 [10880/60000 (18%)] Loss: 0.243717
Train Epoch: 1 [11520/60000 (19%)] Loss: 0.223437
Train Epoch: 1 [12160/60000 (20%)] Loss: 0.125721
Train Epoch: 1 [12800/60000 (21%)] Loss: 0.262643
Train Epoch: 1 [13440/60000 (22%)] Loss: 0.079488
Train Epoch: 1 [14080/60000 (23%)] Loss: 0.154174
Train Epoch: 1 [14720/60000 (25%)] Loss: 0.174587
Train Epoch: 1 [15360/60000 (26%)] Loss: 0.375594
Train Epoch: 1 [16000/60000 (27%)] Loss: 0.375659
Train Epoch: 1 [16640/60000 (28%)] Loss: 0.091523
Train Epoch: 1 [17280/60000 (29%)] Loss: 0.142970
Train Epoch: 1 [17920/60000 (30%)] Loss: 0.232557
Train Epoch: 1 [18560/60000 (31%)] Loss: 0.212773
Train Epoch: 1 [19200/60000 (32%)] Loss: 0.160379
Train Epoch: 1 [19840/60000 (33%)] Loss: 0.097161
Train Epoch: 1 [20480/60000 (34%)] Loss: 0.203838
Train Epoch: 1 [21120/60000 (35%)] Loss: 0.135524
Train Epoch: 1 [21760/60000 (36%)] Loss: 0.350136
Train Epoch: 1 [22400/60000 (37%)] Loss: 0.308733
>> pyroch pod 에 접속하여 gpu 사용률 확인
kct2 exec -it pytorch-pod-9p72l /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# nvidia-smi
Tue Jul 25 05:33:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:06.0 Off | Off |
| N/A 55C P0 47W / 70W | 1958MiB / 16127MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
3. Jupyter 실행
vi jupyter.ayml
---
apiVersion: v1
kind: Pod
metadata:
name: tf-jupyter
labels:
app: jupyter
spec:
nodeSelector:
type: "gpu"
containers:
- name: tf-juypter-container
image: tensorflow/tensorflow:latest-gpu-jupyter
volumeMounts:
- mountPath: /notebooks
name: host-volume
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
command: ["/bin/sh"]
args: ["-c","jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token= ","echo complete"]
volumes:
- name : host-volume
nfs:
server: 169.254.82.85
path: /n2534632_pvc83026478097d4f288/examples
readOnly: false
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-svc
spec:
type: LoadBalancer
selector:
app: jupyter
ports:
- protocol: TCP
port: 80
targetPort: 8888
반응형