ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • NKS 에서 GPU 사용 job 실행해보기
    NAVER Cloud 2023. 7. 25. 15:29
    # kct2 get node --show-labels
    NAME           STATUS   ROLES    AGE     VERSION   LABELS
    gpu-w-33q5     Ready    <none>   3d17h   v1.25.8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.GPU.T4.G002.C016.M080.NET.SSD.B050.G001,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=gpu-w-33q5,kubernetes.io/os=linux,ncloud.com/nks-nodepool=gpu,nodeId=18462172,regionNo=1,type=gpu,zoneNo=3
    test3-w-2qa6   Ready    <none>   83d     v1.25.8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=SVR.VSVR.STAND.C008.M032.NET.SSD.B050.G002,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=1,failure-domain.beta.kubernetes.io/zone=3,kubernetes.io/arch=amd64,kubernetes.io/hostname=test3-w-2qa6,kubernetes.io/os=linux,ncloud.com/nks-nodepool=test3,nodeId=17080373,regionNo=1,zoneNo=3

    1. nvidia device plugin 설치

    wget https://github.com/NVIDIA/k8s-device-plugin/blame/main/nvidia-device-plugin.yml
    
    >> gpu instance 에만 플러그인이 설치되도록 ndoeSelector 추가 
    vi nvidia-device-plugin.yml 
    
    ---
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          labels:
            name: nvidia-device-plugin-ds
        spec:
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          # Mark this pod as a critical add-on; when enabled, the critical add-on
          # scheduler reserves resources for critical add-on pods so that they can
          # be rescheduled after a failure.
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          priorityClassName: "system-node-critical"
          nodeSelector:   ## nodelabel 추가 
           type: "gpu"
          containers:
          - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
            name: nvidia-device-plugin-ctr
            env:
              - name: FAIL_ON_INIT_ERROR
                value: "false"
    ---
    
    kct2 apply -f nvidia-device-plugin.yml
    
     kct2 get pods -n kube-system -o wide | grep nvidia
    nvidia-device-plugin-daemonset-fspr7      1/1     Running   0          40s     198.18.2.218   gpu-w-33q5     <none>           <none>

    2. pytorch job 실행

    https://github.com/pytorch/examples/tree/main/mnist 파일을 nas 볼륨에 다운로드 한 후 pod 에 mount 하였다. 

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pytorch-pod
    spec:
      ttlSecondsAfterFinished: 10
      template:
        spec:
          nodeSelector: # gpu node 에 스케줄링 되도록 nodeSelector 설정
           type: "gpu"
          containers:
          - name: pytorch-container
            image: pytorch/pytorch
            command:
            - "/bin/sh"
            - "-c"
            args:
            - cd ./mnist && pwd && python3 main.py && echo "complete"
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - name: examples
              mountPath: /workspace
          volumes:
          - name: examples
            persistentVolumeClaim:
             claimName: csi-pod-1
          restartPolicy: OnFailure
    
          volumes:  #pytorch job 수행을 위해 필요한 파일을 NAS 볼륨에 다운로드 후 mount
          - name: examples
            nfs:
              server: 169.254.82.85
              path: /n2534632_pvc83026478097d4f288/examples
              readOnly: false
     kct2 apply -f pytorch-job.yaml
     
     kct2 logs pytorch-pod-9p72l -f
    /workspace/mnist
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.282550
    Train Epoch: 1 [640/60000 (1%)] Loss: 1.383654
    Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.893991
    Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.607930
    Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.358046
    Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.448105
    Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.274314
    Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.618691
    Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.241671
    Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.265854
    Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.292246
    Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.203914
    Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.353010
    Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.173982
    Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.330888
    Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.189820
    Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.276857
    Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.243717
    Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.223437
    Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.125721
    Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.262643
    Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.079488
    Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.154174
    Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.174587
    Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.375594
    Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.375659
    Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.091523
    Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.142970
    Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.232557
    Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.212773
    Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.160379
    Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.097161
    Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.203838
    Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.135524
    Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.350136
    Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.308733
    
    >> pyroch pod 에 접속하여 gpu 사용률 확인
    kct2 exec -it pytorch-pod-9p72l /bin/sh
    kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
    # nvidia-smi
    Tue Jul 25 05:33:11 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:00:06.0 Off |                  Off |
    | N/A   55C    P0    47W /  70W |   1958MiB / 16127MiB |     16%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

     

    3. Jupyter 실행

    vi jupyter.ayml
    ---
    
    apiVersion: v1
    kind: Pod
    metadata:
      name: tf-jupyter
      labels:
        app: jupyter
    spec:
      nodeSelector:
       type: "gpu"
      containers:
        - name: tf-juypter-container
          image: tensorflow/tensorflow:latest-gpu-jupyter
          volumeMounts:
            - mountPath: /notebooks
              name: host-volume
          resources:
            limits:
               nvidia.com/gpu: 2 # requesting 2 GPUs
          command: ["/bin/sh"]
          args: ["-c","jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token= ","echo complete"]
      volumes:
      - name : host-volume
        nfs:
          server: 169.254.82.85
          path: /n2534632_pvc83026478097d4f288/examples
          readOnly: false
    
    
    ---
    
    apiVersion: v1
    kind: Service
    metadata:
      name: jupyter-svc
    spec:
      type: LoadBalancer
      selector:
        app: jupyter
      ports:
       - protocol: TCP
         port: 80
         targetPort: 8888

     

    반응형

    댓글

Designed by Tistory.