Atlas 800 Inference Server (Model 3000) 23.0.0 Ascend Software Installation Guide 01

Model Inference

Model Inference

Preparing Model Files and Datasets

Ensure that the server is connected to the network.

  1. Prepare the model implementation file and weight file.

    1. Configure the software source. For details, see Checking the Source Validity.
    2. Install and configure git-lfs (Ubuntu is used as an example).
      apt-get install -y git
      apt-get install -y git-lfs
      
      # Configure git-lfs.
      git lfs install

      If "Git LFS initialized" is displayed, git-lfs is configured successfully.

    3. Download the model implementation and weight files and save them to any path (for example, /home).
      git config --global http.sslVerify "false"
      git clone https://huggingface.co/THUDM/chatglm2-6b
      cd chatglm2-6b
      git reset --hard 4e38bef4c028beafc8fb1837462f74c02e68fcc2
    4. The chatglm2-6b directory is as follows:
      |-- config.json
      |-- configuration_chatglm.py
      |-- modeling_chatglm.py
      |-- pytorch_model-00001-of-00007.bin
      |-- pytorch_model-00002-of-00007.bin
      |-- pytorch_model-00003-of-00007.bin
      |-- pytorch_model-00004-of-00007.bin
      |-- pytorch_model-00005-of-00007.bin
      |-- pytorch_model-00006-of-00007.bin
      |-- pytorch_model-00007-of-00007.bin
      |-- pytorch_model.bin.index.json
      |-- quantization.py
      |-- tokenization_chatglm.py
      |-- tokenizer_config.json
      |-- tokenizer.model
    5. Add the following information in bold to the config.json file:
      { 
        ......
        "world_size": 1,
        ......
      }

  2. Obtain the quantization weight.

    Contact Huawei technical support. After the file is obtained, upload it to any path (for example, /home) on the server and decompress it to obtain the quant_weight folder.

  3. Download the C-Eval dataset.

    Click here to obtain the dataset. After obtaining the dataset, upload it to any path (for example, /home/dataset) on the server and decompress the dataset to obtain the C-Eval folder.

    The C-Eval directory of the dataset is as follows:

    |-- test
    |-- val

Model Inference

Ensure that the server is connected to the network.

  1. Install the third-party dependency. (/home/transformer-llm is only an example. Replace it with the actual path.)

    cd /home/transformer-llm/pytorch/examples/chatglm2_6b
    pip3 install -r requirements.txt

  2. Before inference, configure the following environment variables:

    export HCCL_BUFFSIZE=110
    export HCCL_OP_BASE_FFTS_MODE_ENABLE=1
    export TASK_QUEUE_ENABLE=1
    export ATB_OPERATION_EXECUTE_ASYNC=1
    export ATB_LAYER_INTERNAL_TENSOR_REUSE=1
    
    # Performance can be improved by enabling multi-stream on Atlas 300I Pro and Atlas 300I Duo.
    export ATB_USE_TILING_COPY_STREAM=1

  3. Perform C-Eval dataset inference.

    Run the following commands in the /home/transformer-llm/pytorch/examples/chatglm2_6b directory:
    # Set the path of the model implementation file and weight file.
    export CHECKPOINT=/home/chatglm2-6b
    
    # Set the path of the dataset.
    export DATASET=/home/dataset/CEval
    
    # Set the path of the quantization weight file.
    export QUANT_WEIGHT_PATH=/home/quant_weight
    
    # Single-chip quantization
    export ENABLE_QUANT=1
    python3 generate_weights.py --model_path ${CHECKPOINT}
    python3 main.py --mode precision_dataset --model_path ${CHECKPOINT} --ceval_dataset ${DATASET} --batch 8 --device 0
    
    # Dual-chip quantization (Atlas 300I Duo)
    export ENABLE_QUANT=1
    python3 generate_weights.py --model_path ${CHECKPOINT} --tp_size 2
    torchrun --nproc_per_node 2 --master_port 2000 main.py --mode precision_dataset --model_path ${CHECKPOINT} --ceval_dataset ${DATASET} --batch 8 --tp_size 2 --device 0
    • Check whether error information similar to the following is displayed (/usr/local/gcc7.3.0/lib64/libgomp.so.1 is only an example):
      ImportError: /usr/local/gcc7.3.0/lib64/libgomp.so.1: cannot allocate memory in static TLS block
      ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 12591) of binary: /usr/local/python3.9.2/bin/python3.
      If yes, run the following command to configure environment variables (/usr/local/gcc7.3.0/lib64/libgomp.so.1 is only an example):
      export LD_PRELOAD=/usr/local/gcc7.3.0/lib64/libgomp.so.1
    • Check whether error information similar to the following is displayed:
      ImportError: This modeling file requires the following packages that were not found in your environment: atb_speed. Run `pip install atb_speed`
      If yes, run the following command (replace /home/transformer-llm with the actual model package path):
      cd /home/transformer-llm/pytorch/examples/atb_speed_sdk
      pip install .

  4. Test the model performance data.

    Run the following commands in the /home/transformer-llm/pytorch/examples/chatglm2_6b directory:
    # Set the path of the model implementation file and weight file.
    export CHECKPOINT=/home/chatglm2-6b
    
    # Set the path of the dataset.
    export DATASET=/home/dataset/CEval
    
    # Set the path of the quantization weight file.
    export QUANT_WEIGHT_PATH=/home/quant_weight
    
    # Single-chip quantization
    export ENABLE_QUANT=1
    python3 generate_weights.py --model_path ${CHECKPOINT} # Skip this step if it has been generated.
    python main.py --mode performance --model_path ${CHECKPOINT} --batch 8 --set_case_pair 1 --seqlen_in_pair 256,512,1024 --seqlen_out_pair 64,128,25 --device 0
    
    
    # Dual-chip quantization (Atlas 300I Duo)
    export ENABLE_QUANT=1
    python3 generate_weights.py --model_path ${CHECKPOINT} --tp_size 2 # Skip this step if it has been generated.
    torchrun --nproc_per_node 2 --master_port 2000 main.py --mode performance --model_path ${CHECKPOINT} --batch 16 --tp_size 2 --set_case_pair 1 --seqlen_in_pair 256,512,1024 --seqlen_out_pair 64,128,25 --device 0
    • Check whether error information similar to the following is displayed (/usr/local/gcc7.3.0/lib64/libgomp.so.1 is only an example):
      ImportError: /usr/local/gcc7.3.0/lib64/libgomp.so.1: cannot allocate memory in static TLS block
      ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 12591) of binary: /usr/local/python3.9.2/bin/python3.
      If yes, run the following command to configure environment variables (/usr/local/gcc7.3.0/lib64/libgomp.so.1 is only an example):
      export LD_PRELOAD=/usr/local/gcc7.3.0/lib64/libgomp.so.1
    • Check whether error information similar to the following is displayed:
      ImportError: This modeling file requires the following packages that were not found in your environment: atb_speed. Run `pip install atb_speed`
      If yes, run the following command (replace /home/transformer-llm with the actual model package path):
      cd /home/transformer-llm/pytorch/examples/atb_speed_sdk
      pip install .