情感分析环境搭建之GPU版本

[toc]

# 官方介绍文档

# 服务器配置

GPU型号：RTX 3090
显存：24GB
最高 CUDA：12.0
每 GPU 分配内存：24GB
GPU 数量：1
数据盘：40GB

1
2
3
4
5
6

查看其他环境信息

查看cuda版本

~ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

1
2
3
4
5
6

# 下载Anaconda3环境

wget -b https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh

创建新环境

conda create --name nlp python=3.8 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

使用环境

# 激活环境前先进入 base 环境
source activate
# 再进入创建的环境
conda activate nlp

1
2
3
4

具体安装不在这里介绍

# 安装环境

# 安装paddlepaddle-gpu版本

下载的版本需要按照对应cuda版本下载，我这里是11.7版本

python3 -m pip install paddlepaddle-gpu==2.5.0rc1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html

对应关系：https://www.paddlepaddle.org.cn/documentation/docs/zh/install/pip/linux-pip.html#gpu

安装完成后运行paddle.utils.run_check()校验是否成功

输出x则为成功

# 安装PaddleNLP

pip install --force-reinstall paddlenlp==2.6.0 -i https://mirror.baidu.com/pypi/simple

# 训练

# 下载训练数据

mkdir data
cd data
wget https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/label_studio.tar.gz
tar -zxvf label_studio.tar.gz

1
2
3
4

# 样本构建

python label_studio.py \
    --label_studio_file ./data/label_studio.json \
    --task_type ext \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --options "正向" "负向" "未提及" \
    --negative_ratio 5 \
    --is_shuffle True \
    --seed 1000

1
2
3
4
5
6
7
8
9

# 模型训练

python -u -m paddle.distributed.launch --gpus "0" finetune.py \
  --train_path ./data/train.json \
  --dev_path ./data/dev.json \
  --save_dir ./checkpoint \
  --learning_rate 1e-5 \
  --batch_size 16 \
  --max_seq_len 512 \
  --num_epochs 3 \
  --model uie-senta-base \
  --seed 1000 \
  --logging_steps 10 \
  --valid_steps 100 \
  --device gpu

1
2
3
4
5
6
7
8
9
10
11
12
13

# 模型测试

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.json \
    --batch_size 16 \
    --max_seq_len 512 \
    --debug

1
2
3
4
5
6

结果

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.08s/it]
[2023-09-08 09:36:22,374] [    INFO] - -----------------------------
[2023-09-08 09:36:22,375] [    INFO] - Class Name: 评价维度
[2023-09-08 09:36:22,375] [    INFO] - Evaluation Precision: 0.88732 | Recall: 0.88732 | F1: 0.88732
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.08it/s]
[2023-09-08 09:36:22,706] [    INFO] - -----------------------------
[2023-09-08 09:36:22,706] [    INFO] - Class Name: 观点词
[2023-09-08 09:36:22,707] [    INFO] - Evaluation Precision: 0.73494 | Recall: 0.78205 | F1: 0.75776
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.30it/s]
[2023-09-08 09:36:23,257] [    INFO] - -----------------------------
[2023-09-08 09:36:23,257] [    INFO] - Class Name: X的观点词
[2023-09-08 09:36:23,257] [    INFO] - Evaluation Precision: 0.83636 | Recall: 0.82143 | F1: 0.82883
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  7.19it/s]
[2023-09-08 09:36:23,956] [    INFO] - -----------------------------
[2023-09-08 09:36:23,956] [    INFO] - Class Name: X的情感倾向[未提及,正向,负向]
[2023-09-08 09:36:23,956] [    INFO] - Evaluation Precision: 0.92424 | Recall: 0.92424 | F1: 0.924

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# 模型调用

from paddlenlp import Taskflow
schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best")
print(senta("这家点的房间很大，店家服务也很热情，就是房间隔音不好"))

1
2
3
4

# 自己训练的数据测试

# 样本构建

python label_studio.py \
    --label_studio_file ./data2/project-4-at-2023-09-07-06-13-78e5962b.json \
    --task_type ext \
    --save_dir ./data2 \
    --splits 0.8 0.1 0.1 \
    --options "正向" "负向" "未提及" \
    --negative_ratio 5 \
    --is_shuffle True \
    --seed 1000

1
2
3
4
5
6
7
8
9

# 模型训练

python -u -m paddle.distributed.launch --gpus "0" finetune.py \
  --train_path ./data2/train.json \
  --dev_path ./data2/dev.json \
  --save_dir ./checkpoint2 \
  --learning_rate 1e-5 \
  --batch_size 1 \
  --max_seq_len 512 \
  --num_epochs 3 \
  --model uie-senta-base \
  --seed 1000 \
  --logging_steps 10 \
  --valid_steps 100 \
  --device gpu

1
2
3
4
5
6
7
8
9
10
11
12
13

记录一个问题：

这里batch_size 直接很快就运行完成了，但是没有输出模型，改成1就可以输出，暂时不知道什么原因

可能是因为设置 logging_steps 值，每隔该参数的值保存一遍模型

# 模型测试

python evaluate.py \
    --model_path ./checkpoint2/model_best \
    --test_path ./data2/test.json \
    --batch_size 16 \
    --max_seq_len 512 \
    --debug

1
2
3
4
5
6

上次更新: 2023/09/13, 14:23:09

← 情感分析环境搭建之CPU版本情感分析服务化部署之SimpleServing→