情感分析之文章正负面分析完整版

[toc]

# 环境准备

CPU服务器
服务器需要支持avx2指令
- 可以通过cat /proc/cpuinfo|grep avx2命令查看服务器是否包含avx2指令
Ubuntu系统原始服务器

# 环境安装

# 安装Anaconda3

下载安装文件

wget -b https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh && tail -f wget-log

设置权限

chmod 755 Anaconda3-2023.07-2-Linux-x86_64.sh

执行安装

sh Anaconda3-2023.07-2-Linux-x86_64.sh

直接敲回车

Please, press ENTER to continue
>>>

1
2

输入yes

Do you accept the license terms? [yes|no]
[no] >>>  yes

1
2

回车，设置安装路径

Anaconda3 will now be installed into this location:
/root/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>>

1
2
3
4
5
6
7
8

输入yes

Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
[no] >>> yes

1
2
3

安装成功

Thank you for installing Anaconda3!

配置环境变量

# 一键写入配置
grep -qF 'export PATH=$PATH:/root/anaconda3/bin' /etc/profile || echo 'export PATH=$PATH:/root/anaconda3/bin' >> /etc/profile && source /etc/profile

1
2

创建环境

# 在命令行输入以下命令，创建名为 sentiment_analysis 的环境,此处为加速下载，使用清华源
conda create --name sentiment_analysis python=3.8 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

1
2

使用环境

# 激活环境前先进入 base 环境
source activate
# 再进入创建的环境
conda activate sentiment_analysis

1
2
3
4

到这里就进入了虚拟环境中

(base) # conda activate sentiment_analysis
(sentiment_analysis) #

1
2

# 安装PaddleNLP

# 安装paddlepaddle

安装指定版本，截止当前最新版本安装可能存在些问题，无法直接应用

pip install --force-reinstall paddlepaddle==2.5.0rc1 -i https://pypi.tuna.tsinghua.edu.cn/simple

# 安装paddlenlp

pip install --use-pep517 paddlenlp==2.6.0 -i https://mirror.baidu.com/pypi/simple

pip install wordcloud==1.8.2.2 -i https://mirror.baidu.com/pypi/simple

安装完成后可能会提示如下警告忽略即可

# 服务应用

# Python代码简单调用

创建test.py文件，文件内容如下

from paddlenlp import Taskflow
# 下面这行可以忽略，提取结果根据schema来区分
#schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
schema = ['情感倾向[正向，负向]']
senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
print(senta("漏水的问题，到夏天都体现出来了。一般都是天窗漏水，排水管长时间不清理堵塞导致的。这个问题应该加入保养里，每次保养时清理就可以避免发生。"))

1
2
3
4
5
6

执行test.py

python test.py

首次执行会自动下载代码中指定的模型，加载完成后，在输出结果中可以看到

[{'情感倾向[正向，负向]': [{'text': '负向', 'probability': 0.9972618997001703}]}]

官方说明：点击跳转 (opens new window)

>>> from paddlenlp import Taskflow
# 默认使用bilstm模型进行预测，速度快
>>> senta = Taskflow("sentiment_analysis")
>>> senta("这个产品用起来真的很流畅，我非常喜欢")
[{'text': '这个产品用起来真的很流畅，我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}]

# 使用SKEP情感分析预训练模型进行预测，精度高
>>> senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")
>>> senta("作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。")
[{'text': '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。', 'label': 'positive', 'score': 0.984320878982544}]

# 使用UIE模型进行情感分析，具有较强的样本迁移能力
# 1. 语句级情感分析
>>> schema = ['情感倾向[正向，负向]']
>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
>>> senta('蛋糕味道不错，店家服务也很好')
[{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.996646058824652}]}]

# 2. 评价维度级情感分析
>>> # Aspect Term Extraction
>>> # schema =  ["评价维度"]
>>> # Aspect - Opinion Extraction
>>> # schema =  [{"评价维度":["观点词"]}]
>>> # Aspect - Sentiment Extraction
>>> # schema =  [{"评价维度":["情感倾向[正向,负向,未提及]"]}]
>>> # Aspect - Sentiment - Opinion Extraction
>>> schema =  [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}]

>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
>>> senta('蛋糕味道不错，店家服务也很热情')
[{'评价维度': [{'text': '服务', 'start': 9, 'end': 11, 'probability': 0.9709093024793489, 'relations': { '观点词': [{'text': '热情', 'start': 13, 'end': 15, 'probability': 0.9897222206316556}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9999327669598301}]}}, {'text': '味道', 'start': 2, 'end': 4, 'probability': 0.9105472387838915, 'relations': {'观点词': [{'text': '不错', 'start': 4, 'end': 6, 'probability': 0.9946981266891619}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9998829392709467}]}}]}]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

# 定制化情感分析

# 下载PaddleNLP代码

# 方法一

git直接拉取

mkdir /opt/jast && cd /opt/jast && git clone https://github.com/PaddlePaddle/PaddleNLP.git

如果github网络有问题可以使用gitee代码拉取
mkdir /opt/jast && cd /opt/jast && git clone https://gitee.com/paddlepaddle/PaddleNLP.git
1

# 方法二

直接下载GIT上的代码上传服务器后解压

# 标注数据

参考：https://www.hadoop.wiki/pages/598a4a/#_3-1-%E8%AF%AD%E5%8F%A5%E7%BA%A7%E6%83%85%E6%84%9F%E5%88%86%E7%B1%BB%E4%BB%BB%E5%8A%A1 (opens new window)

[https://datamining.blog.csdn.net/article/details/134964167](

标注数据分类如下

# 上传标注数据

进入刚刚下载的代码目录

cd PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction

目录文件如下

(sentiment_analysis) # ls -l 
总用量 152
-rw-r--r-- 1 root root  3830 12月 12 20:47 batch_predict.py
drwxr-xr-x 2 root root  4096 12月 12 20:47 deploy
-rw-r--r-- 1 root root  5009 12月 12 20:47 evaluate.py
-rw-r--r-- 1 root root  6600 12月 12 20:47 finetune.py
-rw-r--r-- 1 root root  7381 12月 12 20:47 label_studio.md
-rw-r--r-- 1 root root 31389 12月 12 20:47 label_studio.py
-rw-r--r-- 1 root root 42153 12月 12 20:47 README.md
-rw-r--r-- 1 root root  9583 12月 12 20:47 utils.py
-rw-r--r-- 1 root root 30950 12月 12 20:47 visual_analysis.py

1
2
3
4
5
6
7
8
9
10
11

创建存放数据目录

mkdir data

将标注好的数据上传到data目录

这里也可以直接下载我标注好的数据，大概9k条左右文章和评论数据

链接: https://pan.baidu.com/s/1nZLbOXVNrY6HoaWVVm9eaA?pwd=5fei 提取码: 5fei

(sentiment_analysis) # ls -lh
总用量 12M
-rw-r--r-- 1 root root 12M 12月 12 21:11 project-31-at-2023-12-12-08-07-661bbaf0.json

1
2
3

# 使用官方提供脚本进行样本构建

回到目录/opt/jast/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction

cd /opt/jast/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction

执行下面命令

python label_studio.py \
    --label_studio_file ./data/project-31-at-2023-12-12-08-07-661bbaf0.json \
    --task_type cls \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --options "正向" "负向" "中性" \
    --negative_ratio 5 \
    --is_shuffle True \
    --seed 1000

1
2
3
4
5
6
7
8
9

参数介绍：

label_studio_file: 从label studio导出的语句级情感分类的数据标注文件。

task_type: 选择任务类型，可选有抽取和分类两种类型的任务，其中前者需要设置为ext，后者需要设置为cls。由于此处为语句级情感分类任务，因此需要设置为cls。

save_dir: 训练数据的保存目录，默认存储在data目录下。

splits: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照8:1:1的比例将数据划分为训练集、验证集和测试集。

options: 情感极性分类任务的选项设置。对于语句级情感分类任务，默认支持2分类：正向 和 负向；对于属性级情感分析任务，默认支持3分类：正向, 负向和 未提及，其中未提及表示要分析的属性在原文本评论中未提及，因此无法分析情感极性。如果业务需要其他情感极性选项，可以通过options字段进行设置，需要注意的是，如果定制了options，参数label_studio_file指定的文件需要包含针对新设置的选项的标注数据。

is_shuffle: 是否对数据集进行随机打散，默认为True。

seed: 随机种子，默认为1000.

备注：参数options可以不进行手动指定，如果这么做，则采用默认的设置。针对语句级情感分类任务，其默认将被设置为："正向" "负向"；

# 修改源码

执行命令是可能会遇到错误

[2023-12-12 16:34:26,931] [    INFO] - [Train] Start to convert annotation data.
Traceback (most recent call last):
  File "label_studio.py", line 738, in <module>
    do_convert()
  File "label_studio.py", line 698, in do_convert
    train_examples = convertor.convert_cls_examples(raw_examples[:p1], data_flag="Train")
  File "label_studio.py", line 103, in convert_cls_examples
    items = self.process_text_tag(line, task_type="cls")
  File "label_studio.py", line 93, in process_text_tag
    items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
IndexError: list index out of range

1
2
3
4
5
6
7
8
9
10
11

原因：可能是因为标注的数据有些是为空的，我们修改下label_studio.py的代码，将异常数据直接捕获跳过

修改代码第103行左右的convert_cls_examples方法

修改前

    def convert_cls_examples(self, raw_examples, data_flag="Data"):
        """
        Convert labeled data for classification task.
        """
        examples = []
        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
        for line in raw_examples:
             items = self.process_text_tag(line, task_type="cls")
             text, labels = items["text"], items["label"]
             example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
             examples.append(example)

        logger.info("{0:7} End to convert annotation data.\n".format(""))
        return examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14

修改后

    def convert_cls_examples(self, raw_examples, data_flag="Data"):
        """
        Convert labeled data for classification task.
        """
        examples = []
        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
        for line in raw_examples:
            try:
                items = self.process_text_tag(line, task_type="cls")
                text, labels = items["text"], items["label"]
                example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
                examples.append(example)
            except Exception :
                logger.info("数据解析错误")
        logger.info("{0:7} End to convert annotation data.\n".format(""))
        return examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Github上也有相关讨论：https://github.com/PaddlePaddle/PaddleNLP/issues/6292 (opens new window)

再次执行样本构建密令，就会运行成功

[2023-12-12 21:41:51,767] [    INFO] -         End to convert annotation data.

[2023-12-12 21:41:51,809] [    INFO] - Save 5642 examples to ./data/train.json.
[2023-12-12 21:41:51,814] [    INFO] - Save 717 examples to ./data/dev.json.
[2023-12-12 21:41:51,820] [    INFO] - Save 686 examples to ./data/test.json.
[2023-12-12 21:41:51,820] [    INFO] - Finished! It takes 0.52 seconds

1
2
3
4
5
6

在目录下可以看到生成的数据

(sentiment_analysis) # ls -lh data
总用量 16M
-rw-r--r-- 1 root root 389K 12月 12 21:41 dev.json
-rw-r--r-- 1 root root  12M 12月 12 16:11 project-31-at-2023-12-12-08-07-661bbaf0.json
-rw-r--r-- 1 root root 399K 12月 12 21:41 test.json
-rw-r--r-- 1 root root 3.1M 12月 12 21:41 train.json

1
2
3
4
5
6

# 训练模型

nohup python -u -m paddle.distributed.launch  finetune.py \
--train_path ./data/train.json \
--dev_path ./data/dev.json \
--save_dir ./checkpoint \
--learning_rate 1e-5 \
--batch_size 1 \
--max_seq_len 512 \
--num_epochs 3 \
--model uie-senta-base \
--seed 1000 \
--logging_steps 10 \
--valid_steps 100 \
--device cpu &

1
2
3
4
5
6
7
8
9
10
11
12
13

可配置参数说明：

train_path：必须，训练集文件路径。

dev_path：必须，验证集文件路径。

save_dir：模型 checkpoints 的保存目录，默认为"./checkpoint"。

learning_rate：训练最大学习率，UIE 推荐设置为 1e-5；默认值为1e-5。

batch_size：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。

max_seq_len：模型支持处理的最大序列长度，默认为512。

num_epochs：模型训练的轮次，可以视任务情况进行调整，默认为10。

model：训练使用的预训练模型。可选择的有uie-senta-base, uie-senta-medium, uie-senta-mini, uie-senta-micro, uie-senta-nano，默认为uie-senta-base。

logging_steps: 训练过程中日志打印的间隔 steps 数，默认10。

valid_steps: 训练过程中模型评估的间隔 steps 数，默认100。

seed：全局随机种子，默认为 42。

device: 训练设备，可选择 'cpu'、'gpu' 其中的一种；默认为 GPU 训练。

首次训练模型也会自动下载基础模型，训练过程中也会持续输出进度

[2023-12-12 22:00:34,540] [    INFO] - global step 10, epoch: 1, loss: 0.00642, speed: 0.06 step/s
[2023-12-12 22:03:23,946] [    INFO] - global step 20, epoch: 1, loss: 0.00649, speed: 0.06 step/s
[2023-12-12 22:06:15,562] [    INFO] - global step 30, epoch: 1, loss: 0.00514, speed: 0.06 step/s
.....
.....
100%|█████████▉| 812/815 [18:14<00:04,  1.35s/it]
100%|█████████▉| 813/815 [18:15<00:02,  1.35s/it]
100%|█████████▉| 814/815 [18:17<00:01,  1.34s/it]
100%|██████████| 815/815 [18:18<00:00,  1.34s/it]
100%|██████████| 815/815 [18:18<00:00,  1.35s/it]
[2023-12-12 23:42:18,957] [    INFO] - Evaluation precision: 0.76042, recall: 0.71656, F1: 0.73784

1
2
3
4
5
6
7
8
9
10
11

执行后保存的模型会在,刚刚训练模型配置的save_dir目录下，model_best 目录就是我们训练保存的最佳模型

(base) ➜  unified_sentiment_extraction git:(develop) ✗ ls checkpoint 
model_100   model_1200  model_1500  model_1800  model_2000  model_2300  model_500  model_800
model_1000  model_1300  model_1600  model_1900  model_2100  model_300   model_600  model_900
model_1100  model_1400  model_1700  model_200   model_2200  model_400   model_700  model_best

1
2
3
4

# 模型测试

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.json \
    --batch_size 16 \
    --max_seq_len 512

1
2
3
4
5

可配置参数说明：

model_path：必须，进行评估的模型文件夹路径，路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json。

test_path：必须，进行评估的测试集文件。

batch_size：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。

max_seq_len：文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。

debug：是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。

# 模型应用

参考：https://www.hadoop.wiki/pages/33341c/ (opens new window)

上次更新: 2023/12/19, 21:00:20

← 情感分析之使用PaddleGPU文章正负面分析完整版情感分析值使用PaddleGPU文章观点提取分析完整版→