浅谈tensorflow语义分割api的使用(deeplab训练cityscapes)
遇到的坑:
1. 环境:
- tensorflow1.8+CUDA9.0+cudnn7.0+annaconda3+py3.5
- 使用最新的tensorflow1.12或者1.10都不行,报错:报错不造卷积算法(convolution algorithm...)
2. 数据集转换
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# Exit immediately if a command exits with a non-zero status. set - e CURRENT_DIR = $(pwd) WORK_DIR = "." # Root path for Cityscapes dataset. CITYSCAPES_ROOT = "${WORK_DIR}/cityscapes" # Create training labels. python "${CITYSCAPES_ROOT}/cityscapesscripts/preparation/createTrainIdLabelImgs.py" # Build TFRecords of the dataset. # First, create output directory for storing TFRecords. OUTPUT_DIR = "${CITYSCAPES_ROOT}/tfrecord" mkdir - p "${OUTPUT_DIR}" BUILD_SCRIPT = "${CURRENT_DIR}/build_cityscapes_data.py" echo "Converting Cityscapes dataset..." python "${BUILD_SCRIPT}" \ - - cityscapes_root = "${CITYSCAPES_ROOT}" \ - - output_dir = "${OUTPUT_DIR}" \ |
- 首先当前conda环境下安装cityscapesScripts模块,要支持py3.5才行;
- 由于cityscapesscripts/preparation/createTrainIdLabelImgs.py里面默认会把数据集gtFine下面的test,train,val文件夹json文件都转为TrainIdlandelImgs.png;然而在test文件下有很多json文件编码格式是错误的,大约十几张,每次报错,然后将其剔除!!!
- 然后执行build_cityscapes_data.py将img,lable转换为tfrecord格式。
3. 训练cityscapes代码
- 将训练代码写成脚本文件:train_deeplab_cityscapes.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
#!/bin/bash # CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --backbone resnet --lr 0.01 --workers 4 --epochs 40 --batch-size 16 --gpu-ids 0,1,2,3 --checkname deeplab-resnet --eval-interval 1 --dataset coco PATH_TO_INITIAL_CHECKPOINT = '/home/rjw/tf-models/research/deeplab/pretrain_models/deeplabv3_cityscapes_train/model.ckpt' PATH_TO_TRAIN_DIR = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/' PATH_TO_DATASET = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/tfrecord' WORK_DIR = '/home/rjw/tf-models/research/deeplab' # From tensorflow/models/research/ python "${WORK_DIR}" / train.py \ - - logtostderr \ - - training_number_of_steps = 40000 \ - - train_split = "train" \ - - model_variant = "xception_65" \ - - atrous_rates = 6 \ - - atrous_rates = 12 \ - - atrous_rates = 18 \ - - output_stride = 16 \ - - decoder_output_stride = 4 \ - - train_crop_size = 513 \ - - train_crop_size = 513 \ - - train_batch_size = 1 \ - - fine_tune_batch_norm = False \ - - dataset = "cityscapes" \ - - tf_initial_checkpoint = ${PATH_TO_INITIAL_CHECKPOINT} \ - - train_logdir = ${PATH_TO_TRAIN_DIR} \ - - dataset_dir = ${PATH_TO_DATASET} |
参数分析:
training_number_of_steps: 训练迭代次数;
train_crop_size:训练图片的裁剪大小,因为我的GPU只有8G,故我将这个设置为513了;
train_batch_size: 训练的batchsize,也是因为硬件条件,故保持1;
fine_tune_batch_norm=False :是否使用batch_norm,官方建议,如果训练的batch_size小于12的话,须将该参数设置为False,这个设置很重要,否则的话训练时会在2000步左右报错
tf_initial_checkpoint:预训练的初始checkpoint,这里设置的即是前面下载的../research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt.index
train_logdir: 保存训练权重的目录,注意在开始的创建工程目录的时候就创建了,这里设置为"../research/deeplab/exp/train_on_train_set/train/"
dataset_dir:数据集的地址,前面创建的TFRecords目录。这里设置为"../dataset/cityscapes/tfrecord"
4.验证测试
- 验证脚本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
#!/bin/bash # CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --backbone resnet --lr 0.01 --workers 4 --epochs 40 --batch-size 16 --gpu-ids 0,1,2,3 --checkname deeplab-resnet --eval-interval 1 --dataset coco PATH_TO_INITIAL_CHECKPOINT = '/home/rjw/tf-models/research/deeplab/pretrain_models/deeplabv3_cityscapes_train/' PATH_TO_CHECKPOINT = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/' PATH_TO_EVAL_DIR = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/eval/' PATH_TO_DATASET = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/tfrecord' WORK_DIR = '/home/rjw/tf-models/research/deeplab' # From tensorflow/models/research/ python "${WORK_DIR}" / eval .py \ - - logtostderr \ - - eval_split = "val" \ - - model_variant = "xception_65" \ - - atrous_rates = 6 \ - - atrous_rates = 12 \ - - atrous_rates = 18 \ - - output_stride = 16 \ - - decoder_output_stride = 4 \ - - eval_crop_size = 1025 \ - - eval_crop_size = 2049 \ - - dataset = "cityscapes" \ - - checkpoint_dir = ${PATH_TO_INITIAL_CHECKPOINT} \ - - eval_logdir = ${PATH_TO_EVAL_DIR} \ - - dataset_dir = ${PATH_TO_DATASET} |
- rusult:model.ckpt-40000为在初始化模型上训练40000次迭代的模型;后面用初始化模型测试miou_1.0还是很低,不知道是不是有什么参数设置的问题!!!
- 注意,如果使用官方提供的checkpoint,压缩包中是没有checkpoint文件的,需要手动添加一个checkpoint文件;初始化模型中是没有提供chekpoint文件的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
INFO:tensorflow:Restoring parameters from / home / rjw / tf - models / research / deeplab / datasets / cityscapes / exp / train_on_train_set / train / model.ckpt - 40000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting evaluation at 2018 - 12 - 18 - 07 : 13 : 08 INFO:tensorflow:Evaluation [ 50 / 500 ] INFO:tensorflow:Evaluation [ 100 / 500 ] INFO:tensorflow:Evaluation [ 150 / 500 ] INFO:tensorflow:Evaluation [ 200 / 500 ] INFO:tensorflow:Evaluation [ 250 / 500 ] INFO:tensorflow:Evaluation [ 300 / 500 ] INFO:tensorflow:Evaluation [ 350 / 500 ] INFO:tensorflow:Evaluation [ 400 / 500 ] INFO:tensorflow:Evaluation [ 450 / 500 ] miou_1. 0 [ 0.478293568 ] INFO:tensorflow:Waiting for new checkpoint at / home / rjw / tf - models / research / deeplab / pretrain_models / deeplabv3_cityscapes_train / INFO:tensorflow:Found new checkpoint at / home / rjw / tf - models / research / deeplab / pretrain_models / deeplabv3_cityscapes_train / model.ckpt INFO:tensorflow:Graph was finalized. 2018 - 12 - 18 15 : 18 : 05.210957 : I tensorflow / core / common_runtime / gpu / gpu_device.cc: 1435 ] Adding visible gpu devices: 0 2018 - 12 - 18 15 : 18 : 05.211047 : I tensorflow / core / common_runtime / gpu / gpu_device.cc: 923 ] Device interconnect StreamExecutor with strength 1 edge matrix: 2018 - 12 - 18 15 : 18 : 05.211077 : I tensorflow / core / common_runtime / gpu / gpu_device.cc: 929 ] 0 2018 - 12 - 18 15 : 18 : 05.211100 : I tensorflow / core / common_runtime / gpu / gpu_device.cc: 942 ] 0 : N 2018 - 12 - 18 15 : 18 : 05.211645 : I tensorflow / core / common_runtime / gpu / gpu_device.cc: 1053 ] Created TensorFlow device ( / job:localhost / replica: 0 / task: 0 / device:GPU: 0 with 9404 MB memory) - > physical GPU (device: 0 , name: GeForce GTX 1080 Ti, pci bus id : 0000 : 01 : 00.0 , compute capability: 6.1 ) INFO:tensorflow:Restoring parameters from / home / rjw / tf - models / research / deeplab / pretrain_models / deeplabv3_cityscapes_train / model.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting evaluation at 2018 - 12 - 18 - 07 : 18 : 06 INFO:tensorflow:Evaluation [ 50 / 500 ] INFO:tensorflow:Evaluation [ 100 / 500 ] INFO:tensorflow:Evaluation [ 150 / 500 ] INFO:tensorflow:Evaluation [ 200 / 500 ] INFO:tensorflow:Evaluation [ 250 / 500 ] INFO:tensorflow:Evaluation [ 300 / 500 ] INFO:tensorflow:Evaluation [ 350 / 500 ] INFO:tensorflow:Evaluation [ 400 / 500 ] INFO:tensorflow:Evaluation [ 450 / 500 ] miou_1. 0 [ 0.496331513 ] |
5.可视化测试
- 在vis目录下生成分割结果图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
#!/bin/bash # CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --backbone resnet --lr 0.01 --workers 4 --epochs 40 --batch-size 16 --gpu-ids 0,1,2,3 --checkname deeplab-resnet --eval-interval 1 --dataset coco PATH_TO_CHECKPOINT = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/' PATH_TO_VIS_DIR = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/vis/' PATH_TO_DATASET = '/home/rjw/tf-models/research/deeplab/datasets/cityscapes/tfrecord' WORK_DIR = '/home/rjw/tf-models/research/deeplab' # From tensorflow/models/research/ python "${WORK_DIR}" / vis.py \ - - logtostderr \ - - vis_split = "val" \ - - model_variant = "xception_65" \ - - atrous_rates = 6 \ - - atrous_rates = 12 \ - - atrous_rates = 18 \ - - output_stride = 16 \ - - decoder_output_stride = 4 \ - - vis_crop_size = 1025 \ - - vis_crop_size = 2049 \ - - dataset = "cityscapes" \ - - colormap_type = "cityscapes" \ - - checkpoint_dir = ${PATH_TO_CHECKPOINT} \ - - vis_logdir = ${PATH_TO_VIS_DIR} \ - - dataset_dir = ${PATH_TO_DATASET} |
以上为个人经验,希望能给大家一个参考,也希望大家多多支持服务器之家。
原文链接:https://www.cnblogs.com/ranjiewen/p/10134108.html