Linux - 弯路系列4:安装Torque、PBS作业系统及遇到的问题
系统:Anolis8(离线),所有命令在root账户下进行
目录
- 1、安装步骤
- 2、问题汇总
- (1)socket_connect_unix failed: 15137 / could not connect to trqauthd
- (2)cannot connect to server abc01 (errno=111) Connection refused
- (3)cannot connect to server hky01 (errno=15009) munge executable not found, unable to authenticate
- (4)pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file
- (5)qsub: submit error (No default queue specified MSG=requested queue not found) 或 (Unknown queue MSG=requested queue not found)
- (6)提交作业一直是Queued(Q)状态
1、安装步骤
我用的rpm安装包,torque(4.2.10)版本。依赖一个一个找的(阿里巴巴开源镜像站)。
rpm -ivh *.rpm
其他教程参考(感谢分享!):
(1)[集群维护] CentOS下安装PBS+maui教程
(2)PBS作业管理系统安装
2、问题汇总
(1)socket_connect_unix failed: 15137 / could not connect to trqauthd
[root@abc01 ~]# qstat
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd
Torque 的守护进程 trqauthd 没有运行或者配置不正确。trqauthd 是 Torque 调度系统的一部分,负责认证和通信。
解决:
-
启动trqauthd
sudo systemctl start trqauthd ( 或 sudo service trqauthd start )
-
检查:
sudo systemctl status trqauthd (或 sudo service trqauthd status )
(2)cannot connect to server abc01 (errno=111) Connection refused
[root@abc01 ~]# qstat
Unable to communicate with abc01(100.222.150.255)
Cannot connect to specified server host 'abc01'.
Unable to communicate with abc01(100.222.150.255)
Cannot connect to specified server host 'abc01'.
Unable to communicate with abc01(100.222.150.255)
Cannot connect to specified server host 'abc01'.
qstat: cannot connect to server abc01 (errno=111) Connection refused
解决:
确保/var/lib/torque/server_name、/var/lib/torque/mom_priv/config 、/etc/hosts等文件中的主机名一致。
我这里是修改了config文件中的主机名后解决的。
- vi /var/lib/torque/mom_priv/config
- # Configuration for pbs_mom.
$pbsserver abc01
$logevent 255
(3)cannot connect to server hky01 (errno=15009) munge executable not found, unable to authenticate
[root@abc01 ~]# qstat
munge_encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (6)
Unable to communicate with abc01(100.222.150.255)
Communication failure.
munge_encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (6)
Unable to communicate with abc01(100.222.150.255)
Communication failure.
munge_encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (6)
Unable to communicate with abc01(100.222.150.255)
Communication failure.
qstat: cannot connect to server abc01 (errno=15009) munge executable not found, unable to authenticate
解决:
-
安装和配置 Munge
确保 munge、munge-libs、munge-devel三个rpm包已经安装
确保 /etc/munge/munge.key 密钥文件存在- sudo /usr/sbin/create-munge-key
% 通过这个命令生成一个新的 munge 密钥文件,被创建在 /etc/munge/munge.key
- sudo /usr/sbin/create-munge-key
-
启动和启用 Munge:
sudo systemctl enable munge
sudo systemctl start munge
(4)pbsnodes: Server has no node list MSG=node list is empty - check ‘server_priv/nodes’ file
解决:
在‘server_priv/nodes’ file中写入主机名和节点信息。
- vi /var/lib/torque/server_priv/nodes
abc01 np=128
注:np可以通过命令nproc
查看
(5)qsub: submit error (No default queue specified MSG=requested queue not found) 或 (Unknown queue MSG=requested queue not found)
没有默认队列或队列未激活的问题
解决:
设置默认队列
- qmgr -c “create queue abc” % 创建名为abc的队列
qmgr -c “set server default_queue=abc” % 将队列abc设置为“默认队列”
qmgr -c “set queue abc queue_type=Execution” % 指定队列类型为“执行队列”
qmgr -c “set queue abc enabled=True” % 激活队列并可用于作业提交
qmgr -c “set queue abc started=True” % 队列可接受作业,并被调度运行- qmgr -c “list server”
若输出内容中包含:default_queue = abc,则说明默认队列已存在。
(6)提交作业一直是Queued(Q)状态
解决:
- qmgr -c “p s”
查看PBS/Torque服务器当前配置- qmgr -c “set server scheduling = True”
启用调度功能,允许PBS服务器进行作业调度。
原文地址:https://blog.csdn.net/island_chenyanyu/article/details/143843439
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!