基于改进PPO算法的钻锚机器人机械臂路径规划

Robot arm path planning of drilling and anchoring robot based on improved PPO algorithm

  • 摘要: 煤矿巷道支护装备的自动化与智能化水平较低,制约了煤矿巷道的成形效率,是造成“采掘失衡”的关键原因。为解决煤矿巷道支护装备自动化程度低、支护效率差的问题,针对一种集成悬臂式掘进机和多自由度机械臂的钻锚机器人,提出了一种基于深度强化学习的钻锚机器人机械臂路径规划方法。在虚拟环境中构建煤矿巷道环境,并建立机械臂与机身、煤壁以及支护钢带的碰撞检测模型,使用层次包围盒法在虚拟环境进行碰撞检测,形成煤矿巷道边界受限情况下的避障策略。在近端策略优化(Proximal Policy Optimization,PPO)算法的基础上结合多方面因素提出改进。考虑到多自由度机械臂状态空间输入长度不固定的情况,引入长短记忆神经网络(Long Short Term Memory,LSTM)的环境状态输入处理方法,可以提升算法对环境的适应能力。并且在奖惩稀疏的情况下引入了好奇心机制(Intrinsic Curiosity Module,ICM),通过给予内在奖励鼓励智能体更大程度地探索环境。基于奖惩机制建立智能体,根据钻锚机器人的运动特性定义其状态空间与动作空间,在同一场景下分别使用2种算法对智能体进行训练,综合奖励值、回合步数、Actor网络损失值、Critic网络损失值等指标进行对比分析,最后经过仿真消融实验测试对比。实验结果表明,在原始PPO算法不能完成任务的情况下,改进后的算法路径长度比同样能完成任务的PPO-ICM算法缩短了3.98%,所用时间缩短了25.6%。为进一步验证改进后算法的鲁棒性,设计多组实验,改进后的PPO算法均完成路径规划任务,路径终点与目标位置的距离误差在3.88 cm之内,锚杆与竖直方向夹角误差在3°以内,能够有效完成路径规划任务,提升煤矿巷道支护系统的自动化程度。结果验证了所提方法在煤矿井下巷道支护时锚孔位置多变的情况下钻锚机器人多自由度机械臂在路径规划的可行性与有效性。

     

    Abstract: The low level of automation and intelligence of roadway support equipment in coal mine restricts the forming efficiency of coal mine roadway, which is the key reason for “mining imbalance”. In order to solve the problems of low automation and poor support efficiency of coal mine roadway support equipment, a path planning method of drilling and anchoring robot arm based on deep reinforcement learning is proposed for a drilling and anchoring robot arm integrating cantilever road header and multi-degree-of-freedom manipulator. The coal mine roadway environment is constructed in the virtual environment, and the collision detection model of the manipulator and the fuselage, the coal wall and the supporting steel belt is established. The collision detection is carried out in the virtual environment by using the hierarchical bounding box method, and the obstacle avoidance strategy under the condition of limited boundary of the coal mine roadway is formed. Based on the PPO ( Proximal Policy Optimization ) algorithm, combined with various factors, an improvement is proposed. Considering that the state space input length of the multi-degree-of-freedom manipulator is not fixed, the environmental state input processing method of the LSTM ( Long Short Term Memory networks ) neural network is introduced, which can improve the adaptability of the algorithm to the environment. In addition, the ICM ( Intrinsic Curiosity Module) is introduced in the case of sparse rewards and punishments, and the agent is encouraged to explore the environment to a greater extent by giving internal rewards. Based on the reward and punishment mechanism, the agent is established. According to the motion characteristics of the drilling and anchoring robot, its state space and action space are defined. In the same scene, two algorithms are used to train the agent respectively. The comprehensive reward value, round steps, Actor network loss value, Critic network loss value and other indicators are compared and analyzed. Finally, through the simulation ablation experiment test comparison : The experimental results show that when the original PPO algorithm cannot complete the task, the path length of the improved algorithm is 3.98% shorter than that of the PPO-ICM algorithm which can also complete the task, and the time used is shortened by 25.6%. In order to further verify the robustness of the improved algorithm, multiple sets of experiments are designed. The improved PPO algorithm completes the path planning task. The distance error between the path end point and the target position is within 3.88 cm, and the angle error between the bolt and the vertical direction is within 3°. It can effectively complete the path planning task and improve the automation degree of the coal mine roadway support system. The results verify the feasibility and effectiveness of the proposed method in the path planning of the multi-degree-of-freedom manipulator of the drilling and anchoring robot in the case of the changeable position of the anchor hole in the coal mine roadway support.

     

/

返回文章
返回