The use of additional fiber bands for optical communications-known as Multi-band or Band-division multiplexing (BDM)-allows to increase the traffic served in transparent optical networks. In recent years, many proposals have emerged as a solution for resource allocation in such multi-band architectures. This work presents a novel approach based on reinforcement learning (RL) techniques to accommodate multi-band elastic optical network resources. Two new environments were implemented and added to the Optical-RL-Gym toolkit considering four scenarios with different band availability. Six agents were tested in four real network topologies, contrasting their episode rewards on a large number of training steps. Results show Trust Region Policy Optimization (TRPO) as the best performing agent, with consistent output across all the scenarios and network topologies considered. In addition, we illustrate the blocking probability behavior in relation to the traffic load, and band usage distribution, allowing further discussions.