No.92 Wucheng Rd
Taiyuan, 030006
China
Shanxi University
Off-policy reinforcement learning, Sample efficiency, Exploration-exploitation trade-off, Cognitive consistency