Beijing, 100084
China
Tsinghua University
Off-policy reinforcement learning, Sample efficiency, Exploration-exploitation trade-off, Cognitive consistency