Qwop cheats5/17/2023 You're neither used to consciously thinking about operating individual joints in real life (assuming you're lucky enough to have two working legs) nor in games. You press the keys and, sure, Qwop moves, but you instantly realize the process will not be a smooth one. With this, QWOP is added to the list of games where AI has surpassed humans.Image via Amiga keycaps Kickstarter (opens in new tab) The best recorded run of the final agent was 47.34 seconds(below) which is faster than the best recorded run by a human ( 48.34 seconds). It was trained for a total of 40 hours, of which all of it was self-learning except for a couple of minutes of pre-training. It did not do mixed memory training with Kurodo’s data like ACER did. The new agent was first pre-trained with ACER’s runs and then improved through self-learning. The effective frame-rate for both training and testing were increased to 18 and 25 (from 9 previously) so that the agent could make more precise actions. But now that it knows how to stride, I took the training wheels off and let it optimize purely for forward velocity. When the agent was just learning to stride, these helped stabilize its running technique. Previously it had a number of penalties including penalties for low torso height, vertical torso movement, and excessive knee bending. This DQN variant was chosen because it’s much more sample efficient so it can quickly optimize the techniques already learned by ACER. I recorded runs of the ACER agent and fed it to a new Prioritized DDQN⁷ agent. Now that the agent mastered the correct techniques, I created a new agent solely for the purpose of improving speed. The best recorded run of this agent was 1m 8s, which is a top 10 speedrun. The final agent had the following training schedule: pre-training, 25 hours by itself, 15 hours with Kurodo’s data, and another 25 hours by itself. At which point I removed the expert data from its memory and let it continue training by itself. However, the agent’s policy (actions) would become unstable after a while because it was unable to fully reconcile the external data with its own policy. Although there is little theoretical basis for this scheme, it worked surprisingly well.ĪI learns advanced leg swing technique. |s) must also be included with each injection.This approach is somewhat similar to Deep Q-learning from Demonstrations (DQfD)⁶ except we don’t add all the demonstrations at the start, we add demonstrations at the same rate as real experience. Half of the agent’s memory would be games it played and the other half would be Kurodo’s experiences. I then tried injecting the experience directly into the ACER replay buffer for off-policy learning. The agent could not make use of the data immediately, and then quickly forgets/overwrites the pre-training experience. I first tried supervised pre-training like before but the results were not great. Kurodo ( one of the world’s top QWOP speedrunners, was generous enough to record and encode 50 episodes of his play for my agent to learn from. I wanted to show the AI this technique but I was definitely not skilled enough to demonstrate it myself, so I reached out for help. The agent wasn’t discovering a special technique used by all of the top speed runners: swinging the legs upwards and forwards to create extra momentum. Refer to the original paper² for more details.ĪI finally learns to take strides. To address the instability of off-policy estimators, it uses Retrace Q-value estimation, importance weight truncation, and efficient trust region policy optimization. learns faster) than its on-policy only counterparts. This allows ACER to be more sample efficient (i.e. This means the agent not only learns from its most recent experience, but also from older experiences stored in memory. This Reinforcement Learning algorithm combines Advantage Actor-Critic (A2C) with a replay buffer for both on-policy and off-policy learning. The first agent’s learning algorithm was Actor-Critic with Experience Replay (ACER)². The action space consisted of 11 possible actions: each of the 4 QWOP buttons, 6 two-button combinations, and no keypress. The state consisted of position, velocity and angle for each body part and joint. Next, the game was wrapped in an OpenAI gym style environment, which defined the state space, action space, and reward function.
0 Comments
Leave a Reply. |