Just Paste Me

Learning To Play Minecraft With Video PreTraining (VPT)

There are many videos that can be found on the internet. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. These videos provide only a record of the event, not the actual process. you will not know the exact sequence of mouse movements and keys pressed. These domains are not as easy to build foundation models on a large scale as language with GPT. The lack of action label is a new challenge. In language, "action labels", which are simply the next sentence, presents a new challenge.

We present Video PreTraining, a semi-supervised imitation method that allows you to make use of the vast amount of unlabeled data available on the internet. We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use both past- and future information in order to predict each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. The trained IDM will then be able to label a greater number of online video clips and learn to behave via behavioral cloning.

VPT Zero-Shot Results

Our validation method was tested in Minecraft, which is a popular video game and has a lot more video data. It also allows for open-ended activities similar to real-world applications, such as computer usage. Our AI uses a native human interface, 20Hz framerate, to communicate with the mouse and keyboard, in contrast to previous works in Minecraft.

Our behavioral cloning system (the "VPT Foundation model") has been trained using 70,000 hours of IDM-labeled video. It is capable of performing tasks in Minecraft that are nearly impossible with reinforcement learning from scratch. It learns how to cut down trees to collect logs and craft those logs into planks. Then it crafts those planks into crafting tables. This process takes a human who is proficient in Minecraft approximately 50 seconds, or 1,000 consecutive actions.

Additionally, the model can perform other complex skills that are common in the game such as swimming and hunting animals for food. It also learned the skill "pillar jumping", which is a Minecraft common behavior. This involves repeatedly jumping up and placing a block under yourself.

Fine-tuning with Behavioral Cloning

Foundation models are designed with a broad behavior profile to be able to perform a wide range if tasks. It is common to fine tune these models to smaller datasets in order to add new knowledge or to allow them to specialize in a narrower task range. To show how well the VPT foundation can be fine-tuned for downstream datasets, we asked our contractors if they would play for 10 mins in new Minecraft worlds and build houses from basic Minecraft materials. We hoped this would increase the foundation's ability reliably to perform "early" skills such building crafting tables. Fine-tuning this dataset shows a significant improvement in the foundation model's ability to reliably perform early game skills. The fine-tuned model can also learn to go deeper into the technology tree, crafting both wooden tools and stone tools. Sometimes, we see the agent searching through villages and raiding chests.

Improved early game behavior from BC fine-tuning

Data Scaling

Our most important hypothesis is that using labeled contractor data to train an IDM is much more efficient than using that same small contractor dataset to directly train a BC foundation modeling model. This hypothesis is supported by foundation models being trained on increasing amounts of data, from 1 to 70,000 hours. For those with less than 2,000 hours of data, they are trained on contractor data with ground truth labels. Those with more than 2,000 hours are trained with internet data labeled by our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.

Fine-tuning: Effect of foundation model training data

As foundation model data increases, we generally see an increase in crafting ability, and only at the largest data scale do we see the emergence of stone tool crafting.

Fine-Tuning with Reinforcement Learning

When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, there are many tasks that require you to overcome hard exploration challenges. Many RL methods deal with these problems using random exploration priors, e.g. Entropy bonuses can be used to encourage models to act randomly. VPT models should be a much better pre-requisite for RL, as emulating human behavior can be much more helpful than random actions. Our model was given the difficult task of collecting a diamond pickaxe. This is a unique capability in Minecraft, made more difficult by the native human interface.

The process of crafting a diamond pickaxe is complicated and requires many subtasks. To make this task tractable, we reward agents for each item in the sequence.

A random initialization (the standard RL technique) is the best way to train RL policies. This means that it rarely learns how to collect logs or sticks and doesn't get any rewards. Fine-tuning from VPT models not only teaches how to craft diamond pickaxes in 2.5% of 10-minute Minecraft episodes, but also has a human-level success rate when it comes to collecting all items that lead to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.
ADDICT GAMING

Reward yourself for watching more episodes

Conclusion

VPT is a path to allowing agents to learn how to act by watching the large number of videos on the Internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. We only tried Minecraft but the game is extremely open-ended and the natural human interface (mouse, keyboard) is very generic. Therefore, we believe that our results will be useful in other similar domains such as e.g. computer usage.

For more information, please see our paper. We are also open sourcing our contractor data, Minecraft environment, model code, and model weights, which we hope will aid future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS competition this year. Contestants can use and fine-tune our models to try to solve many difficult tasks in Minecraft. Anyone interested can visit the competition webpage to compete for a $100,000 blue-sky prize and a $20,000. regular prize pool.

Created: 15/08/2022 15:40:34

Page views: 101

CREATE NEW PAGE