

In addition, the models can be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language. For example, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. With a context length of over 8,000 tokens, the StarCoder models can process more input than any other open LLM, enabling a wide range of interesting applications. We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models such as code-cushman-001 from OpenAI (the original Codex model that powered early versions of GitHub Copilot). We fine-tuned StarCoderBase model for 35B Python tokens, resulting in a new model that we call StarCoder. Similar to LLaMA, we trained a ~15B parameter model for 1 trillion tokens. StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks.
