Skip to content

Count tokens

count_tokens(text, model_name='cl100k_base')

Count the number of tokens in the text

References: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Encoding name OpenAI models
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
p50k_base Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2) GPT-3 models like davinci

Parameters:

Name Type Description Default
text str

The text to count the tokens

required
model_name str

The model name to use for tokenization. Default is "cl100k_base"

'cl100k_base'

Returns:

Name Type Description
total_token int

The number of tokens in the text

Source code in Docs2KG/utils/llm/count_tokens.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def count_tokens(text, model_name="cl100k_base") -> int:
    """
    Count the number of tokens in the text

    References: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

    | Encoding name | OpenAI models                                                                                 |
    |---------------|-----------------------------------------------------------------------------------------------|
    | cl100k_base   | gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large  |
    | p50k_base     | Codex models, text-davinci-002, text-davinci-003                                              |
    | r50k_base (or gpt2) | GPT-3 models like davinci                                                               |

    Args:
        text (str): The text to count the tokens
        model_name (str): The model name to use for tokenization. Default is "cl100k_base"


    Returns:
        total_token (int): The number of tokens in the text

    """
    enc = tiktoken.get_encoding(model_name)
    tokens = enc.encode(text)
    return len(tokens)