How to get started with Docs2KG?

Installation

We have published the package to PyPi: Docs2KG,

You can install it via:

pip install Docs2KG

Confirm the Installation

Run python in your terminal, and then import the package:

import Docs2KG

print(Docs2KG.__author__)  # ai4wa

Set up the `config.yml` file

We have several configurations that you will need to set up in the config.yml file.

You can copy the config.example.yml file and rename it to config.yml.

cp config.example.yml config.yml

There are three main configurations that you need to set up:

data directories for input, output and ontology
agent configurations for openai, ollama, huggingface and llamacpp
ontology configurations regarding: entity list, relation list, ontology json and domain description text

Setup `CONFIG_FILE` environment variable

You can either set the CONFIG_FILE environment variable to the path of the config.yml file.

export CONFIG_FILE=/path/to/config.yml

Or you can set it before you import Docs2KG in the code like this

import os
from pathlib import Path

os.environ["CONFIG_FILE"] = str(Path.cwd() / "config.yml")

Data Directories

input_dir: The directory where the input documents are stored.
output_dir: The directory where the output knowledge graph will be stored. Our output json will be stored in this directory.
ontology_dir: The directory where the ontology files are stored.

Agent Configurations

openai: The configuration for the OpenAI API. You can change the endpoints to use Azure OpenAI API.
ollama: The configuration for the OLLAMA API. You will need to set up ollama sever by yourself and get the specific model running.
huggingface: The configuration for the Huggingface API. You will need to acquire the API key from Huggingface.
llamacpp: The configuration for the quantization model. You will need to download your specific model and set the path to the model.

Ontology Configurations

Entity List and Relation List

entity_list: The list of entities that you want to extract from the documents. It should have the following format:

entity,entity_type
xx,xx

relation_list: The list of relations that you want to extract from the documents. It should have the following format:

relation,relation_type
xx,xx

Ontology JSON

ontology_json: The ontology json file that you want to use for the knowledge graph. It should have the following format:

{
  "entity_types": [
    "TECTSETTIN",
    "Exploration Method",
    "SUPERBASIN"
  ],
  "relation_types": [
    "WorksFor",
    "Manages",
    "LeadsProject",
    "BelongsTo"
  ],
  "connections": [
    [
      "WorksFor",
      "Person",
      "Organization"
    ],
    [
      "Manages",
      "Person",
      "Department"
    ],
    [
      "LeadsProject",
      "Person",
      "Project"
    ],
    [
      "BelongsTo",
      "Project",
      "Department"
    ]
  ]
}

Domain Description Text

Put the domain description text in the domain_description.txt file. It will be used to generate the ontology later.