How to get started with Docs2KG?
Installation
We have published the package to PyPi: Docs2KG,
You can install it via:
pip install Docs2KG
Confirm the Installation
Run python in your terminal, and then import the package:
import Docs2KG
print(Docs2KG.__author__) # ai4wa
Set up the config.yml
file
We have several configurations that you will need to set up in the config.yml
file.
You can copy the config.example.yml
file and rename it to config.yml
.
cp config.example.yml config.yml
There are three main configurations that you need to set up:
data
directories for input, output and ontology- agent configurations for
openai
,ollama
,huggingface
andllamacpp
- ontology configurations regarding: entity list, relation list, ontology json and domain description text
Setup CONFIG_FILE
environment variable
You can either set the CONFIG_FILE
environment variable to the path of the config.yml
file.
export CONFIG_FILE=/path/to/config.yml
Or you can set it before you import Docs2KG in the code like this
import os
from pathlib import Path
os.environ["CONFIG_FILE"] = str(Path.cwd() / "config.yml")
Data Directories
input_dir
: The directory where the input documents are stored.output_dir
: The directory where the output knowledge graph will be stored. Our output json will be stored in this directory.ontology_dir
: The directory where the ontology files are stored.
Agent Configurations
openai
: The configuration for the OpenAI API. You can change the endpoints to use Azure OpenAI API.ollama
: The configuration for the OLLAMA API. You will need to set up ollama sever by yourself and get the specific model running.huggingface
: The configuration for the Huggingface API. You will need to acquire the API key from Huggingface.llamacpp
: The configuration for the quantization model. You will need to download your specific model and set the path to the model.
Ontology Configurations
Entity List and Relation List
entity_list
: The list of entities that you want to extract from the documents. It should have the following format:
entity,entity_type
xx,xx
relation_list
: The list of relations that you want to extract from the documents. It should have the following format:
relation,relation_type
xx,xx
Ontology JSON
ontology_json
: The ontology json file that you want to use for the knowledge graph. It should have the following
format:
{
"entity_types": [
"TECTSETTIN",
"Exploration Method",
"SUPERBASIN"
],
"relation_types": [
"WorksFor",
"Manages",
"LeadsProject",
"BelongsTo"
],
"connections": [
[
"WorksFor",
"Person",
"Organization"
],
[
"Manages",
"Person",
"Department"
],
[
"LeadsProject",
"Person",
"Project"
],
[
"BelongsTo",
"Project",
"Department"
]
]
}
Domain Description Text
Put the domain description text in the domain_description.txt
file. It will be used to generate the ontology later.