How to use Docs2KG Package?
After you have done the setup. Then you can start to implement your own code to generate the knowledge graph from your documents.
The overall steps includes:
- Document Digitization: Get any format into markdown
- KG Construction: Construct the knowledge graph from the documents
Example code
import os
from pathlib import Path
# set the environment variable to the config file to current directory
os.environ["CONFIG_FILE"] = str(Path.cwd() / "config.yml")
from Docs2KG.digitization.image.pdf_docling import PDFDocling
from Docs2KG.kg_construction.layout_kg.layout_kg import LayoutKGConstruction
from Docs2KG.kg_construction.semantic_kg.ner.ner_spacy_match import NERSpacyMatcher
from Docs2KG.utils.config import PROJECT_CONFIG
if __name__ == "__main__":
# digitization for PDF
pdf_path = (
PROJECT_CONFIG.data.input_dir / "gsdRec_2024_08.pdf"
) # path to the pdf file or any other format
processor = PDFDocling(file_path=pdf_path) # initialize the processor
processor.process() # process the pdf file
# knowledge graph construction
project_id = "wamex" # project id
md_files = (
PROJECT_CONFIG.data.output_dir
/ "gsdRec_2024_08"
/ "PDFDocling"
/ "gsdRec_2024_08.md"
) # path to the markdown file which is generated from the pdf
layout_kg_construction = LayoutKGConstruction(
project_id
) # initialize the layout kg construction
layout_kg_construction.construct(
[{"content": md_files.read_text(), "filename": md_files.stem}]
) # construct the layout kg
example_json = (
PROJECT_CONFIG.data.output_dir
/ "projects"
/ project_id
/ "layout"
/ "gsdRec_2024_08.json"
) # path to the layout kg json file
# entity extraction with NER spacy matcher given an entity list
entity_extractor = NERSpacyMatcher(project_id)
entity_extractor.construct_kg([example_json])
After this, the output json will be with the entities and relations extracted from the documents.
Upload the output json to the Docs2KG web interface for next step human-in-the-loop validation and verification.