Skip to content

How to use Docs2KG Package?

After you have done the setup. Then you can start to implement your own code to generate the knowledge graph from your documents.

The overall steps includes:

  • Document Digitization: Get any format into markdown
  • KG Construction: Construct the knowledge graph from the documents

Example code

import os
from pathlib import Path

# set the environment variable to the config file to current directory

os.environ["CONFIG_FILE"] = str(Path.cwd() / "config.yml")

from Docs2KG.digitization.image.pdf_docling import PDFDocling
from Docs2KG.kg_construction.layout_kg.layout_kg import LayoutKGConstruction
from Docs2KG.kg_construction.semantic_kg.ner.ner_spacy_match import NERSpacyMatcher
from Docs2KG.utils.config import PROJECT_CONFIG

if __name__ == "__main__":
    # digitization for PDF
    pdf_path = (
            PROJECT_CONFIG.data.input_dir / "gsdRec_2024_08.pdf"
    )  # path to the pdf file or any other format
    processor = PDFDocling(file_path=pdf_path)  # initialize the processor
    processor.process()  # process the pdf file
    # knowledge graph construction
    project_id = "wamex"  # project id
    md_files = (
            PROJECT_CONFIG.data.output_dir
            / "gsdRec_2024_08"
            / "PDFDocling"
            / "gsdRec_2024_08.md"
    )  # path to the markdown file which is generated from the pdf
    layout_kg_construction = LayoutKGConstruction(
        project_id
    )  # initialize the layout kg construction
    layout_kg_construction.construct(
        [{"content": md_files.read_text(), "filename": md_files.stem}]
    )  # construct the layout kg
    example_json = (
            PROJECT_CONFIG.data.output_dir
            / "projects"
            / project_id
            / "layout"
            / "gsdRec_2024_08.json"
    )  # path to the layout kg json file
    # entity extraction with NER spacy matcher given an entity list
    entity_extractor = NERSpacyMatcher(project_id)
    entity_extractor.construct_kg([example_json])

After this, the output json will be with the entities and relations extracted from the documents.

Upload the output json to the Docs2KG web interface for next step human-in-the-loop validation and verification.