Skip to content

Semantic kg

SemanticKG

The plan is

  • We keep the layout_kg.json, and use this as the base
  • Then we start to extract the linkage
  • And then we have a semantic_kg.json

Within this one we will have

  • source_uuid
  • source_semantic
  • predicate
  • target_uuid
  • target_semantic
  • extraction_method

What we want to link:

  • Table to Content
    • Where this table is mentioned, which is actually finding the reference point
  • Image to Content
    • Same as table

Discussion

  • Within Page
    • Text2KG with Named Entity Recognition
  • Across Pages
    • Summary Linkage?

Alerts: Some of the functions will require the help of LLM

Source code in Docs2KG/kg/semantic_kg.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
class SemanticKG:
    """
    The plan is

    - We keep the layout_kg.json, and use this as the base
    - Then we start to extract the linkage
    - And then we have a semantic_kg.json

    Within this one we will have

    - source_uuid
    - source_semantic
    - predicate
    - target_uuid
    - target_semantic
    - extraction_method

    What we want to link:

    - Table to Content
        - Where this table is mentioned, which is actually finding the reference point
    - Image to Content
        - Same as table

    Discussion

    - Within Page
        - Text2KG with Named Entity Recognition
    - Across Pages
        - Summary Linkage?

    Alerts: Some of the functions will require the help of LLM
    """

    def __init__(
        self,
        folder_path: Path,
        llm_enabled: bool = False,
        input_format: str = "pdf_exported",
    ):
        """
        Initialize the SemanticKG class
        Args:
            folder_path (Path): The path to the pdf file
            llm_enabled (bool, optional): Whether to use LLM. Defaults to False.

        """
        self.folder_path = folder_path
        self.llm_enabled = llm_enabled
        self.cost = 0
        logger.info("LLM is enabled" if self.llm_enabled else "LLM is disabled")
        self.kg_folder = self.folder_path / "kg"
        if not self.kg_folder.exists():
            self.kg_folder.mkdir(parents=True, exist_ok=True)

        self.layout_kg_file = self.kg_folder / "layout_kg.json"
        # if layout_kg does not exist, then raise an error
        if not self.layout_kg_file.exists():
            raise FileNotFoundError(f"{self.layout_kg_file} does not exist")
        # load layout_kg
        self.layout_kg = self.load_kg(self.layout_kg_file)
        self.input_format = input_format

    def add_semantic_kg(self):
        """
        As discussed in the plan, we will add the semantic knowledge graph based on the layout knowledge graph

        Returns:

        """
        # we will start with the image to content
        if self.input_format == "pdf_exported":
            self.semantic_link_image_to_content()
            self.semantic_link_table_to_content()
        if self.input_format != "excel":
            self.semantic_page_summary()
        self.semantic_text2kg()

    def semantic_link_image_to_content(self):
        """
        Link the image to the content

        1. We will need to extract the image's caption and reference point
        2. Use this caption or 1.1 to search the context, link the image to where the image is mentioned

        Returns:

        """

        # first locate the image caption
        for page in self.layout_kg["children"]:
            # within the page node, then it should have the children start with the image node
            for child in page["children"]:
                if child["node_type"] == "image":
                    # child now is the image node
                    # if this child do not have children, then we will skip
                    if "children" not in child or len(child["children"]) == 0:
                        continue
                    # logger.info(child)
                    for item in child["children"]:
                        # if this is the caption, then we will extract the text
                        text = item["node_properties"]["content"]
                        if self.util_caption_detection(text):
                            logger.info(f"Figure/Caption detected: {text}")
                            # we will use this
                            child["node_properties"]["caption"] = text
                            """
                            Link the caption to where it is mentioned

                            For example, if the caption is "Figure 1.1: The distribution of the population",
                            then we will search the context
                            And found out a place indicate that: as shown in Figure 1.1,
                            the distribution of the population is ...

                            We need to find a way to match it back to the content

                            Current plan of attack, we use rule based way.

                            If there is a Figure XX, then we will search the context for Figure XX,
                            and link it back to the content
                            Because the content have not been
                            """

                            uuids, caption = self.util_caption_mentions_detect(
                                caption=text
                            )
                            logger.info(f"UUIDs: {uuids}")
                            child["node_properties"]["mentioned_in"] = uuids
                            if caption:
                                child["node_properties"]["unique_description"] = caption
                            continue

        self.export_kg()

    def semantic_link_table_to_content(self):
        """
        Link the table to the content

        So we will do the same thing first for the table

        Returns:

        """
        for page in self.layout_kg["children"]:
            # within the page node, then it should have the children start with the image node
            for child in page["children"]:
                if child["node_type"] == "table_csv":
                    # child now is the image node
                    # if this child do not have children, then we will skip
                    if "children" not in child or len(child["children"]) == 0:
                        continue
                    # logger.info(child)
                    for item in child["children"]:
                        # if this is the caption, then we will extract the text
                        text = item["node_properties"]["content"]
                        if self.util_caption_detection(text):
                            logger.info(f"Table/Caption detected: {text}")
                            # we will use this
                            child["node_properties"]["caption"] = text
                            uuids, caption = self.util_caption_mentions_detect(
                                caption=text
                            )
                            logger.info(f"UUIDs: {uuids}")
                            child["node_properties"]["mentioned_in"] = uuids
                            if caption:
                                child["node_properties"]["unique_description"] = caption
                            continue
        self.export_kg()

    def semantic_text2kg(self):
        """
        ## General Goal of this:

        - A list of triplet: (subject, predicate, object)
        - Triplets will be associated to the tree
        - Frequent subject will be merged, and linked

        Plan of attack:

        1. We need to do the Named Entity Recognition for each sentence
        2. Do NER coexist relationship
        3. Last step will be extracting the semantic NER vs NER relationship

        How to construction the relation?

        - We will grab the entities mapping to text uuid
        {
           "ner_type": {
            "entities": [uuid1, uuid2]
           }
        }

        """
        if self.llm_enabled:
            current_cost = self.cost
            # do the triple extraction
            # self.semantic_triplet_extraction(self.layout_kg)
            logger.info("Start the semantic text2kg extraction")
            nodes = self.semantic_triplet_extraction(self.layout_kg, [])
            # use tqdm to show the progress
            for node in tqdm(nodes, desc="Extracting triplets"):
                # extract the triplets from the text
                text = node["node_properties"]["content"]
                logger.debug(text)
                if text == "":
                    continue
                triplets = self.llm_extract_triplet(text)
                node["node_properties"]["text2kg"] = triplets
            self.export_kg()
            logger.info(f"LLM cost: {self.cost - current_cost}")
        else:
            # Hard to do this without LLM
            logger.info("LLM is not enabled, skip the semantic text2kg extraction")

    def semantic_triplet_extraction(self, node: dict, nodes: List[dict]):
        """
        Extract tripplets from the text

        It will update the node with the Text2KG field, add a list of triplets
        Args:
            node (dict): The node in the layout knowledge graph
            nodes (List[dict]): The list of nodes

        Returns:

        """
        for child in node["children"]:
            if "children" in child:
                nodes = self.semantic_triplet_extraction(child, nodes)
            content = child["node_properties"].get("content", "")
            if not content:
                continue
            nodes.append(child)
        return nodes

    def semantic_page_summary(self):
        """
        Summary of the page, which will have better understanding of the page.

        Not sure whether this will enhance the RAG or not.

        But will be easier for human to understand the page.

        When doing summary, also need to give up page and later page information

        Returns:

        """
        for page_index, page in enumerate(self.layout_kg["children"]):
            if page["node_type"] == "page":
                page_content = page["node_properties"]["page_text"]
                logger.debug(page_content)
                summary = self.llm_page_summary(page_content)
                page["node_properties"]["summary"] = summary

        self.export_kg()

    @staticmethod
    def load_kg(file_path: Path) -> dict:
        """
        Load the knowledge graph from JSON

        Args:
            file_path (Path): The path to the JSON file

        Returns:
            dict: The knowledge graph
        """
        with open(file_path, "r") as f:
            kg = json.load(f)
        return kg

    def export_kg(self):
        """
        Export the semantic knowledge graph to a JSON file
        """

        with open(self.layout_kg_file, "w") as f:
            json.dump(self.layout_kg, f, indent=4)

    def util_caption_detection(self, text: str) -> bool:  # noqa
        """
        Give a text, detect if this is a caption for image or table

        If it is LLM enabled, then we will use LLM to detect the caption
        If it is not LLM enabled, we use keyword match
            - Currently LLM performance not well

        Returns:

        """
        for keyword in CAPTION_KEYWORDS:
            if keyword in text.lower():
                return True
        # if self.llm_enabled:
        #     return self.llm_detect_caption(text)
        return False

    def util_caption_mentions_detect(self, caption: str) -> Tuple[List[str], str]:
        """

        First we need to find the unique description for the caption.

        For example: Plate 1.1: The distribution of the population

        Plate 1.1 is the unique description

        We will need to search the whole document to find the reference point

        Args:
            caption (str): The caption text


        Returns:
            uuids (List[str]): The list of uuids where the caption is mentioned

        """
        # first extract the unique description
        # Extract the unique description from the caption
        keyword_patten = "|".join(CAPTION_KEYWORDS)
        match = re.search(rf"(\b({keyword_patten}) \d+(\.\d+)*\b)", caption.lower())
        unique_description = None
        if match:
            unique_description = match.group(1)
        else:
            if self.llm_enabled:
                """
                Try to use LLM to do this work
                """
                unique_description = self.llm_detect_caption_mentions(caption)
                logger.info(f"Unique description: {unique_description}")

        if not unique_description:
            return []
        logger.info(f"Unique description: {unique_description}")
        mentioned_uuids = []
        # search the context
        mentioned_uuids = self.util_mentioned_uuids(
            self.layout_kg, unique_description, mentioned_uuids
        )
        return mentioned_uuids, unique_description

    def util_mentioned_uuids(
        self, node: dict, unique_description: str, uuids: List[str]
    ) -> List[str]:
        """
        Search the context for the unique description

        Args:
            node (dict): The node in the layout knowledge graph
            unique_description (str): The unique description extracted from the caption
            uuids (List[str]): The list of uuids where the unique description is mentioned

        Returns:
            uuids (List[str]): The list of uuids where the unique description is mentioned
        """
        for child in node["children"]:
            if "node_properties" in child:
                if "content" in child["node_properties"]:
                    if (
                        unique_description
                        in child["node_properties"]["content"].lower()
                    ):
                        uuids.append(child["uuid"])
            if "children" in child:
                uuids = self.util_mentioned_uuids(child, unique_description, uuids)
        return uuids

    def llm_detect_caption(self, text: str) -> bool:
        """
        Use LLM to detect whether the given text is a caption for an image or table.

        Args:
            text (str): The text to be evaluated.

        Returns:
            bool: True if the text is identified as a caption, False otherwise.
        """
        try:
            messages = [
                {
                    "role": "system",
                    "content": """You are a system that help to detect if the text is a caption for an image or table.
                                """,
                },
                {
                    "role": "user",
                    "content": f"""
                        Is the following text a caption for image or table?
                        You need to detect if the text is a caption for an image or table.

                        Please return the result in JSON format as follows:
                            - {'is_caption': 1} if it is a caption,
                            - or {'is_caption': 0} if it is not a caption.

                        Some examples are captions for images or tables:
                        - "Figure 1.1: The distribution of the population"
                        - "Table 2.1: The distribution of the population"
                        - "Plate 1.1: The distribution of the population"
                        - "Graph 1.1: The distribution of the population"

                        "{text}"
                    """,
                },
            ]
            response, cost = openai_call(messages)
            self.cost += cost
            logger.debug(f"LLM cost: {cost}, response: {response}, text: {text}")
            response_dict = json.loads(response)
            return response_dict.get("is_caption", 0) == 1
        except Exception as e:
            logger.error(f"Error in LLM caption detection: {e}")
        return False

    def llm_detect_caption_mentions(self, caption: str) -> Optional[str]:
        """
        Use LLM to detect the mentions of the given caption in the document.

        Args:
            caption (str): The caption text.

        Returns:
            List[str]: The list of uuids where the caption is mentioned.
        """
        try:
            messages = [
                {
                    "role": "system",
                    "content": """You are an assistant that can detect the unique description
                                  of a caption in a document.
                                """,
                },
                {
                    "role": "user",
                    "content": f"""
                        Please find the unique description of the caption in the document.

                        For example, if the caption is "Plate 1.1: The distribution of the population",
                        the unique description is "Plate 1.1".

                        Given caption:

                        "{caption}"

                        Return the str within the json with the key "uid".
                    """,
                },
            ]
            response, cost = openai_call(messages)
            self.cost += cost
            logger.debug(f"LLM cost: {cost}, response: {response}, caption: {caption}")
            response_dict = json.loads(response)
            return response_dict.get("uid", "")
        except Exception as e:
            logger.error(f"Error in LLM caption mentions detection: {e}")
            logger.exception(e)
        return None

    def llm_extract_triplet(self, text) -> List[dict]:
        """
        Extract the triplet from the text
        Args:
            text (str): The text to extract the triplets from


        Returns:
            triplets (List[dict]): The list of triplets extracted from the text
        """
        try:
            messages = [
                {
                    "role": "system",
                    "content": """You are an assistant that can extract the triplets from a given text.
                                """,
                },
                {
                    "role": "user",
                    "content": f"""
                        Please extract the triplets from the following text:

                        "{text}"

                        Return the triplets within the json with the key "triplets".
                        And the triplets should be in the format of a list of dictionaries,
                        each dictionary should have the following keys:
                        - subject
                        - subject_ner_type
                        - predicate
                        - object
                        - object_ner_type

                    """,
                },
            ]
            response, cost = openai_call(messages)
            self.cost += cost
            logger.debug(f"LLM cost: {cost}, response: {response}, text: {text}")
            response_dict = json.loads(response)
            return response_dict.get("triplets", [])
        except Exception as e:
            logger.error(f"Error in LLM triplet extraction: {e}")
        return []

    def llm_page_summary(self, page_content: str) -> str:
        """

        Args:
            page_content (str): The content of the page

        Returns:
            summary (str): The summary of the page

        """
        try:
            messages = [
                {
                    "role": "system",
                    "content": """You are an assistant that can summarize the content of a page.
                                """,
                },
                {
                    "role": "user",
                    "content": f"""
                        Please summarize the content of the page.

                        "{page_content}"

                        Return the summary within the json with the key "summary".
                    """,
                },
            ]
            response, cost = openai_call(messages)
            self.cost += cost
            logger.debug(
                f"LLM cost: {cost}, response: {response}, page_content: {page_content}"
            )
            response_dict = json.loads(response)
            return response_dict.get("summary", "")
        except Exception as e:
            logger.error(f"Error in LLM page summary: {e}")
        return ""

__init__(folder_path, llm_enabled=False, input_format='pdf_exported')

Initialize the SemanticKG class Args: folder_path (Path): The path to the pdf file llm_enabled (bool, optional): Whether to use LLM. Defaults to False.

Source code in Docs2KG/kg/semantic_kg.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def __init__(
    self,
    folder_path: Path,
    llm_enabled: bool = False,
    input_format: str = "pdf_exported",
):
    """
    Initialize the SemanticKG class
    Args:
        folder_path (Path): The path to the pdf file
        llm_enabled (bool, optional): Whether to use LLM. Defaults to False.

    """
    self.folder_path = folder_path
    self.llm_enabled = llm_enabled
    self.cost = 0
    logger.info("LLM is enabled" if self.llm_enabled else "LLM is disabled")
    self.kg_folder = self.folder_path / "kg"
    if not self.kg_folder.exists():
        self.kg_folder.mkdir(parents=True, exist_ok=True)

    self.layout_kg_file = self.kg_folder / "layout_kg.json"
    # if layout_kg does not exist, then raise an error
    if not self.layout_kg_file.exists():
        raise FileNotFoundError(f"{self.layout_kg_file} does not exist")
    # load layout_kg
    self.layout_kg = self.load_kg(self.layout_kg_file)
    self.input_format = input_format

add_semantic_kg()

As discussed in the plan, we will add the semantic knowledge graph based on the layout knowledge graph

Returns:

Source code in Docs2KG/kg/semantic_kg.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def add_semantic_kg(self):
    """
    As discussed in the plan, we will add the semantic knowledge graph based on the layout knowledge graph

    Returns:

    """
    # we will start with the image to content
    if self.input_format == "pdf_exported":
        self.semantic_link_image_to_content()
        self.semantic_link_table_to_content()
    if self.input_format != "excel":
        self.semantic_page_summary()
    self.semantic_text2kg()

export_kg()

Export the semantic knowledge graph to a JSON file

Source code in Docs2KG/kg/semantic_kg.py
302
303
304
305
306
307
308
def export_kg(self):
    """
    Export the semantic knowledge graph to a JSON file
    """

    with open(self.layout_kg_file, "w") as f:
        json.dump(self.layout_kg, f, indent=4)

llm_detect_caption(text)

Use LLM to detect whether the given text is a caption for an image or table.

Parameters:

Name Type Description Default
text str

The text to be evaluated.

required

Returns:

Name Type Description
bool bool

True if the text is identified as a caption, False otherwise.

Source code in Docs2KG/kg/semantic_kg.py
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def llm_detect_caption(self, text: str) -> bool:
    """
    Use LLM to detect whether the given text is a caption for an image or table.

    Args:
        text (str): The text to be evaluated.

    Returns:
        bool: True if the text is identified as a caption, False otherwise.
    """
    try:
        messages = [
            {
                "role": "system",
                "content": """You are a system that help to detect if the text is a caption for an image or table.
                            """,
            },
            {
                "role": "user",
                "content": f"""
                    Is the following text a caption for image or table?
                    You need to detect if the text is a caption for an image or table.

                    Please return the result in JSON format as follows:
                        - {'is_caption': 1} if it is a caption,
                        - or {'is_caption': 0} if it is not a caption.

                    Some examples are captions for images or tables:
                    - "Figure 1.1: The distribution of the population"
                    - "Table 2.1: The distribution of the population"
                    - "Plate 1.1: The distribution of the population"
                    - "Graph 1.1: The distribution of the population"

                    "{text}"
                """,
            },
        ]
        response, cost = openai_call(messages)
        self.cost += cost
        logger.debug(f"LLM cost: {cost}, response: {response}, text: {text}")
        response_dict = json.loads(response)
        return response_dict.get("is_caption", 0) == 1
    except Exception as e:
        logger.error(f"Error in LLM caption detection: {e}")
    return False

llm_detect_caption_mentions(caption)

Use LLM to detect the mentions of the given caption in the document.

Parameters:

Name Type Description Default
caption str

The caption text.

required

Returns:

Type Description
Optional[str]

List[str]: The list of uuids where the caption is mentioned.

Source code in Docs2KG/kg/semantic_kg.py
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
def llm_detect_caption_mentions(self, caption: str) -> Optional[str]:
    """
    Use LLM to detect the mentions of the given caption in the document.

    Args:
        caption (str): The caption text.

    Returns:
        List[str]: The list of uuids where the caption is mentioned.
    """
    try:
        messages = [
            {
                "role": "system",
                "content": """You are an assistant that can detect the unique description
                              of a caption in a document.
                            """,
            },
            {
                "role": "user",
                "content": f"""
                    Please find the unique description of the caption in the document.

                    For example, if the caption is "Plate 1.1: The distribution of the population",
                    the unique description is "Plate 1.1".

                    Given caption:

                    "{caption}"

                    Return the str within the json with the key "uid".
                """,
            },
        ]
        response, cost = openai_call(messages)
        self.cost += cost
        logger.debug(f"LLM cost: {cost}, response: {response}, caption: {caption}")
        response_dict = json.loads(response)
        return response_dict.get("uid", "")
    except Exception as e:
        logger.error(f"Error in LLM caption mentions detection: {e}")
        logger.exception(e)
    return None

llm_extract_triplet(text)

Extract the triplet from the text Args: text (str): The text to extract the triplets from

Returns:

Name Type Description
triplets List[dict]

The list of triplets extracted from the text

Source code in Docs2KG/kg/semantic_kg.py
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
def llm_extract_triplet(self, text) -> List[dict]:
    """
    Extract the triplet from the text
    Args:
        text (str): The text to extract the triplets from


    Returns:
        triplets (List[dict]): The list of triplets extracted from the text
    """
    try:
        messages = [
            {
                "role": "system",
                "content": """You are an assistant that can extract the triplets from a given text.
                            """,
            },
            {
                "role": "user",
                "content": f"""
                    Please extract the triplets from the following text:

                    "{text}"

                    Return the triplets within the json with the key "triplets".
                    And the triplets should be in the format of a list of dictionaries,
                    each dictionary should have the following keys:
                    - subject
                    - subject_ner_type
                    - predicate
                    - object
                    - object_ner_type

                """,
            },
        ]
        response, cost = openai_call(messages)
        self.cost += cost
        logger.debug(f"LLM cost: {cost}, response: {response}, text: {text}")
        response_dict = json.loads(response)
        return response_dict.get("triplets", [])
    except Exception as e:
        logger.error(f"Error in LLM triplet extraction: {e}")
    return []

llm_page_summary(page_content)

Parameters:

Name Type Description Default
page_content str

The content of the page

required

Returns:

Name Type Description
summary str

The summary of the page

Source code in Docs2KG/kg/semantic_kg.py
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
def llm_page_summary(self, page_content: str) -> str:
    """

    Args:
        page_content (str): The content of the page

    Returns:
        summary (str): The summary of the page

    """
    try:
        messages = [
            {
                "role": "system",
                "content": """You are an assistant that can summarize the content of a page.
                            """,
            },
            {
                "role": "user",
                "content": f"""
                    Please summarize the content of the page.

                    "{page_content}"

                    Return the summary within the json with the key "summary".
                """,
            },
        ]
        response, cost = openai_call(messages)
        self.cost += cost
        logger.debug(
            f"LLM cost: {cost}, response: {response}, page_content: {page_content}"
        )
        response_dict = json.loads(response)
        return response_dict.get("summary", "")
    except Exception as e:
        logger.error(f"Error in LLM page summary: {e}")
    return ""

load_kg(file_path) staticmethod

Load the knowledge graph from JSON

Parameters:

Name Type Description Default
file_path Path

The path to the JSON file

required

Returns:

Name Type Description
dict dict

The knowledge graph

Source code in Docs2KG/kg/semantic_kg.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
@staticmethod
def load_kg(file_path: Path) -> dict:
    """
    Load the knowledge graph from JSON

    Args:
        file_path (Path): The path to the JSON file

    Returns:
        dict: The knowledge graph
    """
    with open(file_path, "r") as f:
        kg = json.load(f)
    return kg

Link the image to the content

  1. We will need to extract the image's caption and reference point
  2. Use this caption or 1.1 to search the context, link the image to where the image is mentioned

Returns:

Source code in Docs2KG/kg/semantic_kg.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
def semantic_link_image_to_content(self):
    """
    Link the image to the content

    1. We will need to extract the image's caption and reference point
    2. Use this caption or 1.1 to search the context, link the image to where the image is mentioned

    Returns:

    """

    # first locate the image caption
    for page in self.layout_kg["children"]:
        # within the page node, then it should have the children start with the image node
        for child in page["children"]:
            if child["node_type"] == "image":
                # child now is the image node
                # if this child do not have children, then we will skip
                if "children" not in child or len(child["children"]) == 0:
                    continue
                # logger.info(child)
                for item in child["children"]:
                    # if this is the caption, then we will extract the text
                    text = item["node_properties"]["content"]
                    if self.util_caption_detection(text):
                        logger.info(f"Figure/Caption detected: {text}")
                        # we will use this
                        child["node_properties"]["caption"] = text
                        """
                        Link the caption to where it is mentioned

                        For example, if the caption is "Figure 1.1: The distribution of the population",
                        then we will search the context
                        And found out a place indicate that: as shown in Figure 1.1,
                        the distribution of the population is ...

                        We need to find a way to match it back to the content

                        Current plan of attack, we use rule based way.

                        If there is a Figure XX, then we will search the context for Figure XX,
                        and link it back to the content
                        Because the content have not been
                        """

                        uuids, caption = self.util_caption_mentions_detect(
                            caption=text
                        )
                        logger.info(f"UUIDs: {uuids}")
                        child["node_properties"]["mentioned_in"] = uuids
                        if caption:
                            child["node_properties"]["unique_description"] = caption
                        continue

    self.export_kg()

Link the table to the content

So we will do the same thing first for the table

Returns:

Source code in Docs2KG/kg/semantic_kg.py
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def semantic_link_table_to_content(self):
    """
    Link the table to the content

    So we will do the same thing first for the table

    Returns:

    """
    for page in self.layout_kg["children"]:
        # within the page node, then it should have the children start with the image node
        for child in page["children"]:
            if child["node_type"] == "table_csv":
                # child now is the image node
                # if this child do not have children, then we will skip
                if "children" not in child or len(child["children"]) == 0:
                    continue
                # logger.info(child)
                for item in child["children"]:
                    # if this is the caption, then we will extract the text
                    text = item["node_properties"]["content"]
                    if self.util_caption_detection(text):
                        logger.info(f"Table/Caption detected: {text}")
                        # we will use this
                        child["node_properties"]["caption"] = text
                        uuids, caption = self.util_caption_mentions_detect(
                            caption=text
                        )
                        logger.info(f"UUIDs: {uuids}")
                        child["node_properties"]["mentioned_in"] = uuids
                        if caption:
                            child["node_properties"]["unique_description"] = caption
                        continue
    self.export_kg()

semantic_page_summary()

Summary of the page, which will have better understanding of the page.

Not sure whether this will enhance the RAG or not.

But will be easier for human to understand the page.

When doing summary, also need to give up page and later page information

Returns:

Source code in Docs2KG/kg/semantic_kg.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
def semantic_page_summary(self):
    """
    Summary of the page, which will have better understanding of the page.

    Not sure whether this will enhance the RAG or not.

    But will be easier for human to understand the page.

    When doing summary, also need to give up page and later page information

    Returns:

    """
    for page_index, page in enumerate(self.layout_kg["children"]):
        if page["node_type"] == "page":
            page_content = page["node_properties"]["page_text"]
            logger.debug(page_content)
            summary = self.llm_page_summary(page_content)
            page["node_properties"]["summary"] = summary

    self.export_kg()

semantic_text2kg()

General Goal of this:
  • A list of triplet: (subject, predicate, object)
  • Triplets will be associated to the tree
  • Frequent subject will be merged, and linked

Plan of attack:

  1. We need to do the Named Entity Recognition for each sentence
  2. Do NER coexist relationship
  3. Last step will be extracting the semantic NER vs NER relationship

How to construction the relation?

  • We will grab the entities mapping to text uuid { "ner_type": { "entities": [uuid1, uuid2] } }
Source code in Docs2KG/kg/semantic_kg.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def semantic_text2kg(self):
    """
    ## General Goal of this:

    - A list of triplet: (subject, predicate, object)
    - Triplets will be associated to the tree
    - Frequent subject will be merged, and linked

    Plan of attack:

    1. We need to do the Named Entity Recognition for each sentence
    2. Do NER coexist relationship
    3. Last step will be extracting the semantic NER vs NER relationship

    How to construction the relation?

    - We will grab the entities mapping to text uuid
    {
       "ner_type": {
        "entities": [uuid1, uuid2]
       }
    }

    """
    if self.llm_enabled:
        current_cost = self.cost
        # do the triple extraction
        # self.semantic_triplet_extraction(self.layout_kg)
        logger.info("Start the semantic text2kg extraction")
        nodes = self.semantic_triplet_extraction(self.layout_kg, [])
        # use tqdm to show the progress
        for node in tqdm(nodes, desc="Extracting triplets"):
            # extract the triplets from the text
            text = node["node_properties"]["content"]
            logger.debug(text)
            if text == "":
                continue
            triplets = self.llm_extract_triplet(text)
            node["node_properties"]["text2kg"] = triplets
        self.export_kg()
        logger.info(f"LLM cost: {self.cost - current_cost}")
    else:
        # Hard to do this without LLM
        logger.info("LLM is not enabled, skip the semantic text2kg extraction")

semantic_triplet_extraction(node, nodes)

Extract tripplets from the text

It will update the node with the Text2KG field, add a list of triplets Args: node (dict): The node in the layout knowledge graph nodes (List[dict]): The list of nodes

Returns:

Source code in Docs2KG/kg/semantic_kg.py
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def semantic_triplet_extraction(self, node: dict, nodes: List[dict]):
    """
    Extract tripplets from the text

    It will update the node with the Text2KG field, add a list of triplets
    Args:
        node (dict): The node in the layout knowledge graph
        nodes (List[dict]): The list of nodes

    Returns:

    """
    for child in node["children"]:
        if "children" in child:
            nodes = self.semantic_triplet_extraction(child, nodes)
        content = child["node_properties"].get("content", "")
        if not content:
            continue
        nodes.append(child)
    return nodes

util_caption_detection(text)

Give a text, detect if this is a caption for image or table

If it is LLM enabled, then we will use LLM to detect the caption If it is not LLM enabled, we use keyword match - Currently LLM performance not well

Returns:

Source code in Docs2KG/kg/semantic_kg.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
def util_caption_detection(self, text: str) -> bool:  # noqa
    """
    Give a text, detect if this is a caption for image or table

    If it is LLM enabled, then we will use LLM to detect the caption
    If it is not LLM enabled, we use keyword match
        - Currently LLM performance not well

    Returns:

    """
    for keyword in CAPTION_KEYWORDS:
        if keyword in text.lower():
            return True
    # if self.llm_enabled:
    #     return self.llm_detect_caption(text)
    return False

util_caption_mentions_detect(caption)

First we need to find the unique description for the caption.

For example: Plate 1.1: The distribution of the population

Plate 1.1 is the unique description

We will need to search the whole document to find the reference point

Parameters:

Name Type Description Default
caption str

The caption text

required

Returns:

Name Type Description
uuids List[str]

The list of uuids where the caption is mentioned

Source code in Docs2KG/kg/semantic_kg.py
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def util_caption_mentions_detect(self, caption: str) -> Tuple[List[str], str]:
    """

    First we need to find the unique description for the caption.

    For example: Plate 1.1: The distribution of the population

    Plate 1.1 is the unique description

    We will need to search the whole document to find the reference point

    Args:
        caption (str): The caption text


    Returns:
        uuids (List[str]): The list of uuids where the caption is mentioned

    """
    # first extract the unique description
    # Extract the unique description from the caption
    keyword_patten = "|".join(CAPTION_KEYWORDS)
    match = re.search(rf"(\b({keyword_patten}) \d+(\.\d+)*\b)", caption.lower())
    unique_description = None
    if match:
        unique_description = match.group(1)
    else:
        if self.llm_enabled:
            """
            Try to use LLM to do this work
            """
            unique_description = self.llm_detect_caption_mentions(caption)
            logger.info(f"Unique description: {unique_description}")

    if not unique_description:
        return []
    logger.info(f"Unique description: {unique_description}")
    mentioned_uuids = []
    # search the context
    mentioned_uuids = self.util_mentioned_uuids(
        self.layout_kg, unique_description, mentioned_uuids
    )
    return mentioned_uuids, unique_description

util_mentioned_uuids(node, unique_description, uuids)

Search the context for the unique description

Parameters:

Name Type Description Default
node dict

The node in the layout knowledge graph

required
unique_description str

The unique description extracted from the caption

required
uuids List[str]

The list of uuids where the unique description is mentioned

required

Returns:

Name Type Description
uuids List[str]

The list of uuids where the unique description is mentioned

Source code in Docs2KG/kg/semantic_kg.py
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def util_mentioned_uuids(
    self, node: dict, unique_description: str, uuids: List[str]
) -> List[str]:
    """
    Search the context for the unique description

    Args:
        node (dict): The node in the layout knowledge graph
        unique_description (str): The unique description extracted from the caption
        uuids (List[str]): The list of uuids where the unique description is mentioned

    Returns:
        uuids (List[str]): The list of uuids where the unique description is mentioned
    """
    for child in node["children"]:
        if "node_properties" in child:
            if "content" in child["node_properties"]:
                if (
                    unique_description
                    in child["node_properties"]["content"].lower()
                ):
                    uuids.append(child["uuid"])
        if "children" in child:
            uuids = self.util_mentioned_uuids(child, unique_description, uuids)
    return uuids