How to Extract High-Value Knowledge Graph Relationships / 如何萃取高價值的知識圖譜關係¶
Author: QuarkAndCode
Published:
Source: https://medium.com/@QuarkAndCode/how-to-extract-high-value-knowledge-graph-relationships-36bab532d6f7
Fetched: 2026-06-07T02:00:59.267975
How to Extract High-Value Knowledge Graph Relationships / 如何萃取高價值的知識圖譜關係¶
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

Knowledge graphs are most effective when they connect facts rather than just store them. Their real value comes from relationships, such as who owns what, which product depends on which component, which disease is linked to which symptom, which author wrote which paper, which customer is tied to which account, or which document supports a business decision.
知識圖譜 (Knowledge Graph) 在「連結」事實而非僅僅「儲存」事實時最為有效。它真正的價值來自於關係 (relationships),例如誰擁有什麼、哪個產品依賴哪個元件、哪種疾病與哪種症狀相關、哪位作者撰寫了哪篇論文、哪位顧客與哪個帳戶相連,或是哪份文件支持某項商業決策。
A knowledge graph relationship is usually represented as a connection between two entities. In RDF, this is commonly expressed as a subject–predicate–object triple, where the predicate describes the relationship between the subject and object. The W3C RDF specification describes this as a statement that a relationship indicated by the predicate holds between two resources.
知識圖譜關係通常被表示為兩個實體 (entities) 之間的連結。在 RDF(資源描述框架,Resource Description Framework)中,這通常以「主詞—謂詞—受詞」三元組 (subject–predicate–object triple) 來表達,其中謂詞 (predicate) 描述了主詞與受詞之間的關係。W3C 的 RDF 規範將其描述為一則陳述:表示由謂詞所指示的關係成立於兩個資源之間。
Not every relationship you extract is worth keeping. If a graph is full of weak, duplicate, vague, or unsupported relationships, it quickly becomes cluttered. The aim is not to collect as many relationships as possible, but to focus on those that are accurate, useful, explainable, and valuable for search, analytics, recommendations, automation, or decision-making.
並非每一條你萃取出的關係都值得保留。如果一個圖譜充滿薄弱、重複、模糊或缺乏佐證的關係,它很快就會變得雜亂無章。目標並不是盡可能蒐集最多的關係,而是聚焦於那些準確、有用、可解釋,且對搜尋、分析、推薦、自動化或決策具有價值的關係。
What Makes a Knowledge Graph Relationship "High-Value"? / 是什麼讓知識圖譜關係具有「高價值」?¶
A high-value relationship is a connection that improves the usefulness of the graph. It helps answer important questions, supports business workflows, reveals hidden patterns, or improves the accuracy of downstream systems.
高價值關係是一種能提升圖譜實用性的連結。它有助於回答重要問題、支援商業工作流程、揭示隱藏的模式,或提升下游系統的準確度。
For example, the relationship:
舉例來說,以下這條關係:
Product A — usesComponent — Battery B
產品 A — 使用元件 (usesComponent) — 電池 B
is more useful than:
比下面這條更有用:
Product A — mentionedWith — Battery B
產品 A — 一同被提及 (mentionedWith) — 電池 B
The first relationship gives us specific, actionable information. It can help with supply-chain analysis, compatibility checks, warranty decisions, product recommendations, and risk alerts. The second relationship shows only that the two entities appeared together, which might not be significant.
第一條關係提供了具體、可操作的資訊。它可以協助供應鏈分析、相容性檢查、保固決策、產品推薦與風險警示。第二條關係只顯示這兩個實體曾一同出現,這可能並不具有意義。
High-value relationships usually have five qualities:
高價值關係通常具備五項特質:
They are specific. "Founded by," "manufactured in," "depends on," "approved by," and "compatible with" are stronger than vague predicates such as "related to" or "associated with."
它們是具體的。「由……創立」、「製造於」、「依賴於」、「由……核准」與「相容於」這類謂詞,比起「與……相關」或「與……關聯」這類模糊的謂詞更為有力。
They are relevant to a real use case. A relationship that helps answer a customer, operational, compliance, research, or SEO question is more valuable than one that merely adds volume.
它們與真實的使用情境相關。一條能協助回答顧客、營運、法規遵循、研究或 SEO(搜尋引擎最佳化,Search Engine Optimization)問題的關係,比起僅僅增加數量的關係更有價值。
They are verifiable. A strong relationship should be traceable to a source such as a document, database row, API response, contract, scientific paper, or trusted web page. Provenance matters because users need to know where a fact came from.
它們是可驗證的。一條有力的關係應能追溯至某個來源,例如文件、資料庫資料列、API 回應、合約、科學論文或可信賴的網頁。來源出處 (provenance) 很重要,因為使用者需要知道一項事實是從何而來。
They are reusable. A good relationship can support multiple queries, workflows, or applications.
它們是可重複使用的。一條好的關係能支援多種查詢、工作流程或應用。
They are maintained. Relationships can become stale. Job titles change, companies merge, prices shift, policies update, and scientific understanding evolves. A high-value graph needs a process for refreshing and correcting relationships over time.
它們是被持續維護的。關係可能會過時。職稱會變動、公司會合併、價格會波動、政策會更新,科學認知也會演進。一個高價值的圖譜需要一套流程,以隨時間更新與修正關係。
Start With the Questions the Graph Must Answer / 從圖譜必須回答的問題開始¶
The best relationship extraction projects begin with questions, not tools. These are often called competency questions in ontology and knowledge graph design. They define what the graph should be able to answer.
最好的關係萃取專案始於問題,而非工具。在本體論 (ontology) 與知識圖譜設計中,這些問題常被稱為能力問題 (competency questions)。它們定義了圖譜應該能夠回答什麼。
Examples include:
範例包括:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

This step prevents the graph from becoming a dumping ground. If a relationship does not help answer a meaningful question, it may not be worth extracting.
這個步驟能防止圖譜淪為垃圾傾倒場。如果一條關係無助於回答任何有意義的問題,那麼它可能就不值得被萃取。
Industry knowledge graphs are used in search, product understanding, social networks, question answering, and enterprise discovery. A well-known industry paper from Google Research describes knowledge graphs as structured factual knowledge used by major companies to power intelligent products and search experiences.
產業界的知識圖譜被應用於搜尋、產品理解、社群網路、問答 (question answering) 與企業探索。一篇來自 Google Research 的知名產業論文將知識圖譜描述為一種結構化的事實知識,被各大公司用來驅動智慧型產品與搜尋體驗。
Build a Relationship Schema Before Extracting at Scale / 在大規模萃取之前先建立關係綱要¶
A relationship schema defines the types of connections your graph should contain. Without a schema, extraction systems may produce many variations of the same idea:
關係綱要 (relationship schema) 定義了你的圖譜應包含哪些類型的連結。若沒有綱要,萃取系統可能會針對同一個概念產生許多變體:
· "works for"
· 「為……工作 (works for)」
· "is employed by."
· 「受僱於 (is employed by)」
· "employee at"
· 「任職於 (employee at)」
· "staff member of"
· 「為……的員工 (staff member of)」
· "joined the company."
· 「加入該公司 (joined the company)」
· "has employer"
· 「擁有雇主 (has employer)」
These may all need to map to one canonical relationship, such as:
這些可能全都需要對應到一條標準化 (canonical) 的關係,例如:
Person — worksFor — Organization / 人物 — 任職於 (worksFor) — 組織¶
A schema does not need to be perfect at the start. It can evolve. But it should include the core relationship types, entity types, allowed directions, expected data sources, and validation rules.
綱要在一開始不需要完美。它可以逐步演進。但它應包含核心的關係類型、實體類型、允許的方向、預期的資料來源,以及驗證規則。
A simple relationship definition might look like this:
一個簡單的關係定義可能長這樣:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

For semantic web and structured data projects, Schema.org can also help standardize public-facing descriptions of entities and relationships. Schema.org states that its vocabularies cover entities, relationships between entities, and actions, and that they can be expressed through formats such as RDFa, Microdata, and JSON-LD.
對於語意網 (semantic web) 與結構化資料 (structured data) 專案而言,Schema.org 也能協助標準化對外公開的實體與關係描述。Schema.org 指出其詞彙涵蓋了實體、實體之間的關係以及動作 (actions),並且這些可透過 RDFa、Microdata 與 JSON-LD 等格式來表達。
Choose the Right Sources for Relationship Extraction / 為關係萃取選擇正確的來源¶
High-value relationships usually come from high-quality sources. Choosing the right sources is one of the most important steps in building a knowledge graph.
高價值關係通常來自高品質的來源。選擇正確的來源是建構知識圖譜中最重要的步驟之一。
Structured data sources are often the easiest to process. These include relational databases, CRM records, product catalogs, spreadsheets, APIs, data warehouses, and event logs. They usually contain explicit relationships such as customer-to-account, product-to-category, or employee-to-department.
結構化資料來源通常是最容易處理的。這些包括關聯式資料庫、CRM(客戶關係管理,Customer Relationship Management)紀錄、產品目錄、試算表、API、資料倉儲 (data warehouses) 與事件日誌 (event logs)。它們通常包含明確的關係,例如顧客對帳戶、產品對類別,或員工對部門。
Semi-structured sources include HTML pages, JSON files, XML documents, metadata, tables, and structured web markup. These often have clear entity attributes and relationships, but may need cleaning and normalization.
半結構化來源包括 HTML 頁面、JSON 檔案、XML 文件、詮釋資料 (metadata)、表格,以及結構化的網頁標記。這些來源通常具有清楚的實體屬性與關係,但可能需要清理與正規化 (normalization)。
Unstructured sources include PDFs, reports, emails, contracts, manuals, research papers, news articles, transcripts, and support tickets. These are harder to process because relationships are expressed in natural language. However, they often contain the richest and most valuable knowledge.
非結構化來源包括 PDF、報告、電子郵件、合約、手冊、研究論文、新聞文章、逐字稿與客服工單。這些較難處理,因為關係是以自然語言 (natural language) 表達的。然而,它們往往蘊含最豐富、最有價值的知識。
For SEO and public web visibility, structured data is especially important. Google Search Central explains that structured data provides Google with explicit clues about a page's meaning and can help it understand information about people, books, companies, recipes, and other entities on the web. Google also notes that JSON-LD is generally recommended when site setup allows it.
對於 SEO 與公開網路的能見度而言,結構化資料格外重要。Google Search Central 解釋,結構化資料能為 Google 提供關於頁面意義的明確線索,並能協助其理解網路上關於人物、書籍、公司、食譜與其他實體的資訊。Google 也指出,在網站設置允許的情況下,通常建議使用 JSON-LD。
Extract Entities Before Extracting Relationships / 在萃取關係之前先萃取實體¶
Relationship extraction depends on entity extraction. A system must first recognize the things being connected before it can understand how they are connected.
關係萃取仰賴於實體萃取 (entity extraction)。系統必須先辨識出被連結的事物,才能理解它們是如何被連結的。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

After identifying entities, the system should normalize them. "IBM," "International Business Machines," and "IBM Corp." may refer to the same organization. "New York," "NYC," and "New York City" may need to be consolidated into a single location entity, depending on the domain.
在辨識出實體之後,系統應將其正規化。「IBM」、「International Business Machines」與「IBM Corp.」可能指的是同一個組織。「New York」、「NYC」與「New York City」可能需要整合成單一的地點實體,視領域而定。
This step is often called entity linking or entity resolution. It is essential because duplicate entities create duplicate relationships. A graph with three versions of the same company will yield fragmented, misleading results.
這個步驟常被稱為實體連結 (entity linking) 或實體解析 (entity resolution)。它至關重要,因為重複的實體會產生重複的關係。一個圖譜若存在同一家公司的三種版本,將會產生破碎、具誤導性的結果。
Use Multiple Extraction Methods, Not Just One / 使用多種萃取方法,而非僅靠一種¶
There is no one best way to extract knowledge graph relationships. The most effective pipelines usually combine rules, machine learning, language models, human review, and graph validation.
並不存在唯一最佳的方法來萃取知識圖譜關係。最有效的處理流程 (pipelines) 通常會結合規則、機器學習 (machine learning)、語言模型、人工審查與圖譜驗證。
1. Rule-Based Relationship Extraction / 1. 基於規則的關係萃取¶
Rule-based extraction works well when the language is predictable or the data is structured. For example:
當語言可預測或資料具結構性時,基於規則 (rule-based) 的萃取效果良好。例如:
· "X was founded by Y."
· 「X 由 Y 所創立。」
· "X is headquartered in Y."
· 「X 的總部位於 Y。」
· "X requires Y."
· 「X 需要 Y。」
· "X is compatible with Y."
· 「X 與 Y 相容。」
· "X reports to Y"
· 「X 向 Y 匯報。」
Rules can be created with regular expressions, dependency parsing, database mappings, or template-based patterns. They are transparent and easy to audit, but they can miss relationships expressed in unexpected ways.
規則可以透過正規表達式 (regular expressions)、依存句法剖析 (dependency parsing)、資料庫對應,或基於樣板的模式來建立。它們透明且容易稽核,但可能會遺漏以非預期方式表達的關係。
Rule-based methods are useful for legal clauses, product manuals, financial filings, policy documents, and other domains where language follows repeatable patterns.
基於規則的方法適用於法律條款、產品手冊、財務申報文件、政策文件,以及其他語言遵循可重複模式的領域。
2. Open Information Extraction / 2. 開放式資訊萃取¶
Open Information Extraction, often called OpenIE, extracts relation triples from text without requiring every relationship type to be defined in advance. The original OpenIE research introduced a scalable approach for extracting large numbers of relational tuples from web text without manually specifying every target relation first.
開放式資訊萃取 (Open Information Extraction,常稱為 OpenIE) 能從文字中萃取關係三元組,而不需要事先定義每一種關係類型。最初的 OpenIE 研究提出了一種可擴展的方法,能從網路文字中萃取大量的關係元組 (relational tuples),而無需先手動指定每一個目標關係。
Stanford CoreNLP's OpenIE annotator, for example, extracts open-domain triples consisting of a subject, relation, and object. Its documentation describes triples such as a person being born in a place and notes that OpenIE can be useful when there is limited training data.
舉例來說,Stanford CoreNLP 的 OpenIE 標註器會萃取出由主詞、關係與受詞組成的開放領域 (open-domain) 三元組。其文件描述了諸如「某人出生於某地」這類三元組,並指出當訓練資料有限時,OpenIE 可能會很有用。
OpenIE is useful for discovery. It helps find possible relationships you might not have included in your original schema. However, OpenIE results often need cleaning because natural-language predicates can be inconsistent, wordy, or repetitive.
OpenIE 對於探索很有用。它能協助找出你原本綱要中可能未納入的潛在關係。然而,OpenIE 的結果往往需要清理,因為自然語言的謂詞可能不一致、冗長或重複。
For example, OpenIE might extract:
舉例來說,OpenIE 可能會萃取出:
"Battery B" — "is required for operation of" — "Product A"
「電池 B」 — 「為運作所必需 (is required for operation of)」 — 「產品 A」
A knowledge graph pipeline may need to normalize that into:
知識圖譜處理流程可能需要將其正規化為:
Product A — requiresComponent — Battery B
產品 A — 需要元件 (requiresComponent) — 電池 B
3. Supervised Relation Extraction / 3. 監督式關係萃取¶
Supervised relation extraction uses labeled examples to train a model. For instance, humans may label sentences that express relationships such as:
監督式關係萃取 (supervised relation extraction) 使用已標註的範例來訓練模型。舉例來說,人類可能會標註那些表達以下關係的句子:
· acquiredBy
· 被收購 (acquiredBy)
· locatedIn
· 位於 (locatedIn)
· treats
· 治療 (treats)
· causes
· 導致 (causes)
· authorOf
· 為……作者 (authorOf)
· manufacturedBy
· 由……製造 (manufacturedBy)
· partOf
· 為……一部分 (partOf)
The model learns how those relationships appear in text and then predicts them in new documents.
模型學習這些關係在文字中如何出現,接著在新的文件中預測它們。
This approach can be accurate if you have enough labeled data. The downside is cost, since creating high-quality labeled examples takes time and domain knowledge.
如果你擁有足夠的標註資料,這種方法可以很準確。其缺點是成本,因為建立高品質的標註範例需要時間與領域知識。
4. Distant Supervision / 4. 遠程監督¶
Distant supervision reduces the need for manual labeling by automatically generating training examples from an existing knowledge base. The classic distant supervision paper by Mintz, Bills, Snow, and Jurafsky used Freebase relations to train relation extractors from unlabeled text, offering an alternative to hand-labeled corpora.
遠程監督 (Distant Supervision) 透過從既有的知識庫自動產生訓練範例,來減少人工標註的需求。由 Mintz、Bills、Snow 與 Jurafsky 所撰寫的經典遠程監督論文,使用 Freebase 的關係來從未標註的文字中訓練關係萃取器,提供了一種替代人工標註語料庫的方案。
This method is helpful if you already have a partial knowledge graph or a trusted database. It can help you scale up relation extraction, but it may also introduce noisy labels. The process should still include confidence scoring and validation.
如果你已經擁有部分的知識圖譜或可信賴的資料庫,這種方法會很有幫助。它能協助你擴大關係萃取的規模,但也可能引入帶有雜訊的標籤。此流程仍應包含信賴度評分 (confidence scoring) 與驗證。
5. LLM-Assisted Relationship Extraction / 5. 大型語言模型輔助的關係萃取¶
Large language models can help extract relationships from complex text, especially when the relationships are implied rather than expressed in a simple pattern. They can also help rewrite messy natural-language relations into canonical graph predicates.
大型語言模型 (Large Language Models, LLMs) 能協助從複雜文字中萃取關係,尤其是當關係是隱含的、而非以簡單模式表達時。它們也能協助將雜亂的自然語言關係改寫為標準化的圖譜謂詞。
For example, a document may say:
舉例來說,一份文件可能會寫道:
"Product A cannot operate unless Firmware B has been installed."
「除非已安裝韌體 B,否則產品 A 無法運作。」
A language model can help infer:
語言模型可以協助推論出:
Product A — requires Software — Firmware B
產品 A — 需要軟體 (requires Software) — 韌體 B
LLMs are especially useful for candidate generation, schema mapping, summarizing evidence, and handling varied language. However, they should not be treated as the final source of truth. For high-value relationships, every extraction should be grounded in source text, checked against schema rules, and assigned a confidence score.
LLM 在候選生成 (candidate generation)、綱要對應、摘要證據與處理多樣化語言方面格外有用。然而,它們不應被視為最終的真實依據。對於高價值關係,每一次萃取都應以原始文字為依據、依綱要規則加以檢核,並賦予一個信賴度分數。
A practical LLM-assisted pipeline should require the model to return:
一個實用的 LLM 輔助處理流程,應要求模型回傳:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

This keeps the relationship easier to explain and review.
這能讓關係更容易解釋與審查。
Normalize Relationship Predicates / 正規化關係謂詞¶
Raw extraction often produces many relationship phrases that mean the same thing. Normalization converts them into a controlled vocabulary.
原始的萃取常會產生許多意義相同的關係用語。正規化會將它們轉換為一套受控詞彙 (controlled vocabulary)。
For example:
舉例來說:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

This step improves query quality. Instead of searching across dozens of near-duplicate predicates, users and applications can query one canonical relationship.
這個步驟能提升查詢品質。使用者與應用程式不必在數十個近乎重複的謂詞間搜尋,而可以查詢單一條標準化的關係。
Normalization should also preserve direction. "Company A owns Company B" is not the same as "Company A is owned by Company B." Direction errors are among the most damaging mistakes in knowledge graph construction.
正規化也應保留方向。「公司 A 擁有公司 B」與「公司 A 被公司 B 擁有」並不相同。方向錯誤是知識圖譜建構中最具破壞性的錯誤之一。
Add Context to Relationships / 為關係加入脈絡¶
Some relationships are always true, but many are only true within a specific context.
有些關係永遠為真,但許多關係只在特定脈絡 (context) 中為真。
For example:
舉例來說:
Person A — worksFor — Company B
人物 A — 任職於 (worksFor) — 公司 B
may need context:
可能需要以下脈絡:
· role title
· 職稱
· department
· 部門
· start date
· 起始日期
· end date
· 結束日期
· location
· 地點
· employment status
· 任職狀態
· source document
· 來源文件
· confidence score
· 信賴度分數
Without time and context, the graph may become misleading. A person who worked for a company in 2018 may not work there today. A product that was compatible with one software version may not be compatible with the next.
若沒有時間與脈絡,圖譜可能會變得具誤導性。一個在 2018 年任職於某公司的人,今日可能已不在該公司任職。一個曾與某個軟體版本相容的產品,可能與下一個版本不相容。
Context is especially important for relationships involving employment, ownership, contracts, prices, regulations, medical evidence, scientific claims, political roles, and product compatibility.
對於涉及任職、所有權、合約、價格、法規、醫療證據、科學主張、政治職位與產品相容性的關係而言,脈絡格外重要。
Capture Provenance for Every Important Relationship / 為每一條重要關係記錄來源出處¶
Provenance tells users where a relationship came from and how it was created. It answers questions such as:
來源出處 (provenance) 告訴使用者一條關係從何而來,以及它是如何被建立的。它能回答以下問題:
· Which source supports this relationship?
· 哪個來源支持這條關係?
· When was it extracted?
· 它是何時被萃取的?
· Was it extracted by a rule, a model, an API, or a human?
· 它是由規則、模型、API 還是人類萃取的?
· Has it been verified?
· 它是否已被驗證?
· What evidence does the text support?
· 文字提供了哪些證據支持?
· Has the source changed?
· 來源是否已經變動?
The W3C PROV-O specification provides a model for representing and exchanging provenance information across systems and domains. It includes classes and properties that can be specialized for different applications.
W3C 的 PROV-O 規範提供了一套模型,用於在跨系統與跨領域之間表示與交換來源出處資訊。它包含可針對不同應用加以特化 (specialized) 的類別與屬性。
A high-value relationship should not just be stored as:
一條高價值關係不應只被儲存為:
Supplier A — supplies — Component B
供應商 A — 供應 (supplies) — 元件 B
It should also include:
它還應包含:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

This makes the graph more trustworthy and easier to audit.
這能讓圖譜更值得信賴,也更容易稽核。
Score Relationships by Value, Not Just Confidence / 依價值而非僅依信賴度為關係評分¶
Confidence and value are different. A relationship can be highly confident but not very useful. Another relationship may be moderately confident but extremely valuable if it affects risk, revenue, compliance, or user experience.
信賴度與價值是不同的。一條關係可能信賴度很高卻不太有用。另一條關係可能信賴度中等,但若它影響到風險、營收、法規遵循或使用者體驗,則可能極具價值。
A good relationship scoring model should consider both.
一個好的關係評分模型應同時考量兩者。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

A simple scoring formula might look like this:
一個簡單的評分公式可能長這樣:
Relationship Value Score = Confidence + Source Authority + Use-Case Importance + Reuse + Freshness + Specificity
關係價值分數 = 信賴度 + 來源權威性 + 使用情境重要性 + 重用性 + 新鮮度 + 具體性
This score helps teams decide which relationships to review first, which to publish, and which to keep as low-confidence candidates.
這個分數能協助團隊決定哪些關係該優先審查、哪些該發布,以及哪些該保留為低信賴度的候選項目。
Graph analytics can also help identify important entities and relationships. Centrality algorithms, for example, are commonly used to determine important nodes in a network. Neo4j's Graph Data Science documentation lists centrality methods such as PageRank, degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality.
圖譜分析也能協助辨識重要的實體與關係。舉例來說,中心性演算法 (centrality algorithms) 常被用來判定網路中的重要節點。Neo4j 的 Graph Data Science 文件列出了多種中心性方法,例如 PageRank、度中心性 (degree centrality)、接近中心性 (closeness centrality)、中介中心性 (betweenness centrality) 與特徵向量中心性 (eigenvector centrality)。
However, being important in the graph does not mean something is true. A highly connected node might be important, but its relationships still need to be checked.
然而,在圖譜中具有重要性並不代表某件事為真。一個高度連結的節點或許很重要,但其關係仍需經過查核。
Validate Relationships Before Publishing
在發布之前驗證關係
Validation protects the graph from incorrect facts, broken logic, and inconsistent structure.
驗證能保護圖譜免於錯誤的事實、破損的邏輯與不一致的結構。
Common validation checks include:
常見的驗證檢查包括:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

For RDF graphs, SHACL is a widely used validation language. The W3C SHACL specification defines it as a language for validating RDF graphs against conditions expressed as shapes.
對於 RDF 圖譜而言,SHACL 是一種被廣泛使用的驗證語言。W3C 的 SHACL 規範將其定義為一種語言,用於依照以「形狀 (shapes)」表達的條件來驗證 RDF 圖譜。
Validation should happen before relationships are loaded into production. It should also continue after loading because new data can conflict with old data.
驗證應在關係被載入正式環境 (production) 之前進行。它也應在載入之後持續進行,因為新資料可能會與舊資料發生衝突。
Use Human Review Where the Cost of Error Is High / 在錯誤代價高的地方採用人工審查¶
Automation can extract relationships quickly, but human review is still important for high-stakes domains.
自動化能快速萃取關係,但對於高風險的領域,人工審查仍然重要。
Human experts should review relationships involving:
人類專家應審查涉及以下事項的關係:
· legal obligations
· 法律義務
· medical or scientific claims
· 醫療或科學主張
· financial risk
· 財務風險
· regulatory compliance
· 法規遵循
· security permissions
· 安全權限
· supplier dependencies
· 供應商依賴關係
· public brand facts
· 公開的品牌事實
· customer-sensitive data
· 顧客敏感資料
Human review does not need to cover every relationship. A risk-based approach works better. Review high-value and low-confidence relationships first. Low-risk, high-confidence relationships can often be accepted automatically.
人工審查不需要涵蓋每一條關係。以風險為基礎的做法效果更好。優先審查高價值與低信賴度的關係。低風險、高信賴度的關係通常可以自動被接受。
Store Relationships in a Graph-Friendly Format / 以對圖譜友善的格式儲存關係¶
Once extracted, normalized, scored, and validated, relationships should be stored in a format that supports graph queries.
關係一旦經過萃取、正規化、評分與驗證,就應以一種支援圖譜查詢的格式來儲存。
Common options include:
常見的選項包括:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

A relationship should usually include not only the subject, predicate, and object, but also details such as source, confidence, timestamps, status, and evidence.
一條關係通常不僅應包含主詞、謂詞與受詞,還應包含諸如來源、信賴度、時間戳記、狀態與證據等細節。
Example:
範例:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

Improve Relationship Extraction Over Time / 隨時間持續改善關係萃取¶
Extracting knowledge graphs is not a one-time task. The best graphs get better through feedback.
萃取知識圖譜並非一次性的任務。最好的圖譜會透過回饋而不斷變得更好。
Track these metrics:
追蹤以下指標:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整尺寸圖片

Feedback from users, analysts, search logs, and failed queries should guide the next extraction cycle. For example, if users often search for product compatibility but the graph lacks those relationships, that signals a need to improve extraction for that area.
來自使用者、分析師、搜尋日誌與失敗查詢的回饋,應引導下一輪的萃取週期。舉例來說,如果使用者經常搜尋產品相容性,但圖譜卻缺乏這些關係,這就暗示了該領域的萃取需要改善。
Common Mistakes to Avoid / 應避免的常見錯誤¶
One common mistake is extracting relationships without a clear purpose. This creates a large graph that looks impressive but answers few important questions.
一個常見的錯誤是在沒有明確目的的情況下萃取關係。這會產生一個看起來令人印象深刻、卻幾乎無法回答重要問題的龐大圖譜。
Another mistake is using vague predicates. "Related to" is easy to extract but hard to use. Specific relationships improve search, reasoning, filtering, and analytics.
另一個錯誤是使用模糊的謂詞。「與……相關」容易萃取,卻難以使用。具體的關係能提升搜尋、推理、篩選與分析。
A third mistake is ignoring provenance. A relationship without evidence is hard to trust, especially when users need to make decisions based on it.
第三個錯誤是忽略來源出處。一條沒有證據的關係難以令人信賴,尤其當使用者需要依據它來做決策時。
A fourth mistake is skipping entity resolution. If the same company, product, or person appears under different names, the graph becomes fragmented.
第四個錯誤是跳過實體解析。如果同一家公司、產品或人物以不同的名稱出現,圖譜就會變得破碎。
A fifth mistake is treating AI output as final. AI can speed up extraction, but high-value relationships still need schema alignment, source checks, validation, and quality control.
第五個錯誤是將 AI 的輸出視為最終結果。AI 能加速萃取,但高價值關係仍需要綱要對齊、來源查核、驗證與品質控管。
Practical Workflow for Extracting High-Value Knowledge Graph Relationships / 萃取高價值知識圖譜關係的實用工作流程¶
A strong extraction workflow looks like this:
一個有力的萃取工作流程如下:
-
Define the use case and questions the graph must answer.
-
定義使用情境,以及圖譜必須回答的問題。
-
Identify the most valuable entity and relationship types.
-
辨識最有價值的實體與關係類型。
-
Select trusted structured, semi-structured, and unstructured sources.
-
選擇可信賴的結構化、半結構化與非結構化來源。
-
Extract and normalize entities.
-
萃取並正規化實體。
-
Link duplicate entities to canonical IDs.
-
將重複的實體連結至標準化的識別碼 (canonical IDs)。
-
Extract candidate relationships using rules, OpenIE, supervised models, distant supervision, LLMs, or a hybrid method.
-
使用規則、OpenIE、監督式模型、遠程監督、LLM 或混合式方法來萃取候選關係。
-
Normalize raw predicates into a controlled relationship schema.
-
將原始謂詞正規化為一套受控的關係綱要。
-
Add context such as dates, roles, locations, and source evidence.
-
加入諸如日期、職務、地點與來源證據等脈絡。
-
Assign confidence and value scores.
-
賦予信賴度與價值分數。
-
Validate relationships with schema rules and quality checks.
-
以綱要規則與品質檢查來驗證關係。
-
Send high-risk or uncertain relationships to human review.
-
將高風險或不確定的關係送交人工審查。
-
Store approved relationships in a graph database or RDF store.
-
將已核准的關係儲存於圖形資料庫或 RDF 儲存庫中。
-
Monitor freshness, conflicts, user feedback, and graph performance.
-
監控新鮮度、衝突、使用者回饋與圖譜效能。
Final Thoughts / 結語¶
The value of a knowledge graph is not measured by the number of nodes and edges it contains. It is measured by the quality of the relationships it can use to answer meaningful questions.
知識圖譜的價值,並非以它所包含的節點 (nodes) 與邊 (edges) 數量來衡量。它是以其能用來回答有意義問題的關係品質來衡量。
High-value knowledge graph relationships are specific, trusted, contextual, and useful. They connect the right entities in the right way, with enough evidence for people and systems to rely on them. The best extraction pipelines combine human domain knowledge, clear schemas, strong source selection, automated extraction, provenance, validation, and continuous improvement.
高價值的知識圖譜關係是具體的、可信賴的、具脈絡的且有用的。它們以正確的方式連結正確的實體,並具備足夠的證據讓人們與系統得以倚賴。最好的萃取流程結合了人類的領域知識、清晰的綱要、有力的來源選擇、自動化萃取、來源出處、驗證與持續改善。
A graph built this way becomes more than a database. It becomes a living map of knowledge that supports search, discovery, reasoning, analytics, automation, and better decisions.
以這種方式建構的圖譜,會成為比資料庫更多的東西。它會成為一張活生生的知識地圖,支援搜尋、探索、推理、分析、自動化與更好的決策。
References / 參考文獻¶
-
W3C. "RDF 1.1 Concepts and Abstract Syntax." Defines RDF triples and explains how predicates describe relationships between resources.
-
W3C.《RDF 1.1 Concepts and Abstract Syntax》。定義 RDF 三元組,並解釋謂詞如何描述資源之間的關係。
-
Hogan, A., Blomqvist, E., Cochez, M., et al. "Knowledge Graphs." ACM Computing Surveys / arXiv. Provides a broad survey of knowledge graph data models, schema, identity, extraction, quality, refinement, and publication.
-
Hogan, A.、Blomqvist, E.、Cochez, M. 等人。《Knowledge Graphs》。ACM Computing Surveys / arXiv。對知識圖譜的資料模型、綱要、身分識別、萃取、品質、精煉與發布提供了廣泛的綜述。
-
Noy, N., Gao, Y., Jain, A., et al. "Industry-scale Knowledge Graphs: Lessons and Challenges." Google Research / Communications of the ACM. Discusses enterprise knowledge graph use cases across search, products, social networks, and AI systems.
-
Noy, N.、Gao, Y.、Jain, A. 等人。《Industry-scale Knowledge Graphs: Lessons and Challenges》。Google Research / Communications of the ACM。探討企業知識圖譜在搜尋、產品、社群網路與 AI 系統中的使用情境。
-
Stanford CoreNLP. "OpenIE." Documentation describing extraction of open-domain subject–relation–object triples.
-
Stanford CoreNLP.《OpenIE》。描述開放領域「主詞—關係—受詞」三元組萃取的說明文件。
-
W3C. "Shapes Constraint Language (SHACL)." Defines SHACL as a language for validating RDF graphs against shapes and constraints.
-
W3C.《Shapes Constraint Language (SHACL)》。將 SHACL 定義為一種依據形狀與約束來驗證 RDF 圖譜的語言。
-
W3C. "PROV-O: The PROV Ontology." Provides a standard ontology for representing and exchanging provenance information.
-
W3C.《PROV-O: The PROV Ontology》。提供一套標準本體論,用於表示與交換來源出處資訊。
-
Schema.org. "Schema.org." Describes a vocabulary for entities, relationships, and actions that can be expressed through RDFa, Microdata, and JSON-LD.
-
Schema.org.《Schema.org》。描述一套用於實體、關係與動作的詞彙,可透過 RDFa、Microdata 與 JSON-LD 來表達。
-
Google Search Central. "Introduction to Structured Data Markup in Google Search." Explains how structured data helps Google understand page content and notes Google's general recommendation to use JSON-LD when possible.
-
Google Search Central.《Introduction to Structured Data Markup in Google Search》。解釋結構化資料如何協助 Google 理解頁面內容,並指出 Google 在可行時建議使用 JSON-LD 的一般性建議。
-
Neo4j Graph Data Science Documentation. "Centrality." Describes centrality algorithms used to identify important nodes in graph networks.
-
Neo4j Graph Data Science 文件。《Centrality》。描述用於辨識圖形網路中重要節點的中心性演算法。