GraphRAG: Building a Better AI System (Full Tutorial) / GraphRAG:打造更優秀的 AI 系統(完整教學)¶
Author: Thu Vu
Published:
Source: https://medium.com/@vuthihienthu.ueb/graphrag-building-a-better-ai-system-full-tutorial-8dd609312e46
Fetched: 2026-06-07T02:18:26.466335
GraphRAG: Building a Better AI System (Full Tutorial) / GraphRAG:打造更優秀的 AI 系統(完整教學)¶

Image source: Author
圖片來源:作者
Knowledge graphs aren’t just a fancy way to represent information. They’re a powerful way to help AI actually reason over it.
知識圖譜 (Knowledge Graph) 不只是一種炫麗的資訊呈現方式,它更是一種能讓 AI 真正對資訊進行推理的強大工具。
In today’s walkthrough, we’re building a GraphRAG system to understand AI copyright. This topic is a mess. And I mean that in the most interesting way possible.
在今天的教學中,我們要打造一套 GraphRAG 系統來理解 AI 著作權 (AI copyright) 議題。這個主題非常混亂——而我是以最有趣的角度來形容它的。
And just like with any complex topic, the information isn’t sitting neatly in one place. It’s scattered across hundreds of news articles, court filings, policy documents, and hot takes published across the web.
就像任何複雜的主題一樣,相關資訊並不會整齊地集中在同一個地方,而是散落在數百篇新聞報導、法院訴狀、政策文件,以及散布於網路上的各種熱門評論之中。
We’re going to build a system that scrapes that information live from Google, turns it into a structured knowledge graph, and then uses GraphRAG to ask questions no search engine or standard AI could reliably answer — questions like:
我們將建立一套系統,從 Google 即時抓取(scrape)這些資訊,將其轉換成結構化的知識圖譜,然後運用 GraphRAG 來提出一些任何搜尋引擎或標準 AI 都無法可靠回答的問題——例如:
“Which companies are at the center of these disputes — and how are they all connected?”
「哪些公司處於這些爭議的核心——而它們彼此之間又是如何相互關聯的?」
By the end of this walkthrough, you’ll know exactly:
在本教學結束時,你將能確切了解:
- how GraphRAG works
- when to use it, and
-
how to build it yourself on a real document dataset you may have
-
GraphRAG 是如何運作的
- 何時該使用它,以及
- 如何在你可能擁有的真實文件資料集上自行建立它
I’ll also share a code walkthrough of this example project — scraping relevant articles with a web search API, building the knowledge graph with LlamaIndex, analyzing it Graspologic, visualizing it with PyVis, and finally querying this system.
我也會分享這個範例專案的程式碼導覽——使用網路搜尋 API 抓取相關文章、使用 LlamaIndex 建立知識圖譜、使用 Graspologic 進行分析、使用 PyVis 進行視覺化,最後對這套系統進行查詢。
-
Note: You can also watch the video version of this blog post👇
-
註:你也可以觀看本篇部落格文章的影片版本 👇
The Problem With Standard RAG / 標準 RAG 的問題¶
So let’s start with a quick recap of how standard RAG (Retrieval-Augmented Generation) works, because understanding its limitations is what makes GraphRAG click.
讓我們先快速回顧一下標準 RAG(檢索增強生成,Retrieval-Augmented Generation)是如何運作的,因為了解它的限制,正是讓你理解 GraphRAG 的關鍵。
In a typical RAG pipeline, you take your documents — PDFs, articles, transcripts, whatever — and split them into chunks. Each chunk gets converted into a numerical vector using an embedding model.
在典型的 RAG 流程(pipeline)中,你會把文件——PDF、文章、逐字稿等任何東西——切分成多個區塊(chunk)。每個區塊都會透過嵌入模型(embedding model)轉換成一個數值向量(vector)。

These vectors capture the meaning of the text, so chunks about similar topics end up close together in what we call a vector space.
這些向量會捕捉文字的語意,因此關於相似主題的區塊最終會在我們所謂的向量空間 (vector space) 中彼此靠近。
When a user asks a question, the system converts that question into a vector too, finds the chunks closest to it, pulls those chunks, and feeds them to the LLM as context. The LLM then generates an answer based on what it was given.
當使用者提出問題時,系統也會把該問題轉換成向量,找出與其最接近的區塊,將這些區塊取出,並作為上下文(context)餵給大型語言模型(LLM)。接著 LLM 會根據所獲得的內容生成答案。

Image source: Author
圖片來源:作者
This is great, because it allows the LLM to answer questions about data it was never trained on — your company’s internal docs, research papers, customer tickets, whatever.
這非常棒,因為它讓 LLM 能夠回答有關它從未訓練過的資料的問題——你公司的內部文件、研究論文、客戶工單等任何東西。
But here’s the problem.
但問題就在這裡。
The more you scale your data, the more accuracy drops.
你的資料規模越大,準確率下降得越多。
One study found that vector search accuracy starts degrading at just 10,000 pages — reaching a 12% accuracy drop at 100,000 pages. The more documents you add, the more overlap you get in the embedding space, and the harder it becomes for the system to retrieve the right chunks.
一項研究發現,向量搜尋的準確率在僅僅 10,000 頁時就開始退化——在 100,000 頁時準確率下降達 12%。你加入的文件越多,嵌入空間中的重疊就越多,系統要檢索到正確的區塊也就越困難。

Image source: https://www.eyelevel.ai/post/do-vector-databases-lose-accuracy-at-scale
圖片來源:https://www.eyelevel.ai/post/do-vector-databases-lose-accuracy-at-scale
But scaling isn’t even the main issue. Standard RAG has two more fundamental blind spots:
但規模問題甚至還不是主要的問題。標準 RAG 還有兩個更根本的盲點:
-
First, each chunk is treated as an isolated fragment. Once documents are split and embedded, every chunk exists on its own, disconnected from the chunks around it and from related information in other documents. The system finds text that sounds like your question, but has no understanding of how those fragments connect to form a complete picture.
-
第一,每個區塊都被當作孤立的片段來處理。 一旦文件被切分並嵌入後,每個區塊都各自獨立存在,與周圍的區塊以及其他文件中的相關資訊毫無連結。系統找到的是聽起來像你問題的文字,卻完全不理解這些片段如何相互連結以構成完整的全貌。
-
Second, there’s no ability to reason across documents. When an answer requires linking information scattered across multiple sources — or when the question is about the dataset as a whole, like “What are the main themes in the data?” — standard RAG has no mechanism for it. Vector similarity matches individual chunks. It doesn’t synthesize across them.
-
第二,它沒有跨文件推理的能力。 當一個答案需要連結散布在多個來源的資訊時——或當問題是針對整個資料集,例如「這份資料中的主要主題有哪些?」——標準 RAG 並沒有可以處理它的機制。向量相似度只比對個別區塊,它無法在這些區塊之間進行綜合。

Image Source: Author
圖片來源:作者
This is the problem GraphRAG was built to solve.
這正是 GraphRAG 被打造出來要解決的問題。
What Is GraphRAG? / 什麼是 GraphRAG?¶
Instead of treating your documents as disconnected chunks in vector space, GraphRAG adds a structural layer on top.
GraphRAG 不再把你的文件視為向量空間中彼此無連結的區塊,而是在其上加入一個結構化的層。
Here’s the core idea: it uses an LLM to read each chunk and extract entities — people, companies, technologies, events, legal cases — and the relationships between them. These entities become nodes in a knowledge graph, and their relationships become edges connecting them.
核心概念如下:它使用 LLM 來閱讀每個區塊,並抽取出實體 (entities)——人物、公司、技術、事件、法律案件——以及它們之間的關係 (relationships)。這些實體會成為知識圖譜中的節點(node),而它們的關係則成為連接它們的邊(edge)。

Image source: Author
圖片來源:作者
The result is a structured graph of your entire dataset that reflects how the information actually relates across documents.
其成果是一張涵蓋整個資料集的結構化圖譜,反映出資訊實際上是如何跨文件相互關聯的。
Microsoft Research, which originally published the GraphRAG paper, calls this sensemaking. It’sthe ability to understand connections, patterns, and themes across a large body of information, rather than just retrieving isolated facts.
最初發表 GraphRAG 論文的微軟研究院(Microsoft Research)將此稱為意義建構 (sensemaking)。它指的是理解大量資訊之間的連結、模式與主題的能力,而不僅僅是檢索孤立的事實。

Image Source: https://arxiv.org/pdf/2404.16130
圖片來源:https://arxiv.org/pdf/2404.16130
GraphRAG doesn’t just find text that matches your query. It reasons over a web of connected entities, which means it can chain facts across documents, trace relationships between things mentioned in completely different sources, and answer questions about the dataset as a whole.
GraphRAG 不只是找出符合你查詢的文字。它會在一張由相互連結的實體所構成的網絡上進行推理,這意味著它能跨文件串連事實、追溯在完全不同來源中所提及事物之間的關係,並回答關於整個資料集的問題。
Using a knowledge graph have also been shown to improve LLM response accuracy.
使用知識圖譜也已被證實能提升 LLM 回應的準確率。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image Source: https://data.world/blog/generative-ai-benchmark-increasing-the-accuracy-of-llms-in-the-enterprise-with-a-knowledge-graph/
GraphRAG vs. Vector RAG: When to Use Each / GraphRAG 與向量 RAG 的對比:各自何時使用¶
I want to be clear about something — GraphRAG doesn’t replace standard vector RAG. They’re good at different things.
我想釐清一件事——GraphRAG 並不會取代標準的向量 RAG(Vector RAG)。它們各自擅長不同的事情。
Here are the simple rules to follow:
以下是可以遵循的簡單原則:
Use GraphRAG when:
在以下情況使用 GraphRAG:
- You’re working with hundreds or thousands of interconnected documents
- Questions require connecting facts, tracing relationships, or identifying patterns
- You need big-picture answers such as themes, trends, summaries across an entire dataset
- Explainability matters. You need to trace how the system arrived at an answer
-
You’re in a domain like law, policy, or research where accuracy on complex queries is critical
-
你正在處理數百或數千份彼此相互關聯的文件
- 問題需要連結事實、追溯關係或辨識模式
- 你需要宏觀的答案,例如橫跨整個資料集的主題、趨勢或摘要
- 可解釋性(explainability)很重要。你需要追溯系統是如何得出某個答案的
- 你身處法律、政策或研究等領域,在這些領域中,複雜查詢的準確性至關重要
Use Vector RAG when:
在以下情況使用向量 RAG:
- The questions are simple, direct lookups: “When was this law passed?” or “Who filed this lawsuit?”
- The answer lives inside a single document or chunk
- Speed and cost are the priority
-
Your dataset is small and doesn’t have dense cross-document relationships
-
問題是簡單、直接的查找:「這條法律是什麼時候通過的?」或「誰提起了這場訴訟?」
- 答案就存在於單一文件或單一區塊之中
- 速度與成本是首要考量
- 你的資料集很小,且沒有密集的跨文件關係
How GraphRAG Works (The Pipeline) / GraphRAG 如何運作(流程)¶
Let’s get into how GraphRAG actually works under the hood. There are two main phases: indexing, where you build the knowledge graph, and querying, where you retrieve from it.
讓我們深入了解 GraphRAG 在底層實際是如何運作的。它主要有兩個階段:索引(indexing),也就是你建立知識圖譜的階段;以及查詢(querying),也就是你從中檢索資訊的階段。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image source: General GraphRAG Pipeline
圖片來源:通用 GraphRAG 流程
This is the general pipeline. Microsoft’s approach, which this project follows, extends the general pipeline with two additional steps: community detection (grouping related entities into clusters) and community summarization (generating LLM summaries for each cluster). At query time, these summaries are queried instead of the raw graph, which is what makes it particularly effective for big-picture questions.
這是通用流程。本專案所遵循的微軟做法,在通用流程上額外擴充了兩個步驟:社群偵測 (community detection)(將相關實體分組成叢集 cluster)以及社群摘要 (community summarization)(為每個叢集生成 LLM 摘要)。在查詢時,被查詢的是這些摘要,而非原始圖譜,這正是使它在處理宏觀問題時特別有效的原因。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image source: https://arxiv.org/pdf/2404.16130
圖片來源:https://arxiv.org/pdf/2404.16130
About the Project / 關於本專案¶
For this project we’re using:
在本專案中,我們會使用:
-
SerpAPI to scrape Google News articles on AI copyright and governance. It gives clean, structured results with no browser automation required.
-
SerpAPI 用來抓取 Google News 上關於 AI 著作權與治理(governance)的文章。它能提供乾淨、結構化的結果,且不需要瀏覽器自動化(browser automation)。
-
LlamaIndex for orchestration. It handles the entire pipeline from document loading through entity extraction, graph construction, and querying. I’m using LlamaIndex here because it has the most native GraphRAG support.
-
LlamaIndex 用於協調編排(orchestration)。它處理整個流程,從文件載入、實體抽取、圖譜建構到查詢。我在這裡使用 LlamaIndex,是因為它對 GraphRAG 有最原生的支援。
-
Graspologic — Microsoft Research’s graph algorithm library — for community detection using the hierarchical Leiden algorithm. This is what groups related entities into meaningful clusters.
-
Graspologic——微軟研究院的圖演算法函式庫——使用階層式 Leiden 演算法(hierarchical Leiden algorithm)進行社群偵測。這就是將相關實體分組成有意義叢集的工具。
-
D3.js for visualization. It’s a JavaScript library for building custom, interactive data visualizations in the browser. We use it to render the knowledge graph as a dynamic network you can search, filter, and explore.
-
D3.js 用於視覺化。它是一個 JavaScript 函式庫,用來在瀏覽器中建立客製化、互動式的資料視覺化。我們用它把知識圖譜呈現為一個你可以搜尋、篩選與探索的動態網絡。
-
For LLMs, I’m using
gpt-4o-minifor entity extraction — it's cheaper and fast enough for this step. Thengpt-4ofor the final answer generation, where quality matters more. -
在 LLM 方面,我使用
gpt-4o-mini來進行實體抽取——它較便宜,且對這個步驟而言速度足夠快。然後使用gpt-4o來進行最終答案生成,因為那裡品質更為重要。
No external graph database needed — the graph is stored in-memory using LlamaIndex’s SimplePropertyGraphStore.
不需要外部的圖資料庫(graph database)——圖譜是使用 LlamaIndex 的 SimplePropertyGraphStore 儲存在記憶體中。
Code Walkthrough / 程式碼導覽¶
We’ll use VS Code and build this step by step.
我們會使用 VS Code,並一步一步地建立它。
Step 1 — Setup and Imports / 步驟一——設定與匯入¶
First, let’s install our dependencies and import everything we need.
首先,讓我們安裝相依套件(dependencies)並匯入所有需要的東西。
We’re using LlamaIndex for the GraphRAG pipeline, Graspologic for community detection, D3.js for visualization, and Requests to call the SerpAPI.
我們使用 LlamaIndex 來處理 GraphRAG 流程、Graspologic 來進行社群偵測、D3.js 來進行視覺化,以及 Requests 來呼叫 SerpAPI。
# Install dependencies
# pip install llama-index llama-index-llms-openai graspologic pandas nest-asyncio python-dotenv requests
import asyncio
import nest_asyncio
import pandas as pd
import networkx as nx
from dotenv import load_dotenv
from llama_index.core import Document, PropertyGraphIndex, Settings
from llama_index.core.graph_stores.types import (
EntityNode, KG_NODES_KEY, KG_RELATIONS_KEY, Relation
)
from llama_index.core.graph_stores import SimplePropertyGraphStore
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.async_utils import run_jobs
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.llms.openai import OpenAI
from graspologic.partition import hierarchical_leiden
# Patch the event loop so LlamaIndex async calls work inside Jupyter
nest_asyncio.apply()
# Load API keys from .env file (create one with OPENAI_API_KEY=sk-...)
load_dotenv()
Step 2 — Configuration / 步驟二——組態設定¶
We’re using two different models for different parts of the pipeline:
我們在流程的不同部分使用兩個不同的模型:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

# ── LLM Configuration ─────────────────────────────────────────────────────────
EXTRACTION_LLM = OpenAI(model="gpt-4o-mini", temperature=0) # For extraction + summaries
QUERY_LLM = OpenAI(model="gpt-4o", temperature=0) # For final answer synthesis
# ── Pipeline Parameters ────────────────────────────────────────────────────────
MAX_ARTICLES = 50 # Number of articles to process
MAX_PATHS_PER_CHUNK = 20 # Max entity-relationship-entity triplets extracted per chunk
NUM_WORKERS = 4 # Parallel workers for extraction (faster, same cost)
MAX_CLUSTER_SIZE = 10 # Max entities per community cluster
GRAPH_OUTPUT_FILE = "ai_copyright_graph.html"
Settings.llm = EXTRACTION_LLM
Step 3 — Scrape Articles with SerpAPI / 步驟三——使用 SerpAPI 抓取文章¶
Now let’s get our data.
現在讓我們取得資料。
I’m using SerpAPI to scrape Google search results for our dataset. SerpAPI gives you real-time, structured, clean search results from Google and other search engines through a simple API.
我使用 SerpAPI 來抓取 Google 搜尋結果作為我們的資料集。SerpAPI 透過一個簡單的 API,提供來自 Google 與其他搜尋引擎的即時、結構化、乾淨的搜尋結果。
I ran two queries: "AI intellectual property" and "copyright Generative AI". After filtering out pages that couldn't be scraped due to paywalls or bot protection, we ended up with 13 successfully scraped articles saved to ai_copyright_dataset.csv.
我執行了兩個查詢:"AI intellectual property"(AI 智慧財產權)與 "copyright Generative AI"(生成式 AI 著作權)。在過濾掉因付費牆(paywall)或機器人防護(bot protection)而無法抓取的頁面後,我們最終得到 13 篇成功抓取的文章,並儲存到 ai_copyright_dataset.csv。
If you want to see the full scraping pipeline, including how we pull the full article text and handle YouTube transcripts, check out the GitHub repo here.
如果你想看完整的抓取流程,包括我們如何擷取完整的文章內文以及處理 YouTube 逐字稿,請查看這裡的 GitHub 儲存庫(repo)。
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image: Scraped articles, saved to ai_copyright_dataset.csv
圖片:抓取的文章,儲存到 ai_copyright_dataset.csv
Step 4 — Define the Ontology / 步驟四——定義本體論¶
Before we touch the LLM, we define our ontology. The ontology is the schema of our knowledge graph. It controls exactly what kinds of entities and relationships the LLM is allowed to extract.
在我們接觸 LLM 之前,我們先定義本體論(ontology)。本體論是我們知識圖譜的綱要(schema)。它精確地控制 LLM 被允許抽取哪些種類的實體與關係。
This is an important step. Without a defined schema, the LLM will:
這是一個重要的步驟。若沒有定義好的綱要,LLM 將會:
- invent different entity types per document (“AI_COMPANY” vs “TECH_FIRM” vs “ORGANIZATION”)
-
use inconsistent relationship labels for the same concept
-
為每份文件發明出不同的實體類型(「AI_COMPANY」對「TECH_FIRM」對「ORGANIZATION」)
- 對同一個概念使用不一致的關係標籤
This would make the resulted graph messy and not very useful.
這會讓最終產生的圖譜變得雜亂且不太有用。
For this particular use case, I defined 7 entity types and 8 relationship type below.
針對這個特定的使用情境,我在下方定義了 7 種實體類型與 8 種關係類型。
But this really depends on your use case and what you want to research.
但這實際上取決於你的使用情境以及你想研究的內容。
# ── Entity types ──────────────────────────────────────────────────────────────
# We'll embed these directly into the extraction prompt below.
ENTITY_TYPES = [
"ORGANIZATION", # Companies, labs, industry groups (OpenAI, Google, RIAA)
"PERSON", # Executives, policymakers, judges (Sam Altman, Thierry Breton)
"LEGISLATION", # Laws, acts, regulations (EU AI Act, DMCA, Copyright Act)
"LEGAL_CASE", # Lawsuits, court rulings (NYT v. OpenAI, Getty v. Stability AI)
"CONCEPT", # Abstract ideas: fair use, training data, IP rights
"GOVERNMENT", # Nations, regulatory bodies, courts (EU, US Copyright Office)
"AI_SYSTEM", # Specific models or products (GPT-4, Stable Diffusion, Gemini)
]
# ── Relationship types ─────────────────────────────────────────────────────────
RELATION_TYPES = [
"FILED_AGAINST", # Plaintiff Organization/Person → LEGAL_CASE
"DEFENDANT_IN", # Organization/Person → LEGAL_CASE
"REGULATES", # GOVERNMENT/LEGISLATION → ORGANIZATION/AI_SYSTEM
"ADVOCATES_FOR", # ORGANIZATION/PERSON → CONCEPT or policy position
"TRAINED_ON", # AI_SYSTEM → CONCEPT or dataset type
"PART_OF", # PERSON → ORGANIZATION
"REFERENCES", # LEGAL_CASE/LEGISLATION → CONCEPT
"OPPOSES", # ORGANIZATION/GOVERNMENT → LEGISLATION/CONCEPT
]
Step 5 — Write the Extraction Prompt / 步驟五——撰寫抽取提示詞¶
This is where our ontology gets embedded into the LLM’s instructions.
這就是我們把本體論嵌入到 LLM 指令中的地方。
Most GraphRAG implementations extract bare triplets: (OpenAI, DEFENDANT_IN, NYT v. OpenAI).
大多數 GraphRAG 的實作只抽取裸三元組(bare triplet):(OpenAI, DEFENDANT_IN, NYT v. OpenAI)。
But this prompt extracts descriptions alongside every entity and relationship:
但這個提示詞會在每個實體與關係旁邊一併抽取出描述(description):
- Entity description: “OpenAI is an AI research company that developed GPT-4, currently facing multiple copyright lawsuits”
-
Relationship description: “OpenAI is named as defendant in the New York Times lawsuit over allegedly using copyrighted articles as training data”
-
實體描述:「OpenAI 是一家開發了 GPT-4 的 AI 研究公司,目前正面臨多起著作權訴訟」
- 關係描述:「OpenAI 在《紐約時報》的訴訟中被列為被告,該訴訟指控其涉嫌使用受著作權保護的文章作為訓練資料」
These descriptions flow through to the community summaries, making them far richer and enabling much better final answers.
這些描述會一路流入社群摘要之中,使摘要遠為豐富,並讓最終的答案品質好得多。
# Build the entity and relationship type strings dynamically from our ontology
entity_types_str = ", ".join(ENTITY_TYPES)
relation_types_str = ", ".join(RELATION_TYPES)
KG_TRIPLET_EXTRACT_TMPL = f"""
-Goal-
Given a news article about AI copyright, governance, or intellectual property,
identify all entities mentioned in the article and their relationships.
Extract up to {{max_knowledge_triplets}} entity-relation triplets.
-Allowed Entity Types-
{entity_types_str}
-Allowed Relationship Types-
{relation_types_str}
-Steps-
1. Identify ALL entities. For each entity extract:
- name: Name of the entity, capitalized
- type: One of the allowed entity types above
- description: A brief description of the entity and its role in AI copyright/governance
2. Identify relationships between entities. For each pair extract:
- source: name of the source entity
- target: name of the target entity
- relation: one of the allowed relationship types above
- description: a sentence explaining why and how these entities are related
-Real Data-
######################
text: {{text}}
######################
"""
Step 6 — Define Structured Extraction Models / 步驟六——定義結構化抽取模型¶
Instead of parsing raw LLM text with regex, we define Pydantic models that describe exactly what we want the LLM to return. LlamaIndex passes these to OpenAI as a function schema, so the response is structured JSON — validated and typed automatically.
我們不使用正規表達式(regex)來解析 LLM 的原始文字,而是定義 Pydantic 模型,精確地描述我們希望 LLM 回傳的內容。LlamaIndex 將這些作為函式綱要(function schema)傳給 OpenAI,因此回應會是結構化的 JSON——並自動完成驗證與型別標註。
Three models are defined:
定義了三個模型:
ExtractedEntity— a single entity withname,type, anddescriptionExtractedRelationship— a relationship between two entities:source,target,relation,description-
ExtractionResult— the top-level wrapper containing a list of each -
ExtractedEntity——一個單一實體,包含name、type與description ExtractedRelationship——兩個實體之間的一個關係:source、target、relation、descriptionExtractionResult——最上層的包裝器,包含上述兩者各自的清單
EntityTypeStr and RelationTypeStr are Literal types built from the ontology. This means the LLM cannot return an invalid type because Pydantic will reject it before it ever reaches our code.
EntityTypeStr 與 RelationTypeStr 是從本體論建立的 Literal 型別。這意味著 LLM 無法回傳無效的型別,因為 Pydantic 會在它觸及我們的程式碼之前就予以拒絕。
from pydantic import BaseModel, Field, field_validator
from typing import Literal, List
EntityTypeStr = Literal[
"ORGANIZATION", "PERSON", "LEGISLATION", "LEGAL_CASE",
"CONCEPT", "GOVERNMENT", "AI_SYSTEM"
]
RelationTypeStr = Literal[
"FILED_AGAINST", "DEFENDANT_IN", "REGULATES", "ADVOCATES_FOR",
"TRAINED_ON", "PART_OF", "REFERENCES", "OPPOSES"
]
class ExtractedEntity(BaseModel):
name: str = Field(description="Name of the entity, capitalized")
type: EntityTypeStr = Field(description="One of the allowed entity types")
description: str = Field(description="Brief description of the entity and its role")
class ExtractedRelationship(BaseModel):
source: str = Field(description="Name of the source entity")
target: str = Field(description="Name of the target entity")
relation: RelationTypeStr = Field(description="One of the allowed relationship types")
description: str = Field(description="Sentence explaining the relationship")
class ExtractionResult(BaseModel):
entities: List[ExtractedEntity] = Field(default_factory=list)
relationships: List[ExtractedRelationship] = Field(default_factory=list)
Step 7 — Build the Custom GraphRAGExtractor / 步驟七——建立客製化的 GraphRAGExtractor¶
This is the core extraction component. It sends each text chunk to the LLM with our ontology-constrained prompt, parses the response, and stores the extracted entities and relationships as structured objects on each node.
這是核心的抽取元件。它使用我們受本體論約束的提示詞,將每個文字區塊送往 LLM,解析其回應,並將抽取出的實體與關係作為結構化物件儲存在每個節點上。
Why build a custom extractor?
為什麼要建立一個客製化的抽取器?
LlamaIndex’s built-in SchemaLLMPathExtractor extracts entity and relationship labels but drops descriptions. By building our own extractor:
LlamaIndex 內建的 SchemaLLMPathExtractor 會抽取實體與關係的標籤,但會捨棄描述。透過建立我們自己的抽取器:
- Every
EntityNodecarries anentity_descriptionproperty - Every
Relationcarries arelationship_descriptionproperty -
These flow into community summaries, making them much richer
-
每個
EntityNode都帶有一個entity_description屬性 - 每個
Relation都帶有一個relationship_description屬性 - 這些都會流入社群摘要中,使其豐富得多
The extractor runs asynchronously with num_workers=4 — processing 4 chunks in parallel, which speeds up extraction significantly without increasing API cost.
這個抽取器以 num_workers=4 非同步(asynchronously)執行——同時平行處理 4 個區塊,這在不增加 API 成本的情況下大幅加快了抽取速度。
class GraphRAGExtractor(TransformComponent):
"""
Extracts entities and relationships WITH descriptions from each text chunk.
Uses Pydantic structured output (via OpenAI function calling) to guarantee
that every entity has a valid type and every relationship uses an allowed label.
Data flow for a single chunk:
Raw text
→ LLM (astructured_predict)
→ ExtractionResult(
entities=[
ExtractedEntity(name="OpenAI", type="ORGANIZATION", description="AI lab..."),
ExtractedEntity(name="NYT v. OpenAI", type="LEGAL_CASE", description="Copyright lawsuit..."),
],
relationships=[
ExtractedRelationship(source="OpenAI", target="NYT v. OpenAI",
relation="DEFENDANT_IN", description="OpenAI is named defendant..."),
]
)
→ EntityNode(name="OpenAI", label="ORGANIZATION", properties={...})
→ EntityNode(name="NYT v. OpenAI", label="LEGAL_CASE", properties={...})
→ Relation(label="DEFENDANT_IN", source_id=<OpenAI id>, target_id=<NYT v. OpenAI id>)
"""
llm: LLM = Field(default_factory=lambda: Settings.llm)
extract_prompt: PromptTemplate = Field(
default_factory=lambda: PromptTemplate(KG_TRIPLET_EXTRACT_TMPL)
)
num_workers: int = 4
max_paths_per_chunk: int = 10
@field_validator("extract_prompt", mode="before")
@classmethod
def coerce_to_prompt_template(cls, v):
return PromptTemplate(v) if isinstance(v, str) else v
def __call__(self, nodes, show_progress=False, **kwargs):
return asyncio.run(self.acall(nodes, show_progress=show_progress, **kwargs))
async def _aextract(self, node: BaseNode) -> BaseNode:
text = node.get_content(metadata_mode="llm")
# Calls the LLM with our ontology prompt, parses into ExtractionResult,
# then builds EntityNode + Relation objects and attaches them to node metadata.
# Full implementation in the GitHub repo.
Step 8 — Build the GraphRAGStore / 步驟八——建立 GraphRAGStore¶
GraphRAGStore extends LlamaIndex's SimplePropertyGraphStore with two additional capabilities: community detection and community summary generation.
GraphRAGStore 在 LlamaIndex 的 SimplePropertyGraphStore 之上擴充了兩項額外功能:社群偵測與社群摘要生成。
By bundling these into the store itself, the pipeline stays clean — after building the index, you just call graph_store.build_communities() and everything is handled.
藉由將這些功能整合進儲存體(store)本身,整個流程保持簡潔——在建立索引之後,你只需呼叫 graph_store.build_communities(),所有事情就會被處理好。
How community detection works here:
社群偵測在這裡的運作方式:
- First, convert the property graph to a NetworkX graph
- Then run hierarchical Leiden to find entity clusters
- For each cluster, collect all entities (with descriptions) and relationships (with descriptions)
-
Finally, ask the LLM to write a briefing note for each cluster
-
首先,將屬性圖(property graph)轉換為 NetworkX 圖
- 接著執行階層式 Leiden 演算法以找出實體叢集
- 對每個叢集,收集所有實體(含描述)與關係(含描述)
- 最後,請 LLM 為每個叢集撰寫一份簡報筆記(briefing note)
The descriptions captured during extraction make these briefings significantly richer than if we’d only stored bare labels.
在抽取過程中所擷取的描述,使這些簡報遠比我們只儲存裸標籤時要豐富得多。
class GraphRAGStore(SimplePropertyGraphStore):
"""
Extends SimplePropertyGraphStore with:
- Leiden community detection
- LLM-generated community summaries (using entity + relationship descriptions)
After building the PropertyGraphIndex, call build_communities() to
detect clusters and generate summaries. These summaries are then
used by GraphRAGQueryEngine at query time.
"""
community_summaries: dict = {}
def build_communities(self):
"""
Main entry point: run community detection and generate summaries.
Call this once after PropertyGraphIndex is built.
"""
print("Running community detection...")
nx_graph = self._to_networkx()
if not nx_graph.nodes:
print("⚠️ Graph is empty - no communities to detect")
return
print(f"Graph has {nx_graph.number_of_nodes()} nodes, {nx_graph.number_of_edges()} edges")
# Run hierarchical Leiden algorithm
clusters = hierarchical_leiden(nx_graph, max_cluster_size=MAX_CLUSTER_SIZE)
num_communities = len(set(c.cluster for c in clusters))
print(f"Found {num_communities} communities")
# Collect entities and relationships per community
community_info = self._collect_community_info(nx_graph, clusters)
# Generate LLM summaries
self._generate_summaries(community_info)
print(f"\n✅ {len(self.community_summaries)} community summaries generated")
def _to_networkx(self) -> nx.Graph:
"""
Convert the LlamaIndex property graph to a NetworkX graph for Graspologic.
Edges carry relationship label and description as attributes.
"""
nx_graph = nx.Graph()
# Add entity nodes
for node in self.graph.nodes.values():
if isinstance(node, EntityNode):
nx_graph.add_node(node.id)
# Add edges with relationship metadata
for relation in self.graph.relations.values():
if relation.source_id in nx_graph and relation.target_id in nx_graph:
nx_graph.add_edge(
relation.source_id,
relation.target_id,
relationship=relation.label,
description=relation.properties.get("relationship_description", ""),
)
return nx_graph
def _collect_community_info(self, nx_graph, clusters) -> dict:
"""
For each community cluster, collect:
- Entity names, types, and descriptions
- Relationship strings (with descriptions where available)
This rich data is what makes the LLM summaries informative.
"""
community_mapping = {item.node: item.cluster for item in clusters}
# Build a lookup of node id -> {name, type, description}
node_details = {}
for node in self.graph.nodes.values():
if not isinstance(node, EntityNode):
continue
node_details[node.id] = {
"name": node.name,
"type": node.label,
"description": node.properties.get("entity_description", ""),
}
community_info = {}
for item in clusters:
cid, nid = item.cluster, item.node
community_info.setdefault(cid, {"entities": [], "relationships": []})
# Add entity details to community
if nid in node_details:
community_info[cid]["entities"].append(node_details[nid])
# Add relationships to neighbors in the same community
for neighbor in nx_graph.neighbors(nid):
if community_mapping.get(neighbor) == cid:
edge = nx_graph.get_edge_data(nid, neighbor)
rel = edge.get("relationship", "RELATED") if edge else "RELATED"
desc = edge.get("description", "") if edge else ""
src_name = node_details.get(nid, {}).get("name", nid)
tgt_name = node_details.get(neighbor, {}).get("name", neighbor)
# Format: "Entity A --[RELATION]--> Entity B (description)"
entry = f"{src_name} --[{rel}]--> {tgt_name}"
if desc:
entry += f" ({desc})"
community_info[cid]["relationships"].append(entry)
return community_info
def _generate_summaries(self, community_info):
"""
Ask the LLM to write a briefing note for each community cluster.
Uses entity descriptions and relationship descriptions for richer context.
"""
for community_id, data in community_info.items():
if not data["relationships"] and not data["entities"]:
continue
# Format entities with type and description
entities_text = "\n".join([
f"- {e['name']} ({e['type']}): {e['description']}"
for e in data["entities"] if e.get("name")
])
# Deduplicate and format relationships
relationships_text = "\n".join(sorted(set(data["relationships"])))
prompt = f"""You are analysing a cluster of entities from news articles about
AI copyright, governance, and intellectual property.
Entities in this cluster:
{entities_text}
Relationships:
{relationships_text}
Write a concise briefing (3-5 sentences) that:
1. Identifies the main organizations, people, legal cases, or topics in this cluster
2. Explains how they are connected and why - including legal or regulatory context
3. Highlights any disputes, lawsuits, policy positions, or tensions
4. Notes anything particularly relevant for understanding AI copyright or governance
Briefing:"""
response = EXTRACTION_LLM.complete(prompt)
self.community_summaries[community_id] = response.text
print(f" Community {community_id}: {response.text[:100]}...")
def get_community_summaries(self) -> dict:
return self.community_summaries
Step 9 — Build the GraphRAGQueryEngine / 步驟九——建立 GraphRAGQueryEngine¶
The query engine uses a two-step approach:
這個查詢引擎使用兩步驟的方法:
- Per-community answering — here, you ask the LLM to answer the question from each community summary independently. If a summary isn’t relevant, the LLM says so and we skip it. This avoids polluting the final answer with irrelevant content.
-
Aggregation — combine all relevant partial answers into one final, non-redundant response using
QUERY_LLM(the stronger model). -
逐社群回答(Per-community answering)——在這裡,你請 LLM 獨立地根據每個社群摘要來回答問題。如果某個摘要不相關,LLM 會指出這點,我們就會跳過它。這可以避免用不相關的內容污染最終答案。
- 彙整(Aggregation)——使用
QUERY_LLM(較強的模型)將所有相關的部分答案組合成一個最終、無冗餘的回應。
class GraphRAGQueryEngine(CustomQueryEngine):
"""
Queries all community summaries and synthesises a single answer.
Step 1: For each community summary, ask EXTRACTION_LLM to answer the
question based only on that summary. If not relevant, skip.
Step 2: Aggregate all relevant partial answers using QUERY_LLM into
one final, clear, non-redundant response.
"""
graph_store: GraphRAGStore
llm: LLM
def custom_query(self, query_str: str) -> str:
summaries = self.graph_store.get_community_summaries()
if not summaries:
return "No community summaries found. Run graph_store.build_communities() first."
# Step 1: Get a partial answer from each community summary
community_answers = [
self._answer_from_community(summary, query_str)
for summary in summaries.values()
]
# Filter out empty/irrelevant responses
relevant_answers = [a for a in community_answers if a.strip()]
if not relevant_answers:
return "I don't have enough information in the knowledge graph to answer that question."
# Step 2: Aggregate into one final answer
return self._aggregate(relevant_answers, query_str)
def _answer_from_community(self, summary: str, query: str) -> str:
"""
Ask EXTRACTION_LLM to answer the query from a single community summary.
Returns empty string if the summary isn't relevant to the question.
We use the cheaper model here since this runs once per community.
"""
prompt = (
f"Community summary:\n{summary}\n\n"
f"Question: {query}\n\n"
f"If this summary contains information relevant to the question, answer it. "
f"If not relevant, reply exactly: 'No relevant information.'\n\n"
f"Answer:"
)
response = EXTRACTION_LLM.complete(prompt)
text = response.text.strip()
# Filter out non-answers
return "" if "no relevant information" in text.lower() else text
def _aggregate(self, answers: List[str], query: str) -> str:
"""
Synthesise all relevant partial answers into one final response.
Uses QUERY_LLM (gpt-4o) for better reasoning quality on the final step.
"""
combined = "\n\n---\n\n".join(answers)
prompt = (
f"You have received answers from multiple knowledge graph communities about this question:\n\n"
f"Question: {query}\n\n"
f"Community answers:\n{combined}\n\n"
f"Synthesise these into a single, clear, well-structured final answer. "
f"Remove redundancy, keep all important details, and ensure the answer "
f"directly addresses the question.\n\n"
f"Final Answer:"
)
return self.llm.complete(prompt).text
Step 10 — Load Article Dataset / 步驟十——載入文章資料集¶
Load the articles previously scraped with SerpApi.
載入先前使用 SerpApi 抓取的文章。
# Load from CSV ───────────────────────────────────────────────────
df = pd.read_csv("ai_copyright_dataset.csv")
print(f"Number of articles: {len(df)}")
df.head()
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image: Articles scraped with SerpApi
圖片:使用 SerpApi 抓取的文章
Step 11 — Convert to LlamaIndex Documents / 步驟十一——轉換為 LlamaIndex Documents¶
We wrap each article as a LlamaIndex Document object.
我們將每篇文章包裝成一個 LlamaIndex 的 Document 物件。
Notice we’re keeping the title and source as metadata — we don’t need the LLM to extract those, we already have them. This is exactly the point we made about the ontology: don’t pay an LLM to rediscover information you already have.
請注意,我們把標題與來源保留為中介資料(metadata)——我們不需要 LLM 去抽取這些,因為我們已經擁有它們了。這正是我們先前對本體論所提到的重點:別花錢請 LLM 去重新發現你早已擁有的資訊。
# Wrap each article as a LlamaIndex Document
# text is the main content; everything else is metadata
nodes = [
Document(
text=article["text"],
metadata={
"title": article.get("title", ""),
"source": article.get("source", ""),
}
)p
for article in articles
]
Step 12 — Build the Knowledge Graph / 步驟十二——建立知識圖譜¶
Now everything comes together in the extraction pipeline.
現在所有東西都在抽取流程中匯聚在一起。
PropertyGraphIndex orchestrates the full workflow:
PropertyGraphIndex 協調整個工作流程:
- It passes each chunk to GraphRAGExtractor
- The extractor calls the LLM using our ontology-constrained prompt
-
The resulting parsed entities and relationships are stored in GraphRAGStore
-
它將每個區塊傳給 GraphRAGExtractor
- 抽取器使用我們受本體論約束的提示詞呼叫 LLM
- 解析後產生的實體與關係會被儲存在 GraphRAGStore 中
This is the most time-consuming step.
這是最耗時的步驟。
# Instantiate the extractor with our ontology prompt
kg_extractor = GraphRAGExtractor(
llm=EXTRACTION_LLM,
extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
max_paths_per_chunk=MAX_PATHS_PER_CHUNK,
num_workers=NUM_WORKERS,
)
# GraphRAGStore is both the graph database and the community detection engine
graph_store = GraphRAGStore()
print("Building knowledge graph... (this may take a few minutes)")
index = PropertyGraphIndex(
nodes=nodes,
kg_extractors=[kg_extractor],
property_graph_store=graph_store,
embed_kg_nodes=False, # we don't need vector similarity search for this GraphRAG pipeline
show_progress=True,
)
Step 13 — Build Communities & Generate Summaries / 步驟十三——建立社群並生成摘要¶
Now we run community detection and generate LLM summaries for each cluster.
現在我們執行社群偵測,並為每個叢集生成 LLM 摘要。
This is the step that enables big-picture queries. Instead of searching for similar chunks, GraphRAG queries these pre-built community briefings — each one a synthesized overview of a topic cluster in your data.
這就是讓宏觀查詢得以實現的步驟。GraphRAG 不去搜尋相似的區塊,而是查詢這些預先建立好的社群簡報——每一份都是你資料中某個主題叢集的綜合概覽。
For our AI copyright dataset, you should see communities forming around things like:
對於我們的 AI 著作權資料集,你應該會看到社群圍繞著以下這類事物而形成:
- US copyright litigation (NYT, Getty, authors suing AI companies)
- EU regulatory activity (EU AI Act, GDPR interactions)
- Training data debates (fair use arguments, dataset licensing)
-
Specific AI systems and their legal exposure
-
美國著作權訴訟(《紐約時報》、Getty、控告 AI 公司的作者群)
- 歐盟的監管活動(歐盟 AI 法案、與 GDPR 的交互作用)
- 訓練資料的辯論(合理使用 fair use 的論點、資料集授權)
- 特定的 AI 系統及其所面臨的法律風險
summaries = graph_store.get_community_summaries()
print(f"\n✅ {len(summaries)} community summaries ready for querying")
Step 14 — Visualize the Knowledge Graph / 步驟十四——將知識圖譜視覺化¶
Create a d3.js network graph visualization from the graph we built.
從我們建立的圖譜建立一個 d3.js 網絡圖視覺化。
Open ai_copyright_graph.html in your browser after running this cell.
執行這個程式碼儲存格(cell)後,在你的瀏覽器中開啟 ai_copyright_graph.html。
def export_graph_data(graph_store: GraphRAGStore, output_file: str = "graph_data.json"):
"""
Export the knowledge graph to a JSON file compatible with the D3.js template.
Run this once after graph_store.build_communities().
Saves to disk so you can re-run the visualization without reprocessing
the full pipeline - which is expensive.
"""
# Walks the graph store, builds a node metadata lookup, collects all edges,
# and writes everything to a JSON file for the D3 template to consume.
# Full implementation in the GitHub repo.
def visualize_graph(
graph_store: GraphRAGStore = None,
graph_data_file: str = "graph_data.json",
template_file: str = "graph_template.html",
output_file: str = GRAPH_OUTPUT_FILE,
):
"""
Generate a D3.js interactive knowledge graph visualization.
Can be called in two ways:
1. Pass graph_store directly - exports fresh data then builds the HTML:
visualize_graph(graph_store=graph_store)
2. Load from a previously exported JSON file - no pipeline re-run needed:
visualize_graph() # loads graph_data.json automatically
Features:
- Color-coded nodes by entity type (from ontology)
- Node size scales with degree (more connections = bigger)
- Click a node to highlight its neighbourhood, dim everything else
- Sidebar legend - click any entity type to toggle its visibility
- Search bar to find and highlight any node by name
- Sliders for link distance, charge strength, and edge label threshold
- Hover tooltips with entity name, type, description, and connection count
- Stats header showing total nodes, edges, and communities
"""
# Exports the graph data to JSON, loads the D3 HTML template,
# injects the data, and writes out the final interactive visualization file.
# Full implementation in the GitHub repo.
# Export data and build visualization from the graph store
visualize_graph(graph_store=graph_store)
The result:
結果:
Press enter or click to view image in full size
按 Enter 或點擊以檢視完整大小的圖片

Image source: GraphRAG visualization
圖片來源:GraphRAG 視覺化
Step 15 — Query the System / 步驟十五——查詢系統¶
Now we run queries that would be impossible to answer reliably with standard RAG.
現在我們執行一些用標準 RAG 不可能可靠回答的查詢。
The GraphRAGQueryEngine scores each community summary for relevance, filters irrelevant ones, and synthesizes a final answer from the best results, all using the community briefings generated in Step 13.
GraphRAGQueryEngine 會為每個社群摘要評估相關性分數、過濾掉不相關的摘要,並從最佳結果中綜合出一個最終答案——這一切都使用步驟十三中生成的社群簡報。
query_engine = GraphRAGQueryEngine(
graph_store=graph_store,
llm=QUERY_LLM, # gpt-4o for final synthesis
)
print("✅ Query engine ready")
-
First, a big-picture thematic question — exactly the kind that standard RAG struggles with:
-
首先,一個宏觀的主題性問題——正是標準 RAG 難以應付的那種:
# ── Query 1: Big-picture thematic question ────────────────────────────────────
# Requires synthesising the main legal arguments across many articles.
# Standard RAG would struggle — no single chunk contains the full picture.
q1 = "What are the main legal arguments being made around AI copyright and training data?"
print(f"Query: {q1}")
print("=" * 70)
print(query_engine.custom_query(q1))
The answer:
答案:
Query: What are the main legal arguments being made around AI copyright and training data?
======================================================================
The main legal arguments surrounding AI copyright and training data focus on several key issues:
1. **Originality and Authorship**: Traditional copyright laws require human authorship and originality for a work to be eligible for copyright protection. This raises questions about whether AI-generated content can be considered original and who, if anyone, can claim authorship. Cases like Naruto v. Slater have established that non-humans cannot hold copyright, complicating the status of AI-generated works.
2. **Fair Use Doctrine**: The use of copyrighted materials for training AI systems is a contentious issue. Proponents argue that such use could be considered fair use, especially if it is transformative and does not substitute the original work. Legal precedents like Authors Guild v. Google support this view, but the interpretation of fair use in the context of AI remains debated.
3. **Ownership of AI-Generated Works**: There is ongoing debate about who owns the rights to AI-generated content. Potential claimants include the developers of the AI, the users who input data, or the AI itself, though current laws typically do not recognize AI as an author.
4. **Ethical Use of Training Data**: The legality of using copyrighted material without permission for AI training is a significant concern. Lawsuits, such as those involving The New York Times and OpenAI, highlight the tension between innovation and the rights of content creators.
5. **Transparency and Opt-Out Mechanisms**: Advocates call for transparency in AI training processes and the implementation of opt-out mechanisms to protect individual rights and ensure informed consent regarding the use of personal data.
6. **Balancing Innovation and Protection**: There is a need to balance fostering innovation in AI with protecting intellectual property rights. This includes discussions on potential legal reforms to adapt copyright laws to the realities of AI-generated content.
These arguments reflect the complexities of integrating AI advancements with existing intellectual property laws and underscore the necessity for updated regulations to address these challenges.
中文翻譯:
查詢:圍繞 AI 著作權與訓練資料所提出的主要法律論點有哪些?
======================================================================
圍繞 AI 著作權與訓練資料的主要法律論點集中在幾個關鍵議題上:
- 原創性與著作人身分(Originality and Authorship):傳統著作權法要求一件作品必須具有人類著作人身分與原創性,才有資格受著作權保護。這引發了關於 AI 生成內容是否能被視為原創、以及究竟誰(如果有的話)能主張著作人身分的疑問。像「Naruto v. Slater」這樣的案件已確立非人類不能持有著作權,使得 AI 生成作品的地位變得更為複雜。
- 合理使用原則(Fair Use Doctrine):使用受著作權保護的素材來訓練 AI 系統是一個具爭議的議題。支持者主張此種使用或可被視為合理使用,特別是當它具有轉化性(transformative)且不會取代原作時。像「Authors Guild v. Google」這樣的法律先例支持此一觀點,但在 AI 脈絡下對合理使用的詮釋仍存在爭論。
- AI 生成作品的所有權(Ownership of AI-Generated Works):關於誰擁有 AI 生成內容的權利,目前仍有持續的爭論。潛在的權利主張者包括 AI 的開發者、輸入資料的使用者,或 AI 本身,儘管現行法律通常不承認 AI 為著作人。
- 訓練資料的合乎倫理使用(Ethical Use of Training Data):未經許可使用受著作權保護的素材來訓練 AI 的合法性是一項重大疑慮。諸如涉及《紐約時報》與 OpenAI 的訴訟,凸顯了創新與內容創作者權利之間的緊張關係。
- 透明度與退出機制(Transparency and Opt-Out Mechanisms):倡議者呼籲 AI 訓練過程應具透明度,並導入退出(opt-out)機制,以保護個人權利並確保在使用個人資料方面取得知情同意。
- 在創新與保護之間取得平衡(Balancing Innovation and Protection):需要在促進 AI 創新與保護智慧財產權之間取得平衡。這包括了關於潛在法律改革的討論,以使著作權法適應 AI 生成內容的現實。
這些論點反映了將 AI 進展與現有智慧財產權法整合的複雜性,並凸顯了制定更新法規以應對這些挑戰的必要性。
-
Now a cross-entity relationship question:
-
現在來看一個跨實體的關係問題:
# ── Query 2: Cross-entity relationship question ────────────────────────────────
# Requires tracing which companies appear in disputes and what positions they hold.
# GraphRAG can follow DEFENDANT_IN, ADVOCATES_FOR, OPPOSES edges to answer this.
q2 = "Which companies are involved in AI copyright or governance disputes, and what are their positions?"
print(f"Query: {q2}")
print("=" * 70)
print(query_engine.custom_query(q2)
The answer:
答案:
Query: Which companies are involved in AI copyright or governance disputes, and what are their positions?
======================================================================
Several companies are currently involved in AI copyright and governance disputes, each with distinct positions and challenges:And finally a comparative policy question that requires reasoning across different governments and jurisdictions:
1. **Stability AI**: Facing legal challenges from artists and Getty Images, Stability AI is accused of using copyrighted images without authorization to train its AI model, Stable Diffusion. The company focuses on AI development and innovation, while the plaintiffs advocate for the protection of intellectual property rights.
2. **OpenAI and Meta**: Both companies are engaged in legal disputes over the use of copyrighted material for training generative AI systems. OpenAI is specifically facing a lawsuit from The New York Times for copyright infringement. The disputes revolve around the complexities of copyright law and the implications of the Fair Use Doctrine.
3. **Thomson Reuters and Ross Intelligence**: Thomson Reuters has filed a lawsuit against Ross Intelligence, alleging unauthorized use of its copyrighted materials to train AI models. Ross Intelligence is defending against these accusations, which question the legality of using copyrighted content for AI training.
4. **Midjourney**: Involved in discussions about copyright protections for AI-generated works, Midjourney challenges the U.S. Copyright Office's regulations, advocating for establishing copyright protections for AI-generated content.
5. **DABUS and LAION**: DABUS is involved in legal battles over its status as an inventor, challenging traditional notions of inventorship in AI contexts. LAION, a non-profit organization providing datasets for AI training, faces ethical concerns regarding data usage and copyright, raising questions about its responsibilities in AI training processes.
These disputes highlight ongoing tensions in the legal landscape regarding AI's role in innovation, the ethical implications of data sourcing, and the protection of intellectual property rights in the context of AI technologies.
中文翻譯:
查詢:哪些公司涉入了 AI 著作權或治理爭議,以及它們的立場為何?
======================================================================
目前有數家公司涉入 AI 著作權與治理爭議,每一家都各有不同的立場與挑戰:(接著是一個比較性的政策問題,需要跨不同政府與司法管轄區進行推理:)
- Stability AI:面臨來自藝術家與 Getty Images 的法律挑戰,Stability AI 被指控未經授權使用受著作權保護的圖像來訓練其 AI 模型 Stable Diffusion。該公司聚焦於 AI 開發與創新,而原告則倡議保護智慧財產權。
- OpenAI 與 Meta:這兩家公司都因使用受著作權保護的素材來訓練生成式 AI 系統而捲入法律爭議。OpenAI 特別面臨《紐約時報》提起的著作權侵權訴訟。這些爭議圍繞著著作權法的複雜性以及合理使用原則的影響。
- Thomson Reuters 與 Ross Intelligence:Thomson Reuters 已對 Ross Intelligence 提起訴訟,指控其未經授權使用其受著作權保護的素材來訓練 AI 模型。Ross Intelligence 正在針對這些指控進行抗辯,這些指控質疑了使用受著作權保護內容來訓練 AI 的合法性。
- Midjourney:涉入關於 AI 生成作品著作權保護的討論,Midjourney 對美國著作權局(U.S. Copyright Office)的法規提出挑戰,倡議為 AI 生成內容建立著作權保護。
- DABUS 與 LAION:DABUS 涉入關於其作為發明人身分的法律爭訟,挑戰了 AI 脈絡下傳統的發明人身分(inventorship)概念。LAION 是一個為 AI 訓練提供資料集的非營利組織,面臨關於資料使用與著作權的倫理疑慮,引發了關於其在 AI 訓練過程中責任的疑問。
這些爭議凸顯了法律環境中關於 AI 在創新中的角色、資料來源取得的倫理影響,以及在 AI 技術脈絡下保護智慧財產權等方面持續存在的緊張關係。
# ── Query 3: Comparative policy question ──────────────────────────────────────
# Requires reasoning across different governments and jurisdictions —
# connecting EU, US, and UK regulatory approaches from different articles.
q3 = "How are different governments (e.g. EU, US, and UK) approaching AI governance?"
print(f"Query: {q3}")
print("=" * 70)
print(query_engine.custom_query(q3))
The result:
結果:
Query: How are different governments (e.g. EU, US, and UK) approaching AI governance?
======================================================================
Different governments are adopting varied approaches to AI governance, reflecting their unique legal and regulatory landscapes. The European Union is advancing its regulatory framework through the EU AI Act, which aims to establish comprehensive guidelines for AI systems. This act represents a proactive approach to AI governance, focusing on creating a structured environment for AI development and deployment.
In contrast, the United States has a more traditional stance, particularly in the realm of intellectual property. The U.S. Patent and Trademark Office (USPTO) does not recognize AI as a legal author or inventor, as evidenced by legal cases such as Thaler v. Perlmutter and Thaler v. Vidal. These cases underscore the US's position that only natural persons can be recognized as inventors on patent applications, which may not fully accommodate AI-generated inventions.
The UK's approach to AI governance is not explicitly detailed in the provided summaries, but it is noted that the UK Intellectual Property Office is considering the implications of AI in innovation, potentially indicating a more progressive stance similar to the European Patent Office (EPO). This suggests that the UK may be aligning its policies with broader European perspectives, although specific details are not provided.
Overall, these differing approaches highlight the regulatory complexities and the need for clear guidelines as governments navigate the intersection of technology and intellectual property law in the context of AI.
中文翻譯:
查詢:不同的政府(例如歐盟、美國與英國)是如何處理 AI 治理的?
======================================================================
不同的政府正採取各異的 AI 治理方法,反映出它們各自獨特的法律與監管環境。歐盟正透過歐盟 AI 法案(EU AI Act)推進其監管框架,該法案旨在為 AI 系統建立全面性的準則。這項法案代表了一種對 AI 治理的前瞻性做法,聚焦於為 AI 的開發與部署打造一個結構化的環境。
相比之下,美國則採取較為傳統的立場,特別是在智慧財產權領域。美國專利商標局(USPTO)不承認 AI 為合法的著作人或發明人,諸如「Thaler v. Perlmutter」與「Thaler v. Vidal」等法律案件即為佐證。這些案件凸顯了美國的立場,即只有自然人才能在專利申請中被承認為發明人,而這可能無法完全容納 AI 生成的發明。
英國對 AI 治理的做法在所提供的摘要中並未明確詳述,但值得注意的是,英國智慧財產局(UK Intellectual Property Office)正在考量 AI 在創新中的影響,這可能顯示出一種與歐洲專利局(EPO)類似、較為進步的立場。這暗示英國可能正將其政策與更廣泛的歐洲觀點對齊,儘管並未提供具體細節。
總體而言,這些不同的做法凸顯了監管上的複雜性,以及在各政府於 AI 脈絡下穿梭於技術與智慧財產權法交會處時,對於明確準則的需求。
Wrap Up / 總結¶
And there you have it — a fully working GraphRAG system, built on live data you scraped yourself, reasoning over one of the most complex and fast-moving topics in AI right now.
這樣就完成了——一套完全可運作的 GraphRAG 系統,建立在你自己抓取的即時資料之上,並對當前 AI 領域中最複雜、變化最快速的主題之一進行推理。
You can access full code here, and if you want to try this on your own search topics, you can sign up for SerpAPI here.
你可以在這裡取得完整程式碼,而如果你想在自己的搜尋主題上嘗試這套系統,可以在這裡註冊 SerpAPI。
Thanks for reading!
感謝閱讀!
See you in the next one.
我們下次再見。