知识图谱存储与查询:RDF/Neo4j/SPARQL

“知识图谱的价值不仅在于构建,更在于存储和查询——如何高效地存入、快速地取出。”

一、知识图谱的两种存储范式

1.1 RDF 图 vs 属性图

当前主流的知识图谱存储有两种方式:

特性 RDF 图 属性图
标准 W3C 标准(RDF/OWL) 无统一标准(厂商私有)
表示 三元组 (Subject, Predicate, Object) 节点 + 边 + 属性
查询语言 SPARQL Cypher / Gremlin
推理引擎 内置 RDFS/OWL 推理 通常无内置推理
代表产品 Jena, Virtuoso, GraphDB Neo4j, JanusGraph, TigerGraph
适用场景 学术、语义网、政府数据 企业应用、实时查询
1
2
3
4
5
6
7
8
9
10
11
12
13
14
RDF 三元组表示:
(马化腾, 创办, 腾讯)
(腾讯, 总部位于, 深圳)

属性图表示:
(节点: 马化腾 {type: "人物", name: "马化腾" })
|
| [创办]

(节点: 腾讯 {type: "公司", name: "腾讯", founded: 1998 })
|
| [总部位于]

(节点: 深圳 {type: "城市", name: "深圳" })

1.2 如何选择?

1
2
3
4
中小规模(<1亿三元组)→ Neo4j(易用,高性能)
大规模 + 学术语义 → RDF 图(Jena/Virtuoso)
需要分布式扩展 → JanusGraph / TigerGraph
需要推理能力 → GraphDB / Stardog

二、RDF 与 SPARQL

2.1 RDF 基础

RDF(Resource Description Framework)是 W3C 制定的知识表示标准。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<!-- RDF XML 格式示例 -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:kg="http://example.org/knowledge#">

<kg:Person rdf:about="http://example.org/person/MH">
<kg:name>马化腾</kg:name>
<kg:born>1971</kg:born>
</kg:Person>

<kg:Company rdf:about="http://example.org/company/Tencent">
<kg:name>腾讯</kg:name>
<kg:founded>1998</kg:founded>
</kg:Company>

<!-- 三元组关系 -->
<kg:Founded rdf:about="http://example.org/relation/Founded">
<kg:subject rdf:resource="http://example.org/person/MH"/>
<kg:object rdf:resource="http://example.org/company/Tencent"/>
</kg:Founded>

</rdf:RDF>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<!-- 更常用的 Turtle 格式 -->
@prefix kg: <http://example.org/knowledge#>

kg:MH a kg:Person ;
kg:name "马化腾" ;
kg:born "1971" ;
kg:founded kg:Tencent .

kg:Tencent a kg:Company ;
kg:name "腾讯" ;
kg:founded "1998" ;
kg:headquartered kg:Shenzhen .

kg:Shenzhen a kg:City ;
kg:name "深圳" .

2.2 SPARQL 查询

SPARQL 是 RDF 的查询语言,类似 SQL。

1
2
3
4
5
6
7
8
9
10
11
-- 查询 1:马化腾创办了哪些公司?
PREFIX kg: <http://example.org/knowledge#>

SELECT ?company ?name
WHERE {
kg:MH kg:founded ?company .
?company kg:name ?name .
}

-- 结果:
-- ?company = kg:Tencent, ?name = "腾讯"
1
2
3
4
5
6
7
8
9
-- 查询 2:查找所有人物及其创办的公司
SELECT ?person ?personName ?company ?companyName
WHERE {
?person a kg:Person ;
kg:name ?personName ;
kg:founded ?company .
?company a kg:Company ;
kg:name ?companyName .
}
1
2
3
4
5
6
7
8
9
10
-- 查询 3:多跳查询
-- 查找"某人物的合作伙伴的公司所在城市"
PREFIX kg: <http://example.org/knowledge#>

SELECT ?person ?city
WHERE {
?person kg:partneredWith ?partner .
?partner kg:worksAt ?company .
?company kg:headquartered ?city .
}
1
2
3
4
5
6
7
8
9
-- 查询 4:OPTIONAL(可选匹配)
SELECT ?person ?name ?company
WHERE {
?person a kg:Person ;
kg:name ?name .
OPTIONAL { ?person kg:founded ?company }
}

-- 某人可能没有公司,OPTIONAL 允许该字段为空
1
2
3
4
5
6
7
-- 查询 5:FILTER(过滤条件)
SELECT ?person ?age
WHERE {
?person a kg:Person ;
kg:age ?age .
FILTER (?age >= 30 && ?age <= 50)
}

2.3 使用 Apache Jena

Jena 是 Apache 基金会的开源 RDF 框架。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from jena_permissions import Graph_Fuseki
from rdflib import Graph, Namespace, URIRef, Literal

# 创建图谱
g = Graph()
kg = Namespace("http://example.org/knowledge#")

# 添加三元组
g.add((URIRef("http://example.org/person/MH"),
kg.founded,
URIRef("http://example.org/company/Tencent")))

g.add((URIRef("http://example.org/person/MH"),
kg.name,
Literal("马化腾")))

# SPARQL 查询
query = """
PREFIX kg: <http://example.org/knowledge#>
SELECT ?company
WHERE { kg:MH kg:founded ?company }
"""

results = g.query(query)
for row in results:
print(row.company)

三、属性图与 Cypher(Neo4j)

3.1 Neo4j 基础

Neo4j 是最流行的属性图数据库,使用 Cypher 作为查询语言。

1
2
3
4
5
6
7
8
9
10
11
12
-- 创建节点
CREATE (mh:Person {name: "马化腾", born: 1971})
CREATE (tencent:Company {name: "腾讯", founded: 1998})
CREATE (shenzhen:City {name: "深圳"})

-- 创建关系
CREATE (mh)-[:FOUNDED {year: 1998}]->(tencent)
CREATE (tencent)-[:HEADQUARTERED_IN]->(shenzhen)
CREATE (mh)-[:WORKS_AT]->(tencent)

-- 返回图谱
MATCH (n) RETURN n

3.2 Cypher 查询

1
2
3
4
5
-- 查询 1:查找马化腾创办的公司
MATCH (p:Person {name: "马化腾"})-[r:FOUNDED]->(c:Company)
RETURN c.name AS company

-- 结果:腾讯
1
2
3
4
5
6
-- 查询 2:多跳查询
-- 查找马化腾公司所在城市
MATCH (p:Person {name: "马化腾"})-[:FOUNDED]->(c:Company)-[:HEADQUARTERED_IN]->(city:City)
RETURN city.name

-- 结果:深圳
1
2
3
4
5
6
-- 查询 3:查找所有人物及他们创办的公司(支持无关系的情况)
MATCH (p:Person)
OPTIONAL MATCH (p)-[:FOUNDED]->(c:Company)
RETURN p.name, collect(c.name) AS companies

-- 如果没有公司,返回空列表
1
2
3
4
5
6
7
8
-- 查询 4:路径查询
-- 查找马化腾到深圳的任意路径(最多5跳)
MATCH path = (p:Person {name: "马化腾"})-[*1..5]-(c:City {name: "深圳"})
RETURN path

-- 结果可能:
-- 马化腾 -[FOUNDED]-> 腾讯 -[HEADQUARTERED_IN]-> 深圳
-- 马化腾 -[WORKS_AT]-> 腾讯 -[HEADQUARTERED_IN]-> 深圳
1
2
3
4
5
6
7
-- 查询 5:最短路径
MATCH path = shortestPath(
(p1:Person {name: "马化腾"})-[*]-(p2:Person {name: "李嘉诚"})
)
RETURN path

-- 找出两人之间的所有关系路径
1
2
3
4
5
-- 查询 6:聚合查询
-- 查找每家公司有多少创始人
MATCH (p:Person)-[:FOUNDED]->(c:Company)
RETURN c.name AS company, count(p) AS founder_count
ORDER BY founder_count DESC
1
2
3
4
5
-- 查询 7:子查询和条件
-- 查找有3个以上员工的公司
MATCH (c:Company)
WHERE size((:Person)-[:WORKS_AT]->(c)) >= 3
RETURN c.name, size((:Person)-[:WORKS_AT]->(c)) AS employee_count

3.3 Python 操作 Neo4j

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from neo4j import GraphDatabase

class Neo4jConnection:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))

def query(self, cypher, parameters=None):
with self.driver.session() as session:
result = session.run(cypher, parameters)
return [dict(record) for record in result]

def close(self):
self.driver.close()

# 连接
conn = Neo4jConnection("bolt://localhost:7687", "neo4j", "password")

# 插入数据
conn.query("""
CREATE (mh:Person {name: $name, born: $born})
""", {"name": "马化腾", "born": 1971})

# 查询
results = conn.query("""
MATCH (p:Person)-[:FOUNDED]->(c:Company)
WHERE p.name = $name
RETURN c.name AS company
""", {"name": "马化腾"})

print(results) # [{'company': '腾讯'}]

四、知识图谱的索引与查询优化

4.1 图索引

1
2
3
4
5
6
7
8
9
-- 为常用查询属性创建索引
CREATE INDEX person_name_index FOR (p:Person) ON (p.name)
CREATE INDEX company_name_index FOR (c:Company) ON (c.name)

-- 组合索引
CREATE INDEX person_born_index FOR (p:Person) ON (p.name, p.born)

-- 查看所有索引
SHOW INDEXES

4.2 关系类型索引

1
2
3
4
5
6
7
8
-- 为关系类型创建索引(Neo4j 5.x+ 支持)
CREATE INDEX rel_type_index FOR ()-[r:FOUNDED]-() ON (r.year)

-- 查看查询计划
EXPLAIN
MATCH (p:Person)-[:FOUNDED]->(c:Company)
WHERE p.name = "马化腾"
RETURN c

4.3 查询优化技巧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- ❌ 低效:先 MATCH 再 WHERE
MATCH (p:Person)
WHERE p.name STARTS WITH "马"
RETURN p

-- ✅ 高效:WHERE 在 MATCH 内直接过滤
MATCH (p:Person {name: STARTS WITH "马"})
RETURN p

-- ❌ 低效:返回全量属性
MATCH (p:Person)-[:FOUNDED]->(c:Company)
RETURN p, c

-- ✅ 高效:只返回需要的属性
MATCH (p:Person)-[:FOUNDED]->(c:Company)
RETURN p.name, c.name

五、知识图谱的推理

5.1 RDF 推理

RDF 支持基于 RDFS/OWL 的推理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix kg: <http://example.org/knowledge#> .

# 定义类层次
kg:Person rdfs:subClassOf kg:Agent .
kg:Company rdfs:subClassOf kg:Agent .

# 定义属性层次
kg:founded rdfs:subPropertyOf kg:involves .
kg:headquartered rdfs:subPropertyOf kg:involves .

# 定义属性定义域和值域
kg:founded rdfs:domain kg:Person .
kg:founded rdfs:range kg:Company .
1
2
3
-- 开启推理后,可以自动推出:
-- "马化腾是 Agent"(因为 Person 是 Agent 的子类)
-- "马化腾 involved 腾讯"(因为 founded 是 involves 的子属性)

5.2 规则推理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 使用 PyReason 进行规则推理
from pyreason import Reasoner

# 定义逻辑规则
rules = [
"CEO(X, Y) & Company(Y) -> BusinessPerson(X)", # CEO某公司的人是商业人士
"Founder(X, Y) & Company(Y) -> BusinessPerson(X)", # 创始人也是商业人士
"BusinessPerson(X) & BusinessPerson(Y) & WorksWith(X, Y) -> Colleagues(X, Y)",
]

# 推理
reasoner = Reasoner(rules)
reasoner.add_facts([
("CEO", "马化腾", "腾讯"),
("Founder", "张志东", "腾讯"),
("WorksWith", "马化腾", "张志东"),
])

results = reasoner.infer()
print(results["BusinessPerson"]) # 马化腾, 张志东
print(results["Colleagues"]) # 马化腾-张志东

六、知识图谱的可视化

6.1 Neo4j Bloom

Neo4j Bloom 是官方的图可视化工具,支持自然语言查询。

1
2
用户输入:"马化腾创办的公司"
Bloom 自动将其转换为 Cypher 查询并可视化结果

6.2 Python 可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import networkx as nx
import matplotlib.pyplot as plt
from neo4j import GraphDatabase

# 从 Neo4j 导出为 NetworkX 图
def export_to_networkx(conn):
G = nx.DiGraph()

results = conn.query("""
MATCH (p:Person)-[r]->(c)
RETURN p.name AS source, type(r) AS relation, c.name AS target
""")

for row in results:
G.add_edge(row['source'], row['target'],
relation=row['relation'])

return G

# 可视化
G = export_to_networkx(conn)

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue',
font_size=10, font_weight='bold')
nx.draw_networkx_edge_labels(G, pos,
edge_labels=nx.get_edge_attributes(G, 'relation'))
plt.show()

七、知识图谱查询语言对比

特性 SPARQL Cypher
标准 W3C 标准 openCypher(开放标准)
模式匹配 图模式(Triple Pattern) 图模式(Node-Rel-Node)
变量绑定 SELECT WHERE MATCH RETURN
聚合 GROUP BY, HAVING WITH, collect(), count()
子查询 嵌套 SELECT 嵌套 MATCH
路径查询 property path [*n..m] 语法
更新 INSERT/DELETE CREATE/SET/DELETE
1
2
3
4
5
6
7
8
9
10
-- SPARQL 路径查询
SELECT ?person ?company
WHERE {
?person (kg:founded|kg:co_founded)+ ?company .
}
-- 查找直接或间接创办的公司

-- 等价 Cypher
MATCH (p:Person)-[:FOUNDED|Co-FOUNDED*]->(c:Company)
RETURN p, c

八、实战:构建一个小型知识图谱

8.1 完整流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from neo4j import GraphDatabase
import json

class KnowledgeGraphBuilder:
def __init__(self, uri, user, password):
self.conn = Neo4jConnection(uri, user, password)

def build_from_json(self, data: list):
"""从 JSON 数据构建知识图谱"""

# 先清空现有数据
self.conn.query("MATCH (n) DETACH DELETE n")

# 导入数据
for item in data:
# 创建实体节点
self.conn.query(f"""
CREATE (:{item['type']} {{name: $name,
id: $id,
properties: $props}})
""", {
"name": item['name'],
"id": item['id'],
"props": json.dumps(item.get('properties', {}))
})

# 创建关系
for rel in item.get('relations', []):
self.conn.query(f"""
MATCH (a:{item['type']} {id: $from_id}),
(b:{rel['target_type']} {id: $to_id})
CREATE (a)-[:{rel['type']}]->(b)
""", {
"from_id": item['id'],
"to_id": rel['target_id']
})

# 创建索引
self.conn.query("CREATE INDEX FOR (n:Person) ON (n.name)")
self.conn.query("CREATE INDEX FOR (n:Company) ON (n.name)")

return "知识图谱构建完成"

def query_path(self, from_entity: str, to_entity: str, max_hops: int = 5):
"""查询两个实体间的最短路径"""
result = self.conn.query(f"""
MATCH path = shortestPath(
(a {{name: $from}})-[*1..{max_hops}]-(b {{name: $to}})
)
RETURN path
""", {"from": from_entity, "to": to_entity})
return result

# 使用
data = [
{
"type": "Person",
"id": "p1",
"name": "马化腾",
"properties": {"born": 1971, "nationality": "中国"},
"relations": [
{"type": "FOUNDED", "target_type": "Company", "target_id": "c1"}
]
},
{
"type": "Company",
"id": "c1",
"name": "腾讯",
"properties": {"founded": 1998, "industry": "互联网"},
"relations": [
{"type": "HEADQUARTERED_IN", "target_type": "City", "target_id": "city1"}
]
},
{
"type": "City",
"id": "city1",
"name": "深圳",
"properties": {"province": "广东", "population": 17660000}
}
]

builder = KnowledgeGraphBuilder("bolt://localhost:7687", "neo4j", "password")
builder.build_from_json(data)

# 查询路径
path = builder.query_path("马化腾", "深圳")
print(path)

总结

  1. 存储范式:RDF 图(标准、推理强)vs 属性图(易用、性能好)
  2. RDF/SPARQL:W3C 标准,适合学术和需要推理的场景
  3. Neo4j/Cypher:最流行的图数据库,Cypher 直观易用
  4. 索引优化:为高频查询属性建立索引,避免全表扫描
  5. 推理能力:RDF 内置 RDFS/OWL 推理,属性图需额外工具
  6. 可视化:Neo4j Bloom、NetworkX + Matplotlib

相关文章: