15. Design Proximity Service / Yelp
In this chapter, we design a proximity service. A proximity service is used to discover nearby places such as restaurants, hotels, theaters, museums, etc., and is a core component that powers features like finding the best restaurants nearby on Yelp or finding k-nearest gas stations on Google Maps.
在本章中,我们设计了一个邻近服务。邻近服务用于发现附近的地点,例如餐馆、酒店、剧院、博物馆等,并且是支持在 Yelp 上查找附近最好的餐馆或在 Google 地图上查找 k 最近的加油站等功能的核心组件。
Step 1 - Understand the Problem and Establish Design Scope
1.1 Functional requirements
- Return all businesses based on a user’s location (latitude and longitude pair) and radius. 根据用户位置(纬度和经度对)和半径返回所有商家。
- Maximal search radius allowed 允许的最大搜索半径, 20 km (12.5 miles)
- Search radius options in UI 搜索半径选项
- 0.5 km (0.3 miles)
- 1 km (0.6 miles)
- 2 km (1.2 miles)
- 5 km (3.1 miles)
- 10 km (6.2 miles)
- 20 km (12.5 miles)
- Business owners can add, delete or update a business, but this information doesn’t need to be reflected in real-time. 企业主可以添加、删除或更新企业,但这些信息不需要实时反映。
- Customers can view detailed information about a business. 客户可以查看有关企业的详细信息。
1.2 Non-functional requirements
- Low latency 低延迟. Users should be able to see nearby businesses quickly.
- Data privacy 数据隐私. Location info is sensitive data. When we design a location-based service (LBS), we should always take user privacy into consideration.
- High availability and scalability 高可用性 requirements. We should ensure our system can handle the spike in traffic during peak hours in densely populated areas.
1.3 Back-of-the-envelope estimation 粗略计算
- DAU(daily active users): 100 million
- Businesses 商家: 200 million
- Assume a user makes 5 search queries per day. 假设一个用户每天进行 5 次搜索查询。
- Search QPS =
100 million * 5 / 10^5 = 5,000
Step 2 - Propose High-Level Design and Get Buy-In
2.1 API Design
GET /v1/search/nearby
- This endpoint returns businesses based on certain search criteria. 该端点根据某些搜索条件返回企业。
- The business object contains everything needed to render the search result page, but we may still need additional attributes such as pictures, reviews, star rating, etc., to render the business detail page. 企业对象包含渲染搜索结果页面所需的所有内容,但我们可能仍然需要其他属性,例如图片、评论、星级评分等,以渲染企业详细信息页面。
- Therefore, when a user clicks on the business detail page, a new endpoint call to fetch the detailed information of a business is usually required. 因此,当用户单击企业详细信息页面时,通常需要调用新的端点来获取企业的详细信息。
// Request Parameters
{
"latitude": "decimal",
"longitude": "decimal",
"radius": "int"
}
// Response Body
{
"total": 10,
"businesses":[{business object}]
}
APIs for a business
API | Detail |
---|---|
GET /v1/businesses/{:id} | Return detailed information about a business |
POST /v1/businesses | Add a business |
PUT /v1/businesses/{:id} | Update details of a business |
DELETE /v1/businesses/{:id} | Delete a business |
2.2 Data model
Read/write ratio 读写比率
Read volume is high because the following two features are very commonly used:
- Search for nearby businesses. 搜索附近的商家。
- View the detailed information of a business. 查看商家的详细信息。
On the other hand, the write volume is low because adding, removing, and editing business info are infrequent operations. 另一方面,由于 CRUD 操作不是频繁的操作,因此写入量较低。
Data schema
The key database tables are the business table and the geospatial (geo) index table.
Business table 商家表
The business table contains detailed information about a business. The primary key is business_id
.
business {
"business_id": "string", PK
"name": "string",
"address": "string",
"city": "string",
"state": "string",
"zip_code": "string",
"latitude": "decimal",
"longitude": "decimal",
"phone_number": "string",
"categories": "string",
"num_reviews": "int",
"avg_rating": "decimal",
"photo_urls": "string",
"owner_id": "string",
"created_at": "timestamp",
"updated_at": "timestamp"
}
Geo index table 地理索引表
A geo index table is used for the efficient processing of spatial operations. 地理索引表用于高效处理空间操作。
High-level design
The system comprises two parts: location-based service (LBS) and business-related service.
Location-based service 基于位置的服务
The LBS service is the core part of the system which finds nearby businesses for a given radius and location. LBS 服务是系统的核心部分,它可以在给定的半径和位置找到附近的商家。
The LBS has the following characteristics:
- It is a read-heavy service with no write requests. 读多,没有写请求。
- QPS is high, especially during peak hours in dense areas. 高并发服务
- This service is stateless so it’s easy to scale horizontally. 无状态服务,易于水平扩展。
无状态服务 Stateless Service 的概念主要指服务在处理请求时不保存任何客户端特定的数据。在无状态架构中,每个请求都被当作一个独立的单位,不依赖于之前或之后的请求。这意味着每个请求必须包含所有必要的信息来处理它,服务不会利用之前的交互或会话状态。
- 每个请求独立: 服务对每个请求的处理就像是第一次与该客户端交互,不依赖于之前的请求历史。
- 不保留会话信息: 服务不会存储任何关于 客户端会话的信息,如用户身份、过去的活动记录等。
- 简化的服务器设计: 由于服务器不需要跟踪和管理状态信息,服务器设计相对简单,易于实现和维护。
- 易于水平扩展: 因为每个请求都是独立的,所以可以通过增加服务器数量来轻松扩展服务能力,而不需要担心会话数据的同步问题。
- 负载均衡友好: 由于不存在状态信息,请求可以被任意服务器处理,使得负载均衡更加有效。
Business service 商家服务
- Business owners create, update, or delete businesses. Those requests are mainly write operations, and the QPS is not high.
- 商家创建、更新、删除业务,这些请求以写操作为主,QPS不高。
- Customers view detailed information about a business. QPS is high during peak hours.
- 用户访问和查看商家信息是重点服务。
Database cluster 数据库集群
- The database cluster can use the primary-secondary setup. In this setup, the primary database handles all the write operations, and multiple replicas are used for read operations. 主从数据库。主数据库处理所有写操作,多个副本用于读操作。
- Data is saved to the primary database first and then replicated to replicas. Due to the replication delay, there might be some discrepancy between data read by the LBS and the data written to the primary database. 数据优先被写入到主数据库,然后复制到副本。由于复制延迟,LBS 读取的数据和写入主数据库的数据之间可能存在一些差异。
- This inconsistency is usually not an issue because business information doesn’t need to be updated in real-time.不一致性的影响并不大,因为商家信息不需要实时更新。
Step 3 - Algorithms to fetch nearby businesses
In real life, companies might use existing geospatial databases such as Geohash in Redis or Postgres with PostGIS extension.
Before we dive into the answers, let’s take a look at different types of indexing methods. In a broad sense, there are two types of geospatial indexing approaches, as shown in Figure 5. The highlighted ones are the algorithms we discuss in detail because they are commonly used in the industry.
- Hash: even grid, geohash, cartesian tiers, etc.
- Tree: quadtree, Google S2, RTree, etc.
Even though the underlying implementations of those approaches are different, the high-level idea is the same, that is, to divide the map into smaller areas and build indexes for fast search 将地图划分为更小的区域并建立索引以进行快速搜索。.
Option 1: Two-dimensional search 二维搜索
The most intuitive but naive way to get nearby businesses is to draw a circle with the predefined radius and find all the businesses within the circle. 以预先定义的半径画一个圆,并找到圆内的所有企业。
This process can be translated into the following pseudo SQL query:
SELECT business_id, latitude, longitude
FROM business
WHERE (latitude BETWEEN {:my_lat} - radius AND {:my_lat} + radius) AND
(longitude BETWEEN {:my_long} - radius AND {:my_long} + radius)
- Cons 缺点
- This query is not efficient because we need to scan the whole table. 如果没有适当的索引,数据库可能执行全表扫描来查找满足条件的行。
- To fetch businesses within the radius, we need to perform an intersect operation on those two datasets. This is not efficient because each dataset contains lots of data. 为了获取半径内的企业,我们需要对这两个数据集执行相交操作。这效率不高,因为每个数据集都包含大量数据。
- 数据库可以利用 B+ 索引来快速找到满足范围条件的数据的起始点,然后按顺序遍历索引直到结束点。索引本身是有序的,即使基础数据不是。即便如此,查询效率依旧非常低。