Typically, a long URL is a system-generated URL, usually containing multiple query parameters that may be encrypted/encoded to mask those parameters. The need for shortened URLs arises when long URLs have to be shared for various purposes, like sharing links via SMSs, Emails, media types, like Twitter or WhatsApp, etc.
A URL shortening service might sound like a simple service at first, but it poses a very interesting and challenging engineering problem. Please pause here and think about how you could design a system to generate an infinite number of short URLs for possibly all long URLs in the world. The interesting part of the problem is generating unique, random identifiers for these long URLs. The identifiers should be short enough to fit in the URL with an acceptable character set.
A typical short URL could be something like this https://ts.dz/qp7h3zi, where ts.dz is a short domain for a particular application, service, or company, and the last part in the URL qp7h3zi is a unique, random identifier. This part is called as the token. Tokens are alphanumeric strings and act as identifiers to the actual long URL. The length of a token typically varies from 6-9 characters. While designing the URL shortening service, we pay close attention to the length of the token and the characters it contains. To be URL safe, consider smaller case characters a-z and numbers from 0-9, we have a total character set of 36 character set. The number of unique tokens that can be generated using various token lengths are shown below.
|Token Length||Unique Token Combination||Range|
|6||2,176,782,336||> 2 Billion|
|7||78,364,164,096||> 78 Billion|
|8||2,821,109,907,456||> 2 Trillion|
|9||101,559,956,668,416||> 100 Trillion|
From the above table, we can choose the length of our token to be 7 characters long, which can accommodate more than 78 billion unique tokens. If we require a larger range, then setting the token length to 8 characters gives us a larger set, which comes to around 2 trillion (we can build a service possibly with almost all URLs in the world! ). The token generation is covered below, after the high-level design section.
You may also like:
Building Blocks of a Scalable Architecture.
The High-Level Design (HLD) consists of two parts.
- Short URL Generation — covers how to handle a request by the system to generate a short URL for an incoming long URL
- Short URL Redirection — covers how to handle a short URL click request by a user so that the user can be redirected to the actual location i.e. long URL
HLD – Short URL Generation
The URL Shortening service accepts a long URL as input, generates a short URL, and returns it. If the incoming long URL already exists in the system, then the previous generated short URL has to be returned. If the existing short URL has expired, then a new one has to be created and sent as a response.
HLD – Short URL Redirection
When a user clicks on the short URL, the request should be redirected to the URL Shortening Service, which would then redirect it to the actual long URL location. The short domain in the short URL has to be mapped to the service like below:
The logic to generate unique, random tokens can be tricky. The URL Shortening service must ensure that the token should not be assigned to two (or more) different long URLs. Otherwise, the system may incorrectly redirect the short URL to another location (long URL). Tokens can be generated by multiple strategies by using 3rd party libraries or using conventional methods like Base36 encoding. We’ll cover each of these cases individually.
A very reliable library that generates unique, random tokens (ids) is Hashids (which has support in more than 30 languages!). Hashids can create short, unique Ids from numbers (positive numbers and zero) and allows custom alphabets as well as salt, so Ids are unique only to you. The Ids generated are mangled and are not easy to guess. The code footprint is very minimal, and there are no external library dependencies.
Sample tokens generated for salt = ”this is my salt” and character set = “abcdefghijklmnopqrstuvwxyz1234567890”.
This is a very straightforward approach, and most languages have these functions. If you don’t need all the fancy extras Hashids has to offer, this method will work just fine. It’ll probably be faster too. The input range of numbers can be synchronized using a central Zookeeper service.
A high-level design for token generation using Hashids library is shown below. The input to the Hashids library is a number (long) that generates unique tokens. As the URL Shortening service will be working in a cluster with “N” nodes, the input numbers to the Hashids library should be non-repetitive. To handle this, we introduce a synchronization service, like Zookeeper, which can manage and handle the range of Ids (range of numbers) assigned to the URL Shortening application node.
Every application node, on start-up, requests the Zookeeper service for a new range, and the same will be recorded by the Zookeeper. For every short URL request, the URL Shortening application node increments a number (from the range assigned by Zookeeper) and passes it to the Hashids library, which in-turn generates a unique token for a corresponding input number. This token will be saved into the database corresponding to the long URL.
Once the assigned number range has been exhausted by the application node, the URL Shortening application requests for a new range from Zookeeper. This design gives us the flexibility to add any number of application nodes based on the incoming traffic.
The Zookeeper service can be replaced by an instance of DynamoDB, which can maintain a counter. The role of the Zookeeper service is to maintain the ranges that are already assigned to application nodes and assign new ones when requested. A simple table with a counter variable in DynamoDB can be stored. This counter variable is made atomic in nature so that any application node saving an incremental number value doesn’t clash (writes-reads don’t clash). See how to implement atomic counters in DynamoDB in this link. The DynamoDB can also be replaced by a MySQL table as well or any other system that offers atomic number generation like Redis INCR.
Complete System Design
The complete system design can be summarized, as shown below. The main points and how the work flows are maintained are described below the diagram.
Note: A Redis cache setup is added here to improve application performance, as newly generated short URLs can be responded to from the cache rather than querying the database. The cache entries (short URL and long URL map) can be configured with an expiry or an eviction policy, such as LRU (least recently used).
- The URL Shortening application nodes works in a cluster; whenever the application nodes start or when a new node is added to the cluster, they talk to a Zookeeper service.
- The Zookeeper service allocates a unique number range to each application node on request.
- A request to shorten a long URL reaches the Load Balancer, which then appropriately delegates the call to one of the URL Shortening application nodes in a round-robin fashion.
- The application checks if there is an existing short URL entry in the database/cache for the long URL. If an entry exists, and if the short URL is still valid (not expired), then the same short URL will be responded back with HTTP 200 OK.
- If the long URL does not exist in the database, then the URL Shortening application uses the next available number (from the range allotted by Zookeeper in step 2) and generates a new token by using the Hashid library. A short URL is generated using the token and the configured expiry is set and then saved to the database and to the cache. The short URL is sent back to the client with a status of HTTP 201 CREATED.
- Whenever the short URL is clicked (by any user), the request reaches the Load Balancer and is passed on to one of the URL Shortening application nodes.
- The application checks in the cache if a long URL exists for the short URL token and then goes to step 9.
- If there is no cache entry, then the application checks in the database and goes to step 9.
- Upon finding a long URL that the client has to be redirected to, the application responds with HTTP 302 with the long URL set in the HTTP header Location. This redirects the client from short URL to the actual location (i.e. long URL).
- If there is no such token found in the database/cache the application node responds with a status of HTTP 404 NOT FOUND.