A mirror is a copy of data on a second resource. It’s used to provide access to information from different sources. *nix systems are distributed this way: repository copies are saved on different mirrors at various points around the world. Mirrors let you efficiently distribute loads and provide high-speed downloads.
Our company has its own mirror, where copies of popular linux repositories are saved. In this article, we’ll be taking a detailed look at how it’s built.
When we launched our Cloud Server in 2010, we chose net-install as our installation model. Distributions are installed using the native installer from one of the official mirrors. With this model, the latest version, with all of the latest changes made by the distribution maintainers, can always be downloaded. Another advantage of the net-install model is that it lets you avoid problems commonly associated with cloning instances (needing to generate SSH keys, filesystem UUIDs, etc.).
For our main mirror, we chose mirror.yandex.ru since it’s the closest geographically and contains all of the repositories our clients need. This was fine in the beginning, but then the unexpected happened. The number of installations started to grow and engineers began dedicating themselves to testing templates. In the end, Yandex got tired of handling the enormous number of identical requests and blocked our subnet from accessing the mirror.
While looking for a stable solution that would reduce the likelihood of unexpected issues, we had an idea: make nginx a proxy server for several mirrors. This seemed like a perfectly sound and reliable solution: even if one uplink dropped, we’d still be able to download files from another without any problems. However, we quickly ran into problems with the diverse structure of mirrors: for example, CentOS repositories can be saved in /centos on one uplink, but /CentOS on another, and /www/mirror/srv/pub/centos on a third.
Since we could count the number of universal mirrors with all the distributions we needed on one hand (CentOS, Debian, Ubuntu, OpenSUSE), we had to make a separate mirror list for each distribution.
When putting this idea into practice, we ran into even bigger issues:
- uplink speeds are rarely consistent: it’s not uncommon for one host to transfer at 5-10 Mbps one minute, then a few hours later at no more than 5-20 Kbps. Since the installer downloads packages one at a time, the drop in speed may delay the installation indefinitely;
- some uplinks may not have been configured properly: instead of returning an RPM package, they returned an HTML page with the text “It works!”;
- some uplinks didn’t have the particular package catalog or they had the packages, but with an incorrect checksum. This can occur if upstream synchronization occurs out of order: first index files and then packages, not the other way around. Errors can occur if rsync, which writes in place files, isn’t configured properly, and so contents aren’t saved to a temporary file with the latest changes.
Because of these difficulties, we had automated-installation issues on more than one occasion. To fix these errors once and for all, we made our own mirror — mirror.selectel.ru. It’s only available from Selectel IP addresses (since we pay for outgoing traffic, we’d rather not risk it going public and racking up 10-20 gigs of traffic).
By making our own mirror, we solved all of the problems listed above. Among the advantages, we should mention:
- uplink synchronization does not interrupt a client’s service and in no way affects their working copy;
- the synchronized copy replaces the existing copy only if the checksums of all the new packages match;
- if an uplink is unavailable or returns corrupt data for whatever reason, the mirror will continue to return data from the last working version;
- uplink synchronization is divided by distributions: some distributions may be synchronized more often than others. It’s also possible to clone multiple repositories.
Operating systems are installed on our dedicated servers from this mirror.
How Repositories are Built
A repository is usually made up of two key components: a catalog (index) and a pool (package storage).
Information on packages in the repository is saved in the catalog: name, description, architecture, version, checksum, and in some cases, information on dependencies and package contents. The catalog also shows where one version or another of a file is located in a pool for each package.
The actual file packages are saved in the pool. They can be arranged in a particular hierarchy or just layered in one directory.
At the root of each RPM repository is a directory with catalog files — repodata. A description of all the catalog sections is saved in the file repomd.xml. Every section is presented in a separate file in the catalog’s directory. The description contains the path to the file containing the section and its checksum.
The contents of repomd.xml may look like the following:
1362531727 87aa4c4e19f9a3ec93e3d820f1ea6b6ece8810cb45f117a16354465e57a1b50d 77b5cfcf2c06156858a14a52595e1f69cd8cbb58c09699a3ea4391379260e943 1362531876 2043735 12931923 243fdef956d09cb6d022e894e40d145f497bcf3d6d2bed79814e1c88452b9d29 533872a158160ac3a83746a676c125b5cfb2411725079502b0d5be4f4d05196e 1362531897.21 10 3605913 14942208 ...
The RPM catalog contains the following sections:
- primary: a description of all the packages saved in the repository, package file paths, and their checksums;
- filelists: lists of files included in each package;
- group: descriptions of package groups installed using yum groupinstall;
- other: additional information (for example: changelogs).
The structuring and grouping of packages varies from operating system to operating system. For example, CentOS saves all packages in the Packages directory in the root repository. Additionally, separate repositories are created for each existing architecture.
OpenSUSE saves packages for every architecture in one repository with separate pools in the i686/x86_64/etc directory.
All of the packages in a DEB repository are saved in a public pool. This prevents duplicate packages in different releases. A separate catalog is created for each release in the repository.
Catalog parsing begins with the file /dists/[distribution]/Release (distribution here is the code name of the release, like squeeze/wheezy/jessie). This file contains a list of all the release’s components, as well as information on the size and checksum of all of the index files. The Release file is signed by the archive maintainers and the signature is saved in the file Release.gpg (the contents of Release may be located along with the signature in the file InRelease).
A description of the pool’s contents is found in two kinds of index files: Packages (where binary packages are listed) and Sources (where source code is listed).
The path to the Packages file is /dist/[distribution]/[component]/binary-[architecture]/Packages, and the path to the Sources file is /dists/[distribution]/[component]/source/Sources.
Note: index files are sometimes compressed with gzip or bzip2. In this case, the extension .gz or .bz2 gets attached to the file name. Some clients support LZMA (.lzma), XZ (.xz), and LZIP (.lz).
An example of the Packages file record:
Package: openssh-server Source: openssh Version: 1:6.2p2-6 Installed-Size: 747 Maintainer: Debian OpenSSH Maintainers Architecture: amd64 Replaces: openssh-client (<= 2.16), libcomerr2 (>= 1.01), libgssapi-krb5-2 (>= 1.10+dfsg~), libkrb5-3 (>= 1.6.dfsg.2), libpam0g (>= 0.99.7.1), libselinux1 (>= 1.32), libssl1.0.0 (>= 1.0.1), libwrap0 (>= 7.6-4~), zlib1g (>= 1:1.1.4), openssh-client (= 1:6.2p2-6), sysv-rc (>= 2.88dsf-24) | file-rc (>= 0.8.16), libpam-runtime (>= 0.76-14), libpam-modules (>= 0.72-9), adduser (>= 3.9), dpkg (>= 1.9.0), lsb-base (>= 4.1+Debian3), procps Recommends: xauth, ncurses-term Suggests: ssh-askpass, rssh, molly-guard, ufw, monkeysphere, openssh-blacklist, openssh-blacklist-extra Conflicts: rsh-client (<< 0.16.1-1), sftp, ssh (<< 1:3.8.1p1-9), ssh-krb5 (<< 1:4.3p2-7), ssh-nonfree (<< 2), ssh-socks, ssh2 Description: secure shell (SSH) server, for secure access from remote machines Multi-Arch: foreign Homepage: http://www.openssh.org/ Description-md5: 842cc998cae371b9d8106c1696373919 Tag: admin::login, implemented-in::c, interface::daemon, network::server, protocol::ssh, role::program, security::authentication, security::cryptography, use::login, use::transmission Section: net Priority: optional Filename: pool/main/o/openssh/openssh-server_6.2p2-6_amd64.deb Size: 257438 MD5sum: 1f18e568c17d81cc2c493ee48c93a03f SHA1: 207f131bbd4d709a47bcb69c997520c998ed7593 SHA256: 242b7f041292dea0702b24e19dc6355f47147796b227f1024665920a493641f2
How Our Mirror Works
Repositories for each distribution are saved on our mirror in duplicate: a shadow (background) and working (foreground) copy. Both parts are located in a separate LVM volume, which lets you add disk space on the fly. A verified copy of the mirror is saved in the working section, which is expanded using nginx. The shadow section synchronizes with the upstream mirror and is then thoroughly verified.
The validation process includes checking its catalog, digital signature (if it has one), and the checksums of all the index files. It is fairly difficult to verify the checksums of all the packages: not tens, but hundreds of gigabytes of packages may be saved in some repository pools. This is why checksums are only checked for new packages that are accessed by rsync. When the verification is finished, the shadow and working sections swap. This operation is carried out by the mv command. This way, we can almost guarantee the atomicity of the swap (three quick mv executions are enough to change the directory) and minimize possible downtime. Open files being transferred are not closed during the swap.
After the two sections have switched places, the shadow section locally “catches up” with the present status of the working copy.
This algorithm is implemented in our set of scripts, grouped together as “mirror-sync”, which was published on GitHub under GNU GPL. We hope a lot of users find our efforts to be useful and that some of our visitors can take advantage of our experience when making their own mirror.
If you have any comments or suggestions for improving our mirror, please leave them in the comments below. We’ll certainly take them into account when making future changes.