RubyGems.org is the Ruby community’s gem hosting service. Gem developers can publish their gems for anyone to install, and Ruby developers can browse gem pages to learn more about dependencies and revision histories. Their open-source site is fronted by Fastly, whose CDN logs are easy to send straight to Honeycomb.
Being able to sift through CDN traffic for a site like RubyGems.org exposes a surfeit of interesting tidbits about the gems that the Ruby community is downloading most, which gems have the largest number of actively-downloaded versions, and how the Fastly cache status impacts download times.
Below, find a few examples of interesting tidbits we discovered by exploring their Fastly data.
Note: All of these questions / explorations link directly to a graph attempting to answer that question. That graph is a permalink to a previously-run (and permanently preserved) execution of that question. To run it again, simply hit “Run Query” to get recent data.
|Ashburn, VA makes up a remarkably large portion of traffic to RubyGems.org.|
|Workday trends surface differences in patterns between humans and automated tools.|
gunzenhausenin the data) has the most unique client IPs using IPv6, followed by Redmond, WA. (And almost nobody uses HTTP2 yet.)
|Fastly even helpfully exposes which world cities contribute to IPv6 traffic.|
ORs instead of
geo_country(string) — Fastly’s geolocation functionality offers all sorts of ways to identify useful characteristics about your client traffic. Use Group By or Where with this field to discover trends that occur by location!
resp_body_size(int) — given the nature of RubyGems.org (to serve published gems to the Ruby community), the number of bytes written in the HTTP response might seem deceptively constant and boring at first blush: of course all requests served with
downloaded_gem_name = rspecshould have similar
resp_body_sizes, they are all serving the same file! … But when might that not be true? If you slice this data by another field like
geo_city, instead of one tightly correlated with payload size, and try a
SUM(resp_body_size)rather than an
AVGor percentile, you could learn a bit about which cities are consuming the most bandwidth.
time_elapsed(int) — in any sort of HTTP-serving service, the duration taken to serve the request (in this case, in microseconds) is a great candidate for visualization.
HEATMAPs show distributions over time (like a histogram turned sideways and smeared over time),
P95s show the 95th percentile graphed over time, and adding a Group By lets you compare the
time_elapsedof one group of Fastly requests against another.
bundler_minor_version(string) — the overwhelming majority of requests to RubyGems.org are being made by Bundler, a popular Ruby package manager. These two columns break out the different versions of Bundler being used to talk to RubyGems.org, and are not populated by Fastly — instead, they are fields that were extracted on-the-fly from the
user_agentfield. (See the technical fine print below, for more about how we defined these derived columns.)
downloaded_gem_version(string) — these two fields are not always populated for all requests to RubyGems.org, since not all requests are asking to download a gem. For those that are, though, these two derived columns extract the specific gem name and specific gem version being downloaded, based on regular expression patterns in the Fastly-provided
url(string) — the URL served. Requests to
/info*are mostly clients fetching metadata that they will use to decide what gems to download and install.
But do not stop there! You can find a full description of each field in the right-hand sidebar under the Details tab.
We have described some fun starter queries for you to begin exploring RubyGems.org’s traffic.
A couple of Honeycomb-specific notes: if you are struggling with the query builder, you can find some helpful documentation here. And note that RubyGems.org serves quite a bit of traffic, so we recommend constraining queries to a short time period (say, 30 minutes) while you experiment with queries — this way you can iterate fast while exploring, then expand the time window when you find a query worth running over a longer period of time.
downloaded_gem_version, visualize the overall volume of requests over time (
COUNT), and set the Where clause to just rspec traffic (
downloaded_gem_name = rspec).
downloaded_gem_name, Visualize the overall distribution and 99th percentile of response times (
P99(time_elapsed)), and set the Where clause to just traffic that matches the download pattern (
url starts-with /gems/).
downloaded_gem_nametable cell will expose a
⋯button which you can click to “Only show me events in this group.” Then, try adding a
geo_cityGroup By to see which cities are requesting these slow downloads of your chosen gem.
downloaded_gem_name, then Visualize the number of distinct countries represented in the client IPs recorded by Fastly (
COUNT_DISTINCT(geo_country_code)). You may also want to set the Where clause to traffic where
geo_city, then Visualize the number of distinct human languages requested by the HTTP headers (
COUNT_DISTINCT(request_accept_language). We order by the first Visualization in descending order by default, but you can change the ordering here to
COUNT_DISTINCT(...) ascto view the least multilingual cities, or order by
geo_cityto get an alphabetical listing.
Playing with data is all well and good, but where did it all come from?
RubyGems.org is first and foremost open source, and supported by a fantastic crew of folks who were willing to expose their realtime data to the public in the interest of a learning opportunity for the community. Connecting their traffic to Honeycomb was simply a matter of configuring their CDN logs to output a structured format for ingestion by Honeycomb.
To protect client privacy while also preserving uniqueness, the
client_ip field populated in the Log Streaming to Honeycomb docs has been replaced with a
client_ip_hash, which hashes the
On the Honeycomb side, a handful of extra columns were created in order to extract some particularly useful fields out of the standard HTTP fields:
The first two operate on the
user_agent field while the latter two operate on the served
(You can see the definition of the derived columns by expanding the Details sidebar and clicking on the field name in question.)
While these fields could certainly be populated in the Fastly config, using derived columns to extract values from other fields on the fly allows for a bit more flexibility in column definition and column evolution.
The URL pattern for gem downloads from RubyGems.org is
/gems/NAME-VERSION.gem, and we were able to utilize the
Gem::Version::VERSION_PATTERN to confidently match the values we were interested in.
Did you find what you were looking for?