A Guide to Exploring RubyGems.org Traffic | Honeycomb

We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

A Guide to Exploring RubyGems.org Traffic

RubyGems.org is the Ruby community’s gem hosting service. Gem developers can publish their gems for anyone to install, and Ruby developers can browse gem pages to learn more about dependencies and revision histories. Their open-source site is fronted by Fastly, whose CDN logs are easy to send straight to Honeycomb.

Being able to sift through CDN traffic for a site like RubyGems.org exposes a surfeit of interesting tidbits about the gems that the Ruby community is downloading most, which gems have the largest number of actively-downloaded versions, and how the Fastly cache status impacts download times.

Launch the Dataset

What are some things we can learn from this dataset?  🔗

Below, find a few examples of interesting tidbits we discovered by exploring their Fastly data.

Note: All of these questions / explorations link directly to a graph attempting to answer that question. That graph is a permalink to a previously-run (and permanently preserved) execution of that question. To run it again, simply hit “Run Query” to get recent data.

A few particularly interesting fields  🔗

  • geo_city (string), geo_country (string) — Fastly’s geolocation functionality offers all sorts of ways to identify useful characteristics about your client traffic. Use Group By or Where with this field to discover trends that occur by location!

  • resp_body_size (int) — given the nature of RubyGems.org (to serve published gems to the Ruby community), the number of bytes written in the HTTP response might seem deceptively constant and boring at first blush: of course all requests served with downloaded_gem_name = rspec should have similar resp_body_sizes, they’re all serving the same file!

    … But when might that not be true? If you slice this data by another field like geo_city, instead of one tightly correlated with payload size, and try a SUM(resp_body_size) rather than an AVG or percentile, you could learn a bit about which cities are consuming the most bandwidth.

  • time_elapsed (int) — in any sort of HTTP-serving service, the duration taken to serve the request (in this case, in microseconds) is a great candidate for visualization. HEATMAPs show distributions over time (like a histogram turned sideways and smeared over time), P95s show the 95th percentile graphed over time, and adding a Group By lets you compare the time_elapsed of one group of Fastly requests against another.

  • bundler_version (string), bundler_minor_version (string) — the overwhelming majority of requests to RubyGems.org are being made by Bundler, a popular Ruby package manager. These two columns break out the different versions of Bundler being used to talk to RubyGems.org, and are not populated by Fastly — instead, they’re fields that were extracted on-the-fly from the user_agent field. (See the technical fine print below, for more about how we defined these derived columns.)

  • downloaded_gem_name (string), downloaded_gem_version (string) — these two fields aren’t always populated for all requests to RubyGems.org, since not all requests are asking to download a gem. For those that are, though, these two derived columns extract the specific gem name and specific gem version being downloaded, based on regular expression patterns in the Fastly-provided url field.

  • url (string) — the URL served. Requests to /api* and /info* are mostly clients fetching metadata that they will use to decide what gems to download and install.

But don’t stop there! You can find a full description of each field in the right-hand sidebar under the Details tab.

Now, go explore on your own!  🔗

We’ve described some fun starter queries for you to begin exploring RubyGems.org’s traffic.

A couple of Honeycomb-specific notes: if you’re struggling with the query builder, you can find some helpful documentation here. And note that RubyGems.org serves quite a bit of traffic, so we recommend constraining queries to a short time period (say, 30 minutes) while you experiment with queries — this way you can iterate fast while exploring, then expand the time window when you find a query worth running over a longer period of time.

  • How often are certain versions of, say, rspec being downloaded?
    • Hints: you’ll want to Group By downloaded_gem_version, Visualize the overall volume of requests over time (COUNT), and set the Where clase to just rspec traffic (downloaded_gem_name = rspec).
  • Which are the gems that are the slowest to download?
    • Hints: you’ll want to Group By downloaded_gem_name, Visualize the overall distribution and 99th percentile of response times (HEATMAP(time_elapsed) and P99(time_elapsed)), and set the Where clause to just traffic that matches the download pattern (url starts-with /gems/).
    • Next steps: pick a gem to explore more. In the summary table, mousing over a downloaded_gem_name table cell will expose a button which you can click to “Only show me events in this group.” Then, try adding a geo_city Group By to see which cities are requesting these slow downloads of your chosen gem.
  • Which gems are the most internationally popular?
    • Hints: you’ll want to Group By downloaded_gem_name, then Visualize the number of distinct countries represented in the client IPs recorded by Fastly (COUNT_DISTINCT(geo_country_code)). You may also want to set the Where clause to traffic where downloaded_gem_name exists.
  • Which cities served by RubyGems.org are the most multilingual?
    • Hints: you’ll want to Group By geo_city, then Visualize the number of distinct human languages requested by the HTTP headers (COUNT_DISTINCT(request_accept_language). We order by the first Visualization in descending order by default, but you can change the ordering here to COUNT_DISTINCT(...) asc to view the least multilingual cities, or order by geo_city to get an alphabetical listing.

Some technical fine print  🔗

Playing with data is all well and good, but where did it all come from?

RubyGems.org is first and foremost open source, and supported by a fantastic crew of folks who were willing to expose their realtime data to the public in the interest of a learning opportunity for the community. Connecting their traffic to Honeycomb was simply a matter of configuring their CDN logs to output a structured format for ingestion by Honeycomb.

To protect client privacy while also preserving uniqueness, the client_ip field populated in the Log Streaming to Honeycomb docs has been replaced with a client_ip_hash, which hashes the client_ip values.

Extracting the gem name + version  🔗

On the Honeycomb side, a handful of extra columns were created in order to extract some particularly useful fields out of the standard HTTP fields: bundler_version, bundler_minor_version, downloaded_gem_name, and downloaded_gem_version. The first two operate on the user_agent field while the latter two operate on the served url. (You can see the definition of the derived columns by expanding the Details sidebar and clicking on the field name in question.)

While these fields could certainly be populated in the Fastly config, using derived columns to extract values from other fields on the fly allows for a bit more flexibility in column definition and column evolution. The URL pattern for gem downloads from RubyGems.org is /gems/NAME-VERSION.gem, and we were able to utilize the NAME_PATTERN and Gem::Version::VERSION_PATTERN to confidently match the values we were interested in.