CS615A -- Aspects of System Administration - Scripting Exercise

Text Processing

The Wikimedia project provides analytics data files for the page views of its projects, including the popular Wikipedia. This data is available from https://dumps.wikimedia.org/, with page views at https://dumps.wikimedia.org/other/pageviews/.

The format of these files is described in more detail here, but the files we are interested in are made up of these four space-separated fields:

domain_code page_title count_views total_response_size

Using this format, fetch a file for a specific hour (e.g., this file for the hour starting at 18:00 UTC on March 29th, 2020) and answer the following questions:

  • How many unique objects were requested?
  • How many unique objects were requested for en only?
  • Which is the most often requested object?
  • How many requests per second were handled during this hour?
  • How much data was transferred in total?
  • Which was the largest object requested?
  • What is the longest word found on the ten most frequently retrieved English Wikipedia pages?

Can you generalize these questions to make your answer more flexible for other areas of interest?

[Course Website]