The Web & the Digital Humanities

Motivation

The amount of cultural information that is generated every day on the Web presents new opportunities for historians, political scientists, sociologists, linguists, computer scientists, and other scholars. Much of this information is captured within web archives (WARC) created by organizations such as the Internet Archive, the BnF, and The British Library1. By collaborating with these institutions, scholars from the digital humanities can use web archives to study human production. For instance:

  • To study Latin American Women’s Rights Movements through Language, Time and Space.
  • For comparing the way countries conduct online political campaigns.
  • To analyze the public opinion on social media platforms (e.g., #metoo).

Due to the massive volume and the variety of information available on web archives (e.g., webpages, PDFs, images, videos, audio), manually exploring web archives is a complex and time-consuming task. In this context, data science and Big Data technologies offer an opportunity to extract value from large collections of WARCs files like Common Crawl.

Objective

Through this datathon, participants will :

  • Develop awareness about new analytical requirements resulting from the datafication phenomena that call for data science expertise.

  • Put into practice participants’ Big Data technical skills to analyse and explore web archives produced in the context of the LINFRANUM (Cartographie de la Littérature Française Numérique) project.

  • Develop storytelling skills to explain analytical pipelines to technical and non-technical experts.

The datathon will heavily use of Apache Spark and the Archives Unleashed Toolkit, a project that makes petabytes of historical internet content accessible to scholars and others interested in researching the digital recent past.

Learning Outcomes

Regarding Spark you will be able to:

  • Prepare a Spark environment in an externalised computing environment and schematically explain the underlying functional architecture.
  • Transfrom unstructured data collections into tabular representation (i.e. make sure that you can explain the necessity of this transformation and its benefits related to the use of Spark)
  • Design and explain a data exploration pipeline using Spark operators.

Regarding data storytelling, you will be able to:

  • Import visualization tools into your notebook.
  • Produce code to feed visualization functions.
  • Provide insight (textual) to describe input data and interpret data processing results.
  • Understand the architecture behind your notebook with the storage services, Spark data processing and the visualization tool(s).
Next