Profiling Queries (phase 1)

Schema

To answer the profiling queries, you need to extend the dataframe returned by the AUT .webpages() method as shown in the extended schema below.

Note that:

  • all new columns are derivated from the url column.
  • the derivation can be done using the tldextract and urllib.parse libraries.

Extended Schema

df.webpages() 
|
|-- crawl_date                  
|-- mime_type_web_server        
|-- mime_type_tika              
|-- language                    
|-- content                     
|-- url:                         http://forums.news.cnn.com:80/
    |-- url_host_name:           forums.news.cnn.com
    |-- url_domain:              cnn
    |-- url_subdomain:           forums.news
    |-- url_tld:                 com
    |-- url_registered_domain:   cnn.com
    |-- url_domain_reversed:     com.cnn.news.forums
    |-- url_protocol:            http
    |-- url_host_port:           80

Optional (see HTTP header fields)

    url_host_ip         192.168.1.10                                cf. log.txt file
    content_length      Content-Length: 348                         cf. content’s HTTP header 
    content_charset     Content-Type: text/html; charset=utf-8      cf. content’s HTTP header

Queries

The following queries must be computed usning the WARC data collection of your choice.

Domain names

  1. Which are the top registered domains (see example)?

    • # urls per domain
    • % domain urls in collection
  2. Which are the Top TLD & gTLDs domains (see example)?

  3. Identify domains having more than one TLD (e.g., amazon/.com/.fr).

  4. Identify subdomains for each domain.

URLs

  1. Compute the distribution of the URLs components. For instance:

    • http/https
    • port number
  2. Count the number of words used in URLs

    • e.g., split URLs at . (dot) and - (hyphen)

Web pages content

  1. Compute the list of languages used in webpages?

  2. Compute the distribution of: 

  3. Identify page titles (optional)

Collection & Hosts

  1. How many pages, per language, were collected per host?

  2. What is the total number of bytes that was collected by the crawler for each host?

  3. What is the ratio between the webpages and Web ressources that were collected per host? (optional)

  4. For each host, list the servers IPs to which the crawler interacted to. Then, determine whether custom domains point to any of these hosts (optional)

Images

  1. Identify the largest/smallest image in the dataset (width x height)

  2. Identify the images appearing in the data collection under different names (i.e., images having the same MD5 hash)

  3. Find images shared between more than 2 domains (example)

Web graph

  1. Identify domains with strong/weak connectivity

Multimedia

  1. Compute the statistics of the multimedia files within the data collection (optional)
Previous
Next