Profiling Queries (phase 1) | Big Data Analytics

Schema

To answer the profiling queries, you need to extend the dataframe returned by the AUT .webpages() method as shown in the extended schema below.

Note that:

all new columns are derivated from the url column.
the derivation can be done using the tldextract and urllib.parse libraries.

Extended Schema

df.webpages() 
|
|-- crawl_date                  
|-- mime_type_web_server        
|-- mime_type_tika              
|-- language                    
|-- content                     
|-- url:                         http://forums.news.cnn.com:80/
    |-- url_host_name:           forums.news.cnn.com
    |-- url_domain:              cnn
    |-- url_subdomain:           forums.news
    |-- url_tld:                 com
    |-- url_registered_domain:   cnn.com
    |-- url_domain_reversed:     com.cnn.news.forums
    |-- url_protocol:            http
    |-- url_host_port:           80

Optional (see HTTP header fields)

    url_host_ip         192.168.1.10                                cf. log.txt file
    content_length      Content-Length: 348                         cf. content’s HTTP header 
    content_charset     Content-Type: text/html; charset=utf-8      cf. content’s HTTP header

Queries

The following queries must be computed usning the WARC data collection of your choice.

Domain names

Which are the top registered domains (see example)?
- # urls per domain
- % domain urls in collection
Which are the Top TLD & gTLDs domains (see example)?
- see What Is a TLD? for more information
Identify domains having more than one TLD (e.g., amazon/.com/.fr).
Identify subdomains for each domain.

URLs

Compute the distribution of the URLs components. For instance:
- http/https
- port number
Count the number of words used in URLs
- e.g., split URLs at . (dot) and - (hyphen)

Web pages content

Compute the list of languages used in webpages?
Compute the distribution of:
- MIME types (see example)
- languages (see example)
- charsets (see example) (optional)
Identify page titles (optional)

Collection & Hosts

How many pages, per language, were collected per host?
What is the total number of bytes that was collected by the crawler for each host?
What is the ratio between the webpages and Web ressources that were collected per host? (optional)
For each host, list the servers IPs to which the crawler interacted to. Then, determine whether custom domains point to any of these hosts (optional)

Images

Identify the largest/smallest image in the dataset (width x height)
Identify the images appearing in the data collection under different names (i.e., images having the same MD5 hash)
Find images shared between more than 2 domains (example)

Web graph

Identify domains with strong/weak connectivity

Multimedia

Compute the statistics of the multimedia files within the data collection (optional)