Collector

Living Standard,

This version:
https://mellonscholarlycommunication.github.io/spec-collector
Editors:
(meemoo - Flemish Institute for Archiving)
(IDLab - Ghent University)
(IDLab - Ghent University)

Abstract

This document specifies the implementation requirements for the Collector component.

1. Set of documents

This document is one of the specifications produced by the ResearcherPod and ErfgoedPod project:

  1. Overview

  2. Orchestrator

  3. Data Pod

  4. Rule language

  5. Artefact Lifecycle Event Log (this document)

2. Introduction

… The Collector is an Autonomous Agent that traverses the scholarly network to gather event information on specific artefacts. It has the task of gathering all useful information it can, and ranking it according to the preferences set by the actor running the collector instance. The main usecase of this component will be in addition to existing third party indexes for scholarly artefacts such as Google Scholar and Arxiv. The component collects information from the Web by crawling the network, taking into account the actor’s preferences of trusted sources. The Collector’s discovery process requires the materialization of three capabilities: selection, ranking, and verification of retrieved artifact event information.

3. Definitions

This document uses the following defined terms from Overview of the ResearcherPod specifications § definitions:

4. Collector interface

A Collector component MUST be deployable as a local background process or as a remote web service. A Collector component MUST provide an interface on initialization through which the initializing actor can set the starting parameters for the component. This interface MUST include the possibility to set one or more target artefact URI’s, for which the component must crawl the network to discover Event information. Additionally, the interface MUST support setting an initial set of data sources the component may crawl, mapped to a set of trust scores for each data source indicating the actor’s trust in a certain data source. On intialization, the collector automatically starts the collection process. This provisioning MUST be possible using a PUT or POST to a dedicated [HTTP1.1] resource if the orchestrator is deployed as a remote web service.

5. The Collection Process

The Collector component gathers Artefact event information in the decentralized scholarly communication network. It’s main target is the collection of bot Lifecycle Event and Interaction Event information on the target artefacts in the network. Based on the parameters passed on initialization, the component crawls the network for this information. The component can be initialized with no target artefacts, in which case it SHOULD crawl the network, and return all information for all discovered artefacts to the initializing actor.

5.1. Selecting

The first step in the selection process is selecting the data sources which the collector instance will crawl. Initially, the available sources include the URI of the artefact for which information is sought, as well as the set of available data sources and their trust score assigned by the actor. During the crawling process, the collector will come accross new links that may lead to new data sources. These data sources MUST be added to the sources index with a trust score based on the user preferences or derived from the data source it was discovered in. The actor initializing the Collector component SHOULD be able to set the logic for the assigning of these trust scores on initialization. The algorithm deciding in which order the data sources are crawled MUST be based on both the trust score of the data sources and the type of data expected to be found in the data source. Any subsequent of concurrent collection tasks with other target URIs SHOULD make use of the updated sources list created through other collection tasks to speed up discovery. An example algorithm for the selection process is given in § 6.1 The selection algorithm.

5.2. Ranking

On crawling the data sources, Event information may be discovered on the target artefact. These discovered Events MUST display and may be ordered according to their resulting trust factor, in combination with the other defined filters by the actor. The trust factor given to an Event MUST initially be deduced from the trust factor of the data source on which it was discovered. An example ranking algorithm is given in § 6.2 The ranking algorithm. On verification of the Event in the verification step, this trust score MAY be adapted based on the trust in the service that created the event and with which it was verified.

5.3. Verifying

After the retrieval of an Event, the collector MAY choose to try to verify the authenticity of the discovered event information. This functionality MUST be available either automatically or through an interface where the actor can specify specific Events to verify for the collector. Event information can be validated according to the algorithm described in § 6.3 The validation algorithm. Data corruption of any kind should be warned to the user. We define corruption as any truncation or alteration to an artifact or its metadata. In the case of versioned data, the verification step MUST take into account the vesions of the resources, and verify for the specific versions of the resources that are defined in the dat.

6. The proposed algorithms

6.1. The selection algorithm

The Collector selection algorithm has the task of discovering and crawling the available data sources for artefact Event information on the network. The algorithm MUST

6.1.1. Implementation

  1. Dereference the artefact, discovering any defined Event Log instances according to the Event Log spec.

  2. List all found Event instances.

  3. Discover the data sources of all event instances, and add all data sources to the list of data sources. The trust factor of these sources MAY be adapter from the trust factor of the current data source.

  4. Repeat untill a cutoff treshold value is reached: a. Pick highest trust score data source from priorityqueue (score may be adapted using additional filters) b. Dereference the data source and add the event information to the listing c. Add all discovered data sources to the priorityqueue.

The discovery of data sources in an Event Log happens by taking the origin of all listed Events.

6.2. The ranking algorithm

The Collector ranking algorithm has the task of ranking the discovered Events connected to the queried artefact. This ranking is adaptable, as entity ranking may change on verification of the discovered event or on adaptation of the data sources trust scores used by the algorithm. The used algorithm (parameters) must be adaptable through the collector interface.

6.3. The validation algorithm

The used validation method will depend on the available data.

6.3.1. Implementation

  1. Discover the origin of an Event

  2. If the Event has a digital signature, verify the signature with the origin

    • If the signature verifies, confirm the verification of the event.

    • If the signature does not verify, flag Event as tampered with.

    • If the origin is not available, mark as unresolved. These events can be verified by discovering matching events in archived versions

  3. Index the Event. If any identical event is come accross, mark as validated.

Any Event MUST have a defined origin. To validate an Event, the first step MUST be to dereference the Event origin. The validation algorithm MUST include

Multiple methods of verification can be used:

Note: 
These mechanisms provide solutions for verifying the origin and contents of an event found in an event log.
In case of an event not mentioned in the event log of an actor, the collec:+1: tor has no way of discovering this.

7. Deploying a collector

A Collector MUST be deployable as a local background process or as a remote web service. In case of the latter, an actor SHOULD be able to spawn, initialize and trigger the instance over [!HTTP1.1].
POST /test HTTP/1.1
Host: collector.service
Content-Type: application/json
Accept: application/json

{
  target: "http://target.artefact.uri",
  datasources: {
    "http://data.source/1": 1,
    "http://data.source/2": .8,
    "http://data.source/3": .5,
    "http://data.source/4": 0,
  },
  filters: [
    ...
  ]
}

If deployed as a local background process, an (custom) API MUST be present that is able perform these actions.

8. Updating collector targets

9. Notes

9.1. Deleted Events

10. Spec roadmap

  1. Consolidate discovery and verification algorithms

  2. Specify interfaces for component

    • creation interface

    • running interface

      • retieve current results + ranking

      • update targets ? - new process?

  3. Create flow graphs for the different use cases.

  4. Specify implementation details.

Appendix A: Implementation details

Retrieving inbox notifications

Observing LDP resource state updates

Time based trigger implementations

11. Acknowledgement

We thank Herbert Van de Sompel, DANS + Ghent University, hvdsomp@gmail.com for the valuable input during this project.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[SPEC-OVERVIEW]
Miel Vander Sande; et al. Overview of the ResearcherPod specifications. Editor’s Draft. URL: http://mellonscholarlycommunication.github.io/spec-overview/