02 Feb


The rapid growth and fragmented character of social media (Twitter, Facebook, blogs, etc.) and publicly available structured data (e.g., Linked Open Data) has led to challenges of how to extract knowledge from such noisy, multilingual sources in a robust, scalable and accurate manner. Existing approaches often fail when encountering unpredictable input, a shortcoming that is amplified by the lack of suitable training and evaluation gold standards. The goal of this inter-disciplinary project is to address these challenges by developing new methods arising from the Human Computation (HC) paradigm, which harnesses collective intelligence to augment automated methods. Embedding HC in the emerging discipline of Web Science, however, is far from trivial, especially when aiming to extract knowledge from heterogeneous, noisy, and multilingual data: Which knowledge artefacts are best suited for HC-based acquisition? How can complex knowledge extraction be broken down into a series of simple HC tasks? How can noisy input be used to train a knowledge extraction algorithm? Aiming to solve these problems in an effective and lasting fashion, the uComp project has the following four objectives:

1Develop a generic, configurable and reusable HC framework. Using HC for knowledge extraction requires significant research effort from several scientific disciplines and poses challenges in terms of scalability, accuracy, and feasibility. uComp will break new ground by creating a reusable framework with an extensible set of knowledge acquisition tasks. This generic HC framework will include new empirical and formal methods for (i) HC process configuration; (ii) engagement, profiling and incentivisation of human contributors; (iii) reliability monitoring, cheating prevention and quality control.

2Address challenges of dealing with noisy data. HC-based approaches for knowledge extraction raise issues of quality control (inter-human agreement, monitoring the quality of acquired knowledge), aggregation of noisy input data (reconciliation, provenance) and learning from this data to optimise algorithms. Some of these issues are already addressed within knowledge acquisition and language processing infrastructures, as long as small groups of highly skilled human experts are involved. However, as we move towards large-scale HC processing, further experimentation is required on how to best acquire high quality resources from relatively un-skilled contributors, who have little training and are self-directed.

3Embed human computation into knowledge extraction workflows. uComp will study new empirical and formal methods for integrating HC tasks into complex workflows, a novel knowledge extraction approach that we term Embedded Human Computation (EHC). This approach will go beyond knowledge acquisition and support other key steps in the knowledge processing life cycle. Comprehensive methodological and algorithmic support will be achieved through embedding the HC framework into mature, open-source knowledge extraction tools, to be coupled with new pattern discovery methods. We will maximise impact by using appropriate knowledge-encoding standards, addressing ethical and legal issues, and leveraging the research communities of established open-source toolkits.

4Evaluate EHC performance. uComp will evaluate flexibility, generalisability, accuracy and scalability of EHC for acquiring factual and affective knowledge. A rigorous evaluation process based on objective and reproducible experiments will measure the quality of HC-created resources. Additionally, the efficiency of combining large-scale HC with automated methods will be compared against both human experts and automated knowledge extraction. We will also create a new method and accompanying shared datasets for quantitative black-box evaluation, where collective intelligence is incorporated into a computer-human symbiotic infrastructure to address the gold-standard bottleneck. An open evaluation campaign will be organised to benchmark uComp’s new methods against existing knowledge processing algorithms from the Web Science research community.