Technology of content categorisation based on the analysis of social media and social-psychological data using crowdsourcing platforms

Relevance of the project

Current models for predicting complex traits (political beliefs, social attitudes, psychological traits) of users based on social media are based on the analysis of marked-up data, assuming the primary labelling of the author of the text (or other content) as a carrier/non-carrier of the trait. However, the specificity of the data can significantly reduce the quality of the resulting social media analysis models. The solution to this problem is the formation of a technology for reliable markup of data necessary for training algorithms for analysing social media texts.

The technology proposed for development in the project is a technology for generating a reliable labelled data set for further use in training text analysis algorithms.

Project tasks:

To develop technological and methodological principles for generating reliable data sets
Formulate the requirements, capabilities and limitations of the measurement tools to be used.

This is necessary to assess complex traits and generate datasets that conform to the developed principles.
Create an algorithm for assessing data quality.

Data is collected using crowdsourcing platforms (ensuring that respondents who performed the proposed tasks in bad faith are weeded out).

Planned Outcomes:

Methodology and technology for generating robust marked-up datasets for predicting complex traits on social media;

An algorithm to assess the quality of data collected in the methodology using crowdsourcing platforms;

Robust labelled datasets for predicting complex traits.

The project is implemented jointly with a partner

Yandex

Project team

Maria Chumakova

Aleksandr Vecherin

Alisa Kuzmina

Denis Stukal