Innovation : creating a Datalake at Médiamétrie

Faced with the influx of information from Big Data, Médiamétrie created a Data Lake in 2016 which centralises the audience data collected via its measurement systems

Objective? To conduct analyses for audience measurements and for R&D purposes, by adopting the latest approaches in Data Science and Big Data. Structural choices were made. Audience le Mag retraces the steps of this innovation.

An efficient and secure database

In 2015, working groups began to meet at Médiamétrie to reflect on the possibility of creating a Data Lake. A Data Lake is “a secure store of immutable, raw data, for the most part not processed, which acts as a data mining and analysis source,” explains Mélanie Langlois, Director of Médiamétrie’s IS Innovation Department.

Médiamétrie increasingly needs to cross-reference a large and growing volume of TV, Internet and radio audience results. This is done for the purposes of R&D and innovation in audience measurements, to enhance existing facilities and propose new offers. Médiamétrie is also responsible for processing important data from its partners or clients, as part of Data Media qualification activities. To this end, the teams must have structured and secure access to the data, especially to carry out projects in line with market timeframes. The purpose of the Data Lake is therefore to give Médiamétrie’s teams the chance to respond with “agile” projects.

As explained by Mélanie Langlois, “The Data Lake involves no processing and no mathematical or statistical model, and it doesn’t enhance the data. It is intended for the Data Scientists’ mining and analysis work; it’s not a tool for producing study results."

The challenges involved in creating a Data Lake are many. Security and confidentiality must be ensured, for example by managing identities and user access to the Data Lake.

Médiamétrie is particularly vigilant in relation to personal data protection. In accordance with the General Data Protection Regulation (GDPR), the Data Lake takes into account the principle of Privacy by Design, which consists in integrating the personal data management and protection rules established by the GDPR from the early stages of the project. To meet this requirement, Médiamétrie carried out a Privacy Impact Assessment, the objective of which is to assess the risks associated with data processing, in line with the principle of accountability on the part of the data controller.

Lastly, the Data Lake must make it possible to load and store large volumes of data in a cost-effective way.

Strictly compliant with security requirements, the Data Lake offers users a unified work interface.

A catalogue of homogeneous data

The innovative dimension of Médiamétrie’s Data Lake stems from the fact it has the right mix of the old Data Warehouses and the state of the art features found on Data Lakes. Médiamétrie’s Data Lake contains a data catalogue that can be used for complex analyses. Teams at Médiamétrie have developed an interface for shared access to the data.

To define how data is inputted into the Data Lake, Médiamétrie carried out a process involving two steps:

● First, providing information (meta-data) which enables the system to automatically recognise the data files that are inputted; this is the framework of the data catalogue.

● Second, pushing the data files into the Data Lake by specifying the framework that enables recognition of the file and its contents. All the data inputted into the Data Lake thus has the same technical format for storage, regardless of its origin. In this way, the Data Lake can adapt to any type of data.

As Mélanie Langlois points out,“Because the framework formalises the format and structure of the data to a very high degree, several Médiamétrie teams (from the departments of IT, science, Business Units, etc.) can work jointly on the same data.”

Use of the cloud to optimise costs

Data is stored in the Cloud; this is an appropriate option given the large volumes of data, and it is scalable too, that is to say it can be adapted as storage needs change. The economic model of Médiamétrie’s Data Lake, based on Cloud principles, therefore provides a suitable solution to the costs of storing Médiamétrie’s terabytes of data. The infrastructure costs required for algorithm calculations are only incurred on demand and based on the amount of time needed by the Data Scientist teams.

To ensure the necessary flexibility in the calculations and analyses, Médiamétrie’s teams are free to choose their working environment, their tools, and the desired computing power. All the calculations are therefore performed in the Cloud and users are responsible for their resource consumption. There are no limitations on the calculation capacity of the proposed infrastructure.

Médiamétrie’s IS Innovation Department needed to integrate several types of technology into this infrastructure in order to meet business, economic and technical objectives.

The first uses

Médiamétrie has already had the opportunity to carry out a number of R&D projects using the Data Lake, for example, improving the quality of audience collections by studying the possibility of avoiding blockages in measurement systems for certain versions of mobile OSs, or by analysing the performance of tags.

The Data Lake also plays an important role in the development of Médiamétrie’s Data Business offer, which aims to enhance partners’ data with audience results from Médiamétrie panels.

For Patrice de Flaujac, Director of Information Systems at Médiamétrie, “the Data Lake makes it possible to reinforce value creation in studies and to propose new, innovative offers.

Laure Osmanian Molinero

Calcul d’intervalle de confiance à 95%

Taille de l'échantillon ou d'une cible dans l'échantillon

n =

Proportion observée dans l'échantillon ou sur une cible dans l'échantillon

p =

%

Attention : ne s'applique qu'à une proportion. Le Taux Moyen est une moyenne de proportions et la Part d'audience un rapport de proportions.
Cet outil est donné à titre indicatif. Il ne saurait pouvoir s'appliquer sans autres précautions à des fins professionnelles.

Test de significativité des écarts entre deux proportions

Permet d'évaluer si la différence entre 2 proportions est significative au seuil de 95%

Proportion

Taille de l'échantillon

Échantillon 1

%

Échantillon 2

%

Attention : ne s'applique qu'à une proportion. Le Taux Moyen est une moyenne de proportions et la Part d'audience un rapport de proportions.
Cet outil est donné à titre indicatif. Il ne saurait pouvoir s'appliquer sans autres précautions à des fins professionnelles.