A high-quality data scientist must have a solid background in mathematics, statistics and computing, which will serve as a foundation for the development of an expertise in machine learning and, more generally, in artificial intelligence. This stronger theoretical background will allow him to understand the scope of the existing "big data" solutions, to be able to evolve them and also to follow the very rapid evolution of the sector throughout his career.
A rapidly increasing share of human activity leaves massive trails of computer data that can be used for better management or better services. This data can be encoded, for example in the case of bank transactions, or obtained from sensors ranging from simple temperature sensors to high-definition cameras. They are often produced at a high rate, accumulated in large volumes and come from a variety of sources in formats ranging from the well defined structures used in a database tocompletely unconstrained texts or images. The term "big data" refers to this accumulated data, and the techniques used to analyze and exploit it are referred to as "data science".
In the economic world, "big data" and "data science" pass with a destabilizing speed from the concept stage to the exploration of that essential tool to develop / improve new products and / or optimize the functioning of companies. And this is not limited to the IT sector, but affects fields as diverse as chemistry, mechanics, internet sales, energy, hospital management, ... This revolution of the "big data" Is very often essential for the development, or even the survival of companies in an increasingly competitive world.
With this enthusiasm of companies for big data, new scientific and technical questions naturally arise. These scientific questions are centered around the following points:
- How to extract more relevant information from existing data through supervised learning? We can mention here recent advances in the field of deep learning and in the field of random forest methods.
- How to go beyond supervised learning paradigms to extend the scope of targeted practical problems? Here, we can mention the paradigms of learning by reinforcement, which often make it possible to extract policies from decisions much more sophisticated than those derived from classical supervised learning, such as policies to learn how to drive cars or to play games. Games such as GO or more and more sophisticated video games.
- How can we enrich existing data in a context where obtaining new data can have a significant cost?
- How to incorporate more and more unstructured data into learning channels such as video sequences, text, and traces of human-machine interactions?
These questions are often complex and in order to develop "big data" products / solutions, it is therefore important for companies to be able to join specialists who are specifically trained in the field of data science. The same goes for the many research laboratories increasingly dependent on quality data scientists to exploit their experimental data.
While at the dawn of the "big data" world, it was still possible to imagine that an engineer, a computer scientist, or a mathematician of general education could have properly filled a job destined for a real "data Scientist ", this is clearly no longer the case at present, given the complexity of the majority of the" big data "problems encountered and the very rapid scientific and technological progress.
This complexity coupled with the speed at which the big data sector is changing also naturally leads to the fact that a high-quality data scientist must have a solid background in mathematics, statistics and computing, of foundations to evolve subsequently an expertise in automatic learning and, more generally, artificial intelligence. This solid theoretical background will enable him to understand the scope of the existing "big data" solutions, to be able to evolve them and also to follow the very rapid evolution of the sector throughout his career.
It is such a training that this Master in Data Science has the claim to want to offer and there is much to bet, given the rise of the "big data" and the expertise gathered in this topic at the University of Liège, that it will be successful at regional and international level. It will also be a real attraction for a whole public of scientists already graduates and anxious to undergo a training that can give a new dynamic to their career. The emerging target-oriented "data scientist" is to detect the possibilities of data exploitation and to realize and implement IT systems to realize these possibilities.
Mastering the scientific foundations
The foundations of data science relate to applied mathematics (computation of probabilities, statistics, optimization), computer science (algorithms, data structures, automata, complexity), and artificial intelligence (automatic learning, Knowledge, automatic reasoning).
In order to develop long-term skills and an ability to adapt to the techniques of the future, it is essential to master these scientific foundations.
Knowing how to implement IT tools
The object of the data science is to extract synthetic and exploitable knowledge, valuing data captured from the real world. These data are often of heterogeneous quality, usually come in large volumes, and in very varied forms (texts, numerical values, images, time series). The types of knowledge to be extracted are also very varied (predictive models of behavior, homogeneous behavior groups, decision rules, relevant variables). The technological tools available for the extraction of knowledge from data include machine learning and optimization toolkits, data visualization techniques, parallel programming languages and paradigms, storage and computing systems massively parallel and distributed.
The practice of data science requires a very good knowledge of the possibilities and limitations of these tools and know how to implement them to develop a solution.
Knowing how to develop a solution in real environment
The development of a 'big data' solution involves a number of steps, including defining the target knowledge, choosing the data to be exploited, prototyping a data processing pipeline, collecting data, testing and optimizing the pipeline, presenting results, and developing a pipeline maintenance cycle to ensure sustainability. In order to ensure that the developed solution meets the needs of users and can function effectively and permanently in the target environment (laboratory, industry, administration, etc.), it is necessary to involve the 'customer' to understand the nature of the data and needs, as well as the constraints on the ground.
It is therefore necessary both to master the principles of the management of a 'big data' project and to be able to dialogue with the experts of the target domain and the IT managers of the client, in order to make the right technical choices when developing a 'big data' project.
Knowing how to make a cost-benefit analysis
In order to help companies make the right choices for data-mining projects, it is necessary to be able to analyze the costs and financial benefits of such an operation, both in terms of the early phases and in the duration.
The data scientist must therefore have a methodology enabling him to make a cost-benefit analysis on the basis of the information provided by the client company and in dialogue with the strategic management of the company.
Understanding legal and societal implications
Implementation of data science applications can lead to changes in the distribution of work in enterprises and / or to the exploitation of information about the behavior of people (workers, customers, general public). To be viable, they must therefore comply with privacy legislation and be acceptable to the company and company staff.
The 'data scientist' must be aware of the legal and societal implications of the projects in which it will engage