Big Data: breaking the curse of dimensionality

Interview with Prof. Wolfgang Marquardt, Chairman of Forschungszentrum Jülich (Germany) and Big Data specialist

The term big data is complex. On one hand, it describes the amount of data itself while characterizing the technology required to collect and analyze the data on the other. The fact is big data is essential in medicine. Data-supported models not only assist in promoting medical research, they also make it easier to reach treatment decisions.


Photo: Man with glasses and a tablet in his hands

Prof. Wolfgang Marquardt; © Forschungszentrum Jülich spoke with Professor Wolfgang Marquardt about the potential of big data in medicine, and why not all data is the same.

Professor Marquardt, big data has become a type of trendy buzzword. Everyone uses it, but hardly anyone knows what’s behind it. What is big data actually?

Prof. Wolfgang Marquardt:
Big data is not just a buzzword without substance. It is not only the size or the amount of data you want to talk about here but also about heterogeneous, unstructured and multifaceted data. The speed of generating data, as well as new possibilities for data processing and interpretation, justify claims about the new potential in science and application. Big data leads to brand-new problem-solving solutions. You can actually illustrate this quite well with the example of medicine.

What potential does big data hold in this area?

Marquardt: The potential is multi-faceted. Let’s take personalized medicine for example: Biological processes that take place in the body of a very particular patient can be detected while using molecular biological omics technologies. In addition, the medical history of the person will increasingly be available in an electronically documented way. This provides personalized data of various types and in a variety of sources. These data has to be brought together on one level to create an overall picture of a person. This lays the foundation to make personalized treatment decisions.

Recently, IBM defined a new business segment with Watson Health. Data from the medical-scientific field are not only made available with the Watson technology, but also linked to a new information source in the context of a specific medical question. The doctor uses the so prepared information for diagnostic or therapeutic decisions. Of course, further research is required to make these types of systems more common and secure.

You could also use big data to describe pathologies with non-invasive methods. Imaging and imaging analysis play an essential role here. If you take this a step further, you can make predictions as to which disease patterns could develop in large groups of patients – so-called cohorts. The predictions could stretch all the way to a survival analysis for severe diseases, pandemics and the like. This data and well-founded knowledge could predict the development of patient groups or individual patients.
Photo: huge data server

The high-performance computer Watson is able to process in just three seconds about 200 million pages of medical textbooks and magazines. Thus, it generates an advance in knowledge towards human doctors; ©

How is the collected data being analyzed and interpreted?

Marquardt: We need to consider a number of tasks in this case: A first area is the collection, provision and availability of data including questions of data protection and with it the obligation to anonymize or use pseudonyms of personalized data. A second area is related to the extraction of information and knowledge from data. For example, mathematical methods are used to search for patterns or correlations in the data from which hypotheses can be generated.

Here at the Research Institute Jülich, the recognition of spatial patterns in organs is an essential element of the Human Brain Project. The aim is to create a "brain atlas", which not only includes the spatial structures of the brain with high resolution, but also the assignment of functional areas to anatomical structures in the brain. To do this, the laboratory creates two-dimensional views by light microscopy of the brain which are compiled into a spatial model. Since the data is very extensive, high-performance computers have to be used to create the aggregation and comparison of thousands of two-dimensional images. Such an atlas is an important step to elucidate the long-term brain function or to predict them sometimes.

There is also the option to combine these image data from electro-physiological or optogenetic experiments with genetic data. Data mining or machine learning approaches illustrate correlations and patterns to find relationships and cause-effect correlations and create qualitative data-driven models in the end. The entire analysis technique is technically and methodically influenced. The development of these methods always requires consideration of natural science and biomedical knowledge.
Photo: Projection of different icons

Interfaces between different disciplines are particularly relevant when it comes to privacy issues. This fact requires large interdisciplinary research collaborations across borders of single institutes; ©

Big data could literally be translated as "a lot of data". Does the adage "more is more" also apply in medicine?

Marquardt: If I had to decide, I would say "more does not help much". Basically, we are able to create as much data as we like without it containing information. You need to have the right data that delivers an added benefit and is of high quality. This also means that errors in measurement need to be quantified or even eliminated. If you want to correlate a symptom with the causes of a disease, the number of parameters you have to include is large. This is called the "curse of dimensionality". You can only break this curse by using knowledge to reduce data requirements. Oftentimes, very simple connections are sufficient to reduce the amount of data you need. If data is not cleverly selected and interpreted, more does not actually help. However, when you systematically collect high-quality datasets, a huge number of data can provide a big benefit.

The importance of big data will continue to increase in medical research. Are clinical trials therefore soon "to become extinct"?

Marquardt: This is inconceivable. Today we are more at a point where there are not enough clinical trials. That is to say, our goal should not be to reduce the number of clinical trials, but to actually utilize data and analyzing methods to increase the quality and the informative value of these studies. By complementing clinical trials with additional data sources like medical history and personal living conditions, the value of clinical trials should be increased.
Photo: Melanie Günther; Copyright: B. Frommann

© B. Frommann

The interview was conducted by Melanie Günther and translated by Elena O'Meara.