Investigating Content - medical studies use case


The term (Big) Data Analytics appears everywhere at the moment. Large numbers of documents that need to be processed have reached not only large companies but also small and medium size enterprises, governments and many other public organizations.

What do we expect from (Big) Data Analytics? In our opinion, it is valuable information. Valuable information can take many forms depending on the data under investigation: from identification of topics, data categorization to finding anomalies, hidden dependencies and relations.  

Large amounts of data don’t automatically give you more value. Intelligent search is great and useful but it doesn’t help you all that much, if you don’t know what you are looking for. So, what we need is a combination of algorithms that give us an information overview or contextual highlights such as topics, hidden dependencies and relations. 

Our goal is to experiment with algorithms in order to provide the content highlights of large data sets. And we can go even further and create a dynamic content map that changes when the data change or when a user navigates through the highlights, such as topics, down to the specific document. We call it a Content Navigation Map

In our experimental work we came across a site for publicly available medical studies. The site has a very nice search mask. But given more than 200.000 studies with their detailed descriptions and results of trials, we would like to see the overall contextual picture downsized to the valuable information. Basically, given all the studies what are the topics,issues or results covered by them?


Our experimental dataset contains more that 200.000 Studies. Each document contains the study title, summary, course of events, place and dates, executing department, target group with the description such as age, gender and medical records as well as other specifications of the study itself.

Content Highlights

What can we investigate in such a data set? For example topics that appear in the course of studies as well as their geographical distribution, appearance depending on the demographic information (age, gender) or time periods. Furthermore, we can investigate dependencies and relations between the documented conditions, study descriptions, types etc. What do we need to discover such information? On the analytics side, there are a variety of methods we can experiment with such as collocation and term analysis, significant terms extraction, topic modelling with knowledge maps, text similarity measures, entity extraction, association analysis, anomaly and outlier detection and many other algorithms in the areas of machine learning, data/text mining and semantic web. 

Data processing challenges and values

There are several steps that need to be considererd in order to process the data, identify and extract relevant information and finally visualize it. We describe the high level (technical) concepts of a processing pipeline in our next blog

Dealing with (big) data processing poses a variety of challanges that need to be considered. For example:
  • The dimensions of the data set(s) - what kind of information is stored in a data set?
  • Extracting relevant information - what can be treated as relevant and what not? 
  • Finding appropriate visualisation(s) 
  • Visualisation(s) that provide the possibility to navigate through the content of a data set
  • Extracting hidden information and dependenciies through enrichment or combination of different data sets

Smart analytics in combination with appropriate visualization can give us the desired valuable information. The overview as well as categorization of data give us a first insight of the content in our dataset. The appropriate visual representation outlines the major highlights and allows identification of relevant information. Integration of visualization with dynamic navigation through the data enhance the user experience and provides an alternative to search.