Latest Data Science student projects from class #14

by Badru Stanicki

Data Science student projects class #14

CodeNotary: AIOps - Cutting Server Costs with Machine Learning

Students: Gianluca MacaudaMaritsa Norton Oleson

Simplified NUMA server architecture
As our reliance on digital platforms increases, more and more organizations - from banks to government institutions - find their operational costs bloated by misconfigured server settings. CodeNotary can provide the expertise to fix these issues, but investing resources upfront is difficult to prioritize when there is no commercially available method for estimating the expected business impact. Maritsa Norton and Gianluca Macauda developed a tool that aims to assist VMware NUMA users in making this decision by forecasting the effect of configuration changes on server systems’ performance.
 
In this project our students had access to a unique datasource provided by CodeNotary. They set up infrastructure for data processing and forecasted server efficiency based on the server’s configuration settings and historical usage patterns. Their pilot regression models predicted the impact of known configuration changes on highly inefficient systems within 82-92% of the actual results. This lays the groundwork for applications that allow organizations to quantify the impact of configuration changes with greater certainty.
 
Graph weekly summaries
 
Fig 2: Real (blue) versus predicted (red) inefficiencies on a weekly basis. Each value on the x-axis is an individual Virtual Machine on a weekly level. Inefficiencies on the y-axis are presented as averaged Remote Memory (as a percentage of total memory). The higher the values, the more inefficient a Virtual Machine is running. Predictions are made by an optimized Random Forest regression model.
 

SPI: Impulsing Social Progress

Students: Gilda Fernandez-Concha Jahnsen, Lena Rubi, Céline C.
 
We tend to relate the success of a country to economic growth, assuming that economic growth and social progress go together. The Social Progress Index (SPI) was created as an alternative to GDP, to measure the social progress of a country. SPI is a framework linked to the Sustainable Development Goals of the United Nations and is composed of multiple indicators that can be divided into three dimensions: basic human needs, foundations of wellbeing, and opportunities.
 
In this project, Gilda, Lena and Celine teamed up with the Social Progress Imperative foundation, which provided a data set of 52 indicators, used to calculate the SPI, for more than 160 countries over the last ten years (2011-2020). They applied unsupervised machine learning to find similar groups of countries, based on their social progress. This allows to derive a list of countries that are most similar to a given country, a functionality that can now be added to SPI's webpage.
 
Using dimensionality reduction they mapped the 52 dimensions provided by SPI to a much smaller set of key features, which allow to directly visualize similarities and differences between countries. For this purpose, they created interactive and user-friendly visualizations that can allow stakeholders of a society to interact with the data provided by the Social Progress Imperative. 

computed clusters for 2020
 

Sihl: advanced cash flow forecasting 

Students: Ferdinand Limmer, Raoul Steiger, Thomas Massie
 
Many companies struggle to anticipate their immediately available financial means for the near future. How much cash do we have? How much can we spend? Being able to make well-informed forecasts about what might happen allows for reasonable planning. This is important in general, however, even more during times of high uncertainty such as the still ongoing CoViD-19 pandemic.
 
Thomas, Ferdinand and Raoul helped the SIHL Group, an SME with headquarter in Ostermundigen (BE, Switzerland), to analyze their business data to, first, reconstruct historic cash flows, and second, predict the near (1-3 months) and far (>3 months) future cash flow based on, both, historic data and callable cash flow (that is, expected cash flow due to, for example, payments that are known to be due at a specific point in time).
 
The team used Prophet, a library made by Facebook and which is especially designed to analyze and forecast time series. Prophet is based on an additive model where non-linear, global trends are fit with seasonality and holiday effects. It's known to work best with time series that have strong seasonal effects (daily, weekly, yearly). Moreover, Prophet uses STAN, a state-of-the-art platform for statistical modeling and high-performance statistical computation, which makes fitting super fast.

graph Sihl Group
 
Fig.: Historic data is used to train a model to forecast the future cash flow (black line and circles), and compare these estimates with the callable cash flow (blue bars). The callable cash flow consists of payments that are due in the future and already know at the time of the prediction. The difference between both (red bars) allows management to anticipate future cash flow in a much more informed and data-driven manner.
 

Nispera: Performance analysis on solar power plants

Students: Marcus Lindberg, Lina Siegrist, Lisa Crowther
 
The detection of soiling losses at photovoltaic plants and the decision of when to clean the panels is an important business problem. The cost of cleaning the panels at such large-scale plants needs to be balanced against the losses in output occurring from soiling.
 
The challenge was to identify output losses that occur due to soiling, in the absence of soiling sensor data that would quantify soiling. Lisa, Lina, and Marcus developed a semi-automated pipeline to analyze photovoltaic panel performance using power output data to detect parkwide soiling related losses, and to further analyze the soiling of individual strings of panels to identify clusters of most-soiled strings. This will be useful for recommendations for cleaning and maintenance, allowing soiling detection from power output, temperature, and irradiance data alone.

Quantification of soiling losses

Module clustering soiling
 

Cencosud Scotibank: user reviews analysis

Student: Natalie Arias
 
The aim of Natalie's project was to gain insights from customer reviews for improving the online banking app of Cencosud Scotiabank. Web scraping allowed her to gather user reviews of competing apps. She used Natural Language Processing (NLP) techniques for extracting key phrases and thereby comparing Scotiabank's app reviews with those of their competitors.  The results will allow Scotiabank to understand how their customers receive their service and how they can improve to keep up with their customers. 

Most relevant terms scale
 

Lausanne University Hospital: Patient similarity in oncology

Student: Julien Dupont
 
Precision treatment is the future of medicine which implies identifying cohorts and the distance of the index patient to one’s cluster. Hard clustering based on domain knowledge is still the dominant approach used by doctors to prescribe the appropriate treatment. In Oncology, imaging, laboratory analysis, and vital parameters support the classification of patients into cancer types and stages as well as levels of ability to withstand treatments and their side effects.

The Precision Oncology Center of the University Hospital in Lausanne, provided a unique data set with over 80’000 patients. Julien Dupont helped setting up Data Sciences pipelines and kickstarted a new research topic at the Hospital. His proof-of-concept demonstrates the relevance of this approach, reveals the potential of the method and will support acquiring future funding.

In the long term, this project aims to add a layer of data-driven knowledge to support decision making. Unsupervised learning based on demographic data and patient journey data will allow us to identify common patient trajectories and, thereby, enable personalized treatment leading to better care and outcomes in Oncology.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Read more
Blog