Embracing a new challenge: Feedzai!

Since 2016 I’ve been focusing on data science and distributed computing, a path that I’ve decided to take back in late 2014 when I got back do University to take a Masters in Computer Science. At the time, I was looking to learn a bit about machine learning and tons about distributed system and web development - but while I was learning more about Operational Research and Machine Learning, my direction changed a little bit and I begin to study Data Visualization and Data Warehouse. And while studying these subjects, I knew the path that I’ve been interested in was really data and distributed systems.

When it was time to choose a subject to tackle for the dissertation, I was sure I wanted to apply Data Science but I also wanted to put my computer science skills on the table - concurrency and paralelism, distributed computation, database systems optimization, and so on. So, when I started my dissertation at Altran Portugal (which was about using machine learning to detect short-circuits on induction motors), I knew my short-term future would be about artificial inteligence (more specificly machine learning) and distributed system (streaming and information feeds).

After an amazing year at Altran Portugal learning a ton of things and leading the development of a project that was also my dissertation, I felt that I needed more for my next challenge (well, I also took a break for 4 months, but that’ is’s another story) For my next challenge I wanted it to be in a context where deciding in real-time is really a necessity, and where I could learn more about Machine Learning and Data science.

Feeds and AI

Analysing and deciding upon your business data in real time is a capability that brings value to a lot of companies. This is important to unleash the power of historical data by combining that knowledge with the information of what is happening right now, and doing this the right way is a tough challenge - and Artificial Inteligence (AI) is here to help, more specifically Data Science (a sub-field of AI).

Leveraging your data in real-time presents challenges in infrastruture and engineering to scale the system right and support (Big) data processing, plus the modelling know-how to put Data Science to work with the join knowledge of the past and the present.

For the Data Science process to work properly, it has to be correctly feed. Are you familiar with the phrase “We are what we eat”? The resulting model of a Data Science pipeline is a pretty good example of that phrase, since it will only be as good as the data it is feed. This is not only related to the data quality, but is related as well to the human bias presented in the data.

Also, for this gathered knowledge during the Data Science pipeline to take the due result in an effective, real-time production environment, it is needed that the streams (a.k.a feeds) that take and transform the data to be resilient and that they can respond correctly with the variance of the data volume over time.

Not only this, but even when you have your model and offline pipeline - the one where the model was iterated - how much time will it take to turn this pipeline into a production pipeline, with features being calculated in a feasable way to have results in (near) real time?

Data Variance

Dealing with real time information is not trivial. For example, in Banking domain, it simply exists a lot of information to be processed - think about the amount of transactions that you bank has to process every day, and when you use your card for a payment you do not want to wait a lot of time for that transaction to be authorized - so there is a maximum amount of time that those transactions can take to be processed.

Now, on top of this, thing about how the number of transactions variation during the day - maybe at lunchtime there are a lot of them, and at midnight there are really few of them, like it is ilustrated in figure 1.

Figure 1: An hipotetical plot of transaction number by hour of day

But in both cases you do not want to wait like 1 minute for the transaction to be processed, you want it to be fast.

So, real time systems have to deal with this data variance phenomena in order to work properly. Making the right architecture and taking the right decisions, alongside with implementing the components the right way, is a tough Engineering problem that cannot be ignore - and will definitely dictate the availability of your system.

Your Feeds have to handle the Data Variance!

Historical Data

Historical data is the large amount of information that is available about the past. Therefore, we want to match patterns of what is happening right now on top of what happened in the past. Doing so is valuable to understand about how is a certain entity behaving at the moment and comparing with how does that entity usualy behaves.

Doing this in real time is a challenge, and typically involves two domains - the engineering domain and the data science domain.

The engineering challenge is how to cross information with the real time data in a performant way. This sounds like small stuff, but if you consider that the system has to be fault-tolerant and distributed, with a lot of events, things start really to get tricky to support the time bound constraint of these systems.

The data science challenge is to find patterns in data that are associated with the information that we want to get insights, and doing so on billions of transactions is also a challenge.

Maybe we have found a pattern in our historical data that appears 70% of the time in the event we are trying to predict, but the trend of that pattern is decreasing and for the most recent data is almost non existing. What do you do?

Figure 2: Data Science? That’s just feeding data to models.

Another challenge rises when the models containing the learned patterns (the ones that data scientists have found) have to be deployed in these real time systems - maybe a feature that is pretty usefull for the model cannot be reproduced in real time due to time-bound constraints, and therefore that feature do not help us.

This environment is the environment where I want to make the diference, and where I want to apply my knowledge - right here, surrounded by Engineering and Data Science.

Right here, between Feeds and AI.

feeds | ai

Feedzai

And not only Feedzai does it, but this knowledge is in its genesis! At Feedzai, I believe I will have a lot of opportunities to learn in this domain as well as apply this knowledge of mine. Another thing is that people at Feedzai really know about their stuff! Working surrouded by these kind of persons, I believe will be a huge learning experience for me - having the opportunity to know these people and learn with them!

So, for my next challenge I will choose Feedzai - because I believe in the project and in the People!

Figure 3: Feedzai is not a crystal ball. Feedzai is AI!