Good morning dear readers of Tecnogalaxy, today we will go to see what is present in the source code of Yandex.

Yandex, a Russian tech giant, was recently the victim of hackers who published over 40 GB of the company’s source code. The company denied being hacked.

Leaked sources are packed into individual components, such as analytics, cloud, portal, and so on. Keeping in mind that computer programs contain only text files, seeing over 40 GB would make up quite a few lines of code (loc). Long story short, these leaked data are a mix of source code and media files, in which media files clearly make up most of the leaked data.

Automated testing

Test-Automation is a technique used in software development to make sure the software works as intended. Engineers would write code that tests an application. Taking the taxi module as an example, a test technician would like to write a test to see if the billing process is correct, so that the user can continue with the taxi order. Inside “billing-test-data” we find a 400 mb file”06-balance-select.json”.

Automatic tests are generally written in a generic way. They take some input data and the test program itself makes sure that the input data entry in our taxi module produces an expected output. The leaked data contains many of those data automation and test files.

Many modern software products use a protocol called grpc to communicate with each other. Communication details can be difficult for a stranger to decipher, because data transfer objects in grpc are binary, so only machine readable. If I’m an automation tester, I need to know how to communicate with the service. Fortunately, the leaked Yandex data contains “.proto” files. These files describe the structure of binary data transfer objects. So this can be very interesting for some researchers.


One of the largest directories within the taxi module is called “infra”, which can be short for “infrastructure”.

Contains several hundred files “.yaml” (Yet Another Markup Language). This type is usually used for configuration. In the case of the taxi module, it is used to describe the API endpoints and their acceptable returned codes. So this can also be interesting for researchers, because yaml files provide information about the endpoints of a web service.

Business logic

So far we’ve realized that most of the content in the leaked data comes from media files, test automation, and infrastructure code. But that’s not all. The interesting part of any software is the so-called “Business Logic”. This is exactly where all the magic happens. Business logic describes how we can “book a taxi.

There are many files with endings such as “.cpp”, “.hpp”, typical of C++ programs. Further investigation seems to be related to some web services. However, at first glance, those files do not seem to be complete programs. We can find some managers for Web requests (acceptance of connections, data recovery), but here doesn’t seem to be involved much business logic.

In any case it is a challenge to say whether the leaked source code is real or not. All files in the archives have a date of February 24, 2022. The “leaker” states that the data was downloaded from them in July 2022, while they published the link to the files on January 25, 2023.

From the inspection of source, media and configuration files of various modules, we have to assume that the leak contains real data. It is unthinkable to create such a quantity of things just for fun. However, since there is no corporate source, then we may conclude that the leak contains a reduced version of the Yandex software repositories.

Read also:

Was this article helpful to you? Help this site to keep the various expenses with a donation to your liking by clicking on this link. Thank you!

Follow us also on Telegram by clicking on this link to stay updated on the latest articles and news about the site.

If you want to ask questions or talk about technology you can join our Telegram group by clicking on this link.

© - It is forbidden to reproduce the content of this article.