Welcome to Innaxis News

The Innaxis Post

Read our Blog

Call for entry level or junior Data Scientists

Innaxis Research and Foundation (www.innaxis.org) is currently seeking for exceptional Data Scientists to join its research and development team based in Madrid, Spain. The position is directed towards talented and highly motivated individuals who want to pursue and lead a career in Data Science and Big Data outside of the more mainstream, conventional alternatives such as consulting or academia. Individuals with a great dose of imagination, problem solving skills, ambition and passion are encouraged to apply.

As a Data Scientist, you will mainly assist the team to understand, analyse and mine data, but also to prepare and assess the quality of such. You will also develop methods for data fusion and anonymization. Ultimately your goal will be to extract the best knowledge and insights from data, overcoming technical limitations and committing with regulatory requirements. You will also work closely with data engineers, you will help the engineers team to define the requisites for the Big Data architectures; covering the whole process of data gathering, processing and delivery. You will always need to be ahead and use the latest technologies and solutions for the ultimate performance and data insight.

About Innaxis

If not unique, Innaxis is at most not conventional: it is a private independent non-profit research institute focused on Data Science and its applications: most notoriously in aviation, air traffic management and mobility, among other areas.

As an independent entity, Innaxis determines its own research agenda and has now a decade of experience in European research programs with more than 30 successfully executed projects. New projects and initiatives are evaluated continuously and open to new opportunities and ideas proposed within the team.

Our team consist on a very interdisciplinary group of scientists, developers, engineers and program managers, together with an extensive network of external partners and collaborators, from private companies to universities, public entities and other research institutes.

Wish lists

Our team members work very closely, so broader knowledge means a much better coordination. The following list of skill defines the whole Data Scientist team at Innaxis. No not hesitate to apply, even if you don’t fulfil all the skills below. Hardly any single person does.

  • University degree, MSc or PhD on Data Science or Computer Science, or related field provided sufficient experience.
  • No professional experience required, although it might be positively evaluated.
  • Proficient in a variety of programming languages, for instance: Python, Scala, Java, R or  C++ and up to date on the newest software libraries and APIs, e.g. Tensorflow, Theano.
  • Experience with acquisition, preparation, storage and delivery of data,  including concepts ranging from ETL to Data Lakes.
  • Knowledge of the most commonly used software stacks such as LAMP, LAPP, LEAP, OpenStack, SMACK and similar.
  • Familiar with some of the IaaS, PaaS and SaaS platforms currently available such as Amazon Web Services, Microsoft Azure, Google Cloud and similar.
  • Understanding of the most popular knowledge discovery and data mining problems and algorithms; predictive analytics, classification, map reduce, deep learning, random forest, support vector machines and such.
  • Hands-on experience on most common visualisation tools: Tableau, Qlik, QuickSight, etc.
  • Continuous interest for the latest technologies and developments, e.g. blockchain, Terraform,
  • Excellent English communication skills. It is the working language at Innaxis.
  • Availability and wiling to travel to Europe and engage with our research partners and stockholders.
  • And of course, great doses of imagination, problem solving skills, ambition and passion.

Your benefits

The successful candidate will be offered a Innaxis’ position as a Data Scientist, including a unique set of benefits:

  • Being part of a young, dynamic, highly qualified, collaborative and heterogeneous international team.
  • Great flexibility and most excellent working conditions.
  • Long term and stable position. Innaxis is steadily growing since its foundation ten years ago.
  • A fair salary according to the nature of the institute and adjusted to skills, experience and education.
  • Independence, as a non-profit and research-focused nature of Innaxis, the institute is driven by different forces than in the private sector, free of commercial and profit interests.
  • The possibility to develop a unique career outside of mainstream: academia, private companies and consulting.
  • No outsourcing whatsoever, all tasks will be performed at Innaxis offices.
  • Opportunity to get around Europe while visiting our extensive partner network.
  • An agile working methodology; Innaxis recently implemented JIRA/Scrum and all the research is done on a collaborative wiki/Confluence.

How to apply

Interested candidates should send an email to recruitment@innaxis.org containing:

  • An up-to-date and detailed CV in pdf, references, academic records and proofs might be requested afterwards but they are not necessary for applying
  • research motivational letter, explaining carefully why she or he is the perfect candidate.
  • It is highly recommended to include any professional Internet presence, such as GitHub and/or Stack Overflow profiles, website-blog, portfolio, LinkedIn account , etc.
  • Any other relevant information supporting the application

You will be contacted further and a personal selection process will start.

Research documents clustering for CAMERA

The Horizon 2020 Coordination and support action CAMERA evaluates the impact of European mobility-related projects. The CORDIS database presents a high volume of unclassified project data to which manual methodologies would be impossible to apply due to the high dimensionality of the dataset. Also, not all of the projects presented in the CORDIS database are related to mobility.

These problems show the necessity of using algorithms to detect patterns within the corpus of documents presented in the database. By using automated methodologies in non-classified databases, we can amplify the scope of the project. This implies looking at all texts – including those normally unaffiliated with the topic of mobility but that may present soft relation with mobility areas. Also, by developing a data-driven statistical model, more metrics regarding the projects can be designed and assessed.

The rise of statistical Natural Language Processing (NLP)

Natural Language Processing (NLP) is a branch of artificial intelligence that aims to make computers “understand”, interpret and manipulate human language. NLP combines different disciplines including computer science, computational linguistics and statistical models in its pursuit to fill the gap between human communication and computer automation. The main challenges faced by NLP are speech recognition, natural language understanding and natural language generation.

Since the so-called “statistical revolution” in the late 1980s, NLP research has relied heavily on machine learning. The machine learning methodology focuses on using statistical inference, or automatically learning the rules of natural language through the data analysis of a large corpora of real-world examples. In machine learning, a corpus is a set documents with human or computer annotations that can be used to generate a large set of “features” that “teach” algorithms to “understand” the relationships within the documents.

CAMERA’s challenges

In CAMERA, we analyze more than 40.000 projects from 2007 to 2020. The document corpus is composed by concatenating the “title” and “objective” of each project, yielding a combined average text length of 4200 words per document. The documents can’t easily be labeled by topic, so no prior information is known regarding its content or the amount of topics covered. This prevents the use of supervised machine learning techniques to classify the documents by topic, adding a new layer of difficulty.

Furthermore, the texts are expected to present complex topical distribution with soft links between subtopics and documents belonging to multiple topics. Among all the research areas covered in the European Union, only mobility is relevant in CAMERA. This means that we need to identify the right topical distributions in a very sparse space. In this context, traditional unsupervised machine learning algorithms (e.g. clustering) do not perform well because they give fixed classes to texts, limiting the possibilities of hybrid topic distributions. In this scenario, more complex methodologies are required, such as using probabilistic clustering to generate topic models.

Topic modelling in CAMERA

Topic modeling is a well-known tool for discovering hidden semantic structures in a corpus of documents. Topic models learn many related words from large corpora without any supervision. Based on the words used within a document, they mine topic level relationships by assuming that a single document covers a small set of concise topics. Furthermore, the output of the algorithm is a cluster of terms that identify a “topic”. The topic model can be very useful for quickly examining a large set of texts and automating the discovery of topical relationships between them.

The most popular topic modeling algorithm is Laten Dirichlet Allocation (LDA). LDA is a three-level hierarchical Bayesian model that fits words and documents over an underlying set of topics. The main particularity of this algorithm is that it is an unsupervised generative statistical model that allows sets of observations to be explained by unobserved groups that break down why some parts of the data are similar. In this case, the observations are words collected into documents and each document can be presented as mixture of a small number of topics.

The most important aspect of LDA, the most relevant for the CAMERA objective, is that it is a matrix factorization technique. Any collection of documents can be represented in a vector space as a document-term matrix. The document-term matrix gives the frequency count of a word (represented as columns) in a Document (represented as rows). LDA decomposes this document-term matrix into two lower dimensional matrices: the document-topics matrix and the topic-terms matrix with dimensions (N,K) and (K,M) respectively where K is the number of topics (a parameter fixed by the analyst), M is the number of documents and N is the number of distinct terms.

By using the LDA topic modeling approach, we can analyze the corpus of documents and iteratively extract the documents with higher probability of belonging to mobility-related topics. After we extract the most relevant documents we can run a topic modelling again and extract the distribution of mobility-related subtopics. This will give us quantitative metrics such as the grade of coverage in specific research areas, correlations between topics and similarity metrics to find similar projects.

But the methodology presents another problem: with LDA being an unsupervised algorithm, we cannot “choose” which topics are interesting and which not. This is a huge issue when looking for a certain topic or distribution of topics. How did we solve this problem? Stay tuned for my next post on how we turned the unsupervised LDA methodology to semi-supervised.

1st CAMERA Workshop

What are Europe's mobility goals and how can progress towards these goals be measured? What would make up a feasible set of key performance indicators (KPIs) for mobility? And which major aspects of the work towards creating Europe's future transport system are addressed in the Mobility4EU Action Plan?
These are some of the key questions that were discussed during the workshop organised by the EU-sponsored CAMERA and Mobility4EU projects on 15th June in Brussels.
The aims of the “European mobility for the future: strategic roadmaps and performance assessment” workshop were to acquire feedback from experts from different mobility sectors on the development of a strategic roadmap for the European transport system (Mobility4EU Action Plan) and to discuss the research requirements, gaps, and bottlenecks shown up by this roadmap (Progress towards EU mobility goals).
A key output was that it is very easy to define things that should happen and very difficult to decide exactly how to measure them! For more details on the workshop results visit www.h2020camera.eu.
WORKSHOP MATERIALS: Handout | Final results

Domino: The structure

Author: Luis Delgado

Domino’s project is structured in 6 workpackages as shown in the following image:

WP3 will analyse the current and future structure of the ATM system and define the mechanisms and the case studies that will be tested by Domino. These first case studies are the investigative case studies which will set the first set of scenarios to be tested. WP4 will develop an Agent Based Model (ABM) which will be able to execute the different scenarios. In Domino, we understand the different actors in the system as agents which try to optimise their utility functions subject to the system constraint and the environment. The system constraints are changed when different mechanism are implemented as different options arise; and the environment in ATM is subject to uncertainty that the actors need to manage.

The metrics generated by the ABM will cover the impact on both flight and passengers. These outcomes will be analysed by WP5 where a Complexity Science toolbox will be used in order to generate knowledge on the status of the system. Traditional and complex metrics will be generated but also specific network analysis to understand how the elements in the system are coupled and where the bottlenecks are generated. Once again this dual view flight an passenger perspective of the system is core in these analyses.

WP2 will provide support to the other technical packages in terms of data requirements, acquisition and preparation. Domino will model a past day of operations with new mechanisms applied to it.

Finally, Domino requires close collaboration and feedback from stakeholders and experts. This will be achieved with the interactions in WP6. The mechanisms will be subject to a consultation, the model developed in WP4 will be calibrated with the help of stakeholders and the results of the investigative case studies shared in a workshop (to be run in Spring 2019). This workshop be the forum where adaptive case studies will be selected. These case studies try to mitigate some of the network issues identified on the investigative case studies results. The adaptive case studies will be run again from WP3 to WP5 to develop the Domino's methodology: you have a new mechanism (technological or operational change) and you'd like to learn about its impact in the ATM system; this mechanism is modelled within the ABM framework; tested with the Complexity Science toolbox; and once hotspots are identified can be mitigated creating new scenarios to test!

Keep in touch to learn more or provide feedback to Domino and follow our updates regarding the preliminary results and the workshop!

See http://www.domino-eu.com for more info on the project.

Domino: The knock-on effect

AUTHOR: Luis Delgado

The objective of Domino is to analyse the coupling of elements in the ATM system and how changes (for example, by implementing different mechanism) have an impact on the interrelationships between elements. In order to achieve this, Domino will develop a set of tools, a methodology and a platform to assess the coupling of ATM systems from a flight and a passenger perspective.

Different actors in the ATM system might have different views of its elements and their criticality. For this reason, Domino adds the passenger's view to the more classic flight-centred vision.

In Domino, the ATM system is seen as a set of elements that are related to each other by how the different actors (airlines, flights, passengers, airports, etc.) use them. The behaviour of these actors depend on the available rules of the system. These rules are defined, partially, by the mechanisms that are in place. Complexity Science tools will allow us to understand how the elements in the system are interconnected and how these connections change when the system is modified.

Domino will develop an Agent Based Modelling platform to capture the different systems' relations, and it will focus on three mechanism, implemented and deployed with different scope: Dynamic Cost Indexing (DCI), User Driven Prioritisation Process (UDPP) and Extended Arrival Manager (E-AMAN). Domino will provide a view of the effect of deploying solutions in different manners, e.g., harmonised vs. local/independent deployment.

If a piece in the system is knocked which others are going to be affected? Let Domino tell us!

See http://www.domino-eu.com for more info on the project.

European mobility for the future: strategic roadmaps and performance assessment

What are Europe's mobility goals and how can progress towards these goals be measured? What would make up a feasible set of key performance indicators (KPIs) for mobility? And which major aspects of the work towards creating Europe's future transport system are addressed in the Mobility4EU Action Plan?

These are some of the key questions that will be discussed at a workshop organised by the EU-sponsored CAMERA and Mobility4EU projects on 15th June in Brussels.

The aims of the “European mobility for the future: strategic roadmaps and performance assessment” workshop are to acquire feedback from experts from different mobility sectors on the development of a strategic roadmap for the European transport system (Mobility4EU Action Plan) and to discuss the research requirements, gaps, and bottlenecks shown up by this roadmap (Progress towards EU mobility goals).

The workshop will consist of two distinct sessions: the morning session on the Mobility4EU “Action Plan”and the afternoon one on the CAMERA “Progress towards EU mobility goals” topic. Both of these sessions will rotate through three parallel round-table discussions, and participants will actively contribute to the development of the Action Plan and to the identification of important aspects to be taken into consideration when designing Europe's future transport system.

Do you think you could bring something to these discussions? Are you interested in hearing what other experts think? Don’t miss this opportunity: take a look at the agenda and register for the workshop now! Admittance is free but places are limited.

EUROPEAN MOBILITY FOR THE FUTURE
STRATEGIC ROADMAPS AND PERFORMANCE ASSESSMENT

BRUSSELS, 15 JUNE 2018 Blue Point Brussels
Bvd. Auguste Reyers 80
1030 Brussels

www.h2020camera.eu
micol.biscotto@dblue.it
annika.paul@bauhaus-luftfahrt.net

www.mobility4EU.eu
marcia.urban@bauhaus-luftfahrt.net
beate.mueller@vdivde-it.de

Aircraft, network, and zoology

It is well known that the problem of building a schedule plan for an airline is a difficult one. The core difficulty is indeed to take into accounts the multiple constraints of aircraft, crew, maintenance, passenger correspondence etc, while trying to capture as much market as possible, all with minimum expenses. It is similar to riding a bike... except you do not know who is riding, where the wheels are, where you are supposed to go and if you should buy a car instead.

One of the most important constraints is the aircraft, since:

  • it is impossible to fly without it (rockets are quite unsafe to land at airports),
  • it is quite expensive (I've been told).

Let's imagine that, as an airline, you roughly know what cities you want to connect and how many passengers should travel with you. Where should your existing aircraft fly? Should you buy one? Do you have different strategies if you are a low-cost carriers or a traditional one? This is roughly the answers that our agents are trying to answer in the second block of our Vista model, the "schedule mapper". Of course, since our model simulates all the airlines in Europe, we cannot dedicate as much time (real and computational) as airlines do in reality to their schedule plan. But, like for the other parts of Vista, we are trying to catch to main behaviours of the system.

As usual, we start from what we can observe from data. For instance, it is common to say that aircraft usually go back and forth, and that some of them do sometimes triangular flights. Is that true? To investigate this, we take a three days time window where we track the itineraries of aircraft in terms of airports, defined as "patterns", using DDR data. What kind of patterns 'live' in this environment? How to classify them?

First, like taxonomists do not care about the specifics of a single individual to make a classification, we should not take into account the details of the patterns to classify them (in fact, that's the definition of a classification...). So for instance Rome - Paris - Rome has the same pattern than Frankfurt - London - Frankfurt, which can be rewritten 1 - 2 - 1 for instance. If a specific sequence is an individual in zoology, a pattern is thus akin to a taxon.

We can roughly divide these taxons into two "reigns": the ones which are closed (more explicitly have at least one closed loop), and the rest. For instance, an aircraft doing Paris - Frankfurt - Rome - Paris - Rome - Paris in three days has a closed pattern, whereas an aircraft doing Rome - Madrid - Barcelona is open. Of course, in the long run, most of aircraft do at least one full loop, but in three days some of them cannot make it. However, when counted in number of flights, most of them are closed in 3 days already, as shown in the figure below. In the following, we focus only on these closed taxons. Pretty much like one could focus on a study on mammals for instance, except that in this case, the mammals represent most of the animal kingdom.

Among them, some are more elemental than others, in the sense that they cannot be constructed from their peers. These are the ones which have exactly one closed loops. The ones present in the data are represented in the figure below, with their frequency of appearance (the number n corresponds to the number of airports in the loop). Most of them are single returns (1 - 2 - 1), triangular flights (1 - 2 - 3 - 1), and rectangular flights (1 - 2 - 3 - 4 - 1), and we focus on these three ones in the following. Note that rectangular flights seem more frequent than triangular ones, perhaps contrary to the popular belief.

All the other patterns can be constructed from these elementary ones, and we name them 'combined' patterns. For instance, (1 - 2 - 1 - 2 - 1) is composed of two single back and forth. In terms of zoology, it is a bit like saying that an elephant can be obtained by gluing a snake to a hippopotamus. Or that a giraffe is really nothing more that a horse with a periscope in the throat, which personally I believe very much. In any case, it easy to plot the frequency of appearance of these combined taxons, as shown in the figure below. Since all of them are coming from three taxons, we use notation the (X, Y, Z), where e.g. (2, 0, 0) represents two returns, (1, 1, 1), a return, a triangular flight and a rectangular one, etc. Some very rare patterns have been omitted in the figure. As expected from the previous figure, most of the aircraft goes back and forth during the three days. It is interesting to see that triangular flights are very under-represented, and that it is more frequent to have a rectangular flights every now and then, in combination with returns. Note that when a pattern features several returns, it is not necessarily between the same airports (e.g. Warsaw - Oslo - Warsaw - Vienna - Warsaw). In fact, we found that most of the combined patterns are 'impure', i.e. they are composed of elementary patterns with different airports (like gluing two birds of different colours for instance).

What does Vista do with this freak zoo? Well, the way the airlines choose implicitly the different patterns is a complex procedure, driven by the different constraints cited above. So the idea is that the best patterns should be selected for their efficiency, much like some taxons are selected by evolution based on their fitness in the given environment. Each taxon has also some particularities. For instance, flights using the taxon (4, 0, 1) mainly departs (from their first airport) in the early morning,  whereas taxons (2, 0, 0) are used by flights departing more frequently in the late morning, and sometimes in the evening, as shown in the figure below. Other regularities can be found in terms of average turn-around times for instance.

In the model, we use all these data to build reasonable schedules by resampling the different taxons for each airline. This will be described in a later blog post. And no more weird animal crossings, we swear!

SafeClouds Mid-term Review

 

On 11th of April, we had a successful mid-term review for our H2020 project, Safeclouds. The meeting was hosted by Eurocontrol in Brussels, with participants from all entities involved in the project.

Read Eurocontrol's post on the mid-term review here!

ANSPs, how changes on fuel price affect your airspace revenues?

AUTHOR: Luis Delgado

Vista allows to analyse complex scenarios with interactions between metrics of different stakeholders.

Flight plan generation and route selection

When airlines select their flight plans between a given origin and destination many different factors need to be considered, such as possible routes available, weather, aircraft performance or time required. Vista uses a data-driven approach analysing historical flight plans, routes between airports and aircraft performances to estimate the cost of operating those different routes.

As shown in the above diagram, the historical analysis of data allow us to generate a pool of two dimensional routes, probability distributions for cruise wind, speed and flight level request and length and duration of climb and descent phases. With this information, for each possible route we can estimate the 4D trajectories that the airline will plan and estimate the total operating cost of these possibilities.

A given flight will, of course, follow only one of the possibilities, so at pre-tactical level, the different flight plans options are prioritised considering their expected direct operating costs (as a function of flight time, fuel and en-route airspace charges). This selection is not deterministic as airlines not always will follow the apparent lest cost route and in Vista we are interested on reproducing realistic flight plan selections options, not the best option!

 

What if we change the cost of fuel?

Vista is a great tool to analyse the impact of changes of parameters such as fuel cost on the behaviour of the stakeholders in the system. In some areas of Europe, airlines face the possibility of selecting different routes which might incur on different airspace en-route charges and different fuel consumptions and flying time. This leads to trade-offs that can be captured by Vista. An example of one of those regions is western Europe and flights to-from the UK and the Canary Islands. As shown in this image, airlines can select more direct routes using the airspace of France, Spain and Portugal or operate longer routes which benefit from the low airspace usage cost of the Oceanic airspace.

The trade-offs between different metrics for the airlines can be explicitly computed by Vista as shown in the image below for different fuel price scenarios. With higher fuel cost, shorter routes tend to be selected leading to lower fuel usage but higher airspace en-route charges.

As Vista considers multiple stakeholders it is possible to assess the impact of these changes on the demand and expected revenue obtained by the different ANSPs as shown in the following images:

Expected revenue due to en-route charges variation for GCTS - EGKK flights

Expected revenue due to en-route charges variation for all of ECAC flights

The figure above shows the expected changes on revenues for the different ANSPs across Europe if changes of fuel price are produced. This illustrates how different parameters are interconnected for different stakeholders in subtle manners that can be captured by Vista: changes on fuel prices represent variations on routes preferences which might have an impact on airspace usage and revenues of the ANSPs!

SafeClouds presented in the EU-US workshop

Last January, a team of European and American entities organised a workshop on transatlantic research with the support of the European Commission. The event was hosted by the FAA in their facilities at the William J. Hughes Technical Center in Atlantic City. Those mostly in attendance were US and European companies interested in how the different research threads could be boosted through international cooperation.

Among the subjects discussed during the three day event, data analytics was mentioned several times as a interesting area with applicability to different areas in industrial research. Particularly, safety data analytics was covered in three presentations. First, the FAA presented their +10-year old programme ASIAS, which collects data from more than 40 carriers and has been leading the developments in this field for more than a decade. Second, EASA presented the Data4Safety programme, recently launched and in a proof-of-concept stage. Lastly, Innaxis presented the research programme SafeClouds.eu, including the latest technological developments and how they could complement the existing initiatives by providing and exploring new research avenues.

Connect with us!