Complex networks, data mining, causality, and beyond

Over the last few weeks Innaxis has published two papers that may be of interest to air transport researchers, among others.

The first paper is an extensive review on the combined use of complex network theory and data mining. Not only do complex network analysis and data mining share the same goal in general- that of extracting information from complex systems to ultimately create a new compact quantifiable representation- but they also often address similar problems as well. Despite these commonalities, a surprisingly low number of researchers take advantage of methodologies, as many conclude that these two fields are either largely redundant or totally antithetic. In this review, we challenge this perception, show how this state of affairs should be relegated to contingent rather than conceptual differences, and that these two fields can in fact advantageously be used in a synergistic manner. The review starts by presenting an overview of both fields, and by illustrating some of their fundamental concepts. A variety of contexts in which complex network theory and data mining have been used in a synergistic manner are then presented. Finally, all discussed concepts are illustrated with worked examples through a series of hands-on sections, which we hope will help the reader to put these ideas in practice. If you ever wonder how a real-world problem can be tackled by these two techniques, you should definitively read this review!

 

 

The second paper addresses the common misinterpretation of correlation vs causality. Following this idea, many causality metrics have been proposed in the literature, all sharing a same drawback: they are defined for time series. In other words, the system (or systems) under analysis should display a time evolution. Associating causality to the temporal domain is intuitive, due to the way the human brain incorporates time into our perception of causality; nevertheless, such association results in some rather important problems.

For instance, suppose one is trying to detect if there is a causality relation between the workload of an ATC controller and the appearance of loss of separation events. These events are only defined at one point in time. To illustrate, one can detect an instance of a loss of separation and check the corresponding workload; afterwards, perform the same actions for another event; and so forth. In the end, the researcher would get two vectors of features, which do not encode any temporal evolutions – in other words, consecutive values are not correlated. So, in this situation, how can we detect if a true causality (and not just a correlation) is present?

In this paper we propose a novel metric able to detect causality within static data sets, by analysing how extreme events in one element correspond to the appearance of extreme events in a second element- refer to the picture above for a graphical representation. The metric is able to detect non-linear causalities, to analyse both cross-sectional and longitudinal data sets, and to discriminate between real causalities and correlations caused by confounding factors.

If you are interested in these ideas, feel free to have a look at these two papers:

M. Zanin et al., Combining complex networks and data mining: why and how. Physics Reports (2016), pp. 1-44. http://authors.elsevier.com/a/1T3yF_8QfbYE-k. Also available at: http://arxiv.org/abs/1604.08816
M. Zanin, On causality of extreme events. PeerJ. Also available at: http://arxiv.org/abs/1601.07054

If you have questions about them, please contact M. Zanin at mzanin@innaxis.org

Finally, Seddik Belkoura is going to present a paper at the forthcoming ICRAT 2016, Philadelphia, about the use of the static causality metric to study delay propagation. You can find the paper on the official website of the conference (http://www.icrat.org/), and also by contacting him at sb@innaxis.org.

Guardar

Secure information sharing – secure multi-party computation in air traffic management

Easing confidentiality between business rivals through a clever use of mathematics

Secure Multi-party Computation is the preferred technique when multiple parties have to perform a computation, yet do not want to share private, confidential data. SecureDataCloud, the first research project about the application of SMC in air transport, has been recently completed by a team led by Dr. Zanin. The foundations of this technique are valuable for potential applications in the context of cyber-security, air transport and other domains.

The history of cryptography, i.e. the study of techniques for secure communication in the presence of adversaries, is fascinating and has been linked to social and cultural changes. Over two thousand years ago, the Caesar cypher was the state of the art. It involved an alphabet shift with a constant key, such that “abc” may be encrypted to “bcd”. This concept, while in actuality very simple to understand in present day, was a novel technique in the days of the Roman Empire.

Then, a substantial change in technology occurred in 1553, when the Vigenère cipher was invented by Giovan Battista Bellaso. This new cypher relied on a large key word, which controls the letter substitution depending on the letter used from the key word. If the key word is long enough, ideally, as long as the message itself, this schema is secure. The challenge of transmitting a long secret key was to use sentences from books that were owned by both the sender and the receiver, which in those days was less probable.

The most progress made in the cryptographic evolution has been achieved in the last decades, through the development of Secure Multi-party Computation (SMC) techniques. Previously, the scenario involved two parties trying to maintain privacy against an external adversary. However, in many modern applications, two or more parties need to maintain their privacy against each other, not just external adversaries. Yes, they still need to collaborate to exchange critical information, which is a significant change in the information security framework.

Secure computation was invented by Andrew Yao in 1982, and can be exemplified by the following problem, as originally proposed by Yao himself. Suppose two millionaires, Alice and Bob, are interested in knowing which is wealthier yet they do not want to reveal their actual wealth. To put in a different way, both parties (Alice and Bob) possess some information, respectively represented by A and B; the SMC problem is then an evaluation of a function C = f(A, B), such that at the end both Alice and Bob get to know C, but they don’t gain any additional information about A and B.

Many solutions have been proposed in the last 30 years enabling the evaluation of (almost any) functions. The mathematics involved in such computations could be complex and the computational cost associated with SMC protocols is high. Just to give an example, the secure two-party evaluation of an Advanced Encryption Standard (AES) encryption was achieved in 2007 (Lindell and Pinkas, 2007) but the computation takes around 20 minutes. Using SMC to access your bank account could be really secure but access to the information may take 20 minutes.

Innaxis started working on solving certain information-sharing paradigms in Air Traffic Management (ATM) using SMC in 2012. In these scenarios, different stakeholders must share information to reach a common goal, as mandated by the concept of Collaborative Decision-Making (CDM). Such information may be confidential and parties may not be comfortable sharing them due to high risk and confidentiality. For instance, considering the case of slot trading, airlines may be interested in trading slots, but revealing their target price is tantamount to giving away business information (i.e. the business value of that slot, the number of passengers they expect to allocate there, and so forth). Other applications of SMC could enable the exchange of safety information; exchanging the number of certain safety critical events might be beneficial to all airlines, but this kind of information is confidential and very sensitive and would better be shared through a SMC protocol.

Can these problems be solved by a trusted “neutral-party”, which is in charge of managing the information and ensure no ill-conceived analyses are executed? Possibly, but you have to find and trust the information maintains confidentiality within the neutral-party and ensure the security of the communication links in the transmission of the data. Additionally, having a single entity with access to every piece of data makes the system very vulnerable to cyber-attacks.

Starting from these considerations, we decided to start a research line concerning the use of SMC within air transport. The SESAR programme of the European Union recognised the value of this and financed the research project SecureDataCloud. We addressed two important problems: the trading of airport slots by airlines, and the calculation of delay statistics, both processed in a secure way.

The reader may refer to the several publications that resulted from this research work, with concrete implementation details that take address and solve the mathematical and computation challenges. Specifically, (Zanin et al., 2013) outlines the main ideas beyond the project and how SMC could be applied to ATM. (Zanin et al., 2014) and (Zanin et al., 2016) study a parallel problem, i.e. the creation of a secure CO2 allowance trading mechanism. Finally, (Zanin et al., 2015) deals with the problem of creating a secure trading mechanism for airport slot allocation.

Massimiliano Zanin will present SMC for air transport applications in the forthcoming Eurocontrol Cyber-security workshop, next March 23rd in Toulouse. If you need more details, about this talk or SMC in general, please feel free to contact Massimiliano, at mz@innaxis.org.

 

References:

Y. Lindell and B. Pinkas, “An efficient protocol for secure two-party computation in the presence of malicious adversaries,”Eurocrypt 2007, vol. Springer LNCS 4515, pp. 52-78, 2007.

Zanin, Massimiliano, et al. “SecureDataCloud: Introducing Secure Computation in ATM.” SESAR Innovation Days,Stockholm (2013).

Zanin, Massimiliano, et al. “Enabling the Aviation CO2 Allowance Trading Through Secure Market Mechanisms.” SESAR Innovation Days, Madrid (2014).

Zanin, Massimiliano, et al. “Design and Implementation of a Secure Auction System for Air Transport Slots.” Services (SERVICES), 2015 IEEE World Congress on. IEEE, 2015.

Zanin, Massimiliano, et al. “Towards a Secure Trading of Aviation CO2 Allowance”. Journal of Air Transport Management, in press, 2016.

The Case for Mobility Modelling in Europe

INX_Mobility Modelling

There are many performance targets for the European aviation system. It is clear that performance-based frameworks are needed and utilised, especially when decision makers need to act on legislative packages or when operational managers need to make procedural changes or decisions regarding technology in aviation. This overarching model of operations proves that any costly decision must ultimately result in an increase in performance.

Different performance frameworks look into different aspects of the European aviation framework, with varying goals that are not necessarily compatible or align in the same direction. To illustrate, the FlightPath 2050 envisions an air transport system that improves safety levels but also guarantees a time-performance for the future passengers in Europe; up to four hours maximum door-to-door travel time for 90% of travellers. This number is not arbitrary, as it corresponds to the type of experience high level experts had envisioned for European passengers. However, punctuality and efficiency metrics are mostly flight centred. Passengers are rarely considered on time performance schemes and therefore very little is known about the actual door-to-door time performance from the passenger perspective. Decisions such as ‘when’ or ‘where’ to act in achieving this goal have proven to be more challenging than initially expected.

The European Commission Single European Sky Unit is working on the Reference Period 3, which delves deeper into the performance scheme for air navigation service and network functions. This performance framework is very detailed, but unfortunately does not yet include provisions for passenger time-performance. Due to the complexity of different, non-interchangeable metrics, the KPAs and goals of the different performance schemes do not necessarily match.

SESAR and CleanSky have detailed, technical performance goals. By looking into specific technology pieces or procedures, it is clear their technologies will surely improve the performance of many concrete operational elements (e.g. runway performance), however it is unclear how much those programmes will contribute to other performance frameworks. For instance, Europe may need additional funding to ensure better technology or have a different distribution of effort across the different technology research areas.

Mobility Modelling with Mercury
It is not realistic to believe a top-down Performance Framework can rule all initiatives. Each initiative has its complexities which justify executing independently, in occasions working with different groups of stakeholders or professionals. Nonetheless, a single vision for European mobility is needed.

Innaxis and the University of Westminster have been working for over 5 years on an integrated mobility model that provides a wide range of performance and mobility metrics, for use by a variety of airlines, network managers and policy makers. This integrated mobility model is called the Mercury Air Transport model (Mercury).

Mercury is capable of modelling passenger connectivities inside the European aviation system, along with a wide range of flight and passenger prioritisation scenarios. In order to cope with this monumental tasks, Mercury uses Soft Computing techniques and it runs in a cloud-based infrastructure. Mercury has been validated by airlines and captures airline decision-making and related costs by fusing a variety of data sources. Furthermore, Mercury works within the integration of different Performance Frameworks to produce the most accurate and useful metrics for each stakeholder.

Guardar

Guardar

Data Scientist position at Innaxis

Innaxis is seeking a Data Scientist to join its research and development team in aviation projects. As a member of the team, you are joining a very interdisciplinary group of researchers, scientists, mathematicians and engineers that work for private companies and public institutions on solving the most challenging problems and get the most out of their data.

A mixture of creativity and technical skills are required to complement the skill set of a team that has worked in the last 5 years achieving landmarks in terms of network performance analyses across different areas within the aviation sector.

We are looking for a talented individual to help the team to complement the existing research threads on machine learning and data mining, to provide new insights on the performance of complex systems and enable the real time analysis of complex phenomena. Being part of our team will mean to cooperate with other skilled researchers currently focused on knowledge discovery, data engineers and visualisation experts.
Requirements are as follows:

  • Degree on Computer Science or similar (mathematics, physics) with outstanding background and experience in programming.
  • Experience on collection and preparation of datasets for machine learning exercises.
  • Understanding of general architectures and tools for machine learning, from validation to (automatic) feature selection.
  • Fluency in English: it is the working language at Innaxis!

Technical skills that may be relevant in the evaluation:

  • Understanding of the theoretical and implementation approaches for standard data mining models and algorithms, from SVMs to deep learning techniques based on Deep Neural Networks, as well as their combination.
  • Basic knowledge of database technologies and use: MySQL, MongoDB, JQuery.
  • Any programming language is a plus: both general (Python, C, Matlab) and data analysis oriented (Weka, R) …

We offer:

  • Immediate start within a highly qualified and collaborative international team with innovative thinking and working methodology focused on the development of large scale research and innovation projects.
  • Interesting salary as a function of skills, experience and education.
  • Flexibility and good working conditions

Interested candidates should send their detailed CV and relevant information to innovation@innaxis.org

Big Data Engineer position at Innaxis

Innaxis is seeking a Big Data Engineer to join its research and development team. As a member of the team, you are joining a very interdisciplinary group of researchers, scientists and engineers that work for private companies and public institutions on solving the most challenging problems and get the most out of their data. A mixture of creativity and technical skills are required to complement the skill set of a team that has worked in the last 5 years achieving landmarks in terms of network performance across different areas within the aviation sector.

We are looking for a talented individual to help the team to complement the existing research threads on engineering infrastructures to support data mining against large datasets. Your role within the team will be to design, test and implement state-of-the-art information acquisition systems of existing data sources within the aviation sector. This data will be further analysed in search for insightful patterns and ultimately knowledge discovery. The acquisition systems developed should also be cost-efficient, reliable and in compliance with our data providers privacy directives.

Requirements are as follows:

  • Degree or MSc on Computer Science with outstanding background and experience in programming and systems management.
  • Strong interest for Amazon cloud-based solutions, specifically EC2, EBS, RDS and IAM.
  • Strong interest for databases design and management, including SQL and NoSQL solutions and ecosystems.
  • Enthusiasm for software design and testing methodologies
  •  Fluency in English: it is the working language at Innaxis!

Technical skills that may be relevant in the evaluation:

  • Knowledge of database technologies and use: MySQL, MongoDB, JQuery
  • Proficiency in at least one programming language: Python, Perl, R, C++
  • Understanding of data mining algorithms: KDD, support vector machines, etc.

We offer:

  • Immediate start within a highly qualified and collaborative international team with innovative thinking and working methodology focused on the development of large scale research and innovation projects.
  • Interesting salary as a function of skills, experience and education.
  • Flexibility and excellent working conditions.

Interested candidates should send their detailed CV, a research interest letter and any relevant information to innovation@innaxis.org

Complex, functional and multi-layer networks: from the brain to air transport

This post was written by Innaxis researcher, Massimiliano Zanin. 

0317_blog

 

In the last few years, researchers have realized that interactions between the constituting elements of complex systems seldom develop on a single channel.  Let’s take the case of a social network: information exchange may happen orally, electronically, or even indirectly; additionally, people interact according to different types of relationships, like friendship and co-working. This is important because the type of information shared may significantly depend on the channel and on the type of relation: you would probably not say the same to a co-worker in an email as you would to a significant other face to face. Due to this, it may be necessary to include different types, or layers of links, in order to obtain a meaningful representation of the system under study. Neglecting such multi-layer structure, or in other words working with the projected network, may alter our perception of the topology and dynamics, leading to a wrong understanding of the properties of the system.

Since a couple of years, I’ve been interested in the multi-layer structure of the air transport system, see for instance Refs. [1, 2]. Clearly, not all connections are the same: it is straightforward to identify that a clear multi-layer structure is created by airlines and airline alliances, which allow an easy movement of passengers between them, but difficult inter-layer movements. Last year we have published a huge monograph on multi-layer networks, which includes all aspects: from defining topological metrics, analysing dynamical process, up to a review of applications. You can find it in Ref. [3]. (But please, do not print it before checking the number of pages!)

More recently, I’ve started asking myself: “what about multi-layer functional networks?” Let’s take one step back, and see what functional networks are.

In the early stages of complex network theory, such paradigm was mainly used to analyze systems whose structure, either physical or virtual, could be directly mapped into a network. Once again, this is the case of the air transport system, as links (direct flights between pairs of cities) have a physical nature and are easily accessible. It was soon clear that in certain cases this is not possible, as the only information obtainable from the system itself was the evolution through time of some observables. Such measurable variables reflect the behavior of the interacting elements constituting the system, and as such, the value of every observable is expected to be a “function” of the values of other peers. When the structure of such interactions is inferred from the dynamics of the observables, the result is then called a functional network.

Many examples are available of functional networks, but probably the most famous is the study of brain dynamics. First of all, it has to be noticed that physical connections between brain regions do exist, but they are quite difficult to assess… especially if you don’t want to damage the brain! Also, physical connections are interesting, but much more important are the connections that actually activate when the brain is performing some kind of task. A functional network representation can be the perfect solution. By considering the magnetic or electric field generated by spiking neurons, links are established whenever some kind of synchronisation is detected between the recorded time series, usually by means of metrics like Pearson’s linear correlation, Synchronization Likelihood, or Granger Causality. When two regions are synchronised, they are (probably, indeed this point can be discussed!) exchanging some kind of information, and thus participating in a specific computation: functional networks thus represent these collaborative processes.

Now, what about the multi-layer structure of the brain? It is well known that the human cortex has a six-layer structure, in which each layer is responsible for a different level of information abstraction and integration. This structure is nevertheless neglected, due to the limited spatial resolution of magnetic and electric sensors, and the analysed time series just correspond to the global activity of the top-most layers. We are thus projecting the multi-layer network into a single layer. Are we confident that the resulting network is still representative of the original brain activity? Notice that the non-linear nature of the projection process can foster the appearance of constructive or destructive interferences: a link may appear in the projection even if no relationship is present in any layer; or links in two layers can interfere, and disappear from the projection.

How can we validate this hypothesis? It cannot be done with brain data, as we still cannot solve the spatial resolution problem – let’s see how technology will evolve in the next decade. I found a solution by moving back to aviation. Specifically, we can create functional networks of delays: nodes are airports, pairwise connected when there is a correlation (or causality) between the time series representing their average hourly delay. Airports are thus connected if a delay propagation process is detected between them. The concept is not new: see for instance the work performed in the POEM WP-E project [4]; the advantage is that delay propagation can be assessed without any modelling process, and without having to gather information about aircraft turn-arounds, crews, etc. Moreover, the availability of high-resolution real data allows the reconstruction of a complete multi-layer picture, in which each layer corresponds to a different airline. But we can also collapse the dynamics, in order to simulate the creation of a single-layer representation, and compare the structures of the single- and multi-layer representations.

This is exactly what I’ve done in a paper recently published in Physica A [5]. Results are quite startling! First, the most central nodes in the projections do not correspond to the nodes of high centrality in each layer; therefore, the former analysis give biased estimations, which cannot reliably be used to detect the most critical elements in the system. If you then try to use a single-layer model to allocate resources, you would probably end up giving money to the wrong airport! Furthermore, when a simple dynamical model is executed, the magnitude of the error yielded by considering a single layer projection is as big as the results themselves, thus indicating that any estimate obtained with this simplification is meaningless.

So, what does this mean in terms of complex systems modelling? Can we neglect the multi-layer structure? The answer is clearly NO.
Let’s consider the problem of modeling and forecasting the dynamics of the air transport network. First, results obtained imply that any simulation performed to understand the dynamics of the system may yield misleading results when the multi-layer structure created by airlines is neglected. In spite of this, most of the recent research works in this fields fail to include this essential ingredient, both in the analysis of delay propagation and of the network robustness to disruption and attacks. Second, it has to be noticed that the air transport system is created by the interactions between a large number of agents, which may create different layers along different dimensions. For instance, multiple flights do not just share the airline, but they may also be connected by the crew operating them. Disregarding these different layer dimensions, like crews, aircraft types or flight type (cargo or passengers), may further bias our understanding of the system. If the most important airports, in term of delay propagation, cannot reliably be detected with a projected functional network, the identification of functional hubs in the brain dynamics may be confused by the fact that the multi-layer structure of the cortex is neglected. Therefore, global hubs may not correspond to the most important nodes in each layer: the single layer analysis may then be misinforming about the real structure created by information flows. Or, in other words, it is possible that all results obtained in neuroscience by means of functional networks may be biased… quite a big problem!

Summing up: complex networks, and their functional version, are very powerful tools to understand the hidden dynamics behind real complex systems. Yet, one has to remember this: one layer does not fit all!

P.S.: One last comment: if you want to play with the data of Ref. [5], you can find an interactive version of the paper here.

 

References:

[1] Cardillo, A., Gómez-Gardeñes, J., Zanin, M., Romance, M., Papo, D., Del Pozo, F., & Boccaletti, S. (2013). Emergence of network features from multiplexity. Scientific reports, 3. Freely accessible here.

[2] Cardillo, A., Zanin, M., Gómez-Gardeñes, J., Romance, M., del Amo, A. J. G., & Boccaletti, S. (2012). Modeling the multi-layer nature of the European Air Transport Network: Resilience and passengers re-scheduling under random failures. arXiv preprint arXiv:1211.6839. Preprint available here.

[3] Boccaletti, S., Bianconi, G., Criado, R., Del Genio, C. I., Gómez-Gardeñes, J., Romance, M., … & Zanin, M. (2014). The structure and dynamics of multilayer networks. Physics Reports544(1), 1-122. Preprint available here.

[4] Cook, A., Tanner, G., Cristóbal, S., & Zanin, M. (2013). New perspectives for air transport performance. Third SESAR Innovation Days, 26th – 28th November 2013. PDF available here.

[5] Zanin, M. (2015). Can we neglect the multi-layer structure of functional networks?. Physica A: Statistical Mechanics and its Applications. Preprint available here.

Guardar

Innaxis looks forward to sharing their work in ATM at SIDS 2013 in Stockholm

As in every year, many of us are getting ready to participate in the SESAR Innovation Days, organised by Eurocontrol and the SJU, this year in Stockholm.

In 2013, Innaxis has been particularly busy in this research field and, just as in previous editions, we will be especially active during the SID. In any case, with the goal of stimulating discussions with you in different research areas, this email aims to give you a small briefing on how we are participating in Stockholm in this important event:

The ComplexWorld network has gone a long way in the last three years. Paula López (plc@innaxis.org) will give a presentation on the first day providing details of the different activities and how the network is setting itself up for 2014. On Day 3 the network will hold the satellite event “Complex Metrics in ATM” and a PhD session. Please, do not hesitate to talk to Paula (plc@innaxis.org) if you are interesting in more information on the network activities.

At Innaxis we have been working on passenger-oriented metrics for a few years now, crafting a detailed tool to compute those metrics focusing on the 4 hour door-to-door challenge of FlightPath2050 and now Horizon2020. On the second day of SID (27th Nov), our colleagues from the University of Westminster will present this tool and the initial results . As part of our efforts to improve the way performance is assessed in air transport, this year we also worked on other case studies, including scenarios in which there is no tool set available to correctly design ATM operational concepts. To tackle the challenges of changing ATM while still remaining in control of the performance assessment, it is critical to look into new ways of estimating KPIs. This is what the tool set developed by CASSIOPEIA has accomplished. A poster about this project and the tool set developed will be available at SID.

If you would like more information on the four hour door-to-door challenge in which Innaxis is currently engaged, the latest passenger metrics developed in POEM, or the most recent agent-based CASSIOPEIA modelling framework, please do not hesitate to contact the architect of these design tools for ATM, Samuel Cristóbal (scristobal@innaxis.org) who will be at the conference over the whole week.

Data Science has been an area of major interest at Innaxis over the last few years and in October we organised the first Data Science Workshop for Air Transport, which was held in Madrid. We are working on different elements of an infrastructure to allow major data mining work for Air Transport on different fronts; from evaluating current delay propagation, resilience of airports and airlines against disturbances, to evaluating new paradigms on safety monitoring, all of which is based on powerful data analytics. We are very proud of our advancements in the area. On Tuesday the 26th, our colleague Massimiliano Zanin will present  a paper comparing traffic density as measured today (i.e., number of aircraft crossing a sector), with other measures based on data analytics. If you need any information about this interesting research topic, please contact Mass directly (mz@innaxis.org).

Information Management has also been an area of interest for us. In particular, we think the Data Science paradigms will only be fully enabled if data is shared across stakeholders and this can be achieved only if the right secure and encrypted mechanisms are put in place. We present as a poster a number of SecureDataCloud ideas for ATM. This will be of use on different fronts; safety and fuel consumption, among others. You should also talk to Mass if Information Management is your area of interest.

Last, but not least, we will also serve as rapporteurs and we will help Eurocontrol to extract some conclusions as well as provide our own views on future research avenues. Carlos Álvarez will take care of this during the closing session. Please, contact Carlos (calvarez@innaxis.org) if you feel inspired by his words!

We hope we have many opportunities to interact next week and hope you find our activities interesting and motivating for future initiatives.

See you in Stockholm!

Wh- Questions about Data Science

The five Wh’s of Data Science – What, Why, When, Who and Which.

While preparing the upcoming October workshop in Data Science, Innaxis has gathered wh- questions and simple answers about the “new reality” of data science. We also provide links to pages where more information about these important questions have been provided.

What?

The basic answer to what is Data Science could be “a set of fundamental principles that support and guide the principled extraction of information and knowledge from data”. Definitions, especially of new terms should remain simple despite the urge to make them complicated. Furthermore, the boundaries of Big Data, Data Science, Statistics and Data Mining definitions are not so discernible and include common principles and tools and, importantly, the same aim: extraction of valuable information.

Why?

What is the reason for extracting information from data? There is a brilliant quote by Jean Baudrillard “Information can tell us everything. It has all the answers. But they are answers to questions we have not asked, and which doubtless don’t even arise” In this context, proper data science is [ generally ]  neither basic science nor long term research; it is considered an extremely valuable resource for the creation of business. Mining large amounts of both structured and unstructured data to identify patterns that can directly help an organization in terms of costs, in creating customer profiles, increasing efficiencies, recognizing new market opportunities and enhancing the organization’s competitive advantage.

When?

Through history, an extensive list of names have been given to a well known duality: information=power;  from the middle ages census to the Royal Navy strategies based on statistical analysis. Concerning the current understanding of Data Science, its name has moved away from being a synonym for Data Analysis in the early 20th century to being associated, from the nineteen-nineties, with Knowledge Discovery (KD). One of the very best compilations of data science history and publications over the last 60 years can be found in this Forbes article.

Throughout history, the various methods and tools used have changed, developing as both the mathematical, extraction and software and hardware capabilities have increased in recent years. The consequent “sudden” eruption in Data Science jobs,  which identifies the market’s real interest in those potential benefits that knowledge extraction offers, is visually described with the following graph taken from Linkedin analytics:

Courtesy LinkedIn Corp.

Who?

If you are a lawyer or a doctor everybody knows more or less your level of education at university and the nature of your daily tasks. What is then a “Data Scientist”? The clear paths that could lead to a Data Science career are not so defined and are difficult to identify. The so called “Sexiest Job of the 21st century” (according to the Harvard Business Review), needs a common definition and even specific university degrees.  The data jockeys that have always been employed in Wall Street are no longer alone. Meanwhile the scope and variety of data now available is a non-stop, growing, force resulting in operational, statistical and even hacking backgrounds being welcome to extract value from it. More information about data scientist careers and the main disciplines can be found in this excellent article from naturejobs.com.

In order to understand Data Science job titles, we recommend you also have a look at this article by Vincent Granville from DataScienceCentral. It’s a living tongue twister: data mining activity done by a data scientist regarding data scientist job titles. Summing it up, it is pretty similar to the following recipe: Take a mixer from the kitchen; add the words “Data” “Analytics” “Scientist”; switch it on; include some institutional label “director” “Junior” “Manager”. An additional optional topping could be your university degree “engineer” “mathematician”. There you have one of the possible names of current data scientist.

Which?

Which data is “datascience-able”? As we described in our previous post about Data Science, there is huge potential in almost every imaginable field that could provide sufficient quality data for analysis. Although, even where the date is available, there are challenges faced,  generally connected with data storing and managing capabilities. These challenges are covered in detail in the Innaxis blogpost, “The benefits and challenges of Big Data”. One of the remarkable and exciting things about Data Science is that there is additional knowledge to extract from data sets that at first sight are not expected to provide anything beyond the obvious potential from the so called “direct” datasets. The reality is it’s hard to know which data sets will add value before testing them with Data Science. When discovered, hidden patterns and unseen correlations are really adding more valuable knowledge to entities than direct cause-and-effect relationships. They represent being one step ahead, which is crucial in the highly competitive world in which we are living.

By Héctor Ureta – Collaborative R&D Aerospace Engineer at Innaxis

 

 

Guardar

Guardar

Wh- Questions about Data Science

The five Wh’s of Data Science – What, Why, When, Who and Which.

While preparing the upcoming October workshop in Data Science, Innaxis has gathered wh- questions and simple answers about the “new reality” of data science. We also provide links to pages where more information about these important questions have been provided.

What?

The basic answer to what is Data Science could be “a set of fundamental principles that support and guide the principled extraction of information and knowledge from data”. Definitions, especially of new terms should remain simple despite the urge to make them complicated. Furthermore, the boundaries of Big Data, Data Science, Statistics and Data Mining definitions are not so discernible and include common principles and tools and, importantly, the same aim: extraction of valuable information.

Why?

What is the reason for extracting information from data? There is a brilliant quote by Jean Baudrillard “Information can tell us everything. It has all the answers. But they are answers to questions we have not asked, and which doubtless don’t even arise” In this context, proper data science is [ generally ]  neither basic science nor long term research; it is considered an extremely valuable resource for the creation of business. Mining large amounts of both structured and unstructured data to identify patterns that can directly help an organization in terms of costs, in creating customer profiles, increasing efficiencies, recognizing new market opportunities and enhancing the organization’s competitive advantage.

When?

Through history, an extensive list of names have been given to a well known duality: information=power;  from the middle ages census to the Royal Navy strategies based on statistical analysis. Concerning the current understanding of Data Science, its name has moved away from being a synonym for Data Analysis in the early 20th century to being associated, from the nineteen-nineties, with Knowledge Discovery (KD). One of the very best compilations of data science history and publications over the last 60 years can be found in this Forbes article.

Throughout history, the various methods and tools used have changed, developing as both the mathematical, extraction and software and hardware capabilities have increased in recent years. The consequent “sudden” eruption in Data Science jobs,  which identifies the market’s real interest in those potential benefits that knowledge extraction offers, is visually described with the following graph taken from Linkedin analytics:

Courtesy LinkedIn Corp.

Who?

If you are a lawyer or a doctor everybody knows more or less your level of education at university and the nature of your daily tasks. What is then a “Data Scientist”? The clear paths that could lead to a Data Science career are not so defined and are difficult to identify. The so called “Sexiest Job of the 21st century” (according to the Harvard Business Review), needs a common definition and even specific university degrees.  The data jockeys that have always been employed in Wall Street are no longer alone. Meanwhile the scope and variety of data now available is a non-stop, growing, force resulting in operational, statistical and even hacking backgrounds being welcome to extract value from it. More information about data scientist careers and the main disciplines can be found in this excellent article from naturejobs.com.

In order to understand Data Science job titles, we recommend you also have a look at this article by Vincent Granville from DataScienceCentral. It’s a living tongue twister: data mining activity done by a data scientist regarding data scientist job titles. Summing it up, it is pretty similar to the following recipe: Take a mixer from the kitchen; add the words “Data” “Analytics” “Scientist”; switch it on; include some institutional label “director” “Junior” “Manager”. An additional optional topping could be your university degree “engineer” “mathematician”. There you have one of the possible names of current data scientist.

Which?

Which data is “datascience-able”? As we described in our previous post about Data Science, there is huge potential in almost every imaginable field that could provide sufficient quality data for analysis. Although, even where the date is available, there are challenges faced,  generally connected with data storing and managing capabilities. These challenges are covered in detail in the Innaxis blogpost, “The benefits and challenges of Big Data”. One of the remarkable and exciting things about Data Science is that there is additional knowledge to extract from data sets that at first sight are not expected to provide anything beyond the obvious potential from the so called “direct” datasets. The reality is it’s hard to know which data sets will add value before testing them with Data Science. When discovered, hidden patterns and unseen correlations are really adding more valuable knowledge to entities than direct cause-and-effect relationships. They represent being one step ahead, which is crucial in the highly competitive world in which we are living.

By Héctor Ureta – Collaborative R&D Aerospace Engineer at Innaxis

 

 

Guardar

Connect with us!