Algorithmic automated trading or Algorithmic Trading has been at the centre-stage of the trading world for a few years now. The percentage of volumes attributed to this form of trading has been increasing in the past few years. As a result, it has become a highly competitive market that is heavily dependent on technology. Consequently, the basic architecture has undergone major changes over the past decade and continues to do so. It is today a necessity to innovate on technology in order to compete in the world of algorithmic trading, making it a hotbed for advances in computer and network technologies.
Any trading system, conceptually, is nothing more than a computational block that interacts with the exchange on two different streams.
- Receives market data
- Sends order requests and receives replies from the exchange.
The market data that is received typically informs the system of the latest orderbook. It might contain some additional information like the volume traded so far, the last traded price and quantity for a scrip. However, to make a decision on the data, the trader might need to look at old values or derive certain parameters from history. To cater to that, a conventional system would have a historical database to store the market data and tools to use that database. Analysis would also involve a study of the past trades by the trader. Hence another database for storing the trading decisions as well. Last, but not the least, a GUI interface for the trader to view all this information on the screen.
The entire system can now be broken down into
- The exchange(s) – the external world
- The server
- Market Data receive
- Store market data
- Store orders generated by the user
- Take inputs from the user including the trading decisions
- Interface for viewing the information including the data and orders
- An order manager sending orders to the exchange.
The traditional architecture could not scale up to the needs and demands of Automated trading with DMA. The latency between origin of the event to the order generation went beyond the dimension of human control and entered the realms of milliseconds and microseconds. So the tools to handle market data and analyse it needed to adapt accordingly. Order management also needs to be more robust and capable of handling many more orders per second. Since the time frame is so small compared to human reaction time, risk management also needs to handle orders in real time and in a completely automated way.
For example, even if the reaction time for an order is 1 millisecond (which is a lot compared to the latencies we see today), the system is still capable of making 1000 trading decisions in a single second. This means each of these 1000 trading decisions needs to go through the Risk management within the same second to reach the exchange. This is just a problem of complexity. Since the architecture now involves automated logic, 100 traders can now be replaced by a single system. This adds scale to the problem. So each of the logical units generates 1000 orders and 100 such units mean 100,000 orders every second. This means that the decision-making and order sending part needs to be much faster than the market data receiver in order to match the rate of data.
Hence, the level of infrastructure that this module demands would need to be far superior compared to that of a traditional system (discussed in the previous section). Hence the engine which runs the logic of decision making, also known as the ‘Complex Event Processing’ engine, or CEP, moved from within the application to the server. The Application layer, now, is little more than a user interface for viewing and providing parameters to the CEP.
The problem of scaling also leads to an interesting situation. Let us say 100 different logics are being run over a single market data event (as discussed in the earlier example). However there might be common pieces of complex calculations that need to be run for most of the 100 logic units. For example, calculation of greeks for options. If each logic were to function independently, each unit would do the same greek calculation which would unnecessarily use up processor resources. In order to optimize on the redundancy of calculation, complex redundant calculations are typically hived off into a separate calculation engine which provides the greeks as an input to the CEP.
Although the application layer is primarily a view, some of the risk checks (which are now resource hungry operations owing the problem of scale), can be offloaded to the application layer, especially those that are to do with sanity of user inputs like fat finger errors. The rest of the risk checks are performed now by a separate Risk Management System (RMS) within the Order Manager (OM), just before releasing an order. The problem of scale also means that where earlier there were 100 different traders managing their risk, there is now only one RMS system to manage risk across all logical units/strategies. However, some risk checks may be particular to certain strategies and some might need to be done across all strategies. Hence the RMS itself involves, strategy level RMS (SLRMS) and global RMS (GRMS). It might also involve a UI to view the SLRMS and GRMS.
Emergence of protocols
With innovations come necessities. Since the new architecture was capable of scaling to many strategies per server, the need to connect to multiple destinations from a single server emerged. So the order manager hosted several adaptors to send orders to multiple destinations and receive data from multiple exchanges. Each adaptor acts as an interpreter between the protocol that is understood by the exchange and the protocol of communication within the system. Multiple exchanges mean multiple adaptors.
However, to add a new exchange to the system, a new adapter has to be designed and plugged into the architecture since each exchange follows its only protocol that is optimized for features that that exchange provides. To avoid this hassle of adapter addition, standard protocols have been designed. The most prominent amongst them is the FIX (Financial Information Exchange) protocol. This not only makes it manageable to connect to different destinations on the fly, but also drastically reduces to the go to market when it comes to connecting with a new destination.
The presence of standard protocols makes it easy to integrate with third party vendors, for analytics or market data feeds as well. As a result, the market becomes very efficient as integrating with a new destination/vendor is no more a constraint.
In addition, simulation becomes very easy as receiving data from the real market and sending orders to a simulator is just a matter of using the FIX protocol to connect to a simulator. The simulator itself can be built in-house or procured from a third party vendor. Similarly recorded data can just be replayed with the adaptors being agnostic to whether the data is being received from the live market or from a recorded data set.
Emergence of low latency architectures
With the building blocks of an algorithmic trading system in place, the strategies optimized on the ability to process huge amounts of data in real time and make quick trading decisions. But with the advent of standard communication protocols like FIX, the technology entry barrier to setup an algorithmic trading desk, became lower and hence more competitive. As servers got more memory and higher clock frequencies, the focus shifted towards reducing the latency for decision making. Over time, reducing latency became a necessity for many reasons like:
- Strategy makes sense only in a low latency environment
- Survival of the fittest – competitors pick you off if you are not fast enough
The problem however is that latency is really an overarching term that encompasses several different delays. To quantify all of them in one generic term may not usually make much sense. Although it is very easily understood, it is quite difficult to quantify. It therefore becomes increasingly important how the problem of reducing latency is approached.
If we look at the basic life cycle,
- A market data packet is published by the exchange
- The packet travels over the wire
- The packet arrives at a router on the server side.
- The router forwards the packet over the network on the server side.
- The packet arrives on the Ethernet port of the server.
- Depending whether this is UDP/TCP processing takes place and the packet stripped of its headers and trailers makes its way to the memory of the adaptor.
- The adaptor then parses the packet and converts it into a format internal to the algorithmic trading platform
- This packet now travels through the several modules of the system – CEP, tick store, etc.
- The CEP analyses and sends an order request
- The order request again goes through the reverse of the cycle as the market data packet.
High latency at any of these steps ensures a high latency for the entire cycle. Hence latency optimization usually starts with the first step in this cycle that is in our control i.e, “the packet travels over the wire”. The easiest thing to do here would be to shorten the distance to the destination by as much as possible. Colocations are facilities provided by exchanges to host the trading server in close proximity to the exchange. The following diagram illustrates the gains that can be made by cutting the distance.
For any kind of a high frequency strategy involving a single destination, Colocation has become a defacto must. However, strategies that involve multiple destinations need some careful planning. Several factors like, the time taken by the destination to reply to order requests and its comparison with the ping time between the two destinations must be considered before making such a decision. The decision may be dependent on the nature of the strategy as well.
Network latency is usually the first step in reducing overall latency of an algorithmic trading system. However there are plenty of other places where the architecture can be optimized.
time taken to send the bits along the wire.
Constrained by speed of light of course. Several optimizations have been introduced to reduce the propagation latency apart from reducing the physical distance. For example, estimated roundtrip time for an ordinary cable between Chicago and New York is 13.1 milliseconds. Spread networks, in October 2012, announced latency improvements. Bringing the estimated roundtrip time to 12.98 milliseconds. Microwave communication was adopted further by firms such as Tradeworx bringing the estimated roundtrip time to 8.5 milliseconds. Note that the theoretical minimum is about 7.5 milliseconds. Continuing innovations are pushing the boundaries of science and fast reaching the theoretical limit of speed of light. Latest developments in laser communication, earlier adopted in defence technologies, has further shaved off an already thinning latency by nanoseconds over short distances.
Network processing latency
introduced by routers, switches, etc.
The next level of optimization in the architecture of an algorithmic trading system would be in the number of hops that a packet would take to travel from point A to point B. A hop is defined as one portion of the path between source and destination during which a packet doesn’t pass through a physical device like a router or a switch. For example, a packet could travel the same distance via two different paths. But It may have two hops on the first path versus 3 hops on the second. Assuming the propagation delay is the same the routers and switches each introduce their own latency and usually as a thumb rule, more the hops more is the latency added.
Network processing latency may also be affected by what we refer to as microbursts. Microbursts are defined as sudden increase in rate of data transfer which may not necessarily affect the average rate of data transfer. Since algorithmic trading systems are rule based, all such systems will react to the same event in the same way. As a result, a lot of participating systems may send orders leading to a sudden flurry of data transfer between the participants and the destination leading to a microburst. The following diagram represents what a microburst is.
The first figure shows a 1 second view of the data transfer rate. We can see that the average rate is well below the bandwidth available of 1Gbps. However if dive deeper and look at the seconds image (the 5 millisecond view), we see that the transfer rate has spiked above the available bandwidth several times each second. As a result the packet buffers on the network stack, both in the network endpoints and routers and switches may overflow. To avoid this, typically a bandwidth that is much higher than the observed average rate is usually allocated for an algorithmic trading system.
time taken to pull the bits on and off the wire.
A packet size of 1500 bytes transmitted on a T1 line (1,544,000 bps) would produce a serialization delay of about 8 milliseconds. However the same 1500 byte packet using a 56K modem (57344bps) would take 200 milliseconds. A 1G Ethernet line would reduce this latency to about 11 microseconds.
introduced by interrupts while receiving the packets on a server.
Interrupt latency is defined as the time elapsed between when an interrupt is generated to when the source of the interrupt is serviced. When is an interrupt generated? Interrupts are signals to the processor emitted by hardware or software indicating that an event needs immediate attention. The processor in turn responds by suspending its current activity, saving its state and handling the interrupt. Whenever a packet is received on the NIC, an interrupt is sent to handle the bits that have been loaded into the receive buffer of the NIC. The time taken to respond to this interrupt not only affects the processing of the newly arriving payload, but also the latency of the existing processes on the processor.
Solarflare introduced open onload in 2011, which implements a technique known as kernel bypass, where the processing of the packet is not left to the operating system kernel but to the userspace itself. The entire packet is directly mapped into the user space by the NIC and is processed there. As a result, interrupts are completely avoided.
As a result the rate of processing each packet is accelerated. The following diagram clearly demonstrates the advantages of kernel bypass.
time taken by the application to process.
This is dependent on the several packets, the processing allocated to the application logic, the complexity of the calculation involved, programming efficiency etc. Increasing the number of processors on the system would in general reduce the application latency. Same is the case with increased clock frequency. A lot of algorithmic trading systems take advantage of dedicating processor cores to essential elements of the application like the strategy logic for eg. This avoids the latency introduced by the process switching between cores.
Similarly, if the programming of the strategy has been done keep in mind the cache sizes and locality of memory access, then there would be a lot of memory cache hits resulting further reduction of latency. To facilitate this, a lot of system use very low level programming languages to optimize the code to the specific architecture of the processors. Some firms have even gone to the extent of burning complex calculations onto hardware using Fully Programmable Gate Arrays (FPGA). With increasing complexity comes increasing cost and the following diagram aptly illustrates this.
Levels of sophistication
The world of high frequency algorithmic trading has entered an era of intense competition. With each participant adopting new methods of ousting the competition, technology has progressed by leaps and bounds. Modern day algorithmic trading architectures are quite complex compared to their early stage counterparts. Accordingly, advanced systems are more expensive to build both in terms of time and money.
|Standard 10GE network card||Low latency 10GE network card||FPGA||ASIC|
|Latency||20 microseconds + application time||5 microseconds + application time||3-5 microseconds||Sub microsecond latency|
|Ease of deployment||Trivial||Kernel driver installation||Retraining of programmers||Specialists|
|Man years effort to develop||Weeks||Months||2-3 man years||2-3 man years|
If you’re a retail trader or a tech professional looking to start your own automated trading desk, start learning algo trading today! Begin with basic concepts like market microstructure, strategy backtesting system and order management system. You can also enrol in EPAT which is one of the most extensive algorithmic trading courses available in the industry.