Thursday, May 13, 2021

System design: CQRS pattern | MY 60 minutes study | CQRS documents by Greg Young

May 13, 2021

Introduction

It is the most challenge thing to learn by book reading. I like to use blogger to help me, track every minute and also highlight things new to me. 

CQRS document by Greg Young

Page 46/56

Event Storage as a Queue 

It has been previously discussed that the events coming out of a domain are also an [Integration Model]. Very often these events are not only saved but also published to queue where they are dispatched asynchronously to listeners either within the same system (the reporting model is a good example) or to other applications. An issue that exists with many systems publishing events is that they require a two phase commit between whatever storage they are using (Relational or otherwise) and the publishing of their events to the queue. 

The reason that the two-phase commit is needed is that a catastrophe could occur during the small period of time between when the write to the data storage commits and when the write to the queue commits. If a failure were to happen during this period the message would not be published on the queue (or if the other direction it may be published but the change may not be saved). If either case were to happen the listeners of the events would be out of sync with the producer.

The two-phase commit can be expensive but for low latency systems there is a larger problem when dealing with this situation. Generally the queue itself is persistent so the event becomes written on disk twice in the two-phase commit, once to the Event Storage and once to the persistent queue. Given for most systems having dual writes is not that important but if you have low latency requirements it can become quite an expensive operation as it will also force seeks on the disk. Figure 5 illustrates the two phase commit between data storage and a publishing queue. 

Some try to get around this problem by only writing to a queue then have something on the other side of the queue update the data storage with the changes represented by the events, this however has some issues. The largest issue is that not all of the events will be able to be written to the storage, eventual consistency has been introduced and it is possible that an optimistic concurrency problem will occur on the write of the events. Dealing with this problem in a production system is non-trivial. 

Many organizations do the opposite, use the event storage as a queue. Adding a sequence number to the Events table previously discussed allows the use the Event Storage as a queue. Figure 5 illustrates the change to the schema of the Events table. 

The database would insure that the values of sequence number would be unique and incrementing, this can be easily done using an auto-incrementing type. Because the values are unique and incrementing a secondary process can chase the Events table, publishing the events off to the queue. The chasing process would simply have to store the value of the sequence number of the last event it had processed, it could even update this value with a two-phase commit bringing the update and the publish to the queue into the same transaction. This process can be seen in Figure 7.

The work has been taken off of the initial processing in a known safe way. The publish can happen asynchronously to the actual write. This lowers the latency of completing the initial operation, it also will limit the number of disk writes in the processing of the initial request to one. This strategy can be extremely valuable when dealing with low latency requirements as it allows much of the work on the initial processing to be offloaded to another process asynchronously and in a safe way, there is little difference whether the publish happens as part of the initial processing or asynchronously as generally messages are published asynchronously anyways, using the Event Store as a queue just raises the time until the message is actually published slightly, this can be viewed as slightly raising the SLA.

CQRS and Event Sourcing

CQRS and Event Sourcing become most interesting when combined together. This chapter looks at the intersection of these two concepts within a system where Domain Driven Design has been applied.

CQRS and Event Sourcing have a symbiotic relationship. CQRS allows Event Sourcing to be used as the data storage mechanism for the domain. One of the largest issues when using Event Sourcing is that you cannot ask the system a query such as “Give me all users whose first names are ‘Greg’”. This is due to not having a representation of current state. With CQRS the only query that exists within the domain is GetById which is supported with Event Sourcing.

Event Sourcing is also very important when building out a non-trivial CQRS based system. The problem with integration between the two models is a large one. The maintaining of say relational models, one for read and the other for write, is quite costly. It becomes especially costly when you factor in that there is also an event model in order to synchronize the two. With Event Sourcing the event model is also the persistence model on the Write side. This drastically lowers costs of development as no conversion between the models is needed.

The original stereotypical architecture with using commands in Figure 1 can be compared to Figure 2 CQRS with Event Sourcing and found to be roughly equivalent amounts of work.

Cost Analysis

The client will be identical amounts of work between the two architectures. This is because the client operates in the exact same way. In both architectures the client receives DTOs and produces Commands that tell the Application Server to do something.

The queries between the two models will also be very similar in terms of cost. In the stereotypical architecture the queries are being built off of the domain model, in the CQRS based architecture they are being built by the Thin Read Layer projecting directly to DTOs. As was discussed in “Command and Query Responsibility Segregation” the Thin Read Layer should be equally or in some cases less expensive.

 The main differentiation between the two architectures when looking at cost is in the domain model and persistence. In the stereotypical architecture an ORM was doing most of the heavy lifting in order to persist the domain model within a Relational Database. This process introduces an Impedance Mismatch between the domain model and the storage mechanism, the Impedance Mismatch as discussed in “Events as a Storage Mechanism” can be highly costly both in productivity and the knowledge that developers need to have.

The CQRS and Event Sourcing based architecture does not have an Impedance Mismatch between the domain model and the storage mechanism on the Write side. The domain produces events, these same events are all that is stored. The usage of events is all that the domain model knows about. There is however an impedance mismatch in the read model. The Event Handlers must take events and update the read model to its concept of what the events mean. The Impedance Mismatch here is between the Events and the Relational Model.

The Impedance Mismatch between events and a Relational Model is much smaller than the mismatch between an Object Model and a Relational Model and is much easier to bridge in simple ways. The reason for this is that the Event Model does not have structure, it is representing actions that should be taken within the Relational Model.

Looked at from this perspective, the two architectures have roughly the same amount of work being done. Its not that its a lot more work or a lot less work; its just different work. The event based model may be slightly more costly due to the need of definition of events but this cost is relatively low and it also offers a smaller Impedance Mismatch to bridge which helps to make up for the cost. The event based model also offers all of the benefits discussed in “Events” that also help to reduce the overall initial cost of the creation of the events.

Integration

Everything up until this point has been comparing the systems in isolation. This rarely happens within an organization. More often than not organizations do only rely on systems but on systems of systems that are integrated in some way.

With the stereotypical architecture no integration has yet been supported, except of course perhaps integration through the database which is a well established anti-pattern for most systems. Integration is viewed as an afterthought.

The integration must be custom written. Many organizations choose to build services over the top of their model to allow for integration. Some of these services may be the same services that the clients use but more often than not there is additional work that must be done in order to support integration.

A larger problem exists when the product is being delivered to many customers. It is the teams responsibility to provide hooks for all of the customers and how they would like to integrate with thesystem. This often becomes a very large and unwieldy piece of code, especially on systems that are installed at hundreds or thousands of different clients all of which have different needs. The business model here tends to be to bill the client for each piece of custom integration, this can be quite profitable but it is a terrible model in terms of software. 

With the CQRS and Event Sourcing based model, integration has been thought of since the very first use case. The Read side needs to integrate and represent what is occurring on the Write Side, it is an integration point. The integration model is “production ready” all throughout the initial building of the system and it is being tested throughout by the integration with the Read Side.

The event based integration model is also known to be complete as all behaviors within the system have events. If the system is capable of doing something, it is by definition automatically integrated. In some circumstances it may be desirable to limit the publishing of events but it is a decision to limit what is published as opposed to needing to write code to publish something.

The event based model is also by nature a push model that contains many advantages over the pull model. If the stereotypical architecture desired a push based model then there would be large amounts of work added to track events and ensure that they were synchronized with what the system recorded in its own data model.

Differences in Work Habits

The two architectures also differ greatly in parallelization of work. In the stereotypical architecture work is generally done in vertical slices. There are four common methodologies used.

  • Data Centric: Start with database and working out. 
  • Client Centric: Start with client and work in. 
  • Façade/Contract First: Start with façade, then work back to data model then work finally implement client 
  • Domain Centric: Start with the domain model, work out to the client then implement data model 
These methodologies all have a commonality; they tend to work in vertical slices. The same developers will work on a feature through these steps. The same can be done with the CQRS and Event Sourcing based architecture but it does not need to be. Consider a very high level view of the systems as contained in Figure 3.

The architecture can be viewed as three distinct decoupled areas. The first is the client; it consumes DTOs and produces Commands. The second is the domain; it consumes commands and produces events. The third is the Read Model; it consumes events and produces DTOs. The decoupled nature of these three areas can be extremely valuable in terms of team characteristics.

Parallelization

It is relatively easy to have five to eight developers working on vertical slices at a given point without running into too many conflicts in what is being done. This is because for a small number of developers it is relatively easy to communicate what each developer is working on and to insure that there are few if any areas where developers overlap. This problem becomes much more difficult as you scale up the number of developers.

Instead of working in vertical slices the team can be working on three concurrent vertical slices, the client, the domain, and the read model. This allows for a much better scaling of the number of developers working on a project as since they are isolated from each other they cause less conflict when making changes. It would be reasonable to nearly triple a team size without introducing a larger amount of conflict due to not needing to introduce more communication for the developers. They still communicate in the same way but they communicate about smaller decoupled pieces. This can be
extremely beneficial when time to market is important as it can drastically lower the amount of calendar time to get a project done. 

All Developers are not Created Equally

There, it has been said. On a team there are many different types of developers, some attributes to consider in differences amongst developers include

  • Technical Proficiency 
  • Knowledge of the Business Domain 
  • Cost 
  • Soft Skills  
The points of decoupling are natural and support the specialization of teams in given areas. As an example in the domain, the best candidate is a person who is high in cost but also has a large amount of business knowledge, technical proficiency, and soft skills to talk with domain experts. When dealing with the read model and the generation of DTOs this is simply not the case, it is a relatively straight forward thing to do. The requirements are different which often leads to the next item.

Outsourcing

It is often not cost effective to keep low cost, medium skilled developers on a team. The overhead of keeping employees in terms of salary costs as well as compliance with various governmental regulations is often not worth the benefits of having the developers as employees. If a company is in a high cost locale, the company can certainly get cheaper developers offshore. Whether offshore or onshore the separation helps with successfully outsourcing part of a project.
 
Outsourced projects often fail because large amounts of communication are required between the outsourcers and the local team or domain experts. With these communications many problems can come up including time differences, cultural, and linguistic.

The Read Model as an example is an ideal area of the system to outsource. The contracts for the Read Model as well of specifications for how it work are quite concrete and easily described. Little business knowledge is needed and the technical proficiency requirements on most systems will be in the midrange.

The Domain Model on the other hand is something that will not work at all if outsourced. The developers of the Domain Model need to have large amounts of communications with the domain experts. The developers will also benefit greatly by having initial domain knowledge. These developers are best kept locally within the team and should be highly valued.

A company can save large amounts of capital by outsourcing this area of the system at a low risk, this capital can then be reinvested in other, more important areas of the system. The directed use of capital is very important in reaching a higher quality, lower cost system.

Specialization

A problem exists when working with vertical slices. The “best” developers, with best being defined as most valuable, work with the domain. When working with a vertical slice though anecdotal evidence suggests that they spend roughly 20-30% of their time in this endeavor. 

With the secondary architecture, the team of developers working with the domain spend 80+% of their time working with the domain and interacting with Domain Experts. The developers have no concern for how the data model is persisted, or what data needs to be displayed to users. The developers focus on the use cases of the system. They need only know Commands and Events. 

This specialization frees them to engage in the far more important activities of reaching a good model and a highly descriptive Ubiquitous Language with the Domain Experts. It also helps to optimize the time of the Domain Experts as opposed to having them sit idly while the “technical” aspects of vertical slices are being worked on. 

Only Sometimes 

There are many benefits offered through the separation but they do not need to be used. It is also quite common to have a normal sized team still work in vertical slices. There are benefits in terms of risk management amongst other things to having a small to medium sized team work in vertical slices of the whole system. 

The real benefit with the CQRS and Event Sourcing based architecture is that the option exists to bring it into three distinct vertical slices with each having its own attributes optimized as opposed to using a one size fits all mechanism.

 

 

 


 

No comments:

Post a Comment