Miracle #1

Teknoloji

2 Jul 2009
As I said in my last posting, our current path to full multi-node performance for Darkstar requires that we solve four problems. Each of these is a full research problem in its own right, so in some sense we need to simultaneously perform four miracles. I’ve said in the past that a good project requires the solution of exactly one miracle, so we are currently violating that rule. But it is still worth the effort to give it a try.
The first of these miracles was stated earlier as
  • How do we co-locate data with the tasks or players that are using that data while still keeping the data at least conceptually as a shared resource available from any node on the network?
Put another way, how do we set the system up so that most data accesses are local to a particular server without making it impossible for the data to be accessed by the other servers if the need arises?
This sounds like a pretty classic caching problem, which ought to be open to a pretty classic caching solution. On such a solution, you start with all of the data being held in a single database (or at least a conceptually single
database– it may itself be replicated for higher reliability). If some server needs to access some piece of data, a copy of that data is copied to the server needing the access, and the central database keeps track of the fact
that the copy has been made. As long as no changes are made to the data item in the cache, other copies can be handed out to other servers. If a change is made, the central server first has to be told that a change is about to
happen. If no one else has asked for the right to change the data, the server that asked is told to go ahead. Once the change is made, that change has to be sent back to the centralized server, and all of the other cached copies on any other server need to be told that the data has been changed, and their copy needs to be updated.
This would work fine if the games we are trying to support had the same sort of data access patterns that are seen in standard (enterprise) database applications. In those applications, most data is only read, not altered. Throughput is vital, but some variation in latency is fine. And it is absolutely necessary that the standard ACID properties (atomicity, consistency, isolation, and durability) are preserved.
But as we know, games aren’t like that. This whole exercise is meant to cut down on latency, which is the enemy of fun. Further, from what we can tell the ratio of writes to reads in games is much higher than it is in enterprise
applicaitions. From our (unscientific) observations, it looks like about half of the objects that are obtained in any one task will be altered and then need to be written. So while it may look like we have a pretty standard database caching problem, in fact the environment and requirements are very different.
The problem domain is also different, and it is this difference that allows us some room for a solution. Unlike enterprise systems, games don’t need to be completely reliable. As long as the game state is consistent, it seems ok to allow a bit of game play to be lost (where a bit is some time period measured in small numbers of seconds). We don’t want to give up consistency, but we are willing to give up a small measure of durability to get low latency.
Here is the idea, leaving out lots of the complicated details. I should point out that the work on this, along with most of the thinking, is being done by Tim Blackman, so I’m just reporting on his work as I understand it. We start with a network accessible data store, which is a conceptually centralized repository of all of the information that we need to store for the server side of the game. This may in the future be replicated for high reliability, but
from the point of view of all of the machines that are running a Darkstar server, there is one of these per game.
When an item is needed by one of the servers running the game logic, that item is copied over to the server and manipulated locally. The central server knows where the copies are, so that if some other server asks for a copy of the data the server knows where the cached copy is and can ask for an update. On the game server, the local copy can be used in various (transactional) tasks. When the data is changed by some task, that change is not written back to the central data store on transaction commit. Instead, such a network update is delayed until either the central store asks for it (because some other game server needs to access the data) or the local server has time to send the updates back.
One of the tricky bits on all of this is that when an update is sent back to the central store, all of the updates that have occurred in the same transaction are also sent back, and all of those are sent back after all of the updates from previous transactions are returned to the central store. This means that the order of data changes is the same at the central server as on the individual game servers, and that the updates happen a transaction at a time. This will insure that the state of the central server is always consistent, even if it is not up to date.
The problem that can occur with this scheme is if a game server crashes (or becomes disconnected). There may be transactions that have locally committed on that server but where changes that have occurred in those transactions have not been sent back to the central store. In such a case, those changes will be lost. We don’t believe that more than a couple of seconds of play will be lost in such a case, but there will still be some loss. However, if the players that were connected to the crashed server reconnect to the game on some other server, they will at least see a consistent view of the world, even if it does not reflect the changes of the last little bit of game play. It’s like the deja vu aspect of the Matrix– you will be returned to a slightly earlier version of the world, and resume play from there.
Of course, this will only work if the data being used on a game server is not also being accessed on a different game server. If multiple game servers are trying to share the same data, then the central server will be constantly
asking each of the game servers to flush their cache, and we will introduce network latencies into the game once again. So we need to find a way to insure that all of the players who are accessing the same data are located on the same game server. But that’s another miracle, and will be addressed in another posting.

Source/Kaynak : http://blogs.sun.com/scalinggames/entry/miracle_1

Comment Form

Content In Different Language


Recent Comments


  • Jim Dougherty: You can fix Solaris 8 named_to_major, path_to_inst, drivers_alias errors on boot by simply installin [...]
  • psha: doesn't work [...]
  • Jiji joseph: Can you please let me know how can I get the SRMTools ? [...]
  • Sebastian: Hi, I don't think using a suite will work either. The order is also random. It is just a coincide [...]
  • Henry: Hey, I can't seem to get this working on my mac. The page down works if I put the focus on the wind [...]
  • Our Scores