Page 3 of 6

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Wed Sep 28, 2011 8:20 am
by cami
tzlaine wrote: Carsten, as you can see above, I have been unable to reproduce all your results. The optimized version runs in ~5s for me as for you, but I'm getting ~20s for the vanilla FO you first started investigating -- not minutes. I'm really curious what might be happening here, which is why I posted such detailed info on my test system above. Any idea why we're seeing such radically different things out of FO rev. 4282?
Yes, in revision 4280 my first patch, that only limited search depth, was comitted, which brought a speedup factor of ~7. So you are only experiencing the speedup from my second patch (~3) and the very last changes (~1.5), amounting for a total speedup of ~4.5, which matches your numbers excellently.

The inital post was at revision 4279, if you compare with that version you'll very likely see running times of about 2:00-2:30 minutes.

Edit: r4282 is correct, mixed this up. I'm unsure why the effect is less dramatic in optimized builds, but I guess inlining speeds up the tree/graph traversals dramatically.

Test conditions:
CPU: x86_64 Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz GenuineIntel
L1-Cache: Write Through, 256 kB, 4-way Set-associative
L2-Cache: Write Through, 1024 kB, 8-way Set-associative
L3-Cache: Write Back, 8192 kB, 16-way Set-associative
RAM: 2x DDR3, 64 bits width, 2048 MB (total 4GB), 1333 MHz
OS: GNU/Linux 2.6.38-gentoo-r6 #3 SMP
libc: glibc 2.12.2 shared
boost: 1.46.1 shared
compiler: gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5
CFLAGS: -g -O2 -march=core2

all callgrind runs additionally with cflags:
-fno-inline -fno-inline-functions -fno-inline-small-functions -fno-indirect-inlining -fno-early-inlining -fno-inline-functions-called-once -finline-limit=0

Test operation:
1. Starting the client without parameters
2. loading the game file using menu
3. clicking on "turn".
4. terminate client when turn button became enabled again

Callgrind operation:
1. Starting valgrind --tool=callgrind freeoriond
2. client operation as above
3. server terminates automatically after exit of the last human client

Speedup estimations were deduced from callgrind total cycle comparison and rounded appropriately to the low measurement precision.
Times provided in the posts are simply gathered by "looking on the watch" and not very precise.

PS. On my system sizeof(int) = 4, sizeof(long) = 8. Note that boost::adjacency_list<> uses long e.g. for the number of vertices.
tzlaine wrote:I still think one of my changes could replace some of Carsten's patch for an even faster result. More on that later.
Looking forward to this, as I'm still not happy with the running time. There's still some container copying that cannot be removed, if you plan to replace some containers by std::vector we're probably going to see another slight speedup. However, such a change would have been too daring for me so I didn't include it ^^° I tried to write my code in a way though that it can benefit from it.
tzlaine wrote:Again though, nice work! I have some comments on the patch, those will come after I've tried integrating some of my ideas into your existing work (those efforts may or may not moot my comments).
Thank you very much =) And while we're at it: thanks to everyone for all the efforts in making this game what it is today, and what it will become. Very enjoyable! Well, I'm actually playing the game, I like to explore larger galaxies and just wanted to get the game to run more comfortably for me ^^° I didnt get as far as I hoped for, but I'm still quite happy with the result. Again, looking forward to your improvements, though I'm pessimistic about the less-than-a-second goal by now.


There's only one way I see to make another really substantial speedup, which is eleminating the big for-each-effect condition evaluation loop. This is hard work however, as the code wasn't designed that way. Maybe you can use a caching approach like a did for GetGraphIndex(), but it needs a to be a pretty smart one. Maybe a functional (as in functional programming) scripting engine can help you here, they can usually detect and eleminate re-evaluation of the same term internally.

P.S.2: going to run the tests as tzlaine did for comparison.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Wed Sep 28, 2011 11:15 am
by cami
Test results in tzlaine's style:
r4282: real 48.1s user 19.2s sys 0.5s
r4283: real 45.3s user 17.0s sys 0.5s
2nd patch: real 45.0s user 16.9s sys 0.5s
final patch: real 17.5s user 4.3s sys 0.4s
r4301: real 43.4s user 17.1s sys 0.4s
  • r4283 is my first patch, it was committed vanilla.
    The first patch simply limits the path search to the maximum length that we actually want to check for.
  • 2nd patch was applied to r4280 and upgraded to r4282 afterwards
    The second patch runs all path searches in one WithinStarlaneJumps evaluation simultaneously (see next post)
  • final patch was applied to r4296 and downgraded to r4283 afterwards
    The final patch uses cheaper containers for the patch II changes and removes unnecessary copying operations in general
  • I had to apply a small compile error fix for r4301
  • all tests were run on a clean, fresh build
  • I'm unsure about the huge discrepancy between real time and CPU time. The ratio makes me suspect that a CPU second means running one second on all the CPU cores, and freeorion is currently unable to fully utilize my 8-way parallel configuration. It might also simply be spending lots of time performing I/O though.
  • r4282 is a lot faster than my initial runs, probably I didn't compile an optimized build in the first place. Sorry for the mislead!
  • apparently, speedups are significantly different when inlining is enabled,
  • especially I'm surprised about the little difference between r4283 and patch II. In my previous tests, it was a lot more noticeable.
    I didn't take the pain to fully recompile with enabled inlining during the development.
  • OTOH, speedup from patch II to final version is a lot more significant with inlining than without.
  • my best guess is aggressive optimization relativizes the tremendous recomputation in the original WithinStarLaneJumps by efficient parallelization, L1 Cache usage and possibly even recomputation elemination.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Wed Sep 28, 2011 2:18 pm
by cami
Test results in tzlaine's style, all optimizations disabled (-O0):

r4282: real 200.5s user 114.9s sys 1.2s
r4283: real 139.2s user 77.7s sys 0.9s
2nd patch: real 145.5s user 82.0s sys 0.9s
final patch: real 46.2s user 22.6s sys 0.5s
r4301: real 144.6s user 89.3s sys 0.9s

As you can see, my several-minute results most certainly were without optimizations. Sorry again.
Without optimizations patch II even slows down. Seems either -O2 without inlining is a very special case for that patch, or I mixed something up in its evaluation. I could imagine I compared patch II -O2 without inlining to r4282 or r4283 without optimization at all or something like that.

These results make me wonder whether SystemsWithinJumps() is actually useful. It should never be much slower than the original implementation, and there are many cases where it's a lot faster, but they may be irrelevant to either the test game or the current FreeOrion at all, and it's a lot of code.
What it does is instead of searching the shortest path from every starting point to every target individually, it adds a virtual starting vertex with edges leading to all the real starting points and searches until all targets have been found. It's probably easier for you to evaluate whether there are "real" situations where that improves a lot - typically that will be the case when there are many targets that are far from many starting points, but close to at least one, and jump limit isn't too small, that is, when the original implementation performs lots of futile search for paths that don't exist or are unnecessarily long. Note that the starting points correspond to subcondition matches, whereas the targets correspond to the set of "potential matches" for WithinStarLaneJumps::Eval().
As the callgrinds suggest, in the second patch the gain is probably cancelled by the use of more expensive containers. In the final patch these are replaced by std::vector whenever possible (using compile time checks). Unfortunately I did not produce any patches or callgrinds between those two states that might reveal the actual benefit of SystemWithinJumps(). If you request, and I find the time, I could probably generate those though.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 2:07 am
by tzlaine
cami wrote:
tzlaine wrote: Carsten, as you can see above, I have been unable to reproduce all your results. The optimized version runs in ~5s for me as for you, but I'm getting ~20s for the vanilla FO you first started investigating -- not minutes. I'm really curious what might be happening here, which is why I posted such detailed info on my test system above. Any idea why we're seeing such radically different things out of FO rev. 4282?
Yes, in revision 4280 my first patch, that only limited search depth, was comitted, which brought a speedup factor of ~7. So you are only experiencing the speedup from my second patch (~3) and the very last changes (~1.5), amounting for a total speedup of ~4.5, which matches your numbers excellently.

The inital post was at revision 4279, if you compare with that version you'll very likely see running times of about 2:00-2:30 minutes.

Edit: r4282 is correct, mixed this up. I'm unsure why the effect is less dramatic in optimized builds, but I guess inlining speeds up the tree/graph traversals dramatically.
Good news. I'm glad it wasn't some weird edge case we were going to need to fix.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 2:11 am
by tzlaine
cami wrote:[*]I'm unsure about the huge discrepancy between real time and CPU time. The ratio makes me suspect that a CPU second means running one second on all the CPU cores, and freeorion is currently unable to fully utilize my 8-way parallel configuration. It might also simply be spending lots of time performing I/O though.
It's the latter. We're loading the save file and tons of art and audio assets (the music is probably streaming during the entire performance run).
[*]apparently, speedups are significantly different when inlining is enabled,
Yes! This is the single most effective optimization any compiler ever does.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 2:27 am
by Geoff the Medio
tzlaine wrote:...some weird edge case we were going to need to fix.
It's probably possible to arrange such a case, by doing something like making some of the parameters and subconditions to a WithinStarlaneJumps condition depend on the candidate being matched explicitly, or the root candidate when that is also the local candidate of WithinStarlaneJumps. Something like (untested):

Code: Select all

And [
    Planet
    WithinStarlaneJumps jumps = RootCandidate.Supply condition = And [
        Ship
        Design name = "SD_COLONY_SHIP"
        OwnedBy RootCandidate.Owner
    ]
]
which matches planets that have, within a number of starlane jumps equal to the planet's supply meter, a colony ship owned by the same empire that owns the planet. (I think this is a reasonably plausible example case...)

This condition is likely to be quite inefficient to evaluate since the subcondition and valueref both reference the object being matched by the whole condition. That is, each candidate object could have a different supply meter and a different owner. Consequently, both need to be re-evaluated for each object being matched, and this likely would cause much longer overall condition evaluation times.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 2:32 am
by tzlaine
cami wrote:These results make me wonder whether SystemsWithinJumps() is actually useful. It should never be much slower than the original implementation, and there are many cases where it's a lot faster, but they may be irrelevant to either the test game or the current FreeOrion at all, and it's a lot of code.
I agree. I think your optimization there was really clever, but in the end too much code to verify and maintain, and I went another way, and got the same performance (plus just a bit more) with something a lot simpler.

Here are some more performance numbers. Remember the originals?
r4282: 21.0 21.2
r4299: 19.3 19.6

r4299 + Zach: 18.9 18.7
r4299 + Cami: 5.4 5.3
Now here are the new numbers:

FO r4299 + Cami + Zach1: 4.9 5.0
FO r4299 + Cami + Zach2: 4.9 4.9

"Zach1" refers to my own replacement for SystemGraphIndex() instead of Carsten's. His used an ad hoc hash map. Mine populates the same hash map once, putting it in Universe. It's less code and a bit more efficient.

"Zach2" refers to my replacement of the extensive code in Carsten's patch used to optimize WithinStarlaneJumps with a call to boost::johnson_all_pairs_shortest_paths() to get the shortest paths between each pair of vertices in the graph. We look this up in a table later to get O(1) shortest path lookups.

Now, since I've looked this amount of the patch over very carefully, I'll commit just this part now. I'll commit the Effects.{h,cpp} parts later after I've had the chance to go over those changes more closely.

Finally, here are the numbers on the changes I'm committing. It looks like most of the gains are coming from the stuff in Effect.{h,cpp}.

FO r4299 + Cami + Zach2 - Effects.{h,cpp}: 12.4 12.2

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 2:34 am
by tzlaine
Geoff the Medio wrote:It's probably possible to arrange such a case, by doing something like...
If you could gin up a patch that you've tested and know works, I would really like to test it out.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 3:13 am
by Geoff the Medio
tzlaine wrote:"Zach2" refers to my replacement of the extensive code in Carsten's patch used to optimize WithinStarlaneJumps with a call to boost::johnson_all_pairs_shortest_paths() to get the shortest paths between each pair of vertices in the graph. We look this up in a table later to get O(1) shortest path lookups.
MSVC 2010 with boost 1.46.1 doesn't like 4304...

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 3:28 am
by tzlaine
Ugh. I'll have to tackle this tomorrow. Sorry for the breakage.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 4:32 am
by Geoff the Medio
tzlaine wrote:Ugh. I'll have to tackle this tomorrow. Sorry for the breakage.
Commenting out the include line, and the line where the johnson stuff is used seems to get it to build. (I didn't try just one or the other).
Edit: Although FreeOrion is crashing for me at startup after doing that... although that may be related to unrelated changes I've made locally... /Edit
cami wrote:freeorion is currently unable to fully utilize my 8-way parallel configuration.
Determining effect target sets (Universe::GetEffectsAndTargets) might be a good case for parallelization efforts, as the relevant data does not change while this is occurring, and the results of each condition are independent.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 7:02 am
by cami
Super cool tzlaine =) I'm quite happy the boilerplate is gone. At first, I was a little worried storing the distances forever uses lots of memory (compare: I store two numbers per target. you store two numbers per every possible pair of vertices in the graph. That doesnt scale well) and Ω(VE) set-up time would be, although probably optimal, also tremendous on larger graphs

500 Nodes, average degree 2: 500,000x ops 2 MB mem
2,000 Nodes, average degree 2: 8,000,000x ops 32 MB mem
5,000 Nodes, average degree 2: 50,000,000x ops 200 MB mem
20,000 Nodes, average degree 2: 800,000,000x ops 3200 MB mem

That's one reason I didn't do this, the other is I didnt know if and when you change the system graph. Fortunately, it's improbable that graph sizes where this is starting to matter are ever relevant to this game, even in multiplayer mode on embedded devices, and it seems the system graph is statical atm. I dont know whether there's an efficient known way to adjust distance matrices, but even if you have to recompute the map every turn your version will be as good as mine.

I'm pretty certain this way is a lot faster than my patch in reality, as even despite all the optimizations WithinStarlaneJumps was still taking a significant portion of the time, and now this time is invested only once instead of every turn.

I'm also happy to hear that these changes do make a significant dent even without the copying reductions. Tells me I wasnt hallucinating ^^°

What bothers me though is that you keep the double loop forall (source,target) get distance(source,target) in WithinStarlaneJumps::Eval. This scales quadratically with the number of objects, not systems, which grows quite a bit higher (the example game has few colonized systems yet). I got the closest object "on the way", but now that we have the distances in constant time, isn't there a more efficient way to perform this check? If the closest subcondition match is within the #subcondition_matches closest objects to the target (in fact it needs to be one step closer than this allows), or if there are less objects at most jump_limit-1 steps away than there are subcondition matches, a simple BFS search for any subconditon match starting at the target will take at most the time of scanning through all the subcondition matches, the less the closer the object is (in fact you have to apply a correction factor here as scanning a list item is faster than testing set membership of a vertex). I think it is worthwhile to use this alternative method in most cases for a jump limit of 1 (this is actually used). To verify when its actually useful we'd have to keep the d-neighbourhood sizes for each vertex and small distances d in memory and compare them to the subcondition match list size. Then again, hardware evolution will quickly render this effort moot within the next few years, so...

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 8:02 am
by cami
Geoff The Medio wrote:cannot convert from 'boost::detail::edge_desc_impl<Directed,Vertex>' to 'const boost::detail::edge_desc_impl<Directed,Vertex>'
-.- A very tricky conversion indeed, converting from Type to const Type. This is obviously an MSVC bug. Finding a workaround can be tricky in this case.
Geoff The Medio wrote:Determining effect target sets (Universe::GetEffectsAndTargets) might be a good case for parallelization efforts, as the relevant data does not change while this is occurring, and the results of each condition are independent.'
Good point. If there's a portable way to determine CPU number you could simply have each CPU evaluate its share of the effect conditions.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 10:52 am
by tzlaine
cami wrote:Super cool tzlaine =) I'm quite happy the boilerplate is gone. At first, I was a little worried storing the distances forever uses lots of memory (compare: I store two numbers per target. you store two numbers per every possible pair of vertices in the graph. That doesnt scale well) and Ω(VE) set-up time would be, although probably optimal, also tremendous on larger graphs

500 Nodes, average degree 2: 500,000x ops 2 MB mem
2,000 Nodes, average degree 2: 8,000,000x ops 32 MB mem
5,000 Nodes, average degree 2: 50,000,000x ops 200 MB mem
20,000 Nodes, average degree 2: 800,000,000x ops 3200 MB mem

That's one reason I didn't do this, the other is I didnt know if and when you change the system graph. Fortunately, it's improbable that graph sizes where this is starting to matter are ever relevant to this game, even in multiplayer mode on embedded devices, and it seems the system graph is statical atm. I dont know whether there's an efficient known way to adjust distance matrices, but even if you have to recompute the map every turn your version will be as good as mine.
Right. We're unlikely to want to have people play in galaxies larger than 500 or 1000. If so, we can replace the full mapping with a sparse matrix. The sparse matrix will slow things down, so I'd rather pay in memory, which is usually plentiful.
I'm pretty certain this way is a lot faster than my patch in reality, as even despite all the optimizations WithinStarlaneJumps was still taking a significant portion of the time, and now this time is invested only once instead of every turn.
It seems that way from the numbers.
I'm also happy to hear that these changes do make a significant dent even without the copying reductions. Tells me I wasnt hallucinating ^^°

What bothers me though is that you keep the double loop forall (source,target) get distance(source,target) in WithinStarlaneJumps::Eval. This scales quadratically with the number of objects, not systems, which grows quite a bit higher (the example game has few colonized systems yet). I got the closest object "on the way", but now that we have the distances in constant time, isn't there a more efficient way to perform this check? If the closest subcondition match is within the #subcondition_matches closest objects to the target (in fact it needs to be one step closer than this allows), or if there are less objects at most jump_limit-1 steps away than there are subcondition matches, a simple BFS search for any subconditon match starting at the target will take at most the time of scanning through all the subcondition matches, the less the closer the object is (in fact you have to apply a correction factor here as scanning a list item is faster than testing set membership of a vertex). I think it is worthwhile to use this alternative method in most cases for a jump limit of 1 (this is actually used). To verify when its actually useful we'd have to keep the d-neighbourhood sizes for each vertex and small distances d in memory and compare them to the subcondition match list size. Then again, hardware evolution will quickly render this effort moot within the next few years, so...
There's no way of getting away from the O(N*N) performance of doing this kind of search. We have implemented a greedy system where possible (we only search from previous search results in an And for instance). The only fix that can get rid of this is to memoize results for later reuse. For this to work, we need to have a better understanding of when the universe is constant and when it is not (and so we must throw out our memos). This is what I've been planning for some time. I hope it's even possible.

Re: [PATCH] WithinStarlaneJumps implementation

Posted: Thu Sep 29, 2011 11:02 am
by tzlaine
A couple of notes about your original patch, Carsten.

First, this:

Code: Select all

@@ -4294,16 +4402,15 @@
 
         ObjectSet& from_set = search_domain == MATCHES ? matches : non_matches;
         ObjectSet& to_set = search_domain == MATCHES ? non_matches : matches;
-        ObjectSet::iterator it = from_set.begin();
-        ObjectSet::iterator end_it = from_set.end();
-        for ( ; it != end_it; ) {
-            ObjectSet::iterator temp = it++;
+        std::map<const UniverseObject *, std::pair<const UniverseObject *, int> > object_distances(BuildDistanceMap(subcondition_matches, from_set, jump_limit));
+        
+        for (ObjectSet::iterator it = from_set.begin(), end_it = from_set.end(); it != end_it; ++it) {
             // does candidate object contain any subcondition matches?
-            bool match = WithinStarlaneJumpsSimpleMatch(*temp, subcondition_matches, jump_limit);
+            bool match = WithinStarlaneJumpsSimpleMatch(*it, subcondition_matches, jump_limit, object_distances);
             // transfer
             if ((search_domain == MATCHES && !match) || (search_domain == NON_MATCHES && match)) {
-                to_set.insert(*temp);
-                from_set.erase(temp);
+                to_set.insert(*it);
+                from_set.erase(it);
             }
         }
     } else {
breaks things. Do you see why now that I've pointed it out?

Also, your implementation changed some logic from:

determining the shortest path for fleets in deep space by looking at both possible paths through the systems on either end of its starlane (accurate but slow)

with this:

Code: Select all

+        if (!system_one && result < MANY_JUMPS) ++result; // object one isn't in a system, so add a jump
+        if (!system_two && result < MANY_JUMPS) ++result; // object two isn't in a system, so add a jump // TZL
This is probably wrong 50% of the time. We need to optimize, but we must also preserve correctness.