01:12:42 | * | Trixar_za is now known as Trix[a]r_za |
06:47:58 | * | Boscop quit (*.net *.split) |
06:47:59 | * | ccssnet quit (*.net *.split) |
06:48:00 | * | reactormonk quit (*.net *.split) |
06:48:01 | * | mal`` quit (*.net *.split) |
06:48:01 | * | fowl quit (*.net *.split) |
06:48:01 | * | Roin quit (*.net *.split) |
06:48:01 | * | comex quit (*.net *.split) |
06:48:01 | * | silven quit (*.net *.split) |
06:48:02 | * | CodeBlock quit (*.net *.split) |
06:48:02 | * | Owner quit (*.net *.split) |
06:48:02 | * | moxie quit (*.net *.split) |
06:48:03 | * | Amrykid quit (*.net *.split) |
06:48:05 | * | rking quit (*.net *.split) |
06:48:05 | * | JStoker quit (*.net *.split) |
06:48:05 | * | dom96 quit (*.net *.split) |
06:48:05 | * | Araq quit (*.net *.split) |
06:48:05 | * | Trix[a]r_za quit (*.net *.split) |
06:48:07 | * | XAMPP quit (*.net *.split) |
06:48:08 | * | Reisen_ quit (*.net *.split) |
06:48:08 | * | shevy quit (*.net *.split) |
07:11:42 | * | moxie joined #nimrod |
07:11:42 | * | XAMPP joined #nimrod |
07:11:42 | * | shevy joined #nimrod |
07:11:42 | * | Owner joined #nimrod |
07:11:42 | * | reactormonk joined #nimrod |
07:11:42 | * | Reisen_ joined #nimrod |
07:11:42 | * | silven joined #nimrod |
07:11:42 | * | Amrykid joined #nimrod |
07:11:42 | * | mal`` joined #nimrod |
07:11:42 | * | fowl joined #nimrod |
07:11:42 | * | rking joined #nimrod |
07:11:42 | * | Araq joined #nimrod |
07:11:42 | * | JStoker joined #nimrod |
07:11:42 | * | Roin joined #nimrod |
07:11:42 | * | dom96 joined #nimrod |
07:11:42 | * | CodeBlock joined #nimrod |
07:11:42 | * | Trix[a]r_za joined #nimrod |
07:11:42 | * | comex joined #nimrod |
07:11:45 | * | Boscop joined #nimrod |
07:11:46 | * | ccssnet joined #nimrod |
09:45:33 | * | moxie quit (Ping timeout: 276 seconds) |
09:47:00 | dom96 | hello |
09:47:05 | Araq | hi dom96 |
09:47:17 | dom96 | Araq: Did you not read scrollback? |
09:47:51 | Araq | I think I did, why? |
09:47:59 | dom96 | graphics.nim/DrawLineAA -- sdl_gfx (as fowl pointed out) contains a function which does this. |
09:48:14 | dom96 | Should we alias DrawLineAA to that or remove DrawLineAA altogether? |
09:48:19 | Araq | so what, native nimrod code is nicer |
09:48:38 | dom96 | Yeah, but it's incorrect :P |
09:48:44 | dom96 | it doesn't work half the time |
09:48:46 | Araq | sdl_gfx is an additional package? |
09:49:00 | Araq | well why does it not work? |
09:53:42 | dom96 | it fails on some coordinates for some reason |
10:02:29 | dom96 | Araq: Did you see my reminder that we should test everything in taint mode? :P |
10:05:42 | Araq | the tester itself is compiled in taint mode |
10:05:49 | Araq | so there is some checking |
10:06:08 | Araq | however, ftpclient etc. indeed need to be tested in taint mode |
10:08:46 | dom96 | mmm |
11:20:47 | * | apriori| joined #nimrod |
11:21:00 | apriori| | hi guys |
11:21:04 | Araq | hi apriori| |
11:21:15 | apriori| | is it me being unable to count, or is the user count actually increasing? |
11:21:35 | Araq | increasing |
11:21:41 | apriori| | good :) |
11:21:55 | apriori| | Araq: I've got a question concerning your GC |
11:22:03 | Araq | alright |
11:22:21 | apriori| | now I'm definetly no expert on this topic.. but you said, slices are bad for it because of the interior pointers they represent |
11:22:28 | Araq | yeah |
11:22:36 | apriori| | aren't there actually gc adaptions that can deal in a sane way with it? |
11:23:10 | apriori| | yesterday I found a paper, for example, about that exact topic.. though I don't quite now, whether that is supposed to be used with your GC variant |
11:23:49 | apriori| | funny enough, even though apparently being from after 2006 it references tons of papers from the mid 70s |
11:23:56 | apriori| | http://benediktmeurer.de/files/fast-garbage-compaction-with-interior-pointers.pdf |
11:26:16 | Araq | you can do quite a lot to keep overhead neglible |
11:26:29 | Araq | just like you can to have only 1 pointer type like D does |
11:27:38 | Araq | but it seems much wiser to disallow both |
11:28:08 | Araq | Go and D both start with an incredible permissive memory modell |
11:28:27 | Araq | and GC performance is left as an implementation detail |
11:28:34 | Araq | this is not Nimrod's design at all |
11:29:31 | apriori| | while I can agree with that... I think, since I also used D quite a while, slices offer a really convenient way to deal with subranges in an array ... |
11:30:12 | apriori| | and that, efficient, without copying, etc. |
11:31:29 | Araq | walter claims D's design allows for a precise copying collector |
11:31:47 | Araq | until that is actually implemented and works (!) I doubt it |
11:32:54 | apriori| | I understand... well, the GC is definetly a major topic in D, which the devs seem to have real problems with... |
11:33:12 | apriori| | D community devs complain a long time already about the GC .. |
11:34:46 | apriori| | I guess for the matrix stuff (which is my current motivation why I'd like to have efficient slices) I will just stick to pointer + offsetrange and create respective function overloads |
11:35:06 | apriori| | Araq: did you happen to see eigen3? the linear algebra library? |
11:35:45 | Araq | adrian benchmarked slices and on a modern CPU ptr+offset is as efficent as 'ptr' only |
11:36:07 | Araq | so you can do the slices just fine and keep the base pointer around for the GC |
11:36:17 | apriori| | yeah, that's what I meant |
11:36:26 | apriori| | so you wont have interior pointers |
11:37:21 | apriori| | well, about eigen3... I absolutely like its design. its a c++ header-only library, which uses a hell lot of expression template magic to create a really convenient API |
11:37:22 | Araq | btw the GC supports interior pointers on the stack |
11:37:35 | apriori| | hm |
11:37:46 | Araq | and it caused huge pains |
11:37:56 | apriori| | what causes huge pains? |
11:38:05 | Araq | and it took me a month to get the overhead down to almost 0 |
11:38:15 | apriori| | oh, ok.. the interior pointers |
11:38:50 | Araq | the GC has to support it because we lack the control over the generated assembler code |
11:38:51 | apriori| | so its fine to pass in ptr+range in a function and calculate that down to pointers in a function and use these... |
11:38:58 | apriori| | or what would be the real consequence of that? |
11:39:20 | Araq | I'm just explaining why I hate them |
11:39:25 | apriori| | okk |
11:39:26 | apriori| | ok |
11:39:45 | apriori| | about lack of control... |
11:39:59 | apriori| | do you plan to write a real backend sooner or later? |
11:40:03 | apriori| | like using llvm or so? |
11:40:17 | Araq | not anymore |
11:40:29 | Araq | at least not before 1.0 is out |
11:40:40 | apriori| | yeah, I understand |
11:41:04 | apriori| | well, some languages also dealt quite well with it to compile down to C for a while until a real backend was developed |
11:41:07 | apriori| | (haskell?) |
11:41:20 | Araq | eiffel still does that |
11:41:36 | apriori| | in a way it really makes sense... |
11:41:37 | Araq | though I don't know about the "dealt quite well" ;-) |
11:41:50 | apriori| | one automatically inherits all improvements on the compilers |
11:42:14 | apriori| | well, I think.. one problem is the lack of propper debugging |
11:42:39 | Araq | I plan to patch the generated debug info |
11:42:44 | apriori| | if one compiles down to C, one would at least need a plugin for the debugger to "demangle" (if one could it that) the generated code |
11:42:48 | Araq | so that it show demangled names |
11:43:03 | apriori| | yeah, that would be nice |
11:43:17 | apriori| | I like the edb idea tough... |
11:43:19 | Araq | on the other hand, I can easily debug the generated C already |
11:43:29 | apriori| | at least one debugger that would be exactly the same, everywhere |
11:43:57 | Araq | edb is nice, but it's often too slow |
11:44:10 | apriori| | and needs a recompile of the source |
11:44:22 | Araq | plus having the debugger in the same address space is insane ;-) |
11:44:27 | apriori| | hehe, yeah |
11:44:42 | apriori| | one crash/messing with the registers tearing it all down |
11:46:58 | apriori| | btw., about the snippets of foldl/foldr etc. I once showed you... |
11:47:04 | Araq | btw one motivating example for the TR macros is copy elimination for slices |
11:47:16 | apriori| | actually foldl should be possible to be optimized because it is a tail call |
11:47:37 | Araq | yeah one of the 4 was a tail call iirc |
11:48:12 | Araq | but I dislike tail calls sometimes ;-) |
11:48:17 | apriori| | why? |
11:48:30 | fowl | can i set up a tr so that array[#, type] expands to array[0..#-1, type] |
11:48:56 | Araq | fowl: perhaps, I guess not |
11:49:21 | Araq | we could add it to the language if it constantly annoys you though |
11:49:29 | apriori| | I would second that |
11:49:39 | Araq | I know ;-) |
11:49:49 | fowl | # would be an int literal only |
11:49:50 | apriori| | it is a nice feature to allow a range in there (btw... is a set also allowed?) |
11:50:03 | fowl | apriori|: you can use enums |
11:50:10 | apriori| | fowl: yup, I know that |
11:50:15 | apriori| | I meant an oridnal type with "holes" in it |
11:50:35 | Araq | no, how could that work? |
11:50:42 | apriori| | only via mapping |
11:50:55 | Araq | that's not an 'array' then |
11:51:05 | Araq | use a table for that |
11:51:13 | apriori| | okay |
11:51:40 | Araq | fowl: if it's important enough for you, make a feature request |
11:51:40 | apriori| | Araq: about using # instead of a...b |
11:51:56 | apriori| | I think one actually rarely uses the a..b specialcase, in which indexing doesnt start with 0 |
11:52:12 | Araq | about the tail calls: principle of least power; recursion can do much more than a simple loop |
11:52:36 | apriori| | sure.. but recursion is also convenient ;) |
11:52:44 | Araq | so a loop is easier to read when you care about performance |
11:53:17 | Araq | there are also no tail calls in debug mode due to the stack tracing |
11:53:23 | apriori| | ouch |
11:53:45 | apriori| | that could make code code run a hell lot slower in debug mode ;) |
11:54:17 | Araq | yeah |
11:54:32 | Araq | the a..b special case is more common if the index type is an enum |
11:54:39 | apriori| | yep |
11:54:56 | apriori| | but for "normal" arrays its not |
11:55:06 | Araq | but there are also lots of people who prefer 1..N |
11:55:22 | apriori| | yeah, well I currently constantly write using low/high only.. |
11:55:34 | Araq | however they have a hard time already as strings and sequences start with 0 .. |
11:55:56 | apriori| | and with the matrix multiplication for example, I would even have to make sure, that the len of the index space is identical, no matter how wicked it was setup |
11:56:11 | apriori| | not quite sure though, whether that's actually a good idea there |
11:56:26 | Araq | you'll soon write m.first .. m.last everywhere |
11:56:33 | apriori| | so currently I more work on low(A)+offset and low(B)+offset etc. |
11:56:41 | Araq | to support efficient slicing ... |
11:57:42 | apriori| | okay :) |
12:00:07 | Araq | well if you want efficient slicing that is |
12:00:21 | Araq | I'm not sure it's useful all the time |
12:00:47 | apriori| | one needs exact control over when copies are done and when not |
12:01:27 | apriori| | and, well... for linear algebra stuff efficient slicing definetly rocks |
12:02:00 | Araq | how can you program efficient generic matrixes anyway? 4x4 vs. 100x100 is a big difference, right? |
12:02:26 | apriori| | sure it is |
12:02:42 | apriori| | with bigger matrices one would switch over to non-full representations |
12:02:49 | apriori| | coordinate matrix, etc. |
12:03:05 | Araq | yep |
12:03:13 | Araq | so what should the stdlib support? |
12:03:39 | apriori| | but those assume that matrices are actually sparse.. and there also exist cases in which they simply arent..and full sparse matrices are of course a hell lot slower than simple 2-dim arrays (or 1 flat array) |
12:03:54 | apriori| | the stdlib should of course support both |
12:04:30 | apriori| | and I also wouldn't include some automatic logic to switch over to e.g. sparse matrix for bigger matrices |
12:04:41 | apriori| | that would be a bad idea |
12:05:23 | apriori| | because usually one implements coordinate matrices as a map of a triple (i, j, V), keeping it sorted by one coordinate |
12:05:52 | apriori| | and this would results in insertion order being very important for the construction performance of the sparse matrix |
12:07:18 | apriori| | the stdlib should also make sure, that the matrix representation is compatible with other stuff out there... |
12:07:46 | apriori| | e.g. column-major in its representation for passing it directly without transposing into OpenGL |
12:08:39 | Araq | ah, didn't know about that |
12:09:34 | apriori| | yeah, well.. I guess that column major stuff is a relict of the early fortran times, though I really don't know where that convention comes from |
12:10:24 | Araq | because it's "intuitive"? a[x, y] == a[x][y] |
12:11:10 | apriori| | yeah, I guess so |
12:11:25 | apriori| | but it constantly confuses me..because.. AFAIK c++ is row-major |
12:11:40 | Araq | yes |
12:12:09 | Araq | in fact, I think of an array of strings to get it right ;-) |
12:12:18 | apriori| | hehe |
12:14:12 | apriori| | I'm not quite sure, whether we should later add support for common BLAS and *PACK packages |
12:14:33 | apriori| | I mean.. they are widely used, yes.. but hell, their API sucks so bad, one definetly would need to wrap them. |
12:15:06 | Araq | btw I've implemented a simple stack trace profiler :D |
12:15:08 | apriori| | BLAS is like a library for basic operations on the types (e.g. vector operations) and usually it is replaced on the ABI level |
12:15:30 | Araq | the "embedding" part makes much more sense for a profiler than a debugger |
12:15:47 | apriori| | yeah, when I read that commit messages I though: "wtf.. didnt he just wrote ""I guess I write one on my own"" this very evening?!" |
12:16:08 | Araq | yeah, didn't feel like hunting bugs ;-) |
12:16:10 | apriori| | agreed.. but such a profiler should definetly support sampling |
12:16:42 | Araq | what do you mean? it already does |
12:17:17 | apriori| | with sampling I actually mean, not really counting very instruction one by one.. which usually results in a quite slow behaviour |
12:17:27 | apriori| | but only take a sample of an "area of interest" |
12:17:38 | Araq | {.push profiler:off.} :P |
12:17:57 | apriori| | one should invert the logic |
12:17:59 | Araq | but yeah, that doesn't do quite the right thing |
12:18:04 | apriori| | some global flag.. and some local regions |
12:18:20 | Araq | yeah, good point; I'm adding it |
12:18:35 | apriori| | you're just insane, man ;) |
12:18:43 | apriori| | pulling off such a huge pile of work on your own |
12:20:02 | apriori| | as an addition, I meant: https://en.wikipedia.org/wiki/Profiling_(computer_programming) |
12:20:11 | apriori| | read the part about "statistical profilers" |
12:21:04 | apriori| | while not extremely precise, sampling profilers can at least give you a good hint about bottlenecks without impacting runtime performance too much |
12:21:34 | Araq | well it is a sampling profiler |
12:21:49 | apriori| | interesting..that MIPS even had hardware support for it |
12:21:53 | apriori| | ok, good :) |
12:22:25 | fowl | i had to rearrange an enum so tthat it was in order |
12:22:42 | fowl | last enum on this page: http://enet.bespin.org/protocol_8h.html |
12:22:53 | Araq | it used to be 'flat' profiler but then there's already gprof |
12:22:53 | fowl | is it still compatible if rearranged? |
12:23:20 | Araq | fowl: I'd think so |
12:23:46 | Araq | however it looks like you want an enum + a set to represent it in Nimrod |
12:24:29 | apriori| | Araq: btw.. couldn't one just generate a static internal mapping for such enums with "holes"? |
12:24:44 | apriori| | like.. one has e.g. the actual enum values 2, 4, 8, etc. |
12:24:52 | Araq | why not do instead: |
12:25:01 | apriori| | there is no problem to actually interpret those as 0, 1, 2 in an array |
12:25:31 | Araq | type TMyEnum = enum meA = 0, meB, meUnused, meUnused, meC |
12:25:47 | fowl | whys that? i have ENET_PROTOCOL_HEADER_FLAG_MASK = ENET_PROTOCOL_HEADER_FLAG_COMPRESSED.cint or ENET_PROTOCOL_HEADER_FLAG_SENT_TIME.cint in the enum and its not raised any problems |
12:25:48 | Araq | er, meUnused2 of course for the second |
12:26:17 | apriori| | Araq: bad thing.. because if people use e.g. 2^x, then you got a hell lot of such unused sections to write |
12:26:34 | apriori| | but, yeah, you already said it.. thats definetly a "flagging" case, which is better described using a set |
12:26:38 | Araq | 2^x means you want a set of an enum anyway |
12:27:15 | Araq | and yeah if you can't map it easily you have to live with 'const's and inferior type checking |
12:27:32 | Araq | though you can also do: |
12:27:38 | Araq | type TEnum = distinct cint |
12:27:46 | Araq | const valueA = TEnum(0) |
12:28:10 | apriori| | interesting |
12:28:12 | Araq | proc `or`(a, b: TEnum): TEnum {.borrow.} |
12:28:46 | apriori| | Araq: not wanting to ennerve you again.. but.. pleeaase start documenting every single pragma there is ;) |
12:30:39 | Araq | it's all documented |
12:30:45 | Araq | but you have to read the manual |
12:30:51 | apriori| | ok |
12:30:52 | Araq | and skip the verbose introduction |
12:31:08 | Araq | the tuts simply don't cut it |
12:31:11 | Araq | brb |
12:31:18 | apriori| | k |
13:01:25 | Araq | back |
13:06:36 | * | apriori| quit (Ping timeout: 255 seconds) |
13:31:06 | * | apriori| joined #nimrod |
13:43:35 | * | XAMPP quit (Ping timeout: 272 seconds) |
13:52:32 | apriori| | Araq: one question.. |
13:52:39 | apriori| | about constrained general types |
13:52:53 | Araq | they're a bit limited for now, yes |
13:53:02 | Araq | basically you can do: |
13:53:08 | apriori| | would you prefer to have that constraint (e.g. square matrix) expressed by using only the matching arguments in the function |
13:53:13 | apriori| | or should I check with assert? |
13:53:42 | Araq | hu? |
13:53:46 | Araq | can't follow |
13:53:51 | Araq | do you mean: |
13:53:52 | apriori| | like: matrixmultiply instead of 4 dimensions only 3.. thereby already enforcing that matrixes are compatible |
13:54:25 | Araq | I have no opinion about it |
13:54:38 | apriori| | http://pastebin.com/5q2zZNgK |
13:55:13 | apriori| | I think, assert might be better, because it allows an actual non-crypting error message |
13:55:27 | Araq | resul[i, j] = cast[T](0) # unnecessary, but if you want to do it, use: |
13:55:36 | apriori| | whereas the other variant will just spit out a not neccessarily helpful compiler error |
13:55:42 | Araq | result[i, j] = T(0) |
13:55:49 | apriori| | ok |
13:55:55 | apriori| | but usually it would implicitly cast? |
13:55:58 | Araq | and yeah the identifier lookup in generics needs to be fixed |
13:56:17 | Araq | typo: 'resul' vs 'result' |
13:56:27 | apriori| | yeah.. not compiled yet |
13:56:28 | Araq | and the compiler doesn't complain until instantiation time |
13:56:34 | apriori| | yes |
13:56:36 | apriori| | but... |
13:56:43 | Araq | it's really horrible and I apologize ;-) |
13:56:52 | apriori| | it does have no real clue about a constraint in place in in version 2 |
13:57:11 | Araq | and yeah, usually it would implicitely cast |
13:57:12 | apriori| | it would just say "can't find a matching function" |
13:57:40 | Araq | you should be able to do: |
13:58:05 | apriori| | hm.. I guess this is a good usecase for when + hint |
13:58:08 | apriori| | + assert |
13:58:12 | Araq | yeah |
13:58:19 | Araq | was about to suggest that |
13:58:22 | Araq | :-) |
13:58:24 | apriori| | :) |
14:04:47 | Araq | apriori|: this program is pretty bad for the GC as it generates lots of garbage |
14:04:59 | Araq | for i in 0.. 1_000_000: |
14:05:15 | Araq | let s = formatFloat(i) |
14:06:01 | Araq | there are lots of ways to deal with this, but they all have problems: |
14:06:41 | Araq | a) change formatFloat to take a 'result: var string' so that the memory can be re-used within formatFloat |
14:07:17 | Araq | b) fix the string implementation to have an STL-like "short string" optimization |
14:07:36 | Araq | where a short string is directly embedded in the object |
14:08:25 | apriori| | hm |
14:08:45 | apriori| | do you have a minimal format string example? |
14:08:57 | Araq | strutils.formatFloat |
14:08:59 | apriori| | e.g. the equivalent of printf("something %i", bla)" |
14:09:29 | Araq | c) make strings reference counted like in Delphi |
14:10:02 | Araq | c) is really nice when the output target is C++ because we have fast exception safe destructors there |
14:10:52 | Araq | the real problem is that we want proc p: string for convenience |
14:10:54 | apriori| | (btw.. % operator similar to pythons... that rocks ;) |
14:11:00 | apriori| | yeah... |
14:11:08 | Araq | but need proc p(result: var string) for performance |
14:11:41 | Araq | would be nice if we could transform one into the other |
14:12:06 | Araq | TR macros can do the transformation |
14:12:23 | Araq | but that still requires 2 different 'p' implementations |
14:13:19 | apriori| | yes |
14:13:24 | apriori| | btw... |
14:13:30 | apriori| | how are those tr macros actually applied? |
14:13:39 | apriori| | because.. in term rewriting order should matter |
14:13:50 | Araq | in an extra pass over the AST after semantic checking |
14:14:31 | Araq | the longest match should win but currently it's applied in reverse definition order |
14:14:39 | apriori| | ok |
14:15:06 | Araq | well hm |
14:15:10 | * | zahary joined #nimrod |
14:15:28 | Araq | I'm wrong, it should already implement "longest match wins" |
14:15:47 | apriori| | okay |
14:18:22 | Araq | hi zahary |
14:18:44 | zahary | hi |
14:21:12 | apriori| | Araq: btw: say I got those types: |
14:21:15 | apriori| | TMatrixNM*[M, N, T] = array[M, array[N, T]] |
14:21:16 | apriori| | TVectorN*[A, T] = array[A, T] |
14:22:12 | apriori| | assume I got a matrix with only say 1 row or 1 column.. will the codegen for array[0..1, array[0..n, T]] expand to the same as array[0..n, T] ? |
14:23:04 | apriori| | I mean.. is that obvious useless extra dimension stripped? |
14:23:16 | Araq | no but GCC may do that |
14:23:26 | Araq | it should ;-) |
14:24:01 | apriori| | hm, because I currently think I could also just treat vectors as special matrices |
14:24:32 | Araq | just try it and look the generated assembler |
14:24:37 | Araq | *look at |
14:24:39 | apriori| | yeah, I guess so |
14:24:47 | Araq | it's the only thing that works with modern optimizers |
14:25:15 | apriori| | okay |
14:25:30 | * | q66_ joined #nimrod |
14:25:36 | Araq | benchmarking may not help as much as that could trigger some cache anomalies |
14:26:05 | Araq | where the less efficient code runs faster because it's at an address that happens to not produce a cache conflict |
14:26:22 | apriori| | yeah... cache aware programming sucks ;) |
14:27:16 | * | q66_ is now known as q66 |
14:30:12 | Araq | zahary: read the history please about the formatFloat issue |
14:31:07 | Araq | any ideas how to solve the "excessive proc duplications for strings" problem? |
14:38:25 | zahary | what is the issue again? |
14:38:42 | Araq | for i in 0.. 1_000_000: |
14:38:51 | Araq | let s = formatFloat(i) |
14:39:25 | Araq | produces lots of garbage |
14:39:58 | Araq | and the GC is not really good for this |
14:43:20 | apriori| | btw: how would I need to write an overload for a len function for ranges? |
14:43:33 | apriori| | I mean.. what is the proper type of a range[T]? |
14:43:42 | * | zahary quit (Read error: No route to host) |
14:43:51 | apriori| | because something like len*[T](x: range[T]): int does not work |
14:43:52 | * | zahary joined #nimrod |
14:44:07 | zahary | I guess it's possible to statically determine that this string is temporary and then some alternative allocation strategy could be used for it |
14:45:07 | Araq | yeah and you can do it with a TR macro |
14:45:15 | Araq | but the problem is you wnat |
14:45:21 | Araq | proc p: string |
14:45:25 | Araq | for convienience |
14:45:27 | Araq | and |
14:45:38 | Araq | proc p(result: var string) for performance |
14:45:57 | Araq | and you need to provide both |
14:46:04 | Araq | and then the TR macro can do its job |
14:48:33 | Araq | apriori|: proc len*(x: typedesc[range]): int |
14:48:38 | zahary | i've said before that I think RVO should be used everywhere and when not explicitly told otherwise procs should be able to construct objects both on the stack and on the heap (if the result is a new heap location, the compiler just places an allocation call before the proc call) |
14:49:39 | Araq | I remember now, but that's not the only solution |
14:50:00 | Araq | if you have 'result: var string' you can easily do: |
14:50:05 | Araq | setLen(result, 0) |
14:50:14 | Araq | # fill 'result' then |
14:50:34 | Araq | with no new allocations unless it happens to need a buffer resize |
14:50:51 | Araq | and it doesn't matter that it's been placed on the heap |
14:51:04 | zahary | you mean, var string is more optimal when there is a chain of operations? |
14:51:05 | * | Boscop quit (Ping timeout: 244 seconds) |
14:51:15 | zahary | let me think about it |
14:51:19 | Araq | in fact, it may be beneficial to have it on the heap because it can stay there |
14:51:50 | Araq | whereas if it's on the stack you likely end up copying it to the heap later |
14:55:59 | zahary | the question boils down to whether it's possible to optimise a proc like uppercase(x: string): string to sometimes reuse existing memory |
14:57:33 | * | Boscop joined #nimrod |
14:59:30 | zahary | in C++, the current thinking is that if you accept the input as copy it would be possible for the compiler to reuse its buffer in the result string. also sometimes it would be possible to avoid copying the input string when calling the function (when the input is a rvalue) |
14:59:53 | zahary | obviously uppercase(x: var string) is different interface so we are ignoring it here |
15:03:20 | zahary | C++ is not completely optimal as there is still code for resource stealing and noop destructors being executed even in the best case - in theory it's possible to compile uppercase(x: string): string automatically in 2 forms (one of them being uppercase(result: var string)) |
15:10:37 | apriori| | Araq: that does not work :/ |
15:12:44 | apriori| | http://pastebin.com/Jcm06fDj |
15:13:33 | apriori| | I know that test is silly, because there is array.len, but its just a test for extracting the range and passing that over to the proc |
15:15:02 | zahary | well, apriori|, why are you passing the array to the test function. you should pass the T type |
15:15:36 | apriori| | yep, right |
15:15:42 | apriori| | but doesnt change the error message |
15:16:16 | zahary | let me try it. it's probably my fault :) |
15:18:53 | zahary | this works for me: http://pastebin.com/6zbeNhDz |
15:19:10 | zahary | indeed, there is some error when you say typedesc[range] - I'll look into it later |
15:20:39 | apriori| | ok, testing |
15:21:09 | apriori| | yup, working, thank you |
15:21:23 | Araq | zahary: well yeah the problem is the optimization changes semantics |
15:21:43 | Araq | 'result: var T' is an inout as opposed to an 'out' |
15:25:24 | zahary | the optimisations I'm describing are possible if you know that the input string lifetime ends at the proc call - in C++ you can then move it to the param instead of copying it and in the other hypothetical optimisation you then choose the alternative "var version" of uppercase |
15:30:33 | Araq | it's not at all about copying vs. moving IMO |
15:30:57 | Araq | in fact the code snippet already does not copy, but move |
15:36:40 | Araq | the problem is that the current semantics of a string result prohibit a variant of loop invariant code motion |
15:36:43 | apriori| | hrm |
15:36:56 | apriori| | Error: illformed AST: m1[0, 0] .. what the hell? :/ |
15:37:12 | Araq | for i: x = p(alloc()) |
15:37:20 | Araq | --> |
15:37:23 | Araq | y = alloc() |
15:37:28 | Araq | for i: p(x) |
15:38:03 | Araq | er, loop hoisting, it's not really "invariant" |
15:38:08 | * | Trix[a]r_za is now known as Trixar_za |
15:41:47 | Araq | we could do this: |
15:42:16 | Araq | proc toUpper(x: string, result: var string) {.asfunction.} # in stdlib |
15:42:23 | Araq | and then allow usage as |
15:42:35 | Araq | a function that returns something |
15:42:43 | Araq | echo toUpper("abc") |
15:42:57 | Araq | and the compiler introduces a temporary for us: |
15:43:07 | Araq | var tmp1 = "" |
15:43:22 | Araq | toUpper("abc", tmp1) |
15:43:26 | Araq | echo tmp1 |
15:43:49 | Araq | which is something along the lines what you suggested I guess? |
15:44:52 | * | moxie joined #nimrod |
15:44:56 | Araq | we can't hide it completely I think because it affects proc var compatibility |
15:45:22 | Araq | plus for 'int' it's stupid to transform it into 'result: var int' |
15:47:13 | Araq | thinking about it more, all we need a 'snoopResult' pragma |
15:47:36 | Araq | proc toUpper(s: string): string {.snoopResult.} = |
15:47:56 | Araq | if isNil(result): result = newStringOfCap(s.len) |
15:48:14 | Araq | else: result.ensureCap(s.len) |
15:48:30 | Araq | for i in 0.. <s.len: result[i] = toUpper(s[i]) |
15:49:28 | Araq | 'snoopResult' ensures its passed as 'var T' and that it's not reset/initialized to binary zero |
15:50:26 | apriori| | I guess, another bug:http://pastebin.com/893ZZ0ZJ |
15:51:34 | Araq | it's not a really bug, if you overload [] for arrays the compiler is confused |
15:51:54 | Araq | and uses the builtin [] which does not support ',' notation |
15:52:05 | apriori| | Araq: its not overloaded anywhere there |
15:52:08 | Araq | so it's an "illformed AST" ;-) |
15:52:15 | apriori| | hm... |
15:52:30 | Araq | I can fix it easily I think |
15:52:42 | apriori| | well, if that's the case, how should I overload [] for a matrix which is array[..,array[.., T]]? |
15:53:03 | Araq | well I expected people to wrap it in an object |
15:53:14 | Araq | but I don't mind supporting it for arrays |
15:53:23 | apriori| | hm, ok |
15:54:39 | apriori| | https://github.com/Araq/Nimrod/issues/205 I wonder how that would actually work there... |
15:54:46 | apriori| | return type inference out of nowhere? |
15:55:04 | Araq | yeah somehow peope simply expect that to work ... |
15:58:09 | apriori| | zahary: as an update to the ticket you were working on: https://github.com/Araq/Nimrod/issues/202 |
15:58:24 | apriori| | name doesn't cause ICE anymore, but segfaults |
15:58:50 | zahary | apriori|, 2) works for me. can you send me the exact code for which you get a segfault |
16:01:28 | zahary | Araq, what problem do you see if there isn't snoopResult, but rather the compiler implements it for you automatically in certain cases? |
16:02:18 | apriori| | zahary: sry, I'm stupid today.. it does segfault if you pass in a variable instead of a type. which should be expected. but maybe we should provide an overload |
16:05:09 | zahary | "we can't hide it completely I think because it affects proc var compatibility" - you mean that when the proc is converted to a proc var the optimization is no longer applicable? sure, but this sounds like a rarely triggered edge case. snoopResult can not be overloaded with regular upperCase proc I think, which is somewhat more limiting |
16:05:54 | zahary | apriori|, the compiler should never segfault so it's bug after all |
16:07:13 | apriori| | zahary: yup, right |
16:11:18 | zahary | … also you can have a pragma on the proc var rather then the proc itself (much like you define the calling convention) |
16:23:36 | Araq | '(result: var int)' is not compatible to '(): int' |
16:24:02 | Araq | '(): int' is returned in a register |
16:24:14 | Araq | so that's why it affects proc var compatibility |
16:24:28 | Araq | and yeah 'snoopResult' is for proc var types too |
16:24:52 | Araq | you can't make it the default because it changes semantics within toUpper's body |
16:25:08 | Araq | it's this semantic change that allows the optimization |
16:25:19 | Araq | in toUpper's body you can then do: |
16:25:32 | Araq | setLen(result, 0); fill result |
16:25:48 | Araq | this is otherwise not possible because 'result' starts with a 'nil' value |
16:26:23 | Araq | 'snoopResult' transforms the 'result' from an 'out T' into an 'inout T' |
16:26:29 | fowl | Araq: i think you said not to, but im storing the address of a dereferenced ref and casting it back to a ref and its working, so whats the reason i shouldn't do this? |
16:27:22 | Araq | depends on what you mean exactly |
16:27:41 | fowl | i'm using it for c callbacks |
16:27:47 | Araq | it could be fine but addr x[] is not any different from cast[T](x) |
16:27:59 | Araq | except more confusing to read IMHO |
16:28:25 | Araq | so I'd use the 'cast' version |
16:29:06 | fowl | ? i need a generic pointer to store |
16:29:47 | fowl | but all the related functions expect a ref type so casting it to ptr T isnt useful |
16:33:07 | Araq | but 'addr x[]' produces a ptr T |
16:33:14 | Araq | so I can't follow |
16:33:36 | fowl | because i cant store ref T as a generic pointer |
16:34:34 | Araq | well yeah so you need a 'ptr T' |
16:34:49 | Araq | so you can do: cast[ptr T](myRefT) |
16:35:02 | Araq | no need to do: addr myRefT[] |
16:35:21 | fowl | ooo ok |
16:35:25 | zahary | Araq, sure, I understand the implications. What I'm saying is that it's possible to compile 2 versions of the proc from a single definition. The compiler automatically uses the "snoop" version in regular chained calls like "foo".toUpper.trim.reverse and when converting the proc to a procvar, it just looks at the procvar type to determine which implementation to pick |
16:36:14 | zahary | no need for the user to mark the proc as snoop (thus disabling some of its uses in regular code) |
16:36:28 | Araq | well that's my suggestion as well |
16:36:48 | Araq | a snoop proc can be called like every other proc |
16:37:03 | Araq | client code is not affected except for proc vars |
16:37:15 | * | shevy quit (Read error: Operation timed out) |
16:37:34 | Araq | but in the implementation of toUpper the programmer needs to provide the optimized version |
16:37:42 | zahary | ok, but why should I enable my proc for snooping - this should be the default - have a pragma to disable it if you want |
16:38:09 | Araq | yeah maybe, I don't know |
16:38:43 | zahary | maybe we still disagree somewhere - snoop only in certain situations in my point of view |
16:39:11 | zahary | … when the input is a var about to be destroyed |
16:40:09 | zahary | what happens when you call a snoop proc in expression like |
16:40:10 | zahary | var a = x.toUpper ? |
16:41:04 | zahary | x is copied and then passed to the snooped proc? |
16:48:07 | Araq | why? x ain't copied today |
16:48:44 | Araq | it's not even a problem if you do: x = x.toUpper |
16:49:04 | Araq | though it could become a problem for other examples |
16:49:22 | Araq | but then we already have an alias analyses |
16:51:22 | zahary | ah, poor example indeed |
16:51:57 | Araq | well I think 'snoop' shouldn't be the default because it changes semantics in a subtle way |
16:52:18 | * | shevy joined #nimrod |
16:52:40 | Araq | and you can do another interesting thing with 'snoop': the proc can actually *add* to the result buffer instead of replacing its content |
16:53:00 | Araq | which is very useful for string handling |
16:54:05 | Araq | though I think it's still not enough to transform $(x: tree) into toStringAux(x: tree, result) |
16:54:18 | Araq | in the recursive case |
16:54:47 | Araq | a snoop proc should be allowed to be called like: |
16:55:02 | Araq | p(x, res) |
16:55:10 | Araq | instead of: res = p(x) |
16:55:53 | zahary | what about? |
16:55:54 | zahary | var a = x.reverse.toUpper # 1) both reverse and toUpper are snoop procs. 2) reverse is not snoop |
16:55:54 | zahary | so you see snoop as different semantics in which you expect the result to be already constructed so you are supposed to append to it? I seem to care more about automatic copy elision in certain situations |
16:57:11 | Araq | well it's at least an interesting idea |
16:58:03 | Araq | 1) both are snoops: where is the problem? |
16:58:12 | Araq | var tmp = "" |
16:58:21 | Araq | x.reverse(tmp) |
16:58:31 | zahary | syntax assisted append-semantics are cool idea, but somewhat orthogonal to what I talk about |
16:58:54 | Araq | var tmp2 = ""; toUpper(tmp, tmp2) |
16:59:04 | Araq | a = tmp2 |
16:59:18 | Araq | or maybe directly: toUpper(tmp, a) |
16:59:34 | Araq | 2) reverse is not snoop: |
16:59:54 | Araq | var tmp = x.reverse |
17:00:08 | Araq | toUpper(tmp, a) |
17:00:44 | Araq | it's all very easy to do with TR macros except that the TR macro can't optimize the ordinary toUpper easily |
17:01:07 | Araq | because it can't transform its body |
17:01:30 | zahary | I suggest that it's possible to do |
17:01:30 | zahary | var a = x |
17:01:30 | zahary | reverse(var a) |
17:01:30 | zahary | toUpper(var a) |
17:01:59 | Araq | we often have: |
17:02:12 | Araq | proc `$`(x): string = |
17:02:14 | Araq | result = "" |
17:02:19 | Araq | toStringAux(x, result) |
17:03:02 | Araq | but the compiler can easily transform toStringAux to a proc with result type |
17:03:31 | Araq | the other direction is MUCH harder to do for the compiler |
17:03:50 | Araq | so the programmer should only have to provide toStringAux |
17:04:14 | Araq | which is what 'snoopResult' enables |
17:05:23 | Araq | your transformation only works with 1 argument procs I think? |
17:05:35 | Araq | I mean unary procs |
17:06:04 | zahary | no, the var part is the result - other arguments can still be passed along |
17:06:57 | zahary | what is hard in my suggestion is compiling the correct proc body to mutate the input in place |
17:10:13 | Araq | yeah ;-) |
17:12:31 | zahary | well, the body can be supplied by the user and then the rule is just that |
17:12:31 | zahary | proc (x: var T, …) is interchangeable to proc(x: T, …): T in certain situations |
17:13:46 | zahary | still thinking about automatically transforming the body - seems to be possible if you cache reads of certain fields into local variables |
17:15:34 | zahary | you know the input is immutable, so if you write to some field in the result and then attempt to read the same field from the input, the compiler have to ensure that the original value was cached before the mutation (sure, this will actually result in worst code sometimes) |
17:18:42 | zahary | what about my first rule instead snoopResult? the benefit is that intermediate variables in x.reverse.toUpper are eliminated this way |
17:19:30 | Araq | sorry, what is your first rule again? |
17:19:40 | zahary | proc (x: var T, …) is interchangeable to proc(x: T, …): T in certain situations |
17:19:55 | apriori| | ohm.. how would I overload the [] operator so that it could be used for any operation (like +=, -= etc.) on the scalar type? (matrix scenario) |
17:20:10 | Araq | proc []: var T |
17:20:38 | fowl | how about making them return var strings, then variants that take var strings |
17:21:04 | Araq | but I will implement a[i, j] to mean a[j][i] if 'a' is an array, so that you don't have to overload any [] at all |
17:21:37 | apriori| | Araq: I thought about your wrapping of the array as object |
17:21:40 | zahary | there is 2 definitions of toUpper supplied by me |
17:21:41 | zahary | proc toUpper(x: string): string |
17:21:41 | zahary | proc toUpper(x: var string) |
17:21:41 | zahary | the compiler notices that x.trim.toUpper.reverse matches the first overload, but it will nevertheless use the second one, because it can determine that x.trim produces a temporary value that can be mutated in place |
17:22:09 | apriori| | it makes sense.. because I could then "annotate" the matrix, e.g. as being trilinear, etc... and allow procs to use special algorithms then |
17:22:15 | Araq | well, should a[x,y] mean a[x][y] or a[y][x]? |
17:23:25 | Araq | zahary: 2 definitions of toUpper is exactly the problem that I want to avoid |
17:23:53 | zahary | well, my first thesis was that it's possible to derive the second definition from the first one |
17:23:55 | Araq | your rule is easily done with a TR macro, no need to implement it in the compiler |
17:24:14 | Araq | but I want to get rid of the 2 definitions |
17:24:33 | Araq | because it means a strutils of twice the size it is today |
17:25:17 | Araq | and I doubt it's possible to derive the second implement from the first one |
17:25:22 | Araq | *implementatino |
17:25:28 | Araq | *implementation |
17:25:34 | apriori| | Araq: [x][y] for column major |
17:26:55 | Araq | ok |
17:27:06 | Araq | that's easier to remember anyway |
17:27:10 | apriori| | indeed |
17:28:27 | zahary | it certainly is, only not necessarily in efficient way |
17:28:27 | zahary | proc (x: var T) = |
17:28:27 | zahary | input = x.copy |
17:28:27 | zahary | … replace any occurrence of x with input in the original body |
17:29:34 | Araq | interesting |
17:30:04 | zahary | but that misses the point of reusing existing resources (which is hard problem I agree) |
17:32:05 | zahary | my point is that snoop doesn't buy that much in these chained calls - you still have temporary objects and reallocations? |
17:32:08 | Araq | apriori|: there is a codegen bug returning a 'var array', watch out |
17:32:22 | apriori| | I dont even get to that now |
17:32:31 | apriori| | got some issues with [] for component access already |
17:33:05 | apriori| | []= catches the assignment just fine.. but I dont get other stuff to work (and actually, only a proc that returns the ref should do, but that's not working) |
17:33:20 | apriori| | and I guess.. because [] is not allowed as a normal proc name? just a guess |
17:33:43 | Araq | dunno -.- |
17:35:08 | apriori| | yup, [] not allowed as normal non-op proc |
17:38:44 | apriori| | oh fuck |
17:39:24 | fowl | proc foo[]() = is foo() with no generics |
17:39:41 | apriori| | yeah |
17:42:56 | apriori| | hm |
17:43:02 | apriori| | this _might_ be a bug too |
17:43:12 | apriori| | in a funcion, with its return type as value type |
17:43:20 | apriori| | shouldnt result be actually "var T"? |
17:43:26 | apriori| | in the context of the function |
17:43:49 | Araq | what makes you think it isn't? |
17:44:05 | apriori| | wait a esc |
17:44:06 | apriori| | sc |
17:44:09 | apriori| | grr... sec |
17:45:06 | apriori| | Araq: http://pastebin.com/jGGWiKb6 |
17:46:06 | Araq | why do you prove the ': T' version? |
17:46:18 | Araq | and the '[]=' version? |
17:46:23 | Araq | *provide |
17:46:28 | apriori| | []= is needed for the assigment |
17:46:35 | apriori| | result[i, j] = 0 |
17:46:36 | Araq | that's the bug then |
17:47:24 | apriori| | and []: T is needed for a[i, k] in the line and b[k, j], because those are no var variables |
17:48:02 | Araq | hm |
17:48:55 | Araq | I see, we need overloading of 'var T' and 'T' I think |
17:49:06 | apriori| | yep |
17:49:24 | Araq | C++ is actually much better designed than most people realize |
17:49:37 | apriori| | yeah, I figured :P |
17:49:45 | apriori| | because of ref etc... |
17:50:09 | apriori| | brb.. and thank you guys :) |
17:57:06 | Araq | zahary: I'm still not worried about elimination of temporaries, it'll be done once I implemented the optimizer I have in mind |
17:57:58 | Araq | especially call chains are simple to optimize |
17:58:24 | zahary | how will it work? |
17:59:16 | Araq | GCSE with the effect analysis that I outlined days ago |
17:59:41 | Araq | plus copy elimination |
18:00:01 | Araq | temporaries can be dealt with much like 'let' variables |
18:00:01 | zahary | can you provide some example transformations? |
18:00:11 | * | Trixar_za is now known as Trix[a]r_za |
18:00:26 | Araq | essentially 'let' gives us an SSA representation |
18:00:33 | zahary | this is not about eliminating chained assignments |
18:00:49 | Araq | the lack of 'goto' also helps ;-) |
18:01:47 | Araq | zahary: but it is the same problem as register allocation, only simpler |
18:01:56 | Araq | as the number of registers is unbound |
18:02:16 | Araq | you simply merge tempories whose lifetime does not overlap |
18:02:20 | Araq | *temporaries |
18:02:33 | zahary | it's about mutating values in place. I don't see how the low level SSA optimisations can help here as the optimization makes use of high-level knowledge about he string type . |
18:02:33 | zahary | describe how x.trim.reverse will work? |
18:03:02 | Araq | well if we have snooping procs: |
18:03:20 | Araq | trim(x, tmp1) |
18:03:28 | Araq | reverse(tmp1, tmp2) |
18:03:53 | Araq | dest = tmp2 |
18:04:25 | Araq | --> |
18:04:42 | Araq | trim(x, tmp1); reverse(tmp1, dest) |
18:05:13 | Araq | and then ... ok, this needs more thoughts ;-) |
18:05:56 | Araq | damn you :P |
18:06:19 | zahary | :) |
18:07:16 | zahary | btw check out this. I was a little bit blown away / humbled down by the description of what kind of optimisation is carried away by gcc |
18:07:17 | zahary | http://stackoverflow.com/questions/10250419/why-does-gcc-generate-such-radically-different-assembly-for-nearly-the-same-c-co/10251097#10251097 |
18:13:23 | Araq | and yet I never saw a compiler perform anything like that with string ops ;-) |
18:14:20 | zahary | well, that would be because it requires high-level knowledge of the type as I said :) |
18:14:46 | Araq | which is why our optimizer should be concerned only with high level optimizations |
18:15:17 | Araq | I can't see anything expensive in fast_trunc_one btw, I wouldn't worry about it at all |
18:15:33 | Araq | it doesn't access memory |
18:15:48 | Araq | it's irrelevant ;-) |
18:18:20 | zahary | what amazed me is how the compiler figured out from sign = i & 0x80000000; that sign = -sign and then used that to make another non-trivial transformation |
18:19:34 | Araq | I saw the microsoft compiler transform a nontrivial recursion that implements integer add into the 'add' asm instruction :-) |
18:19:59 | Araq | there was still an unnecessary 'jmp' involved iirc |
18:20:09 | Araq | but the recursion had been completely eliminated |
18:20:16 | Araq | and no, it was *not* a tail call |
18:21:21 | zahary | so we should leave the SSA stuff to these guys for a while :) |
18:21:43 | Araq | well I don't plan to transform anything to SSA |
18:22:02 | Araq | but we get it for free: let vars are in SSA |
18:22:18 | Araq | and every compiler introduced temporary is in SSA too |
18:22:36 | zahary | yes, and the C compiler will happily take care of them |
18:23:04 | zahary | we should strive to eliminate high-level code (calls to user copy procs, etc) |
18:23:45 | Araq | true but I still think copy elimination for ints and strings is essentially the same |
18:24:03 | Araq | except that you get it for 'int' for free in the C backend |
18:24:06 | zahary | sure |
18:28:21 | Araq | the new profiler hooks work exactly like a "stop the world" mechanism could work |
18:28:40 | Araq | every proc entry and every loop body is injected with a call to 'nimProfile()' |
18:29:58 | Araq | but I have no idea how expensive that is as the profiler requires stack traces to be turned on |
18:30:07 | Araq | which are expensive |
18:33:21 | zahary | yeah, I snooped the diff :) btw, it may be a worthwhile to support some existing report format |
18:33:21 | zahary | http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html |
18:33:21 | zahary | there are some visualization scripts with packages like these |
18:33:49 | zahary | maybe dom96 could be interested in such feature? |
18:35:42 | Araq | he's away ;-) |
18:36:28 | Araq | but we should do important work instead, I already played too long with it |
18:38:12 | Araq | btw do you happen to know how fixed length software caches work? |
18:39:48 | Araq | I mean a count table which keeps the entries with the greaters count values |
18:39:55 | Araq | *greatest |
18:40:00 | zahary | most used? |
18:40:04 | Araq | yeah |
18:40:24 | Araq | but it should only use a fixed amount of memory |
18:40:47 | zahary | I've worked on something similar for textures - let me try to recall what was it all about |
18:41:02 | Araq | it can only work stochastically |
18:41:12 | Araq | I think |
18:41:31 | zahary | well, you could go with priority queue, but that wasn't the solution then |
18:46:08 | Araq | a fixed length priority queue does not work as we don't know the priority upfront |
18:46:36 | Araq | the priority is the usage counter which we don't know when profiling |
18:49:15 | zahary | don't quite follow. priority queue will have log(n) updates on each usage. |
18:50:19 | Araq | yes but it's not of a fixed length |
18:50:31 | Araq | if we start with [1, 1, 1] |
18:51:18 | Araq | well ok, let me elaborate: we have stack traces |
18:51:53 | Araq | and we count them |
18:52:06 | Araq | the stack trace with the highest count is the critical path |
18:52:41 | Araq | so if we have 3 slots in our count table, we can keep track of 3 different stack traces |
18:52:58 | Araq | but now a forth stack trace is encountered |
18:53:12 | Araq | and we have to replace 1 slot somehow |
18:53:13 | zahary | I looked at the old texture cache code and it's not overly sophisticated - it doesn't remove one item at a time so purges are linear scans over the whole cache with some value chosen as threshold for removal in heuristic way (there is a global counter of texture usages and you can divide the delta usages from the last purge to get the expected average usage) |
18:53:45 | * | shevy left #nimrod ("I'll be back ... maybe") |
18:54:12 | zahary | otherwise, these caches are called LFU (least frequently used) - here is a paper that provides O(1) for all operations |
18:54:13 | zahary | http://dhruvbird.com/lfu.pdf |
18:54:14 | Araq | but whatever slot we pick, it could be that that's the stack trace that ends up getting the greatest count value |
18:54:29 | zahary | uses hash table and linked lists |
18:55:07 | Araq | I'm not concerned about O(1), I'm concerned about a replacement strategy that guarantees say, "won't be wrong with p probability" |
18:55:50 | Araq | ok, so I want a fixed size LFU ;-) |
18:56:17 | zahary | you can have a max size before starting to purge with the above algorithm |
18:56:27 | zahary | seems simple to implement |
18:56:27 | Araq | which can only be done stochastically |
18:57:47 | zahary | the algorithm is not stochastic in nature, but the results still won't be precise (some stack can be purged and placed back into the cache multiple times) |
18:58:59 | zahary | I've supervised a memory profiler implementation that used stack traces are keys btw - let me try to recall how that worked too :) |
18:59:15 | zahary | * as keys * |
19:01:34 | Araq | currently I use a hash table that tracks the max chain length |
19:02:29 | Araq | if the table is full, one element on the hash chain is replaced |
19:02:40 | Araq | the element with the minimal count |
19:03:21 | Araq | the max chain length is used so that it doesn't compute the minimum over all slots |
19:04:12 | zahary | how do you get the element with the minimal count? |
19:04:53 | Araq | as I said, I do: for i in 0..maxChainLength-1: # follow the chain |
19:05:19 | Araq | when finding the element in the chain, it's not a new element and we can update its counter |
19:05:41 | Araq | but we also keep the index of the minimum while traversing the chain |
19:05:52 | zahary | the algorithm from the paper just adds a linked list next to the table that is updated with insertion sort - since the updates always change the usage count by 1, you know that the insertion sort with have O(1) cost |
19:06:43 | Araq | the problem is: every new element starts with counter 1 |
19:07:12 | Araq | but it still needs to be added because it could become the critical stack trace |
19:07:39 | zahary | and after some time, the newcomers are always the prime candidates for purging? |
19:08:17 | Araq | no ;-) |
19:08:28 | Araq | I keep no "age" |
19:08:39 | Araq | once added it's treated equally to everything else |
19:08:47 | Araq | hrm, that sounds stupid |
19:08:54 | Araq | :D |
19:09:26 | Araq | oh well the table is big enough anyway |
19:09:43 | Araq | the code dealing with purging is never executed in practice |
19:09:55 | zahary | so what's the problem again? |
19:10:04 | zahary | newcomers start with 1, so? |
19:10:07 | Araq | I'm interested in what's out there |
19:10:39 | Araq | what I've implemented works good enough already |
19:10:40 | zahary | you can start newcomers as the current minimum so they don't get unfair disadvantage |
19:11:27 | Araq | that could work, yes |
19:11:33 | zahary | otherwise, O(1) sounds like you can't get better than this |
19:12:39 | Araq | as I said I'm interested in some numbers about how probable it is that the critical stack trace will get purged incidentically |
19:14:02 | zahary | aha |
19:15:22 | zahary | btw, most profilers I've worked with doesn't seem to purge anything (I'm not entirely sure tho) - you see functions that were called only once in the report |
19:15:39 | Araq | yeah I'm aware ;-) |
19:16:07 | Araq | but it seemed risky to let it allocate |
19:18:31 | Araq | now it allocates anyway, but I can easily change that |
19:21:51 | zahary | btw I've seen another inaccurate strategy used in some intrusive profilers |
19:23:26 | zahary | you don't insert the real stack traces in a cache, but rather assign a single record for each proc (this record resides in a static variable, so it's very cheap to update) |
19:23:35 | * | XAMPP joined #nimrod |
19:24:05 | zahary | so for functions with low fan-in it works relatively accurate, but for high fan-in you get results that are not very useful |
19:24:45 | Araq | I can't follow |
19:24:58 | Araq | there is a static variable per proc, ok |
19:25:12 | Araq | what does the variable contain? a counter? |
19:25:26 | zahary | and you just measure the time spent in that proc and its children |
19:25:44 | Araq | that's exactly what my old profiler did :-) |
19:25:55 | zahary | aha, ok |
19:26:13 | Araq | but it was as useful as gprof |
19:26:17 | zahary | in practice, when you get to pick your own functions to profile the inaccuracies potentially doesn't matter that much |
19:26:49 | zahary | hmm, I disagree in the comparision with gprof |
19:27:04 | Araq | so why reinvent gprof? the name mangling is not bad enough to justify it |
19:28:03 | zahary | you still get meaningful nested report like this: |
19:28:03 | zahary | readNetwork 10% |
19:28:03 | zahary | updateWorld 20% |
19:28:03 | zahary | simulatePhysics 14% |
19:28:04 | zahary | runAI 5% |
19:28:04 | zahary | runMobs 3% |
19:28:05 | zahary | renderWorld 40% |
19:28:26 | zahary | gprof only show leafs |
19:28:41 | Araq | no, it also shows total time |
19:29:01 | Araq | I think ... |
19:29:05 | zahary | but not for function that's up the stack |
19:29:17 | zahary | it just looks at the program counter to see in which function you are currently in |
19:32:01 | Araq | http://www.linuxselfhelp.com/gnu/gprof/html_chapter/gprof_5.html |
19:32:17 | Araq | "the call graph" |
19:33:30 | zahary | the disclaimer is that I haven't actually used it, but that's how people explain it usually: |
19:33:30 | zahary | http://stackoverflow.com/questions/6328673/which-is-the-most-reliable-profiling-tool-gprof-or-kachegrind |
19:33:30 | zahary | I've experience only with sampling profilers |
19:38:21 | Araq | ok |
19:38:36 | Araq | but gprof is also a "sampling" profiler, isn't it? |
19:39:45 | Araq | btw profiling shows the copyTree calls for semchecking of calls is really expensive |
19:40:31 | Araq | but we can get rid of them once I rewrote sigmatch with all the changes we have in mind |
19:41:54 | zahary | you mean the copyTree in DirectOp? |
19:42:41 | Araq | there is also a copyTree for each arg in sigmatch |
19:44:05 | zahary | when I tested it in 4 different profilers I got consistent results, but the copyTree in DirectOp at least was very cheap |
19:44:19 | apriori| | Araq: can you give me a simple example for a tr macro which wants to replace a call to a generic function (not a specific one) with something? |
19:45:37 | Araq | the copyTree produces much GC pressure I think |
19:46:52 | zahary | or the profiler overestimates recursive procs |
19:46:58 | Araq | apriori|: template t{f(X)}(f: expr, X: varargs[expr]): expr = g(X) |
19:47:27 | Araq | no because the stack trace did not show copyTree as the leaf |
19:47:35 | zahary | … small recursive procs - the issue with intrusive profilers I've brought before |
19:47:38 | Araq | it showed it as the *reason* :P |
19:47:46 | apriori| | Araq: I meant a function with a generic parameter |
19:48:02 | apriori| | Araq: and a named function.. not all possible calls |
19:48:08 | Araq | apriori|: it's eliminated |
19:48:15 | apriori| | hm? |
19:48:35 | apriori| | so I can only match on specific calls to a generic function or what do you mean? |
19:48:47 | Araq | template f{nameHere(X)}(X: varargs[expr]): expr = g(X) |
19:49:08 | Araq | nameHere[int, int](a, b, c) # should be replaced with: |
19:49:14 | Araq | g(a, b, c) |
19:49:33 | apriori| | hm, okay, thank you |
19:52:19 | Araq | btw even easier and should also work: template t{nameHere}: expr = g |
19:53:02 | apriori| | well, let me be more precise: |
19:53:31 | apriori| | template optimizeCrossOfSameVector{cross(a, b)}(a: expr{alias}, b: expr) = [ 0, 0, 0 ] |
19:53:37 | apriori| | the problem here, is: |
19:54:13 | apriori| | I need to know the actual type of an element in a or b so I can properly construct that array literal |
19:54:42 | Araq | a aliases b? |
19:54:50 | apriori| | or vise versa |
19:57:36 | Araq | proc default[T](): T {.inline.} = nil |
19:57:59 | Araq | template ... = default[type(a)]() |
19:58:48 | Araq | you can also construct the [] in a macro but why bother |
20:01:47 | fowl | if you have a generic ref, how do you set the free proc |
20:01:57 | fowl | this is what i tried https://gist.github.com/3734130 |
20:02:39 | Araq | new(result, free2[A]) |
20:03:38 | fowl | ah |
20:05:59 | apriori| | Araq: btw., are tr macros automatically active? |
20:06:13 | Araq | yeah |
20:06:25 | apriori| | hm |
20:11:52 | apriori| | Araq: what is the scope of tr macros? |
20:12:20 | Araq | everthing in the same module after declaration |
20:12:30 | apriori| | no way to export it? |
20:12:34 | Araq | they don't respect scope |
20:12:40 | Araq | export with * |
20:13:01 | apriori| | grrr... |
20:13:15 | apriori| | actually obvious o_O |
20:21:08 | apriori| | Araq: alias check doesn't seem to work (or works different than I think): |
20:21:15 | apriori| | template optimizeCrossOfSameVector*{cross(a, b)}(a: expr{alias}, b: expr) : expr = |
20:21:16 | apriori| | [ 0.0, 0.0, 0.0 ] |
20:21:34 | apriori| | this should actually only replace, if a is the same array as b |
20:21:51 | apriori| | I now went for the simpler pattern cross(a, a)(a: expr) |
20:26:32 | Araq | strange |
20:27:15 | apriori| | I guess tomorrow is bug filing day ;) |
20:27:28 | Araq | yeah |
20:30:27 | zahary | Araq, how much of the total time is spent in copyTree according to the built-in profiler? |
20:30:52 | Araq | I don't know, I'm improving it and will measure again soon |
20:31:16 | zahary | I did some profiles with Instruments on mac os and Very Sleepy on windows so I can compare |
20:31:46 | Araq | well I can't do "total time spent in" |
20:32:05 | Araq | I can only do "frequently encountered stack traces" |
20:33:42 | * | moxie quit (Remote host closed the connection) |
20:34:47 | zahary | is there percent of the samples for which it was in the trace? |
20:35:08 | Araq | yeah, just wait a sec ;-) |
20:41:38 | Araq | copyTree 36/1992 = 1.8% |
20:41:52 | Araq | ok, can't reproduce now that I improved the profiler |
20:42:51 | Araq | well it's still significant, but not the worst offender, |
20:42:55 | Araq | the stack trace is: |
20:42:58 | Araq | Entry: 9/770 Calls: 4/879 = 0.46% [sum: 77; 77/879 = 8.8%] |
20:43:00 | Araq | add 11/1992 = 0.55% |
20:43:02 | Araq | addZCT 4/1992 = 0.20% |
20:43:04 | Araq | doOperation 5/1992 = 0.25% |
20:43:06 | Araq | nimGCvisit 5/1992 = 0.25% |
20:43:07 | Araq | forAllChildren 5/1992 = 0.25% |
20:43:09 | Araq | CollectZCT 8/1992 = 0.40% |
20:43:10 | Araq | collectCTBody 17/1992 = 0.85% |
20:43:12 | Araq | collectCT 17/1992 = 0.85% |
20:43:13 | Araq | rawNewObj 14/1992 = 0.70% |
20:43:15 | Araq | newObj 16/1992 = 0.80% |
20:43:16 | Araq | newNode 8/1992 = 0.40% |
20:43:18 | Araq | copyTree 36/1992 = 1.8% |
20:43:19 | Araq | matchesAux 24/1992 = 1.2% |
20:43:21 | Araq | matches 23/1992 = 1.2% |
20:43:22 | Araq | resolveOverloads 35/1992 = 1.8% |
20:43:24 | Araq | semOverloadedCall 44/1992 = 2.2% |
20:43:25 | Araq | ... 96/1992 = 4.8% |
20:43:27 | Araq | CommandCompileToC 101/1992 = 5.1% |
20:43:29 | Araq | MainCommand 101/1992 = 5.1% |
20:43:30 | Araq | HandleCmdLine 101/1992 = 5.1% |
20:44:14 | Araq | the CRC stuff wins again |
20:44:31 | Araq | Entry: 0/770 Calls: 21/879 = 2.4% [sum: 21; 21/879 = 2.4%] |
20:44:32 | Araq | updateCrc32 4/1992 = 0.20% |
20:44:34 | Araq | newCrcFromRopeAux 3/1992 = 0.15% |
20:44:36 | Araq | crcFromRope 3/1992 = 0.15% |
20:44:43 | Araq | is the top entry |
20:44:50 | zahary | hmm, I'm getting much more actually, weird |
20:45:40 | zahary | I'm getting 12.5% now - indeed mostly because of GC functions after it |
20:45:53 | Araq | I'm using -d:release --stackTrace:on |
20:46:24 | Araq | well the copyTree is really bad in any case: |
20:46:29 | Araq | think about nested calls |
20:46:39 | Araq | we copy them again and again |
20:46:54 | Araq | we copy the whole call |
20:47:04 | Araq | and each argument again |
20:47:09 | zahary | my profile says the worst copyTrees are these in matchesAux |
20:47:32 | Araq | that's what I'm talking about, yeah |
20:47:33 | zahary | this makes sense as they are performed once per-candidate |
20:48:06 | Araq | Entry: 10/770 Calls: 4/879 = 0.46% [sum: 81; 81/879 = 9.2%] |
20:48:08 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:10 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:11 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:13 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:15 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:16 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:18 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:19 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:21 | Araq | genericAssignAux 45/1992 = 2.3% |
20:48:22 | Araq | genericAssign 6/1992 = 0.30% |
20:48:24 | Araq | resolveOverloads 35/1992 = 1.8% |
20:48:25 | Araq | this is also pretty bad |
20:48:35 | Araq | I'm not even sure where the call to genericAssign comes from |
20:48:56 | zahary | 1.8% total for genericAssignAux here |
20:49:39 | Araq | well I take it that means my profiler works |
20:50:27 | Araq | but it's not suprising really, calls and nkDotExpr are everywhere |
20:50:36 | Araq | so it's important to optimize these |
20:51:18 | Araq | it doesn't make sense to optimize semYield because there are so few 'yield' statements in Nimrod code ;-) |
20:52:17 | Araq | oh and I compute the percentages wrong :D |
20:52:32 | zahary | how much do you get for collectbody ? |
20:53:15 | Araq | collectCTBody 17/1992 = 0.85% |
20:53:32 | Araq | but I'm not sure how to compute it |
20:53:44 | Araq | the 1992 refers to the total stack slots |
20:53:57 | Araq | but it should be for unique stack slots, right? |
20:54:18 | zahary | 1992 total slots in your cache table? |
20:54:42 | zahary | it should be real active slots |
20:55:00 | Araq | no 1992 slots in total |
20:55:15 | Araq | 879 stack traces |
20:55:53 | zahary | so what's 1992? capacity (that's what I meant) |
20:56:32 | Araq | well it occurs in 12 stack traces out of 49 |
20:56:58 | Araq | 1992 is the length of all stack traces together I think |
20:57:29 | zahary | 12/49 is much closer to what I measure |
20:57:46 | Araq | yeah, I should record that instead |
20:57:52 | zahary | I get 22.3 |
20:57:55 | zahary | % |
20:58:23 | Araq | close enough :-) |
20:58:48 | zahary | you don't have functions like setjmp or memcpy on your radar, right? |
20:59:04 | zahary | no way to get to them in iterruption point |
20:59:20 | Araq | that doesn't matter, does it? |
20:59:35 | Araq | it will be attributed to the proc calling these |
21:00:17 | zahary | yes |
21:19:31 | Araq | it's now collectCTBody 67/787 = 8.5% |
21:19:55 | Araq | it occurs in 67 out of 787 stack traces |
21:20:10 | Araq | however, that includes stack traces which occur only once |
21:20:20 | Araq | dunno if that's the correct way to count it |
21:20:38 | Araq | a stack trace that occurs only once is random noise ... |
21:21:16 | Araq | however only 44/787 occur more than once ... |
21:22:29 | zahary | why random noise - stack trace that occurs once in just half as good as stack trace that occurs twice |
21:23:14 | Araq | hm |
21:23:50 | Araq | I am thinking about init time |
21:24:06 | Araq | oh, so it's once in processCmdLineArgs() ... |
21:26:12 | Araq | nah, it's wrong |
21:26:28 | Araq | if it occurs in a stack trace that occured 3x |
21:26:35 | Araq | it should have weight 3 |
21:28:55 | zahary | if there is 100 traces collected at equal intervals of time and proc X is present on 16 of them, then it's likely that 16% of the time of program can be eliminated by deleting the proc X |
21:29:38 | Araq | now that's a useful statement |
21:30:18 | Araq | so weight 3 for a stack trace that occured 3x is correct, right? |
21:31:15 | zahary | well, what do you do with that weight? |
21:31:24 | Araq | sum over it |
21:31:34 | zahary | ok, then it's right |
21:31:55 | Araq | I get: collectCTBody 77/789 = 9.8% then |
21:32:25 | Araq | nah arghhh |
21:32:35 | Araq | it's too late already |
21:33:05 | Araq | it should be 900 cause that's the number of samplings |
21:33:46 | Araq | making it 8.5% |
21:33:59 | Araq | that's not in line with your results, is it? |
21:38:35 | zahary | I'm getting 22% on both systems |
21:39:14 | zahary | what happens when you change the sampling frequency? |
21:41:39 | Araq | well it's a bit weird |
21:41:46 | Araq | the frequency is 5ms |
21:42:06 | Araq | and the program runs for 5.1 secs |
21:42:17 | Araq | which means I should 1000 samples |
21:42:24 | Araq | but I get only 899 samples |
21:42:50 | Araq | which is suspicious |
21:44:02 | Araq | I guess sampling itself is that expensive |
21:46:22 | Araq | 3332 samples for 1ms frequency :-/ |
21:47:07 | Araq | collectCTBody 278/3332 = 8.3% now |
21:47:36 | Araq | semExpr 1475/3332 = 44.% btw |
21:50:21 | zahary | what's 60% and more? |
21:50:29 | zahary | btw, there is big difference between clean build and rebuild |
21:50:50 | zahary | I usually benchmark the rebuild so the C compiler will be out of picture |
21:50:57 | Araq | so do I |
21:51:12 | Araq | koch boot -d:release --profiler:on --stackTrace:on |
21:51:16 | zahary | well, semExpr is 80+% for me |
21:52:24 | Araq | CommandCompileToC 3332/3332 = 1.0e+02% |
21:52:29 | Araq | aka 100% |
21:53:52 | Araq | well I know why it's only 44% for semExpr |
21:54:03 | Araq | I don't record whole stack traces |
21:54:16 | Araq | only 20 entries are in a stack trace |
21:54:30 | zahary | aha, I was about to ask this on the first result for collect |
21:54:44 | Araq | so if semExpr is in the "..." section, it is not recorded |
21:55:12 | Araq | last 3 entries are always the stack bottom |
21:55:22 | Araq | so it's accurate for CommandCompileToC |
21:59:43 | Araq | lookupInRecordAndBuildCheck 136/3332 = 4.1% |
22:02:33 | zahary | 7% - it's consistently half the time |
22:03:02 | Araq | same as semexpr applies for it I guess |
22:03:21 | Araq | newCrcFromRopeAux 184/3332 = 5.5% |
22:03:40 | zahary | 6.5% |
22:03:50 | Araq | crcFromFile 57/3332 = 1.7% |
22:04:12 | zahary | 0.8% |
22:04:29 | zahary | but I have SSD disk |
22:04:45 | Araq | should be in RAM anyway |
22:05:00 | Araq | but maybe linux's caching isn't as good as I think it is |
22:05:26 | Araq | copyTree 387/3332 = 12.% |
22:05:41 | zahary | yep, me too |
22:06:10 | zahary | rawalloc is leaf for me 84% of the time |
22:06:15 | zahary | should be shallow for you too |
22:06:31 | Araq | indeed it is |
22:06:46 | zahary | 10% |
22:07:07 | Araq | newObj 423/3332 = 13.% |
22:07:47 | Araq | newObjRC1 89/3332 = 2.7% |
22:07:48 | Araq | newSeqRC1 114/3332 = 3.4% |
22:08:03 | zahary | newObj 23.3%, leaf only 14% |
22:08:37 | zahary | newSeq 6.8% |
22:08:45 | Araq | hu, how can this ever be a leaf? |
22:09:06 | zahary | 6.8% total time |
22:09:37 | zahary | any function can be leaf if the profiler interrupted execution somewhere within the body |
22:10:01 | Araq | I see |
22:10:43 | Araq | so it's accurate with a factor of 2 :-) |
22:10:46 | zahary | try raising the depth limit to see whether these 2x discrepancies are indeed coming from it |
22:11:07 | Araq | good idea |
22:12:34 | Araq | semExpr 2041/3349 = 61.% with a depth limit of 35 |
22:12:58 | Araq | so I guess it'll become your number if I'd increase it further |
22:13:24 | Araq | newObj 427/3349 = 13.% |
22:13:40 | Araq | but it's always near top anyway |
22:13:58 | Araq | copyTree 373/3349 = 11.% same |
22:16:47 | zahary | how much do you get for rawalloc? |
22:17:01 | Araq | can't measure it |
22:17:20 | zahary | it has profiling off? |
22:17:28 | zahary | it's a nimrod proc, no? |
22:17:38 | Araq | yeah but I turned it off |
22:19:19 | Araq | semExpr 2041/3349 = 61.% |
22:19:20 | Araq | semExprWithType 1075/3349 = 32.% |
22:19:33 | Araq | this is interesting, I would have guessed they are the same |
22:19:33 | * | Trix[a]r_za is now known as Trixar_za |
22:21:18 | zahary | I get the same |
22:22:19 | zahary | if I break down semExpr, I can see that evalImport is called from it, which in turn carries all the passes on the module (sem, cgen, etc) |
22:22:33 | zahary | so that's how it takes all the time |
22:22:44 | Araq | ah I see |
22:23:10 | Araq | newObj 427/3349 = 13.% |
22:23:12 | Araq | ropef 249/3349 = 7.4% |
22:23:14 | Araq | genTraverseProc 28/3349 = 0.84% |
22:23:48 | zahary | ropef 8.9% |
22:23:49 | zahary | genTraverse 0.2% |
22:24:24 | zahary | newObj 23% - this is that last mystery |
22:25:13 | Araq | is it? maybe it's less severe when everything runs slower due to the stack tracing |
22:26:20 | Araq | scanComment 15/3349 = 0.45% |
22:26:53 | zahary | scanComment 0.2% |
22:27:23 | * | Trixar_za is now known as Trix[a]r_za |
22:28:45 | * | Trix[a]r_za is now known as Trixar_za |
22:32:05 | Araq | but I have to sleep now, good night |
22:32:31 | Trixar_za | Goodnite Araq |
22:53:32 | * | zahary quit (Quit: Leaving.) |
22:54:53 | * | q66 quit (Quit: Quit) |
23:00:13 | * | apriori| quit (Remote host closed the connection) |