programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
• Campbell Ritchie
• Jeanne Boyarsky
• Ron McLeod
• Paul Clapham
• Liutauras Vilda
Sheriffs:
• paul wheaton
• Rob Spoor
• Devaka Cooray
Saloon Keepers:
• Stephan van Hulst
• Tim Holloway
• Carey Brown
• Frits Walraven
• Tim Moores
Bartenders:
• Mikalai Zaikin

# Beware: three parameter Stream.iterate method might fall one short

Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
The other day I wanted, for some reason, to get a list of partial sums, given a list of ints. So, if the list had ints {a1, a2, ..., an} I wanted to create a list {a1, a1 + a2, a1 + a2 + a3, ...}.

At first I thought that Stream.iterate() would do the job easily, but then I stumbled upon that you must use a UnaryOperator, while we obviously needed a BiFuntion, something like: (a, i) -> a + list.get(i).
But there is always a way to tranform a function of one parameter into a bifunction of two parameters. I tried:

But, and to my misery exactly as the API explains, this falls one short! For instance, try

In this situation, a solution is easy, since you know upfront how many times an iteration must take place, so a reapir could be:

But in situations where you do not know upfront that number, and the last iteration must be done when the Predicate fails for the first time, you have no alternative than to do that last iteration manually. So beware!

(Of course, using a classic for(x; y; z) would save you a lot of difficulty and misery, but it wouldn't be half so much fun).

The intention was to come from a list of observations to a TreeMap of the sample cumulative density function, to make easy simulations, but that is another story.

Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Here's a variation I came up with that's close to the original structure, but it avoids the requirement of knowing the list size.  Accordingly it uses an Iterable rather than List as input

Another variant, for fun:

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Or:

This is a bit troubling though, as it relies on the assumption that the combiner (3rd arg of Collector.of()) will never be called, since we're not using a parallel stream.  Though truthfully, most all of these implementations rely on that sort of assumption - otherwise our sums would be totaled in a different order.  In comparison, an old-fashioned non-streaming solution would be clearer:

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
hi Mike,

great alternatives! Albeit that I seldom like Streams with a tailor-made Accumulator, since I think you might as well do it the classic way.

The use of a one-element array to remember a previous value is one that I use often myself. Short and simple. This is how I derive my cdf, maybe you see some improvements?
The idea is that I get a List of obeservations, which I then turn in a TreeMap<T, Double> with the double being the sample chance of observation T (using the natural ordering of T) and then create the cdf:

Saloon Keeper
Posts: 15550
364
• Number of slices to send:
Optional 'thank-you' note:
I find it really strange that you like using Java's functional features so much Piet, but then shirk the 'proper way' of doing things in preference of using hacks like modifying objects to get around the limitation that you can't modify local variables from lambdas, which is there for a good reason.

The forEach() in your cdf2() method is not functional. It's procedural code. Then why not just use a for-each loop? That way you won't even have to resort to silly things like using the array.

Why is your CDF keyed by probability? A CDF returns a probability given an observation, so it makes sense to keep T as the key.

You use two collect() operations in your cdf() method. That's wasteful. Use a downstream collector to specify what should be done with the values of the map while the groupingBy() is doing its magic.

Why do you multiply by 1.0 to cast your value? That's what the (double) operator is for.

Don't repeat method calls inside loops. Save calls like observations.size() to local variables first.

When you split up methods into smaller chunks (a good thing) make sure to give the chunks descriptive names. I'd call cdf() getCumulativeDistribution() and I'd call cdf2() accumulateDistribution().

Why is T a Number? The only thing that your methods require about T is that it has a natural order.

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:

Stephan van Hulst wrote:Why is your CDF keyed by probability? A CDF returns a probability given an observation, so it makes sense to keep T as the key.

Well, an inverse CDF can also be useful - e.g. if you want to generate values that follow the original probability distribution, you can generate random numbers in [0, 1) and apply the inverse CDF.  If that's intent (or something similar), then fine.  However it's misleading to call it a CDF in this case - it's an inverse CDF, or quantile function.

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
hi Stephan,

thanks for the critiques, even if it is not very egoless!

You're right, the code is kinda sloppy, it shows all the signs of hasty speed. But I'll give some response.

I find it really strange that you like using Java's functional features so much Piet, but then shirk the 'proper way' of doing things in preference of using hacks like modifying objects to get around the limitation that you can't modify local variables from lambdas, which is there for a good reason.

Well, it is such a light local 'offence' that the convenience far outweighs it.

The forEach() in your cdf2() method is not functional. It's procedural code. Then why not just use a for-each loop? That way you won't even have to resort to silly things like using the array.

The map.forEach() allows the very convenient (k,v) -> construct, and for that I'm willing to skip functional code. As you see, it is just a very clear short one-liner.

Why is your CDF keyed by probability? A CDF returns a probability given an observation, so it makes sense to keep T as the key.

The intent is to do sampling, and if I have a stochastic variable X with cdf F(x), then under circumstances F(X) is uniform distributed, If you think of the graph of F(x), then the procedure is to draw a number (uniform) from the y-axis, go horizontally to the right until it crosses the graph, and go vertically down to the x-axis, and that is your random sample. The TreeMap methods 'floor' and 'ceiling' are ideal for this.

You use two collect() operations in your cdf() method. That's wasteful. Use a downstream collector to specify what should be done with the values of the map while the groupingBy() is doing its magic.

Good point. Never used that downstream collector so far, but I will read the API again and incorporate it.

I'd call cdf() getCumulativeDistribution()

Hmm, cdf is a very much used abbreviation, so for me it was very clear.

Why is T a Number? The only thing that your methods require about T is that it has a natural order.

Correct, but for now I limit myself to numeric random variables, with their in-built natural ordering.

The whole purpose arose when someone at another site was assigned the question about a mortality table, the X being the age of death of a person who is now x years old. That made me realize I'd never done this in Java so far.

Funny, in Liutauras' topic about the GoL, Junilu talks about egoless programming. Now I find myself defending my code... It IS a long way to Tipparary

Stephan van Hulst
Saloon Keeper
Posts: 15550
364
• Number of slices to send:
Optional 'thank-you' note:

Piet Souris wrote:thanks for the critiques, even if it is not very egoless!

If, going by the description of peer reviews that I found on the Wiki for egoless programming, you mean that my review wasn't friendly or collegial, I apologize. I try to write in a neutral way but that doesn't always come across through text.

As you see, it is just a very clear short one-liner.

Short yes, clear no. I had to look twice because the combination of the forEach() and the multiple statements on one line first made it look like a regular for loop. I would have taken your point about the forEach being a nice construct to use over an enhanced for-loop, if you had named the variables k and v something descriptive like observation and count, but now you've only used it for the sake of brevity.

Hmm, cdf is a very much used abbreviation, so for me it was very clear.

But as soon as you posted your code for review, it was for a wider audience than just yourself. I always recommend writing identifiers out in full, so there's as little ambiguity as possible. Avoid abbreviations or symbols common to math, unless they are REALLY common, such as constants like PI. Also note that as a method name, cdf is redundant. It's like I suffixed the word "method" to all my method names.

Correct, but for now I limit myself to numeric random variables, with their in-built natural ordering.

Why? Also, in software development, "for now" is never "for now".

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:

Stephan van Hulst wrote:

Piet Souris wrote:thanks for the critiques, even if it is not very egoless!

If, going by the description of peer reviews that I found on the Wiki for egoless programming, you mean that my review wasn't friendly or collegial, I apologize. I try to write in a neutral way but that doesn't always come across through text.

Now I must apologize. I meant to say that accepting comments in an "egoless" way is even harder than I imagined. So far I was the only one looking at my code (or "the" code), and I always felt pleased about it. Your comments put me right back with two feet on the ground. Lot to think about, foremost that code that I think is obvious, need not be obvious at all.

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Hi Piet, Stephen.  I'm enjoying this - I hope you are too.

Stephan van Hulst wrote:You use two collect() operations in your cdf() method. That's wasteful. Use a downstream collector to specify what should be done with the values of the map while the groupingBy() is doing its magic.

Aside from the extra collection, I'm a bit troubled by doing all the summation on floating-point values.  Seems like it would be less error prone to do the summation on longs, with nice precise values, and then divide by the observation count at the end.  In this way you can better guarantee that the total probability will be 1.0, rather than 0.999999999756 or some such.

Stephan van Hulst wrote:Why is T a Number? The only thing that your methods require about T is that it has a natural order.

Good catch, this is true.  To some extent, you may not even need this requirement, for sampling purposes - but it's nice to see the map in a sorted order.  Which I assume is why the extra TreeMap was put there.

Piet Souris wrote:The map.forEach() allows the very convenient (k,v) -> construct, and for that I'm willing to skip functional code. As you see, it is just a very clear short one-liner.

Yeah, that's a nice feature.  Conversely, I don't really like having to use a the one-element array trick to work around Java's lack of true closures - I find it's an additional readability impediment.  Where readability is often determined by how often you (and other readers) have seen a particular construct.  In this case, I favor returning to a more traditional loop so I can just update a normal variable.

Here are the tweaks I came up with, for your consideration:

Or, a slightly less performant but maybe more readable and reusable version:

Here both integrate() and remap() become more general-purpose utility methods.  The final call to remap could also be expanded to:

which is, once again, more readable, but slower.

In several places, we could also replace List and TreeMap with more general types like Collection and NavigableMap.  But I get used to the short versions of each, even when they're unnecessarily specific.

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:

Mike Simmons wrote:Hi Piet, Stephen.  I'm enjoying this - I hope you are too.

Sure I do, I had not expected the response at all!

Gosh Mike, I thought I knew a thing or two when it comes to Streams and Lambdas, but reading your great innovations I get the feeling I should have put this topic in the Beginners Forum!
I just gave Stephan a well-deserved cow, I'll give you a well deserved cow too, tomorrow.

Thanks guys, for the lessons! And if you have more for me: send them in!

Stephan van Hulst
Saloon Keeper
Posts: 15550
364
• Number of slices to send:
Optional 'thank-you' note:

Mike Simmons wrote:Aside from the extra collection, I'm a bit troubled by doing all the summation on floating-point values.  Seems like it would be less error prone to do the summation on longs, with nice precise values, and then divide by the observation count at the end.  In this way you can better guarantee that the total probability will be 1.0, rather than 0.999999999756 or some such.

Strongly agreed. Always perform division as late as possible, and then prefer BigDecimal over Double, unless performance is important and accuracy isn't.

In several places, we could also replace List and TreeMap with more general types like Collection and NavigableMap.  But I get used to the short versions of each, even when they're unnecessarily specific.

Please get into the habit of doing this, especially when creating public method signatures. Many APIs are awkward to use or maintain because they contain types that are more specific than required.

Here are the tweaks I came up with, for your consideration:

• Don't use raw types. Declare your type variable as T extends Comparable<? super T>.
• Use wildcards in method parameters: Map<? extends T, ? extends Long>. I don't feel strongly about doing this with final types (like Long), but it is more consistent and clearly conveys intent.
• Why is the size parameter a double? If the only reason is to perform an implicit cast, then you're sacrificing the principle of least astonishment to make your code just a tiny little bit less verbose.
• Why are you iterating over a copy of the map? You're not modifying the original map, so iterating over it is safe. If the reason is that you need the map to be sorted, then just accept a sorted map.
• Personally I would have your remap() method operate on the current map, rather than a new one. It saves a lot of copying, and if you actually need a copy then just make the copy before you pass it into the method.
• Don't use the return values of assignments in enclosing expressions in any but the most common idioms, such as incrementing an index while accessing an array.

• My attempt:

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Hi Stephan!

Stephan van Hulst wrote:Strongly agreed. Always perform division as late as possible, and then prefer BigDecimal over Double, unless performance is important and accuracy isn't.

In this case, it's not that double is less accurate - it's that there is no benefit whatsoever to bringing in BigDecimal.  Unless we wanted more than 18 digits in the answer, or are concerned with overflowing the long - which is pretty much impossible here; we could really use an int.

Moreover, it's putting unnecessary work on the clients of this code, adapting to an unnecessarily convoluted API.  This isn't financial data - it's a probability distribution.  When someone uses this TreeMap, they're most likely going to be generating random numbers with Random.nextDouble() and using that as a key to look up the nearest values with floor() and ceiling().  One could convert these random numbers into BigDecimal I suppose - but why?  It's not going to be any more precise, just slower and harder to read.

I also confess to a general prejudice against BigDecimal even for financial data, as I find a properly written round(double val, int digits) method can fix most any problems, and is more readable.  Whereas rounding errors can creep into BigDecimal calculations too (especially where division is concerned, or a value is converted incorrectly from floating point to BigDecimal) but they're harder to find because all code involving BigDecimal is harder to read than equivalent code with double or long.  But that's a more general discussion - in this code, I don't see any reason for BigDecimal at all.

Stephan van Hulst wrote:Don't use raw types. Declare your type variable as T extends Comparable<? super T>.

Agreed, good catch.

Stephan van Hulst wrote:Use wildcards in method parameters: Map<? extends T, ? extends Long>. I don't feel strongly about doing this with final types (like Long), but it is more consistent and clearly conveys intent.

<br /> <br /> Why are there br symbols appearing here?  Seems to be a bug in the forum software - I didn't put them in, but I can't seem to remove them.  Anyway... <br /> <br /> Yeah, I'm less enthusiastic about this one as it makes method signatures even harder to read, but it can have benefits, so OK.

Stephan van Hulst wrote:Why is the size parameter a double? If the only reason is to perform an implicit cast, then you're sacrificing the principle of least astonishment to make your code just a tiny little bit less verbose.

I have a hard time seeing anything astonishing in that code, but that was the reason, yes.

Stephan van Hulst wrote:Personally I would have your remap() method operate on the current map, rather than a new one. It saves a lot of copying, and if you actually need a copy then just make the copy before you pass it into the method.

Hmmm, mutating the input map seems not very FP.

Actually I did consider that - but note that we're also changing the return type, transforming a Map<T, Long> to a Map<Double, T>.  We can hack that within the method with some creative casting tricks, but it won't be pretty.  And anyone retaining a reference to the original map will find some unpleasant surprises if they try to use it.  Also, since we're swapping the key and value, among other things, we need to remove and reinsert the entry anyway for the new key.  Seems much cleaner and more intuitive to do it in a new map.

Stephan van Hulst wrote:Don't use the return values of assignments in enclosing expressions in any but the most common idioms, such as incrementing an index while accessing an array.

In cases where it might be overlooked, I agree.  But in a tiny loop like that it's pretty obvious.  This should be a common idiom.  You're lucky I bothered to put in curly braces around the loop body at all. ;)

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Stefan, regarding your Accumulator code - nice!

I see you're still doing a CDF, which is to say a map of observation to cumulative probability, rather than the inverse function.  I.e. Map<T, BigDecimal> rather than Map<BigDecimal, T>.  I think the inverse function is what Stefan actually needs, as previously noted, but I'll go with your interpretation here.

Also I've already noted my feelings on BigDecimal for this problem, but here I'll accept it and move on.  I guess there is a possible benefit in being able to pass in a MathContext, at least for some applications.

I found one small optimization to make in the combine() method, always merging the smaller map into the bigger one:

As for the general design... I see you're aggressively reusing the same Map instance throughout.  That can work.  But I feel it's imposing a lot of costs as well, having to do all your counting by adding BigDecimal.ONE for every single observation, when long would be much faster.  Also doing a TreeMap log(N) lookup for every access, while you really only need the sorted nature of the map after the counting has been done.  I'm thinking it's better to let Collectors.groupBy() and counting() do most of that work.  If BigDecimal is desired, that really only is needed for the division; it can be done in a downstream transformation.  And if we really want to reuse the map rather than recopying it, we can still do that too, with a little... ummm... questionable casting. ;)

Of course, if you reverse the map as I think Piet intended, then you might as well make at least one new Map along the way, since you need to map on different keys.  But here we're assuming that is not what is needed.

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Or, just for fun, the condensed version :

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
BigDecimals are out of the question.

Nowadays mortality tables come (per gender) as a 2D array (the development package we use does not have Maps as far as I know), where the first dimension is the age (up to 120 currently), and the second dimension is the year of consideration, so a 30-year old has 120 different chances of dying at that age. Traversing a person through time will make you have to go via the diagonals.

Lastly and then I'll shut up: I tried to implement the collectingAndThen, expected to be easy, this is wat I came up with

but I can't check whether this is correct. I seem to have stumbled upon a NetBeans bug (9RC), and with every key press I get IndexOutOfBoundExceptions, AssertionErrors, et cetera, sending bug reports to Apache, and when I try to run the code, I get a ClassNotFoundError. Hairgraying misery.

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Hi Piet - thanks for clarifications.

I'm not clear which code you're using that latest code with, but I don't believe it works.  The second argument to collectingAndThen() needs to be something like a

Function<TreeMap<T, Long>, TreeMap<T, Double>>

whereas you have a

Function<Long, Double>

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
hi Mike,

the API says that the Collector will give you a Collector<T, A, R>, and the finisher should be a Function<R, RR>, so as I read it I should indeed use a Function<Long, Double>.
Here's the API

public static <T,A,R,RR> Collector<T,A,RR> collectingAndThenâ€‹(Collector<T,A,R> downstream,
Function<R,RR> finisher)

I tried it in a different context, but still I got all these errors. Time to update my NetBeans, maybe?

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:
I apologize, you are absolutely right, I misunderstood the API.

But using a Function<Map<T, Long>, Map<T, Double>> would that not require the use of an additional Map as I was criticized for in the first place?

What efficient method do you have for the Function?

Mike Simmons
Master Rancher
Posts: 4840
74
• Number of slices to send:
Optional 'thank-you' note:
Well, in the first place, I wouldn't be so concerned about creating a new Map now and then - sometimes it's warranted. Especially if the keys are changing, as was the case in your original code.  Didn't you originally want a TreeMap<Double, T> rather than a TreeMap<T, Double>?  If that's the ultimate goal, you might as well do it all at once, and the cost of a new map is essentially nothing, since you need to reinsert each entry anyway.  I originally did this with that custom remap() function I wrote on the fly, but I now realize Collectors.toMap() works as well:

However if you want a TreeMap<T, Double>, and are willing to mutate the input map (fine here since you just created it with the counting collector, and no one else has a reference) then you can use some creative casting to accomplish what you want even faster:

The above code generates warnings for unchecked casts, which can be cleaned up with the previously shown coerciveCast util method:

This works as long as you know the thing passed in really is a TreeMap, and the values can really be any Object type.  So the same map can have new types for its values.  But make sure no one tries to access the original map using the TreeMap<T, Long> reference, as that will probably throw ClassCastException if you try to access any Long values in the map - they aren't Longs any more.

Piet Souris
Bartender
Posts: 5466
212
• Number of slices to send:
Optional 'thank-you' note:

Mike Simmons wrote:(...) Didn't you originally want a TreeMap<Double, T> rather than a TreeMap<T, Double>?

The reason I want to have a Map<T, Double> is that my input will probably be in that form, or even a Map<T, Map<Year, Double>> (see my remark about modern mortality tables).

I used the toMap myself in the Function of collectingAndThen, but this method, in this way, is less usefull than I thought.

Well, anyway, I have gained experience and some wise lessons, so: thanks Stephan and Mike!

Greenhorn
Posts: 1
• Number of slices to send:
Optional 'thank-you' note:
thnak ypu...fruitful article

Bartender
Posts: 1251
87
• Number of slices to send:
Optional 'thank-you' note:
Congratulations Piet Souris,