Alicia Perry

Ranch Hand
+ Follow
since Jun 27, 2014
Alicia likes ...
Chrome Java
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
7
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Alicia Perry

Good news, now it all works! I corrected using the advice of Piet.
That is, the tree has the "empty" nodes to indicate the end of a word, then the entropy of each node changes.

Thank you so much :-)
7 years ago
Thank you both for your help and I apologize for the delay in responding.

Winston Gutkowski wrote:Actually, you've stored a bit more than that, and I wonder if some of the things that you store aren't redundant.


True, but honestly in this case I don't care, it's not really a programming exercise but the important thing is to find the correct results. Let's say it is more a "theoretical" exercise.

Winston Gutkowski wrote:Also: DON'T make instance variables public. EVER.


You're absolutely right, but for speed issues I preferred do it that way.

Winston Gutkowski wrote:Well, Java has a very useful structure called an IdentityHashMap, that allows you to quickly retrieve values for specific objects. So, you could either set up one for each value, or you could create a NodeInfo class that contains all "non-essential" information about a TrieNode, and store that in an IdentityHashMap<TrieNode, NodeInfo>.

Then, whenever you encounter a TrieNode and you want to know its "count", you simply look it up in the Map.
Some of these values (eg, 'count') could also probably be calculated while the Trie is being built; although I have to admit that it "smells" like a value that might be better calculated via a method as required - but TBH I don't know.


Probably I didn't use the most efficient data structure but, as I said before, the important thing is to find the correct results and Trie data structures it still allows me to work on the dictionary.

Piet Souris wrote:
Not sure about your last reply. The way I look at it is what I already wrote. Given that I have an 'a' at level 1, then according to me there are two possibilities: either we're dealing with the word 'a' or we are dealing with the word 'ab'. Two possibilities therefore, with an entropy of 1. Now, in order to get that entropy, we must calculate the p for every child, taking that childs 'count' into consideration.
And that's why I talked about a special kind of null-character, so that the formula would only have to look at the children, and not also to inspect the 'isEnd' variable.


I think you're right, you've convinced me.


Winston Gutkowski wrote:
In fact, they include an even "simpler" structure:

where, instead of a List, each Node points to its "first" child; but that child may have "siblings".


Thanks to the alternative, but I prefer to use my data structure. For me it's easier and I don't have much time to rewrite all the code.

Piet Souris wrote:Anyway, I've written my own version, according to what I understand of it.
I think it is about Alicia's version. but I added my own 'TrieEndOfWord class, to see if it works


Thanks. Now I look your code calmly.


For entropy of the dictionary I mean to calculate chain rule:



7 years ago

Piet Souris wrote:Given an ‘a’ at level 1 (so in fact I have a word that starts with an ‘a’), and I ask what possible outcomes I have one level down, I see the following possibilities:

1) The ‘empty’ character, meaning end-of-word
2) ‘b’

Both with probability 0.5.


Why both have probability 0.5? According to me, the (blank) " " character has probability 0 while "b" character has probability 1. In fact, once you are on the "*"->"a" node, you can only choose the character "b", you have no other alternatives.
In the dictionary there is no " " character, only a_z letters, so it makes no sense to assign a probability to a character that does not exist

Piet Souris wrote:
If I define my own notation, I’d say that the ‘entropy’, given an ‘a’ at level 1, is:

- { 0.5 * log_2(0.5) + 0.5 * log_2(0.5) } = - log_2(0.5) = - (-1) = 1

So H(a, 1, 1) (is entropy, given an ‘a’ at level 1 and going 1 level deep) = 1.


I would say that the entropy, given an "a" at level 1, is 0 because there is not uncertainty as to which character to choose, since I can only choose "b".

If the "*"->"a" node had two children equally likely, then entropy would be 1.



7 years ago
Hi Piet, all right, thanks, you? :-)

I created the dictionary you suggested me and I think I noticed a possible error.
I had already created simple dictionaries but I had never noticed this error.

Here is the figure:


The nodes of the path {*, A} and {*, C} should have entropy 0 and not 0.5. Right? In that case there isn't uncertainty, then the entropy should be 0.

Ok, I don't know how to fix it but I'm "happy" to have found it (if the error is this).

7 years ago
I have a trie data structure that stores a sequence of English words. For example, given these (non-sense) words, the dictionary is this:

aa abc aids aimed ami amo b browne brownfield brownie browser brut
butcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony
crumply diarca diarchial dort eserine excursus foul gawkishly he
insomniac mehuman occluding poverty pseud rumina skip slagging
socklessness unensured waspily yes yoga z zoo



The nodes in blue are those in which a word is finished.

In each node I saved:
  • the character that it represents
  • the level at which the node is located
  • a counter that indicates how many words "pass" for that node
  • the entropy of the next character of the node.


  • I would find the entropy for each level of the tree and the total entropy of the dictionary.

    This is a piece of class TrieNode that rapresent a single node:



    And this is a piece of class Trie with some methods to manipulate the trie data structure:



    Now I would find the entropy of each level of the tree.

    To do this I created the following method that creates two data structures (level and levelArray).
    levelArray is an array of double containing the result, that is, in the index i there is the entropy of the i-level.
    level is a arrayList of arrayList. Each list contains the weighted entropy of each node.



    If I run this method on the sample dictionary I get:

    [lev 1] 10.355154029112995
    [lev 2] 0.6557748405420764
    [lev 3] 0.2127659574468085
    [lev 4] 0.23925771271992619
    [lev 5] 0.17744361708265158
    [lev 6] 0.0
    [lev 7] 0.0
    [lev 8] 0.0
    [lev 9] 0.0
    [lev 10] 0.0
    [lev 11] 0.0
    [lev 12] 0.0

    The problem is that I don't understand if the result is likely or completely wrong.

    Another problem: I would like to calculate the entropy of the entire dictionary. To do so I thought I'd add the values present in levelArray. It's rigth a procedure like that? If I do that I obtain that the entropy of the entire dictionary is 11.64.

    I need some advice. No wonder code, but I would understand if the solutions I proposed to solve the two problems are corrected.

    The example I proposed is very simple. In reality these methods must work on a real English dictionary of about 200800 words. And if I apply these methods in this dictionary I get numbers like these (in my opinion are excessive).

    Entropy for each level:

    [lev 1] 65.30073504641602
    [lev 2] 44.49825655981045
    [lev 3] 37.812193162250765
    [lev 4] 18.24599038562219
    [lev 5] 7.943507700803994
    [lev 6] 4.076715421729149
    [lev 7] 1.5934893456776191
    [lev 8] 0.7510203704630074
    [lev 9] 0.33204345165280974
    [lev 10] 0.18290941591943546
    [lev 11] 0.10260282173581108
    [lev 12] 0.056284946780556455
    [lev 13] 0.030038717136269627
    [lev 14] 0.014766733727532396
    [lev 15] 0.007198162552512713
    [lev 16] 0.003420610593927708
    [lev 17] 0.0013019239303215001
    [lev 18] 5.352246905990619E-4
    [lev 19] 2.1483959981088307E-4
    [lev 20] 8.270156797847352E-5
    [lev 21] 7.327868866691726E-5
    [lev 22] 2.848394217759738E-6
    [lev 23] 6.6648152186416716E-6
    [lev 24] 0.0
    [lev 25] 8.545182653279214E-6
    [lev 26] 0.0
    [lev 27] 0.0
    [lev 28] 0.0
    [lev 29] 0.0
    [lev 30] 0.0
    [lev 31] 0.0

    I think they are wrong. And the entropy of the entire dictionary can not be calculated, I think that the sum it's a process too long so as I'm not able to see the result.

    For this I would understand if the methods that I have written are right.

    I don't understand what is wrong.
    Thanks a lot in advance
    7 years ago
    Help..
    The problem is create a strem of short. I decide to use short because I think it's easier to manipulate single bit.

    This is the cycle I think to use once I had a list (buffer) of short:



    In this case my problem is create a stream of short, so create an array of short.
    This is what I've done:



    But I lose a byte and I don't think this is the best way to proced.

    If I follow the advice to use integers instead of short I do not know how to manipulate the bits and how to create the array int[] buffer from file.
    I don't think that using a loop like this may be fine.



    So not sure how continue..I know it is easy but I'm in trouble.

    Thanks
    8 years ago
    I wrote this in pseudocode.. I don't know if it is correct, could you help me?
    I consider it an integer (32 bits) as recommended:



    Where:
  • n is an item in the buffer, so the buffer[i] with i=0...buffer.length-1
  • m is the length of the sequence I must build


  • This seems to work for me for m = 1, m = 2, m = 4, m = 8 and m = 16.
    In other cases not because they are not multiples of 32, so I should delete the remaining part.
    8 years ago
    Mmm ok, I searched about bitwise operators but I don't know how to write the code.

    The code would be something like:
  • take 1 byte
  • split the byte in sequences of length m
  • put all thsese sequences in a list


  • I found that:


    but it return only the byte corresponding to bit in position position.
    Not a sequence of bit.. I would take 2 bit (if m=2), 3 bit (if m=3), ...

    And if I read bytes, then how can I consider sequences of m>8?
    I change my InputStream from Byte to Integer?
    I'm confused..
    8 years ago

    Joel McNary wrote:I would suggest using Integer as the type for your list. Integers in Java are 32 bits, and you need to handle m <= 16, so Integer will be more than enough. If you want, you could use Short (which has a length of 16 bits), but if you start debugging you could see some unexpected display results and get confused. Short is a type really only used in memory-limited applications, though, and so isn't commonly used in classroom assignments. Other than that, your next steps look correct to me.


    Ok, then tomorrow I go ahead following your advice. Thank you
    8 years ago
    Ok all clear. So now I have to change the procedure.

    My idea:
      - Transform the array buffer in a list list of elements long m. But a list of what kind? Int no, Double no, of course bytes no, uhm...
      - For now I consider only the simplest case, that is, file multiple of m or I don't consider the unnecessary bits
      - I switch list to calculateShannonEntropy method that does the same things he does now, but maps have the same type of list instead byte.


    Thank you so much for help you are giving me
    8 years ago
    First of all, I'm sorry but I'mt not speaking very well English (rather so bad), so I could be wrong to interpret and/or answer.

    Joel McNary wrote:1). I would not use Byte as the keys in my maps, because that will not handle your larger m-values (Bytes handle a maximum of m = 8, and even then you might see some unexpected results if you print out the Map, since bytes in Java are signed!)


    Ok, so you're saying that it's better not to use a byte as a key because it would create problems when then I have to change the value of m (m = 1, m = 2, etc.)?

    Joel McNary wrote:2). You will need another method to transform your byte[] input to an array (or List) of something else. For example, an input of byte[] {114} and m = 2 should give you back an array of two-bit numbers ({1, 3, 0, 2}). I leave it as an exercise to you to figure out why 114 would become those four numbers, but I gave you the answer already in this post. What should an input of byte[]{114} and m = 4 return? If you can figure that out, you are most of the way to writing the transformation function.


    With m= 2 --> (114)_10 = (01 11 00 10)_2 where 01 = 1, 11 = 3, 00 = 0 and 10 = 2.
    With m= 4 --> (114)_10 = (0111 0010)_2 where 0111 = 7 and 0010 = 2.

    So I would need a Java function that, taken one byte (which I see as a decimal), decompose the bytes in portions long m with m variable.
    Right? And how can I do that?


    Joel McNary wrote:3). Do you have to handle cases like m = 3? That can be interesting, because there is no guarantee that a file will have an even multiple of 3 bits. What should be done in those cases? (What would byte[]{114} and m=3 return?


    With m= 3 --> (114)_10 = (01 110 010)_2 where 110 = 6 and 010 = 2.
    I don't consider the last two bits because they are not a correct pattern.
    Question: is it right to throw the two most significant bit or the least significant?
    Is correct to not consider the "excess" bits?

    Joel McNary wrote:But before you start those changes, can you get it working front-to-back for m = 8? What entropy value does it give you for that text input?


    I admit that I have understood very well what result I should expect ...



    8 years ago
    Hi Joel McNary, thanks for the reply.
    Today I thought best at the problem and I was able to write some code in Java.
    But I don't know if it is correct or not.



    So in this way it should work properly.
    The problem is that I don't know if the readFileInputStream method takes really the bytes.



    For example, if I consider as input file a text file like the one below, I get that mapO and mapP are:

    Text:
    Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?

    mapO:
    [
    32 = 5
    65 = 0
    97 = 0
    98 = 0
    99 = 0
    100 = 0
    101 = 4
    103 = 2
    105 = 3
    108 = 0
    110 = 2
    111 = 0
    114 = 1
    115 = 0
    116 = 2
    118 = 0
    119 = 0
    121 = 0
    ]

    mapP (I removed symbols with probability=0 because I don't want to consider them):
    [
    32 = 0.2777777777777778
    114 = 0.05555555555555555
    116 = 0.1111111111111111
    101 = 0.2222222222222222
    103 = 0.1111111111111111
    105 = 0.16666666666666666
    110 = 0.1111111111111111
    ]

    Why are 2 or 3-bit numbers (32 65 97 98 99 100 101...)? It should not be sequences of 8-bit type 00000000, 00000001, 00000010, etc?
    I was wrong to use the method?


    I don't know how to explain, I should calculate the entropy for different values of m so I need to read the input file as a sequence of bits (m = 2), sequence of bytes (m = 8, which is what I tried to do), as a 4-bit sequence, etc.

    How can I do this?
    I must definitely use another method. I read that in Java you can read the files only to the maximum until the bytes, for example not as a bit sequence...
    8 years ago
    Hi, I have to write a program that is able to estimate the probability of each symbol of a file and then evaluate the entropy.

    I know that the entropy represents the minimum number of bits needed to represent a symbol of a source of information.
    P_[mi] define the probability of the i-th symbol of the file, where as a symbol assume an m bit string.
    These probabilities can be estimated numerically evaluating the corresponding statistical frequencies.
    And starting from the estimated probabilities I can calculate the entropy of order m.

    So I have to write a program that is able to estimate the entropy of order m, for values of m <= 16 bit regarding a file.
    Well, I don't know where to begin.
    I don't want the task done, only suggestions to help me figure out how to start and what direction to take.

    My ideas are:
    1. create a program (in Java) that reads in a text input file
    2. take for example m = 2. Now, knowing that the numbers are 10 (0...9) and the letters are 26 (A...Z), I should find out what is the frequency that each pair can be made among the 26 characters appear in the text.
    3. since the frequency is comparable to the probability, I can then calculate the entropy.

    In this case "easy" in which I considered only alphanumeric symbols and m = 2.
    Then I can play by adding symbols and increasing and decreasing m.
    It's correct?

    I am very confused, I need your help.
    Thank you very much
    8 years ago
    Thanks to all. I've solved!
    8 years ago