permaculture playing cards*
The moose likes Meaningless Drivel and the fly likes Some Meaningless Statistics Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Other » Meaningless Drivel
Bookmark "Some Meaningless Statistics" Watch "Some Meaningless Statistics" New topic
Author

Some Meaningless Statistics

John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
I remember I was reading a post by an unregistered user in MD, and I was thinking, "Is it possible to determine who the poster is by analyzing the distribution of words in his post and comparing it with that of a known user?"
So, I wrote a program that retreives all the posts of the user from the specified JR forum and runs some statistics. The program is smart enough to parse multi-page topics, and to discard the quotes from other people in the user's posts. I made a somewhat arbitrary decision to count the top 10 most frequently used non-trivial words by a poster, as a user's "fingerprint". I defined non-trivial as the word with at least 6 characters. I didn't program the analysis itself yet, for now it just does the stats. Unfortunately, the JR search interface only allows the maximum of 200 topics, so the volume of posts to analize is limited to a few hundred.
Below are the results for Jim, Tom, and Map (they are the frequent posters in MD, and the stats are most meaningful if there is a large volume of posts to analyze). You can also run the stats for yourself, but please don't abuse, -- the program does some heavy network interactions with JR and makes the "search" requests to JR, and I hear that it is a memory hog in JR server. Specify your member id and the forum id in the main() method of the MeaninglessStats class, and run it.
User: Jim Yingst
Total posts: 417
Total words: 42301
Total non-tivial words (at least 6 characters long): 10358
Total unique non-tivial words: 3519
Top 10 most frequently used non-trivial words:
1. people (91 times)
2. really (88 times)
3. actually (71 times)
4. problem (70 times)
5. probably (65 times)
6. something (59 times)
7. should (57 times)
8. though (55 times)
9. number (46 times)
10. english (45 times)
User: Mapraputa Is
Total posts: 706
Total words: 52245
Total non-tivial words (at least 6 characters long): 13885
Total unique non-tivial words: 4267
Top 10 most frequently used non-trivial words:
1. russian (142 times)
2. people (125 times)
3. something (109 times)
4. should (93 times)
5. english (89 times)
6. language (88 times)
7. another (87 times)
8. because (85 times)
9. example (76 times)
10. michael (73 times)
User: Thomas Paul
Total posts: 508
Total words: 22008
Total non-tivial words (at least 6 characters long): 5583
Total unique non-tivial words: 2519
Top 10 most frequently used non-trivial words:
1. people (89 times)
2. because (67 times)
3. someone (37 times)
4. should (36 times)
5. actually (32 times)
6. really (27 times)
7. little (26 times)
8. government (24 times)
9. better (22 times)
10. country (21 times)
The immediate observations are:
-- if you see the words "little" and "better" in the post, the author is Tom
-- the words "russian", "language", and "example" identify Map
-- Jim seems to favor the words "problem", "number", and "though"
I was trying to post the source code, but JR wouldn't accept it because of some "illegal html tags" or something. But you can download it.
Eugene.
[ May 05, 2003: Message edited by: Eugene Kononov ]
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
!
This is amazing, what MD people are doing!
Eugene, wanna have intercourse?


Uncontrolled vocabularies
"I try my best to make *all* my posts nice, even when I feel upset" -- Philippe Maquet
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Did you try to analyze prepositions instead? I read an article about identifying authors of literature texts, and they found that each writer (Russians ) has his own unique "spectrum" of prepositions (relative frequency). One page of text is enough for identification.
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
Does anyone think it is amusing that Map has "Michael" as her 10th most common word?


Associate Instructor - Hofstra University
Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Eugene, wanna have intercourse?
Nah, I am faithful to Tom.
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
What you are attempting is a form of frequency analysis. It is one method used to break simple ciphers. Such statistics can only be analyzed meaningfully when the average occurence of a particular word is known, and when a significant sampling of each poster is available.
Let's say a word frequencey count is taken for all posts in MD. The number of times a word appears, divided by the total number of words in MD (including duplicates), would give us a frequency for any particular word. Hash each unique word-frequency pair and we might assume this as our "average" frequencies which we will use to base our analysis on. You will have to have this data in order to make statements such as "Tom's posts use the words 'little' and 'better' more often than most".
You might then profile each user using the same methods as above to come up with their word frequencies. Comparisons would be made between each user and the baseline to make determinations. This will be more accurate the more words a person posts and less accurate the less one posts.
Given an unknown poster, generate a profile and compare to the known posters. While the chances are slim you could make a positive match, you could probably come up with a list of the top n most likely candidates, and manually compare against the baseline to narrow your choices further. More importantly though would be to analyze style and usage.
More meaningful than simple word frequencies would be to note misspellings frequently used (although these would stand out pretty well in the overall freq count) as well as frequency counts using combinations or groups of words. Someone who often used "totaly" instead of "totally" or someone who often uses the phrase "in accordance with" would stand out pretty well.
This is all fun and interesting stuff imho.
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
It's not a joke whatr you are doing!
There is an interview with jGuru guys in "Design for Community" book, and frankly, I was envious when reading what they have!
Terence Parr: What we do is, when somebody submit a question, we search for an appropriate answer and provide you with a list of potentials, in an effort to say, "Here's the answer." That way the system automatically tries to reduce the amount of noise in the forums.
Further, it tries to guess when you're in the wrong forum. You do not want someone in the database forum asking a question about building a GUI. So, if someone does that, the system says, "there's an above average chance that you're in the wrong forum. I suggest one of these topics." Then the user can just click on it and switch to the right forum.
The system also tried to detect when you haven't said anything about Java. So if somebody just says, "hey, what's this site about?" or post a thigh cream commercial, the system says, "You know, there is not a single word in there as I recognize as part of the java lexicon. So click here to re-edit and add something that has to do with java"
How do you know what is and isn't java talk?
Parr: We have a fuzzy logic search engine that tries to strip out everything but the important keywords in your question. Then I do a fuzzy comparison against all of the other FAQ entries in our system. I do this not by which FAQs have these keywords, but by how important these keywords are in that particular FAQ. In other words, how often they are used, and the frequency of the use of that word, and how important this document is compared with the rest, so I can bubble the most important one to the top.
The way we started this system was to spider the New York Times website. I got, I dunno, 250,000 words. And I said, okay, that's English. And then I spidered our own website, and I said, that's Java. And then, to distinguish the Java lexicon from the English, I did some complicated fuzzy logic stuff that revealed a set of keywords that are specific to Java.
Because those words don't appear in the New York Times?
Parr: Well, they may appear in the New York Times, but it's a difference in their usage. For example, the word "compile is probably at the New York Times website. However, when you are talking about java, it's used way more. So, if a word is overused in the java lexicon, and underused, relatively speaking, in English, I say, "Aha! That's Java."
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
I noticed that non-native English speakers are especially fond of abusing word frequency.
I knew a woman from Colombia who used the verb "organize" every time she couldn't think about a better alternative. I knew a Japanese guy who started about a half of sentences with "Soooooo..." Then I thought which word *I* abuse, and figured it must be "just". Sooooo... I decided to organize my thoughts just soooooo I wouldn't use this word just too often...
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
Originally posted by Mapraputa Is:
Then I thought which word *I* abuse, and figured it must be "just".

While it's not a word, didn't we decide that it's the hyphen that you abuse? On the other hand, you are far too forgiving on articles. You need to slap them around a bit more.
We all have patterns of use and abuse though, whether or not we are native speakers. I'm particularly harsh on the prepositional phrase, at least when writing.
This abuse can extend regionally as well. In the Baltimore area there are some who use the word "Hon" (short for 'Honey') like it's punctuation, as in "Here you go, Hon", or "What can I get you, Hon?" Growing up in southern New England we criminally abused the word "wicked" (usually came out sounding like 'wicket'), as in "That's wicked cool", "Wicked pissa", "Way wicked", or sometimes just "Wicked".
[ May 05, 2003: Message edited by: Jason Menard ]
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
While it's not a word, didn't we decide that it's the hyphen that you abuse?

This is an interesting question!
"Hyphen abuse" seems to be a part of my alternative upbringing.
Here is an interview with a professional translator, and he said:
"... even punctuation in different languages serves different, sometimes the opposite, goals. Say, where in English a colon is used, in Russian usually a hyphen is needed, and the vice versa, a hyphen in English is a very strong sign, and it is used, especially in American variant, only for strong contrast."
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Does anyone think it is amusing that Map has "Michael" as her 10th most common word?
Yeah, is it my Texan friend that we are talking about?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Does anyone think it is amusing that Map has "Michael" as her 10th most common word
Is that all? The sample set must be limited to relatively recent posts.


"I'm not back." - Bill Harding, Twister
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
JY: Is that all? The sample set must be limited to relatively recent posts.
Yes, 200 last topics, that's the max number of topics that JavaRanch search returns. I would love to run it for a larger set of topics, but JR apparently limits it.
[ May 05, 2003: Message edited by: Eugene Kononov ]
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Did you try to analyze prepositions instead? I read an article about identifying authors of literature texts, and they found that each writer (Russians ) has his own unique "spectrum" of prepositions (relative frequency). One page of text is enough for identification.
Here is the analysis of the prepositions, for the same three people, plus myself.
Summary:
-- Jim has an abnormally high frequency of "at" and a distaste for "about"
-- Map abuses "about"
-- Tom is flawless, I can't see any abnormal peaks
-- Eugene is the only one with "by" in the top 10

Full Results:
User: Jim Yingst
Total posts: 417
Total words: 42386
Total prepositions: 5408
Ratio Total prepositions / Total words: 0.13
Total unique prepositions: 58
Top 10 most frequently used prepositions:
1. to(1210 times,22.37% of total prepositions)
2. of(865 times,15.99% of total prepositions)
3. in(591 times,10.93% of total prepositions)
4. for(340 times,6.29% of total prepositions)
5. but(330 times,6.10% of total prepositions)
6. as(323 times,5.97% of total prepositions)
7. on(219 times,4.05% of total prepositions)
8. with(215 times,3.98% of total prepositions)
9. at(189 times,3.49% of total prepositions)
10. from(166 times,3.07% of total prepositions)
User: Mapraputa Is
Total posts: 706
Total words: 52245
Total prepositions: 5931
Ratio Total prepositions / Total words: 0.11
Total unique prepositions: 56
Top 10 most frequently used prepositions:
1. to(1399 times,23.59% of total prepositions)
2. of(882 times,14.87% of total prepositions)
3. in(713 times,12.02% of total prepositions)
4. for(406 times,6.85% of total prepositions)
5. but(389 times,6.56% of total prepositions)
6. as(313 times,5.28% of total prepositions)
7. with(273 times,4.60% of total prepositions)
8. about(269 times,4.54% of total prepositions)
9. on(219 times,3.69% of total prepositions)
10. from(176 times,2.97% of total prepositions)
User: Thomas Paul
Total posts: 508
Total words: 22008
Total prepositions: 2842
Ratio Total prepositions / Total words: 0.13
Total unique prepositions: 48
Top 10 most frequently used prepositions:
1. to(606 times,21.32% of total prepositions)
2. of(478 times,16.82% of total prepositions)
3. in(384 times,13.51% of total prepositions)
4. on(161 times,5.67% of total prepositions)
5. for(142 times,5.00% of total prepositions)
6. as(123 times,4.33% of total prepositions)
7. but(118 times,4.15% of total prepositions)
8. with(111 times,3.91% of total prepositions)
9. from(100 times,3.52% of total prepositions)
10.about(84 times,2.96% of total prepositions)
User: Eugene Kononov
Total posts: 202
Total words: 16711
Total prepositions: 2058
Ratio Total prepositions / Total words: 0.12
Total unique prepositions: 50
Top 10 most frequently used prepositions:
1. to(502 times,24.39% of total prepositions)
2. of(365 times,17.74% of total prepositions)
3. in(279 times,13.56% of total prepositions)
4. for(118 times,5.73% of total prepositions)
5. with(105 times,5.10% of total prepositions)
6. from(89 times,4.32% of total prepositions)
7. as(82 times,3.98% of total prepositions)
8. but(77 times,3.74% of total prepositions)
9. on(65 times,3.16% of total prepositions)
10. by(64 times,3.11% of total prepositions)
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Can you run your program on whole MD? We could compare top 10 abnormalities weekly. I bet, before a document about fallacies was composed, "Iraq" and "UN" were high on the list, and later it would be something like "penis"
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Did anybody notice that I am a clear winner in
"Total unique non-trivial words" category???
Francis Siu
Ranch Hand

Joined: Jan 04, 2003
Posts: 867
oh... great! Eugene
Is it a assignment in java college?
The comment of this assignment
Best programming technique and good marketing view.But lack programme documentation
Marks: B+
Did you study the subject called "Data mining"?


Francis Siu
SCJP, MCDBA
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Map]: Did anybody notice that I am a clear winner in "Total unique non-trivial words" category???
Only because you posted more in all. Look at the ratio of non-unique words to total words - I'm slightly ahead of you, and Tom has us both beat by a fair margin. Though it's reasonable to expect this metric to diminish gradually the more a person posts - each new post is decreasingly likely to require the use of words not already used in previous posts.
It also occurs to me that a lot of Mapposts contain large sections quoted from other sources (often with no explanation). This doubtless has an effect on Eugene's statistics.
[ May 06, 2003: Message edited by: Jim Yingst ]
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
It also occurs to me that a lot of the Map's posts contain large sections quoted from other sources (often with no explanation). This doubtless has an effect on Eugene's statistics.
Yes, Map is known to include large quotes in her posts, and it probably explains Map's high number of unique words. My program only filters out whatever is in bold, or between the quote blocks, -- it has no way of knowing if the paragraph belongs to author, or the author just included it. I could perhaps filter out whatever is between the quotes, but that also doesn't guarantee that a particular passage belongs to the member who may have used some other means of quoting.
In addition, some people back in 2001-2002 used some nonstandard ways of quoting, such as
> here is a quote
> end of quote
This makes it somewhat non-trivial to find the end of quote.
Nevertheless, in my tests the program did a good job in considering only the user's words. It just that in some cases, the statistics may be biased.
I also found a minor bug in the code that didn't properly remove quotes in some cases. I fixed it, and I am also adding some frequency analysys code for "fingerprinting".
Guess who is the easiest person to identify? It's Don Liu, -- his histogram looks way out of place, with some bizzare words such as "thank you" and "scja" somewhere on the top. I don't think Don Liu is a human, -- I bet you that it is just a Java program that posts to JR.
[ May 06, 2003: Message edited by: Eugene Kononov ]
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Originally posted by Jim Yingst:
[Map]: Did anybody notice that I am a clear winner in "Total unique non-trivial words" category???
Only because you posted more in all. Look at the ratio of non-unique words to total words - I'm slightly ahead of you, and Tom has us both beat by a fair margin.
[Rest of the post deleted as particilarly offensive]



That's right. Every time somebody says something good about Map, here is this JY guy, who put her in her place...

[ May 06, 2003: Message edited by: Mapraputa Is ]
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
That's right. Every time somebody says something good about Map, here is this JY guy, who put her in her place...
It's a dirty job, but somebody's gotta do it. And I think Jimbo is doing a fantactic job.


Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius - and a lot of courage - to move in the opposite direction. - Ernst F. Schumacher
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Eugene - excellent. I hadn't looked closely at the code - didn't realize you were filtering a lot of that stuff out already. Very cool.
[EK]: Map is known to include large quotes in her posts, and it probably explains Map's high number of unique words.
Well to be fair, she does employ a fairly large vocaulary too - I doubt it's all from the quotes.
Hmmm... I don't suppose you do anything to filter out misspellings?
[Map]: That's right. Every time somebody says something good about Map, here is this JY guy, who put her in her place...
Absolutely. Especially when "somebody" is Map herself. I wouldn't want you to think I wasn't reading your posts. And anyway, the other person I was picking on today was Peter den Haan - so you're in pretty good company.
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Hmmm... I don't suppose you do anything to filter out misspellings?
Actually, this is why I commented about my "victory" in "Total unique non-trivial words" category. Looked funny to me, you know. I thought those "unique non-trivial words" must be all misspelled
[ May 06, 2003: Message edited by: Mapraputa Is ]
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
I suppose I could post each word to dictionary.com and parse the response to validate the existance of the word, but it would take a very long time. Are the any dictionaries that I could install and look up locally?
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Ok, I added some enhancements. The program now checks if the words are spelled correctly, and it also does a better job of filtering out the quotes from the user's post. The updated source is here, if you need it.
This run took all night, -- the spell check is very expensive operation, -- the program made some 35,000 HTTP requests to an online dictionary to validate the words. So don't try it with the modem connection (you can disable spell check though).
The stats are now much more meaningful:
User: Jim Yingst
Total words: 42009
Total non-tivial correctly spelled words (at least 6 characters long): 9838
Total non-tivial misspelled words: 660
Total unique non-tivial correctly spelled words: 3177
Ratio (Total unique non-tivial correctly spelled words / Total words) : 0.07562665
Top 10 most frequently used non-trivial words:
1. people (91 times)
2. really (88 times)
3. problem (70 times)
4. probably (65 times)
5. something (59 times)
6. should (57 times)
7. though (55 times)
8. number (46 times)
9. english (45 times)
10. someone (41 times)
User: Mapraputa Is
Total words: 51388
Total non-tivial correctly spelled words (at least 6 characters long): 13028
Total non-tivial misspelled words: 833
Total unique non-tivial correctly spelled words: 3694
Ratio (Total unique non-tivial correctly spelled words / Total words) : 0.07188448
Top 10 most frequently used non-trivial words:
1. russian (142 times)
2. people (125 times)
3. something (109 times)
4. should (93 times)
5. english (89 times)
6. language (87 times)
7. another (86 times)
8. because (85 times)
9. example (75 times)
10. course (71 times)
User: Eugene Kononov
Total words: 17471
Total non-tivial correctly spelled words (at least 6 characters long): 4365
Total non-tivial misspelled words: 262
Total unique non-tivial correctly spelled words: 1946
Ratio (Total unique non-tivial correctly spelled words / Total words) : 0.11138458
Top 10 most frequently used non-trivial words:
1. people (72 times)
2. government (59 times)
3. prepositions (57 times)
4. should (54 times)
5. because (25 times)
6. really (24 times)
7. russian (23 times)
8. something (22 times)
9. country (21 times)
10. someone (19 times)
User: Michael Morris
Total words: 13086
Total non-tivial correctly spelled words (at least 6 characters long): 3003
Total non-tivial misspelled words: 231
Total unique non-tivial correctly spelled words: 1735
Ratio (Total unique non-tivial correctly spelled words / Total words) : 0.13258444
Top 10 most frequently used non-trivial words:
1. because (24 times)
2. little (18 times)
3. always (16 times)
4. americans (15 times)
5. should (14 times)
6. probably (12 times)
7. saying (11 times)
8. middle (10 times)
9. english (9 times)
10. consider (8 times)
User: Thomas Paul
Total words: 21955
Total non-tivial correctly spelled words (at least 6 characters long): 5292
Total non-tivial misspelled words: 293
Total unique non-tivial correctly spelled words: 2326
Ratio (Total unique non-tivial correctly spelled words / Total words) : 0.10594398
Top 10 most frequently used non-trivial words:
1. people (89 times)
2. because (67 times)
3. someone (37 times)
4. should (36 times)
5. actually (32 times)
6. person (27 times)
7. little (26 times)
8. anything (24 times)
9. better (22 times)
10. country (21 times)
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Yet more coolness. For spelling, I suggest that while online lookup may be the best way to go initially, it may well be more efficient in the long run to build a local list, indicating all words looked up, and whether they're valid or not. Each time your program runs, it can load a HashMap (or two HashSets) with this info for fast lookup. Then any time you need to look up another word online because it's not yet in your local list, use the result of the online lookup to update the local listing. eventually you'll have a nice list of valid words and common misspellings. This would also allow investigation of the list of "misspellings" to see how accurate the results are. Calling them misspellings may be inaccurate; my signature for example is spelled correctly (errr... mostly), but probably isn't in m-w.com . Though from your stats it appears that I mistype far more often than I realized - probably inverted letters mostly.
On another note, you needn't be limited to just the last 200 posts from any one person. You can access any thread from the past (if not deleted) by constructing the appropriate URL; they're sequentially numbered according to the order in which each thread was started. You can't tell in advance who has posted to the thread - but what I'm saying is you could just download the complete contents of MD to local files, and then search them all you want for particular posters. Be careful about downloading too much at once; our server may not appreciate such aggressive use, so keep an eye on the pace. (If you start seeing "server too busy" messages, that's a clue to slow it down. But I don't think it would be too difficult to download everything eventually. After that you just need to keep pace with the threads that have been updated; you can parse this from the main MD page.
[ May 07, 2003: Message edited by: Jim Yingst ]
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
Wow, I feel like Steve Martin in The Jerk when he got the brand new phone book, found his name listed in the white pages and shouted "I'M SOMBODY!". First, I'm declared an Infidel, now my posts have been statitically analyzed! On the other hand, this ain't some kind of devious backdoor IRS audit is it? My misspelled ratio is bigger than everybody elses NGYA, NGYA, NGYA, NGYA.
[ May 07, 2003: Message edited by: Michael Morris ]
paul wheaton
Trailboss

Joined: Dec 14, 1998
Posts: 20720
    ∞

I might be fun to set that up on a web page and be able to run it with any ID.


permaculture Wood Burning Stoves 2.0 - 4-DVD set
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
I might be fun to set that up on a web page and be able to run it with any ID.
Sounds like a good idea to me.
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Originally posted by Paul Wheaton:
I might be fun to set that up on a web page and be able to run it with any ID.

Ok, I put up a web site so the meaningful stats now runs online.
It runs on a free server, so it's very slow (takes about 12 seconds to parse a topic, so if you have posted in a lot of topics, it may take some time). However, you can specify the max number of topics to parse.
I also fixed a few bugs (thanks to Jim who reviewed my code), and optimized it a litle.
If you guys in charge ever want to house it on the javaranch server, I'll be happy to help. All source code is available (although I have not posted it yet), and the stats should run much faster from JR server, even if it is still a networked version. My estimate is about 1 second per topic.
Feedback and suggestions are welcome,
Eugene.
R K Singh
Ranch Hand

Joined: Oct 15, 2001
Posts: 5371
I think you should pick last N number of topics .... That will show latest words used by user...
then it will be much easier to find... who is Moose
R K Singh
Ranch Hand

Joined: Oct 15, 2001
Posts: 5371
Or you are using Ranch's search engine ... My God .... JR will be down if more people use ur meaningless stat prog..
2 cents .. use "recent post" link and remove forum search ..... but then it wont be meaningless
AW good job ..
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Eugene- excellent job! Sorry I didn't get back after your last message. Got distracted again. Anyway...
I note that when using the search page for, say, member 290, I get 200 results dating from April 29, 2001 to May 7, 2002. That's basically going back to the oldest posts we have, and looking forward from there until the limit of 200 is reached. Furthermore, when using your option to limit the number of posts to 20 (or whatever) these 20 posts are selected from the beginning rather than the end. I believe two significant improvements are possible in this area: (1) select posts from the end of the list rather than the beginning, and (2) make use of the search page's option to limit the search by date (last 30 days for example). Though if you do (2) then (1) may be less useful, or vice versa.
If you guys in charge ever want to house it on the javaranch server, I'll be happy to help.
Indeed, I was thinking about it. And site owner Paul sounds like he might be amenable, though he wasn't necessarily talking about setting it up on the same server as UBB. (We've had assorted load issues in the past; we're very cautious about this sort of thing.) In fact though, UBB stores each thread in a separate text file (no, there's no nifty DB behind this) so a program running on the server could just read the thread files directly rather than making an HTTP request for each one. Parsing code would have to be modified a bit to look for different telltales, but I doubt that would be too complex - and I'm sure we could get much better speed this way.
One concern of course is putting too much load on the server - especially if there's a publicly-accessible page that anyone can submit a huge search request to. Perhaps instead we could set up some sort of weekly or monthly batch job to run on a weekend, and then post the results somewhere.
Incidentally, for those of you who are trying to identify the Moose, an alternate strategy is to try to bribe one of us moderators who can access the IP number of posts. Though personally I only used the IP to confirm suspicions I already had for other reasons...
Francis Siu
Ranch Hand

Joined: Jan 04, 2003
Posts: 867
Eugene
Are you curious to know who The Moose is? What if you extract all the posts by The Moose, do some statistical analysis on the words The Moose uses, and compare it with that of all the sheriffs?
Would it be possible to identify who The Moose is?
Answer: The Moose is Theodore,isn't him?
Because some traces could be found from a read only forum/museum but now removed
Could you tell me your predication who is The Moose?

And Jim
I only used the IP to confirm suspicions
You must know who is The Moose
Could you check my answer and tell me the truth?
I think that it is not privacy
thanks
[ May 14, 2003: Message edited by: siu chung man ]
The Moose
Bartender

Joined: Apr 01, 2003
Posts: 73
What! You doubt my unique identity? :roll:
No meer human sheriff could cause as much trouble as a true Moose can! Come join my new crusade in the JavaRanch Forum. The crusade to keep JavaRanch primitive!!.
Animal lovers UNITE!!


Finally! Animal rights in action!
 
wood burning stoves
 
subject: Some Meaningless Statistics