Radu Gheorghe

author
+ Follow
since Mar 03, 2015
Radu likes ...
Firefox Browser VI Editor Suse
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
5
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Radu Gheorghe

Hello!

It sounds like both big data and data mining are buzzwords, which make them very fuzzy as concepts. My understanding is that data mining includes scraping content one way or the other, processing it according to your needs, then storing that huge content and pulling meaningful info out of it (statistics, mostly). Big data is about the last two steps (how to store and how to analyze tons of data), so it sounds like data mining includes big data.

Some people also use the two terms interchangeably (because you can consider the data source that you're scraping as big data as well), but in either case one can't say which is more important or things like that. Also, what is considered "big" is highly debatable.

So I guess the short answer is "I don't know", but I had to try
9 years ago
Hello,

With Elasticsearch you get a lot of stuff out of the box that many people seem to need to implement when using raw Lucene as well. Aggregations (used for analytics) are a component specific to Elasticsearch, so is the REST API and its distributed model (which allows for sharding and replication). Also because of the REST API you can de-couple search from your application and upgrade them independently, assuming you reach a point where you treat Lucene as yet another data store.

I don't know about how it compares to Google solutions, from what I hear (from clients) Elasticsearch is more configurable. You have pretty much all the Lucene knobs exposed (and lots of Elasticsearch-specific ones).
9 years ago
Hello and thanks

I think it's a good book for beginners, it doesn't assume any prior search knowledge at all (you can probably see that from chapter 1). That's why it ended up so big, because we wanted to make it for people who are completely new to this stuff as well (but also it had to go deep enough to allow the reader to be "independent" in running Elasticsearch). For me, Elasticsearch was the first search engine/big data technology I got to learn, and I wanted to write as if the reader is in the same situation I was a few years ago.
9 years ago
Hello, and thanks for welcoming

Elasticsearch is definitely applicable for Java developers. You can work with it from any programming language since it has a REST API, and there's also the native Java API that we don't cover in the book because it changes more often (Elasticsearch is written in Java).

Elasticsearch is good for indexing stuff - whether you want to do that for filtering (think "grepping" through billions of logs and getting sub-second replies) or for getting counters for your data (like how many unique visitors landed on your site each day). Centralized logging is the typical use-case for Elasticsearch, but it's good for everything that implies full-text search and/or realtime analytics: product search, social media search and analytics and so on.

It's similar to Hadoop in the sense that it's distributed - you can shard your data across N servers. Plus, the real-time analytics part is a bit like map-reduce. I'm by no means a Hadoop guy, but it sounds like Hadoop is for longer, batch-processing jobs while Elasticsearch is for realtime stuff.

I hope this answers your question - you can find more about Elasticsearch use-cases in the first chapter of our book, which is free.
9 years ago
Hi Raymond,

Thanks for welcoming us We try to cover all grounds in the book: the first two chapters should help you get started, then chapters 3-8 focus on managing the mapping and the queries (how to index data, how to set up text analysis, how to make queries relevant, how to manage relationships, how to run aggregations to get real-time statistics). Chapters 9-11 focus more on administration (scaling, performance tuning, monitoring). Appendices cover features that are very nice, but maybe not needed by everyone (like highlighting or monitoring tools).

At the moment, the book is written on 1.4.x. The final version will include all upcoming changes in the 1.x branch (what you'll see in 1.5 and probably 1.6) and account for what we know will change in 2.0. We're trying to future-proof it as much as possible - most things remain the same, but there are important differences that we need to point out, especially when it comes to best practices.

You can use Sense, but it's now part of Marvel (a commercial monitoring product by Elasticsearch which is free for development). Alternatively, you can send requests from your browser with plugins like Head and Kopf. I personally use curl most of the time, and you'll see curl examples in our book as well.
9 years ago
I don't think there are plans for a complete rewrite, but they're sure working on the performance bit. Checkout the full roadmap here: http://www.elasticsearch.org/guide/en/logstash/roadmap/current/index.html

You have some options now, too:
- try 1.5 Beta if you didn't already try it, because it's faster than 1.4.2
- try parallelizing more. For example, the elasticsearch output has a "workers" option that lets you push more stuff to ES, and you can also tune flush_size to something that works best for your ES cluster. You can also make filters work on multiple threads by starting Logstash with multiple filter workers (-w option). If the input is the bottleneck, maybe you can start more inputs and load balance?
- try something else that's faster. rsyslog and Apache Flume come to mind, they can parse stuff and send to Elasticsearch like Logstash does. They also have configurable in memory and on disk buffers. That said, you might miss some features and they're more difficult to set up (at least that's my experience with them)

In our book, we don't talk much about Logstash and Kibana, but we cover using Elasticsearch for logs and other time-based data, especially chapters 9 (scaling), 10 (performance) and 11 (administration) might be interesting for you.
9 years ago
Hi Alex,

I'm not super-familiar with what MongoDB has to offer (I have to catch up on that), but Elasticsearch's strongpoints are related to real-time search and analytics. Some example use-cases that fit Elasticsearch very well:
- log centralization. This is by far the most popular. You can drop lots of logs in there (or any other time-based data, really, could be metrics, for instance) and "grep" through them very quickly, and you can also do lots of statistics. If you want an example (this is actually a product I've been working on) go to https://apps.sematext.com/demo to the Logsene tab. Logsene is a logging SaaS with an Elasticsearch backend. You can click on the Kibana button there to explore logs through the open-source Kibana UI, which was built specifically for slicing and dicing logs stored in Elasticsearch. If you also heard about Logstash (that can mangle events on their way to Elasticsearch), together they make up what is called the "ELK stack", which is used by many for centralizing logs. There are lots of other tools in the logging ecosystem that work with Elasticsearch. rsyslog is one of them, I'm a big rsyslog fan
- social media. This is another kind of time-based data, but I'm putting it separately because you might have other search needs. Like stemming or fuzzy searches (or statistics - for example you may not want to count "search" and "searching" separately)
- product search. This is more towards the typical search engine case: you have a bunch of products, how do you build relevant search on top of that? Elasticsearch has tons of features in the way text is analyzed (i.e. make tokens from the original text and from the query string and match them) and the way you can run queries. For example, you can rank more exact matches higher, or you can rank newer or promoted products higher

Depending on your use-case, Elasticsearch may have advantages in the search/analytics performance area (because of the way it indexes everything) or in the search relevancy area. That's how I'd divide them, at least.
9 years ago
Thanks Henry, Tim, Alex! Ranch is nice, I like the Ranch
9 years ago