Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: RSyslog: users

Who is interested in ElasticSearch?

 

 

RSyslog users RSS feed   Index | Next | Previous | View Threaded


rgerhards at hq

Apr 10, 2012, 3:56 AM

Post #1 of 17 (736 views)
Permalink
Who is interested in ElasticSearch?

Hi all,

I am doing some experimental work on ElasticSearch integration. I started off
with a contribution and will extend it in the coming days/weeks. I wonder who
else is interested in that topic? Actually, I'd like to get feedback both on
suggested/required features as well as some folks who test out things that
have been implemented.

Someone out here? Please feel free to share/forward this mail if you happen
to know somebody else.

Thanks!
Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


bodik at civ

Apr 10, 2012, 4:10 AM

Post #2 of 17 (691 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

hi,

recently i did some testing. i tried omelasticsearch but i stopped using
direct output plugin in the favor of logstash push agent

a) i wont connect logserver to the els cluster directly (because of
security)

a1) there are also issues sigsegv when setting configuration parameters
of omelasticsearch

b) logstash has better functionality in parsing and mangling data before
they are pushed to els

c) els clients are very sensitive to input data. there were case when
there were binary data in logs and those cannt be pushed and whole
cluster crashed because of this.

Feb 13 19:30:19 127.0.0.1 sshd[22862]: Invalid user imu\361oz from a.b.c.d
Feb 13 19:30:19 127.0.0.1 sshd[22862]: pam_krb5(sshd:auth):
authentication failure; logname=imu�oz uid=0 euid=0 tty=ssh ruser=
rhost=a.b.c.d

that's also why i switch to logstash with:

tr -c '[:print:][:cntrl:]' '?' | $JAVA_HOME/bin/java $JAVA_OPTS -jar
$JAR agent -f lsloader-stdin.conf

On 10.4.2012 12:56, Rainer Gerhards wrote:
> Hi all,
>
> I am doing some experimental work on ElasticSearch integration. I started off
> with a contribution and will extend it in the coming days/weeks. I wonder who
> else is interested in that topic? Actually, I'd like to get feedback both on
> suggested/required features as well as some folks who test out things that
> have been implemented.
>
> Someone out here? Please feel free to share/forward this mail if you happen
> to know somebody else.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


rgerhards at hq

Apr 10, 2012, 4:26 AM

Post #3 of 17 (695 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

> -----Original Message-----
> From: rsyslog-bounces [at] lists [mailto:rsyslog-
> bounces [at] lists] On Behalf Of Radoslav Bodó
> Sent: Tuesday, April 10, 2012 1:10 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Who is interested in ElasticSearch?
>
> hi,
>
> recently i did some testing. i tried omelasticsearch but i stopped
> using
> direct output plugin in the favor of logstash push agent
>
> a) i wont connect logserver to the els cluster directly (because of
> security)

I guess this cannot be solved in any case? You talk about not using direct
connection because you want it indirect, right?
>
> a1) there are also issues sigsegv when setting configuration parameters
> of omelasticsearch
Was that from the recently refactored git branch? I am asking because I have
completely rewritten the config part and would be very interested in any
problems encountered. The relevant branch is here:

http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/master-elastics
earch

> b) logstash has better functionality in parsing and mangling data
> before
> they are pushed to els

What is missing?

> c) els clients are very sensitive to input data. there were case when
> there were binary data in logs and those cannt be pushed and whole
> cluster crashed because of this.
>
> Feb 13 19:30:19 127.0.0.1 sshd[22862]: Invalid user imu\361oz from
> a.b.c.d
> Feb 13 19:30:19 127.0.0.1 sshd[22862]: pam_krb5(sshd:auth):
> authentication failure; logname=imu�oz uid=0 euid=0 tty=ssh ruser=
> rhost=a.b.c.d

The original ,JSON template option was a hack and did not cover all cases.
With the recent commit, JSON coding is much more solid - but obviously still
experimental.

Thanks for the feedback!
Rainer

>
> that's also why i switch to logstash with:
>
> tr -c '[:print:][:cntrl:]' '?' | $JAVA_HOME/bin/java $JAVA_OPTS -jar
> $JAR agent -f lsloader-stdin.conf
>
> On 10.4.2012 12:56, Rainer Gerhards wrote:
> > Hi all,
> >
> > I am doing some experimental work on ElasticSearch integration. I
> started off
> > with a contribution and will extend it in the coming days/weeks. I
> wonder who
> > else is interested in that topic? Actually, I'd like to get feedback
> both on
> > suggested/required features as well as some folks who test out things
> that
> > have been implemented.
> >
> > Someone out here? Please feel free to share/forward this mail if you
> happen
> > to know somebody else.
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


vladg at illinois

Apr 10, 2012, 6:05 AM

Post #4 of 17 (715 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

First off, I'm very interested in ElasticSearch. I tried several different backend databases for log storage, and none of them could scale as well. With a single moderately-sized ES server, I was able to index over 3000 DNS query logs per second, and querying the data was very fast. I have a lot more data to index (~50k/s), and am currently building out the ES cluster.

On 4/10/12 6:10 AM, Radoslav Bodó wrote:
> b) logstash has better functionality in parsing and mangling data before
> they are pushed to els

Logstash is easier to configure, yes. But in my experience, it was unstable and couldn't keep up with any significant amount of logs. I wasn't using any feature in logstash that rsyslog doesn't have - I was using it for message filtering and normalization (which it only does via regular expressions, which were slow).

> c) els clients are very sensitive to input data. there were case when
> there were binary data in logs and those cannt be pushed and whole
> cluster crashed because of this.

You can easily escape this in rsyslog, and configure the character used to escape it. Also, with the newer ES versions, I have yet to experience a crash in sending the data to ES.

Here's my current wishlist for rsyslog/elasticsearch integration:

1) Support bulk inserts (<http://www.elasticsearch.org/guide/reference/api/bulk.html>).
2) Parse the reply, for two things:
a) Messages that didn't get successfully inserted should probably be queued and reattempted once or twice before being discarded. Unfortunately, the new transactional interface won't be sufficient here - if messages 1, 2, 4, and 5 are successfully inserted, but message 3 fails, as far as I know, there's no way in the transactional interface to communicate that only message 3 failed, instead of message 3-5.
b) Messages that matched a percolator should be processed differently. A percolator (<http://www.elasticsearch.org/guide/reference/api/percolate.html>) is a saved query on the ES cluster. Whenever a message is inserted that matches a percolator, it is indicated in the response {"matches": "system_failed"}. This provides near-realtime search functionality. Anything that matches a percolator should somehow be reentered into the queue, so it can be passed to another output plugin (out to a file, ommail, etc.)
3) The ES server and port should be configured via config directives.
4) Somehow, the index and type for each message should be passed to the elasticsearch plugin. This is a bit tricky, because if it's part of the message itself, it takes some time to parse that data out of the message.
5) ES has an automatic discovery feature, where it will detect other cluster members (<http://www.elasticsearch.org/guide/reference/modules/discovery/zen.html>). Ideally, rsyslog would also use this, so that if a cluster member goes down, it can find a new cluster member, and the system benefits from the high-availability of elasticsearch.

We have a developer that's currently working on many of these features, so I'm happy to offer some assistance with building this out.

--Vlad Grigorescu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)

iQIcBAEBCgAGBQJPhDATAAoJEMEVj6tjLlJyRi0QAJduzSmE/xZDOmpkRuDAFfm9
UNGeRQNJIUQJGlNS3+Auk+k7714KoQhGkjHiUqKb23QpPxTVEbOSCoRMdfwrrzq/
zQ9F58XdKbDd29/+0YBuO0m6l4CAqB8x6IlRnYjcWNdjLV8EjhXZrff8vV6MDOPc
WZZZ/GRTbKHdhVPhfLJMCtmqau3hYdR7qTW8hIkMpwS8nL9JrHrhTY6+F3bPzjI7
YF3IGKed+raV/3/VgV+aoBucjRwk8A5TSo8DuXJqDOZHjxLsjZ8t2K9PdSvZPjY9
gG/eK8dCKdswgZM+tv9TkJurwV+NOFPEgfvcpehJowuY3UzfsRg/tzHWehn84pWg
iBSUbWJ3J7f+4Q9ky3XARS/R0Ebx4Igs5DODqsI2SXg11DCg4Ll0D5fF12+ybZDh
VE1n6vLLuPxE2z8rXq8Oj/SQVvyWJBEu/jA3ibtcLi07fsEHP/3bQNc3LHR/ZTc5
/thotJscKrKY5ETpIYxBRdd33bVN+NxydBAbgcDJl4dt41hs2s6WP+Fb7ilWDmOt
H1i3CLeTiFoyEx/9EqRDvNpjed29tr4x8KXMUU9l1Zm+4Ul2rDJtpB8adUrvT7Jr
tQ7kbewlfxvCME1wZm5BHglb1C034B5yRdZcesg57CGmueNtptnkq+e983ezg/Ln
g12Bf2Uvx24X5t8W7grt
=RCu3
-----END PGP SIGNATURE-----
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


david at lang

Apr 10, 2012, 6:16 AM

Post #5 of 17 (688 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

On Tue, 10 Apr 2012, Vlad Grigorescu wrote:

> a) Messages that didn't get successfully inserted should probably be
> queued and reattempted once or twice before being discarded.
> Unfortunately, the new transactional interface won't be sufficient here
> - if messages 1, 2, 4, and 5 are successfully inserted, but message 3
> fails, as far as I know, there's no way in the transactional interface
> to communicate that only message 3 failed, instead of message 3-5.

actually, what happens is that rsyslog sends a transaction and gets a
single success or failure message.

if success, all messages were inserted

if failure, it tries again with half as many messages to see if that goes
through. If it gets down to one message and that fails, then it considers
it a failure (and either retries, or drops the failed message)

so if elasticsearch doesn't have transactions (all or none succeed), then
some messages will be inserted multiple times.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


radu0gheorghe at gmail

Apr 10, 2012, 7:49 AM

Post #6 of 17 (694 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

2012/4/10 <david [at] lang>:
> On Tue, 10 Apr 2012, Vlad Grigorescu wrote:
>
>>  a) Messages that didn't get successfully inserted should probably be
>> queued and reattempted once or twice before being discarded. Unfortunately,
>> the new transactional interface won't be sufficient here - if messages 1, 2,
>> 4, and 5 are successfully inserted, but message 3 fails, as far as I know,
>> there's no way in the transactional interface to communicate that only
>> message 3 failed, instead of message 3-5.
>
>
> actually, what happens is that rsyslog sends a transaction and gets a single
> success or failure message.
>
> if success, all messages were inserted
>
> if failure, it tries again with half as many messages to see if that goes
> through. If it gets down to one message and that fails, then it considers it
> a failure (and either retries, or drops the failed message)
>
> so if elasticsearch doesn't have transactions (all or none succeed), then
> some messages will be inserted multiple times.

Maybe a solution to this is to use IDs somehow to avoid entering
duplicates. Trying to add the same bulk (with the same IDs) will only
"update" existing documents, and increment the "_version" number.

I'm not sure how this could actually be implemented, but it might be an option.

BTW, I'm also interested in Elasticsearch :). But since I'm using it
for logs, I'm not so much affected by duplicates.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


vladg at illinois

Apr 10, 2012, 7:54 AM

Post #7 of 17 (692 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

The thing to consider here is what happens when you have multiple rsyslog servers logging to ElasticSearch. Does there need to be some kind of concurrency, so that each of them have unique IDs for the messages? What happens if two messages have the same ID?

These are questions I'm unsure of, but for now, I'm happy to use ElasticSearch's automatic ID generation features.

--Vlad

On 04/10/2012 09:49 AM, Radu Gheorghe wrote:
> 2012/4/10 <david [at] lang>:
>> On Tue, 10 Apr 2012, Vlad Grigorescu wrote:
>>
>>> a) Messages that didn't get successfully inserted should probably be
>>> queued and reattempted once or twice before being discarded. Unfortunately,
>>> the new transactional interface won't be sufficient here - if messages 1, 2,
>>> 4, and 5 are successfully inserted, but message 3 fails, as far as I know,
>>> there's no way in the transactional interface to communicate that only
>>> message 3 failed, instead of message 3-5.
>>
>>
>> actually, what happens is that rsyslog sends a transaction and gets a single
>> success or failure message.
>>
>> if success, all messages were inserted
>>
>> if failure, it tries again with half as many messages to see if that goes
>> through. If it gets down to one message and that fails, then it considers it
>> a failure (and either retries, or drops the failed message)
>>
>> so if elasticsearch doesn't have transactions (all or none succeed), then
>> some messages will be inserted multiple times.
>
> Maybe a solution to this is to use IDs somehow to avoid entering
> duplicates. Trying to add the same bulk (with the same IDs) will only
> "update" existing documents, and increment the "_version" number.
>
> I'm not sure how this could actually be implemented, but it might be an option.
>
> BTW, I'm also interested in Elasticsearch :). But since I'm using it
> for logs, I'm not so much affected by duplicates.
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/

--
Vlad Grigorescu | IT Security Engineer
Office of Privacy and Information Assurance
University of Illinois at Urbana-Champaign
0x632E5272 | 217.244.1922
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


rgerhards at hq

Apr 10, 2012, 8:10 AM

Post #8 of 17 (711 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

> 1) Support bulk inserts
> (<http://www.elasticsearch.org/guide/reference/api/bulk.html>).
> 2) Parse the reply, for two things:
> a) Messages that didn't get successfully inserted should probably be
> queued and reattempted once or twice before being discarded.
> Unfortunately, the new transactional interface won't be sufficient here
> - if messages 1, 2, 4, and 5 are successfully inserted, but message 3
> fails, as far as I know, there's no way in the transactional interface
> to communicate that only message 3 failed, instead of message 3-5.
> b) Messages that matched a percolator should be processed
> differently. A percolator
> (<http://www.elasticsearch.org/guide/reference/api/percolate.html>) is
> a saved query on the ES cluster. Whenever a message is inserted that
> matches a percolator, it is indicated in the response {"matches":
> "system_failed"}. This provides near-realtime search functionality.
> Anything that matches a percolator should somehow be reentered into the
> queue, so it can be passed to another output plugin (out to a file,
> ommail, etc.)

I need to dig into more details with this feature. It looks like we could use
a kind of status indicator, just like recently added for the normalizer, for
this functionality (As far as the rsyslog engine is concerned).

> 3) The ES server and port should be configured via config directives.

That's already available with the new version. Doc is missing, simply because
things are not yet finalized

Action(type="omelasticsearch" server="" port="" ...)

> 4) Somehow, the index and type for each message should be passed to the
> elasticsearch plugin. This is a bit tricky, because if it's part of the
> message itself, it takes some time to parse that data out of the
> message.
It's currently a per-instance fixed (but configurable) string. I've already
begun to discuss a functionality similar to dynafiles to permit
message-contained values.


> 5) ES has an automatic discovery feature, where it will detect other
> cluster members
> (<http://www.elasticsearch.org/guide/reference/modules/discovery/zen.ht
> ml>). Ideally, rsyslog would also use this, so that if a cluster member
> goes down, it can find a new cluster member, and the system benefits
> from the high-availability of elasticsearch.
>
> We have a developer that's currently working on many of these features,
> so I'm happy to offer some assistance with building this out.
It would probably be quite good for all parties involved if we could join
forces. I obviously can provide strong rsyslog knowledge and would love to
work with someone who is far more fluent as I am in ES ;)

Just reply (on- or off-list as you like) to sort out any questions.

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


rgerhards at hq

Apr 10, 2012, 8:14 AM

Post #9 of 17 (714 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

Among all features, I'd like to tackle this "catch and solve insert error"
issue probably as the last one (there already is a lot of support for
handling output errors, that's the main reason).

Rainer

> -----Original Message-----
> From: rsyslog-bounces [at] lists [mailto:rsyslog-
> bounces [at] lists] On Behalf Of Vlad Grigorescu
> Sent: Tuesday, April 10, 2012 4:55 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Who is interested in ElasticSearch?
>
> The thing to consider here is what happens when you have multiple
> rsyslog servers logging to ElasticSearch. Does there need to be some
> kind of concurrency, so that each of them have unique IDs for the
> messages? What happens if two messages have the same ID?
>
> These are questions I'm unsure of, but for now, I'm happy to use
> ElasticSearch's automatic ID generation features.
>
> --Vlad
>
> On 04/10/2012 09:49 AM, Radu Gheorghe wrote:
> > 2012/4/10 <david [at] lang>:
> >> On Tue, 10 Apr 2012, Vlad Grigorescu wrote:
> >>
> >>> a) Messages that didn't get successfully inserted should probably
> be
> >>> queued and reattempted once or twice before being discarded.
> Unfortunately,
> >>> the new transactional interface won't be sufficient here - if
> messages 1, 2,
> >>> 4, and 5 are successfully inserted, but message 3 fails, as far as
> I know,
> >>> there's no way in the transactional interface to communicate that
> only
> >>> message 3 failed, instead of message 3-5.
> >>
> >>
> >> actually, what happens is that rsyslog sends a transaction and gets
> a single
> >> success or failure message.
> >>
> >> if success, all messages were inserted
> >>
> >> if failure, it tries again with half as many messages to see if that
> goes
> >> through. If it gets down to one message and that fails, then it
> considers it
> >> a failure (and either retries, or drops the failed message)
> >>
> >> so if elasticsearch doesn't have transactions (all or none succeed),
> then
> >> some messages will be inserted multiple times.
> >
> > Maybe a solution to this is to use IDs somehow to avoid entering
> > duplicates. Trying to add the same bulk (with the same IDs) will only
> > "update" existing documents, and increment the "_version" number.
> >
> > I'm not sure how this could actually be implemented, but it might be
> an option.
> >
> > BTW, I'm also interested in Elasticsearch :). But since I'm using it
> > for logs, I'm not so much affected by duplicates.
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com/professional-services/
>
> --
> Vlad Grigorescu | IT Security Engineer
> Office of Privacy and Information Assurance
> University of Illinois at Urbana-Champaign
> 0x632E5272 | 217.244.1922
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


bodik at civ

Apr 10, 2012, 8:50 AM

Post #10 of 17 (688 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

>> recently i did some testing. i tried omelasticsearch but i stopped
>> using direct output plugin in the favor of logstash push agent
>>
>> a) i wont connect logserver to the els cluster directly (because of
>> security)
>
> I guess this cannot be solved in any case? You talk about not using direct
> connection because you want it indirect, right?

yes. logserver shoud log and not to do anything else. also i'd like to have a
way to push some old data into els cluster, and that can be done with logstash
in the same way both with new and old logs (`cat | logstash`). this just fits my
need better right now ...

>> a1) there are also issues sigsegv when setting configuration parameters
>> of omelasticsearch
> Was that from the recently refactored git branch? I am asking because I have
> completely rewritten the config part and would be very interested in any
> problems encountered. The relevant branch is here:
>
> http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/master-elastics
> earch

no, i used just origin/master 2 months ago


>> b) logstash has better functionality in parsing and mangling data
>> before they are pushed to els
>
> What is missing?

* pushig old logs from disk
* parsing using grok filters
* deleting some parts of the messages

it think all those are related to the way i'd like to use my logs. and it's fine
for me not to mix logserver with a search engine ...

i know that some many of those could be done with proper templates but ...

b
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


bodik at civ

Apr 10, 2012, 8:54 AM

Post #11 of 17 (687 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

> Also, with the newer ES versions, I have yet to experience a crash in
> sending the data to ES.

which version yer using ? i'm working with 0.18.7
b
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


radu0gheorghe at gmail

Apr 11, 2012, 12:33 AM

Post #12 of 17 (688 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

2012/4/10 Vlad Grigorescu <vladg [at] illinois>:
> The thing to consider here is what happens when you have multiple rsyslog servers logging to ElasticSearch. Does there need to be some kind of concurrency, so that each of them have unique IDs for the messages? What happens if two messages have the same ID?

If two messages have the same ID, the one that gets inserted last
overrides the previous one, and gets an incremented _version. Which
basically means you lose data, because the old message isn't there
anymore.

>
> These are questions I'm unsure of, but for now, I'm happy to use ElasticSearch's automatic ID generation features.

Well, if you rely on Elasticsearch to generate the IDs, I don't think
there's a way for rsyslog to know which documents were successfully
inserted and which not:

# curl -XPUT 'http://localhost:9200/test2/'
{"ok":true,"acknowledged":true}
# curl -XPUT 'http://localhost:9200/test2/type1/_mapping' -d '
{
"type2" : {
"properties" : {
"field1" : {"type" : "long"}
}
}
}
'
{"ok":true,"acknowledged":true}
# cat requests
{ "index" : { "_index" : "test2", "_type" : "type1" } }
{ "field1" : 1 }
{ "index" : { "_index" : "test2", "_type" : "type1" } }
{ "field1" : "bla" }
{ "index" : { "_index" : "test2", "_type" : "type1" } }
{ "field1" : 3 }
# curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo
{"took":29,"items":[{"create":{"_index":"test2","_type":"type1","_id":"F5a5Rxt1RCSLXQ0N7wV4_w","_version":1,"ok":true}},{"create":{"_index":"test2","_type":"type1","_id":"vU07l91nQu-Nx9xLoextrA","error":"MapperParsingException[Failed
to parse [field1]]; nested: NumberFormatException[For input string:
\"bla\"]; "}},{"create":{"_index":"test2","_type":"type1","_id":"q2uJUEleRTmVv0jGoPxZkQ","_version":1,"ok":true}}]}

The only way to know which document was inserted and which not is by
order. Which looks a bit risky in my book.

>
>  --Vlad
>
> On 04/10/2012 09:49 AM, Radu Gheorghe wrote:
>> 2012/4/10  <david [at] lang>:
>>> On Tue, 10 Apr 2012, Vlad Grigorescu wrote:
>>>
>>>>  a) Messages that didn't get successfully inserted should probably be
>>>> queued and reattempted once or twice before being discarded. Unfortunately,
>>>> the new transactional interface won't be sufficient here - if messages 1, 2,
>>>> 4, and 5 are successfully inserted, but message 3 fails, as far as I know,
>>>> there's no way in the transactional interface to communicate that only
>>>> message 3 failed, instead of message 3-5.
>>>
>>>
>>> actually, what happens is that rsyslog sends a transaction and gets a single
>>> success or failure message.
>>>
>>> if success, all messages were inserted
>>>
>>> if failure, it tries again with half as many messages to see if that goes
>>> through. If it gets down to one message and that fails, then it considers it
>>> a failure (and either retries, or drops the failed message)
>>>
>>> so if elasticsearch doesn't have transactions (all or none succeed), then
>>> some messages will be inserted multiple times.
>>
>> Maybe a solution to this is to use IDs somehow to avoid entering
>> duplicates. Trying to add the same bulk (with the same IDs) will only
>> "update" existing documents, and increment the "_version" number.
>>
>> I'm not sure how this could actually be implemented, but it might be an option.
>>
>> BTW, I'm also interested in Elasticsearch :). But since I'm using it
>> for logs, I'm not so much affected by duplicates.
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>
> --
> Vlad Grigorescu | IT Security Engineer
> Office of Privacy and Information Assurance
> University of Illinois at Urbana-Champaign
> 0x632E5272 | 217.244.1922
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


nathans at aconex

Apr 11, 2012, 2:55 AM

Post #13 of 17 (683 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

Hi Rainer,

----- Original Message -----

> Hi all,

> I am doing some experimental work on ElasticSearch integration. I
> started off
> with a contribution and will extend it in the coming days/weeks. I
> wonder who
> else is interested in that topic?
I'm interested, and glad to see others continuing to work on it. Time has become a bit
tight for me at the moment, and I can't really contribute much more coding-wise ... but,
great to see you & others taking it on - thanks!

cheers.

--
Nathan
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


vladg at illinois

Apr 11, 2012, 5:58 AM

Post #14 of 17 (684 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 4/11/12 2:33 AM, Radu Gheorghe wrote:
> 2012/4/10 Vlad Grigorescu <vladg [at] illinois>:
>> The thing to consider here is what happens when you have multiple rsyslog servers logging to ElasticSearch. Does there need to be some kind of concurrency, so that each of them have unique IDs for the messages? What happens if two messages have the same ID?
>
> If two messages have the same ID, the one that gets inserted last
> overrides the previous one, and gets an incremented _version. Which
> basically means you lose data, because the old message isn't there
> anymore.

Well, that's certainly not what you want when it comes to logs.

>> These are questions I'm unsure of, but for now, I'm happy to use ElasticSearch's automatic ID generation features.
>
> Well, if you rely on Elasticsearch to generate the IDs, I don't think
> there's a way for rsyslog to know which documents were successfully
> inserted and which not:
>
> The only way to know which document was inserted and which not is by
> order. Which looks a bit risky in my book.

According to the ES documentation[1] the order of the responses is the same as the order you sent to be indexed. I'll try to confirm with the author that that will remain the same down the line, but I suspect that many people rely on that fact at this point.

--Vlad

[1] - https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkResponse.java#L32
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)

iQIcBAEBCgAGBQJPhYARAAoJEMEVj6tjLlJyETsP/1SEB9lhS9JjeqvQTAgq6o1W
W7Pgu22FxbWj+5VS3AkoDaBSapagrdc02YuWOrH+obMQ0smQExiCqrMq7JHpaI6G
RVh0plhkwJDbJ2JmISQAED6ImjgRDSqBr9wHJ6Eoytm2SFmBaIzp9oNiCn5B6O/H
CRh4PqL1KJcM99rvNv2MULkM8D0JRdeD+g3kXLhwm/0nUsI5Cw0zkxFcS1cSrXiL
ODKCDgt0Wr4PnpmFAwqGizB2r8FFXTyoxHcYKWCQsCriyCeGe5ow2JLUYu2i/pqR
GelC2SJVRuDdaiuTxV4UdXu8jcFrABxJt8UclOu2Bolq3t9Q2YZitVKzfteoFuKH
OTgcHvQt9Qmns7Ew48cQHq+11oh7F6YG5KM9jtdeAjRyyQerLy8RQhL82T1ljHVl
45rLbgO+3BOCz9nuDlXm7jvjFUq0GO6zar/9TsYxEjNYJQZd0QrjcjSE1kCSMTsH
7eXwyVM9jxonaBzYufQQ6VuvSPznKqwvZe5phIYjmEFJnsL9tsritHATCctQykLm
D5evrdRQq/iZ8Lpvkak1xEQo0Mb2xRJIn1MA1a62gkCWih5LVxf9yMwZhyj51xfc
Ki/ss+BwrLdN7PI+eSouVMvN1s8y/kj0yEcoQBdR++0QlhzM3Vh3h9e6xruqI7B2
ZwEmR7NkZyfe6Y35k4yY
=XE/L
-----END PGP SIGNATURE-----
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


radu0gheorghe at gmail

Apr 11, 2012, 6:15 AM

Post #15 of 17 (695 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

2012/4/11 Vlad Grigorescu <vladg [at] illinois>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On 4/11/12 2:33 AM, Radu Gheorghe wrote:
>> 2012/4/10 Vlad Grigorescu <vladg [at] illinois>:
>>> The thing to consider here is what happens when you have multiple rsyslog servers logging to ElasticSearch. Does there need to be some kind of concurrency, so that each of them have unique IDs for the messages? What happens if two messages have the same ID?
>>
>> If two messages have the same ID, the one that gets inserted last
>> overrides the previous one, and gets an incremented _version. Which
>> basically means you lose data, because the old message isn't there
>> anymore.
>
> Well, that's certainly not what you want when it comes to logs.
>
>>> These are questions I'm unsure of, but for now, I'm happy to use ElasticSearch's automatic ID generation features.
>>
>> Well, if you rely on Elasticsearch to generate the IDs, I don't think
>> there's a way for rsyslog to know which documents were successfully
>> inserted and which not:
>>
>> The only way to know which document was inserted and which not is by
>> order. Which looks a bit risky in my book.
>
> According to the ES documentation[1] the order of the responses is the same as the order you sent to be indexed. I'll try to confirm with the author that that will remain the same down the line, but I suspect that many people rely on that fact at this point.
>
>  --Vlad
>
> [1] - https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkResponse.java#L32

Cool!

Then maybe there's a hackish way to actually avoid duplicates while
still relying on ES to generate the IDs. Something like:
1. when the bulk is first sent, don't put in any IDs
2. if the reply has errors, take the original bulk and complete it
with IDs where you have them
3. take half of the original bulk and re-insert
4. repeat steps 2 and 3

But I guess implementing this sort of logic is nearly as complicated,
only slower (and probably less reliable) than identifying the failed
messages and try to reindex them.

As for identifying which message are malformed and which messages are
worth retrying to insert, I guess it depends on the exception type.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


rgerhards at hq

Apr 11, 2012, 6:52 AM

Post #16 of 17 (688 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

> > I am doing some experimental work on ElasticSearch integration. I
> > started off
> > with a contribution and will extend it in the coming days/weeks. I
> > wonder who
> > else is interested in that topic?
> I'm interested, and glad to see others continuing to work on it. Time
> has become a bit
> tight for me at the moment, and I can't really contribute much more
> coding-wise ... but,
> great to see you & others taking it on - thanks!

That's the beauty of open source where contributions grow and grow :-) Thanks
for doing the initial step!

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/


david at lang

Apr 11, 2012, 11:36 AM

Post #17 of 17 (696 views)
Permalink
Re: Who is interested in ElasticSearch? [In reply to]

On Wed, 11 Apr 2012, Radu Gheorghe wrote:

> 2012/4/11 Vlad Grigorescu <vladg [at] illinois>:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA512
>>
>> On 4/11/12 2:33 AM, Radu Gheorghe wrote:
>>> 2012/4/10 Vlad Grigorescu <vladg [at] illinois>:
>>>> The thing to consider here is what happens when you have multiple rsyslog servers logging to ElasticSearch. Does there need to be some kind of concurrency, so that each of them have unique IDs for the messages? What happens if two messages have the same ID?
>>>
>>> If two messages have the same ID, the one that gets inserted last
>>> overrides the previous one, and gets an incremented _version. Which
>>> basically means you lose data, because the old message isn't there
>>> anymore.
>>
>> Well, that's certainly not what you want when it comes to logs.
>>
>>>> These are questions I'm unsure of, but for now, I'm happy to use ElasticSearch's automatic ID generation features.
>>>
>>> Well, if you rely on Elasticsearch to generate the IDs, I don't think
>>> there's a way for rsyslog to know which documents were successfully
>>> inserted and which not:
>>>
>>> The only way to know which document was inserted and which not is by
>>> order. Which looks a bit risky in my book.
>>
>> According to the ES documentation[1] the order of the responses is the same as the order you sent to be indexed. I'll try to confirm with the author that that will remain the same down the line, but I suspect that many people rely on that fact at this point.
>>
>>  --Vlad
>>
>> [1] - https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkResponse.java#L32
>
> Cool!
>
> Then maybe there's a hackish way to actually avoid duplicates while
> still relying on ES to generate the IDs. Something like:
> 1. when the bulk is first sent, don't put in any IDs
> 2. if the reply has errors, take the original bulk and complete it
> with IDs where you have them
> 3. take half of the original bulk and re-insert
> 4. repeat steps 2 and 3
>
> But I guess implementing this sort of logic is nearly as complicated,
> only slower (and probably less reliable) than identifying the failed
> messages and try to reindex them.
>
> As for identifying which message are malformed and which messages are
> worth retrying to insert, I guess it depends on the exception type.

Actually, I think that the way the default batch handling works would want
logic more like:

when the bulk is sent, try to insert it all

if any inserts fail, remove any messages that were successfully inserted

David Lang

RSyslog users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.