ab at getopt
Jan 9, 2012, 4:07 PM
Post #2 of 2
On 09/01/2012 16:27, Mike C wrote:
> I'm investigating storing syslog data using Lucene (via Solr or
> Elasticsearch, undecided at present). The syslogs belong to systems
> under the scope of the PCI DSS (Data Security Standard), and one of
> the requirements is to ensure logs aren't tampered with. I'm looking
> for advice on how to accomplish this.
> Looking through the Lucene documentation, I believe there doesn't
> exist any built in functionality to secure index data through digital
> signatures or HMACs. Is this the case, or have I overlooked something?
> I see there is a lucenetransform project
> (http://code.google.com/p/lucenetransform/) that offers encryption,
> but not digital signatures. I'm not concerned about hiding the
> contents of the data, just need to ensure it hasn't been tampered
> with. At present I use Splunk, which signs and verifies blocks of
> indexed data. Unfortunately its pricing model doesn't scale well,
> hence looking for a lucene-based solution.
> I suppose I could add a digital signature programmatically to each
> lucene Document/Syslog, though it seems like a lot of overhead.
> Lucenetransforms approach does seem to suggest that I could provide a
> digital signature version of Directory (and IndexInput/IndexOutput),
> however before I go down that rabbit hole, decided to check in here.
> Any advice or suggestions appreciated.
This is an interesting and important problem.
I assume that the signature(s) should be created as a part of the
regular indexing process, and in a sense they would also depend on and
provide a way to verify the authenticity of the application that created
the index (because the application has to know how to create valid
signatures). You would obviously need a counterpart application that can
verify such signatures.
Per-document sigs do add some overhead, but if you can keep them small
(128 bits?) then you can still use stored fields (or DocValues in trunk,
which offer a more efficient, compact representation). Still, if you
need non-repudiation for certain sequences of events then you need to
sign such sequences too - in Lucene terms this would be probably
segments or Directory files.
So the "transformation" approach can work well for creating global (per
segment and per file) signatures - instead of encrypting you would pass
all data that is written to Directory through a HMAC algo, which on
stream close would simply write a signature to a separate file in
Directory - this can be easily implemented as a Directory wrapper. The
only complication here is that you would have to handle changes related
to segment merges yourself, i.e. you would have to do something with sig
files that correspond to obsolete segments (discard?).
In Lucene trunk you can use the Codec API to essentially do the same as
explained above, only this time you can interpret the data more easily,
e.g. if some aspects of data (postings, payloads, term dictionary) are
not so important for the signature as e.g. stored fields are, then you
can skip them - and finally when a batch of documents (that corresponds
to a Lucene segment) is finished you would write the signatures to
additional files - only this time the sig files would be known as
belonging to that segment, so you would get some help from Lucene during
segment merging and you could handle merging of data (create additional
sigs for every merge? or recompute sig for the new segment?), and old
sigs would be deleted whenever old segments are deleted due to merging.
I'd give it a shot with Directory-based approach first, because it's
easy to implement, and then see if it's good enough.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene